For the past several months, we have been finalising the first usable release of a system that we think would remedy some high-impact pain points we have seen in working with data in high-paced bio- and med-tech companies.
That data is crucial for ML/AI is a platitude that any practitioner or executive would admit to eagerly. We are still at the stage where data constrains the quality of AI solutions - it is often still the case that more data corresponds directly to better results. However, in most cases, it is surprisingly difficult to find an organisation genuinely happy with its data. It is evident in the biomedical domain with its quirks, including the discord between research and application, the multitude of highly dedicated formats, pervasive outdated infrastructure, and the knowledge gap between different specialisations. And that's on top of the data noise, lack of structure, limited reliability, differences in interpretations, etc.
There are plenty of reasons, and examples include:
Appeals to divert more attention to data are of little help. Data quality and ease of use are not (in most cases) the end product, and more often than not, it is not clear how to value or assess its impact on the business. This results in chasing the short-term goals, slowly accumulating a tangled mess of pipelines. Companies occasionally push for redesign by the engineering department, either re-inventing the wheel or constricting the solution from the extensions. In the end, the data processes are, at best, slow and costly in terms of maintenance and, at worst, prohibitive of new developments or incapable of retrieving crucial past data (e.g. to fulfil legal obligations or required for clinical trials).
We know this can be changed. We leveraged years of practice and opinions coming from both new entrants and established companies. After dozens of conversations, plenty of presentations and months of research, we set out to build a system aiming to: - Serve multiple concurrent goals - (e.g. investigation / model generation / information augmentation) - Bridge 'skunkworks' and production usage - Include a stable, consistent and replaceable acquisition methodology - Provide domain specialists and technical generalists with an easily understandable point of convergence - Allow for the flexibility in processing and analytics frameworks, visualisation and versioning technologies - Handle new releases seamlessly - Rely on a resilient and transparent storage layer - Provide a strong provenance and versioning for production and clinical trial scenarios - Visualise in-house usage - Facilitate cross-team reusability The further considerable difficulty is to harness the best data engineering practices while converging with a fast-paced, research-oriented culture and allow for long-term costly maintenance and growth.
We aim to radically reduce the cost of acquiring, maintaining and utilising the data in the biomedical domain in the context of ML/AI. We provide an extendable, efficient solution with out-of-the-box integrations to prominent datasets and popular technology stacks. We seamlessly leverage BiŌkeanós as a source for knowledge of data availability, popularity, formats, sizes and licensing. For smaller companies and projects - it allows you to start right and take advantage of proven methods. For the larger ones - extend your capabilities and speed up your processes. Naturally, there are caveats. Most notably, currently, the system has been tuned to and tested with datasets with sizes up to tens of GB. Massive datasets (e.g. genetics or medical imaging)will require additional custom handling. This can be simplified by taking advantage of the modularity of our approach or relying on dedicated external technologies (e.g. genomic big data solutions).