Arachne.ai - a biomedical data backbone

2021/08/10
arachne.ai data
Rodion Kutsaev via Unsplash

For the past several months, we have been finalising the first usable release of a system that we think would remedy some high-impact pain points we have seen in working with data in high-paced bio- and med-tech companies.

 

The need for a reliable biomedical data integration

That data is crucial for ML/AI is a platitude that any practitioner or executive would admit to eagerly. We are still at the stage where data constrains the quality of AI solutions - it is often still the case that more data corresponds directly to better results. However, in most cases, it is surprisingly difficult to find an organisation genuinely happy with its data. It is evident in the biomedical domain with its quirks, including the discord between research and application, the multitude of highly dedicated formats, pervasive outdated infrastructure, and the knowledge gap between different specialisations. And that's on top of the data noise, lack of structure, limited reliability, differences in interpretations, etc.

 

Why is dealing with health and life science data difficult?

There are plenty of reasons, and examples include:

  • The monetary value of data is vague and difficult to quantify.
  • Scientists see limited value in long-term dealing with provenance, health and repeatability. At the same time, it is incomprehensible for engineers, and it is too easy for them to silo it from the users.
  • Many companies devise their custom, unique ways of dealing with the technical aspects of data handling.
  • It is rare to have access to people with both high technical skills and domain knowledge.
  • Dealing with data is a laborious and unglamorous process.
  • Because of the low visibility, data handling is often ad-hoc, with just enough effort to convince the organisation it achieves the immediate goal.

Appeals to divert more attention to data are of little help. Data quality and ease of use are not (in most cases) the end product, and more often than not, it is not clear how to value or assess its impact on the business. This results in chasing the short-term goals, slowly accumulating a tangled mess of pipelines. Companies occasionally push for redesign by the engineering department, either re-inventing the wheel or constricting the solution from the extensions. In the end, the data processes are, at best, slow and costly in terms of maintenance and, at worst, prohibitive of new developments or incapable of retrieving crucial past data (e.g. to fulfil legal obligations or required for clinical trials).

 

How to make better use of data in the age of ML/AI?

We know this can be changed. We leveraged years of practice and opinions coming from both new entrants and established companies. After dozens of conversations, plenty of presentations and months of research, we set out to build a system aiming to: - Serve multiple concurrent goals - (e.g. investigation / model generation / information augmentation) - Bridge 'skunkworks' and production usage - Include a stable, consistent and replaceable acquisition methodology - Provide domain specialists and technical generalists with an easily understandable point of convergence - Allow for the flexibility in processing and analytics frameworks, visualisation and versioning technologies - Handle new releases seamlessly - Rely on a resilient and transparent storage layer - Provide a strong provenance and versioning for production and clinical trial scenarios - Visualise in-house usage - Facilitate cross-team reusability The further considerable difficulty is to harness the best data engineering practices while converging with a fast-paced, research-oriented culture and allow for long-term costly maintenance and growth.

 

Our take - Arachne.ai

We aim to radically reduce the cost of acquiring, maintaining and utilising the data in the biomedical domain in the context of ML/AI. We provide an extendable, efficient solution with out-of-the-box integrations to prominent datasets and popular technology stacks. We seamlessly leverage BiŌkeanós as a source for knowledge of data availability, popularity, formats, sizes and licensing. For smaller companies and projects - it allows you to start right and take advantage of proven methods. For the larger ones - extend your capabilities and speed up your processes. Naturally, there are caveats. Most notably, currently, the system has been tuned to and tested with datasets with sizes up to tens of GB. Massive datasets (e.g. genetics or medical imaging)will require additional custom handling. This can be simplified by taking advantage of the modularity of our approach or relying on dedicated external technologies (e.g. genomic big data solutions).

 

Sounds interesting?

See what Arachne.ai can do for you, schedule a demo.