Chartering the vastness of biomedical data

2021/03/31

biokeanos data

There is an unprecedented demand for biomedical data both for research and application, driven by the availability of large-scale processing technology and the resurgence of predictive modelling / AI. Given the vast amount of possible sources, how does one know which one they should be using?

What do we mean by biomedical data?

The goal of biomedicine is to find and verify treatments for health conditions using a scientific approach. It is the prevalent form of Western medicine, rooted in empirical evidence gathering and rigorous clinical trials. To find cures, it sources knowledge from a wide range of disciplines. For example, understanding how the human organism works relies on anatomy, physiology, genetics, cell biology, immunology, and embryology. To draw parallels with other organisms, it uses studies of model animals and our evolutionary relationships. Then there are studies of viruses, bacteria and other phages, pathogens, as well as toxicology. Comprehending the low-level intervention mechanisms requires an in-depth understanding of chemistry and how it interacts with organisms (biochemistry). It is only the very tip of the iceberg - the range of disciplines involved is substantial - consider radiology for medical imaging, nanotechnology or various other forms of biotechnology or biomedical engineering. In addition to the empirical knowledge bases, there are attempts to systematise processes - e.g. representing particular types of information, designing clinical trials, reporting results, or even defining what diseases are.

How is it created?

Researchers generate a significant amount of data. Their methodology, scope and format depend on the field of study. While this varies from domain to domain, more effort is put into unifying disparate discoveries into consistent databases. For example, cataloguing gene-wide association studies (GWAS) or grouping protein information - Uniprot. While some of those endeavours result in a single point of reference for a given topic, most often, this is not the case. For example there are at least 4 different reputable large scale lists of chemical compounds backed by sizeable organisations: CAS REGISTRY, PubChem, ChemSpider and UniChem. Still, this is nothing compared to clinical data, where each hospital (or trust) can have its own dedicated solution and format.

Complexity

The databases we discussed are, more often than not, very specialised. Take, for example, Ensembl. From its description, it is "a genome browser for vertebrate genomes that supports research in comparative genomics, evolution, sequence variation and transcriptional regulation." It takes a considerable amount of time to fully comprehend what it represents and realise its value and possible use cases. This high barrier of entry makes it difficult for a datasource to be understood. Or indeed even known outside of its domain. And any interdisciplinary effort is famously hard.

Visibility

Enabling data visibility and dedicated effort in getting it to the hands of scientists is nothing new. From permissive licensing, sharing standards like FAIR, dedicated agencies such as Elixir to tool and data sharing platforms, a lot of organisations are working exclusively on making sure the scientific discoveries are available for use. To add to this public effort, some larger companies are following the suit - either internally (e.g. applying FAIR standards for internal knowledge sharing) or donating their datasets to the public realm (e.g. chemical probes). As mentioned before, there already exist manually curated web services aggregating and/or recommending particular datasources, examples include BioPortal, NAR Database issue, Expasy, re3data, FAIRsharing.org, bio.tools or HDR Research Gateway. Internally we call them metasources. They typically provide description, URLs, relevant links, and some further information on the databases they contain.

A note on the newer approaches

There are also newer approaches to hosting and providing biomedical data. From scientific archiving such as CERN's Zenodo, Dryad to data science-driven communities like Kaggle's Datasets or Datahub, they provide methods for searching and obtaining the data. On top of this Google's Dataset Search indexes catalogues and the web at large to deliver the results. Those systems are robust, extensive, designed with the modern user in mind and renowned in the data community. However, from our experience, they suffer from the noise and lack of specialisation. They also don't distinguish between an extremely specialised, one-off or toy dataset and a reputable database used by thousands of scientists.

"What datasets are out there that might be useful for a particular biomedical ML/AI use case?"

Working with clients in bio- and med-tech sectors, we realise that this is a very hairy, context-dependent question. It would be impossible to answer this with anything other than with a very broad brush. In our case, we needed a way to identify the landscape our clients are in quickly - so that we can understand their context and, if appropriate, bring to their attention another dataset that might benefit them. Nothing can replace an in-depth knowledge of domain specialists. Therefore, we concentrated on a heuristic solution that would: - Make the discovery process for the new entrants or those spanning multiple domains much faster - Avoid repeated lookups to various metasources - When possible - identify and prioritise the most impactful datasets - Link to additional information available - Allow for in-depth exploration by providing cues and recommendations.

To answer those needs, we have built an open biomedical data discovery tool - BiŌkeanós. To our knowledge, it is the most extensive such collection of this kind. It reconciles metasources, merges in our private lists of databases, unifies the information while maintaining provenance, augments them with scientific literature references, points back to metasource for further details, allows for searching, tagging, provides recommendations and scoring. It is aimed at researchers and industry trailblazers seeking an advantage in data - either for research reference, evaluation, expanding their training samples or seeking to find that additional signal that may bring their solution to the next level.

Sounds interesting?

Take a look and judge for yourself.

Our next step is Arachne.ai. This biomedical data engine we are building integrates with BiŌkeanós to quickly deliver data to the researchers and AI/ML specialists in the format of their preference on a variety of infrastructures. We are actively looking for early partners. If you or a person you know would benefit from a fast, reliable biomedical data backbone - let us know - we will be in touch.

Acknowledgements

It would be impossible to build this tool without significant help from our colleagues from various companies and organisations. Their suggestions, questioning, in-depth insight and encouragement drove this effort. Special thanks go to Professor John Overington whose inspiration was invaluable. Most importantly - the vast majority of the data surfaced is based on the tireless work of the communities of data curators, researchers and their respective organisations who created and let us use their resources - they made this work possible.