As specialists in biological data integration and processing, we help companies with a wide range of problems either through consulting or creating software and methodology to their specifications. Here is a sample of our previous work.
A growing medical technology startup had a problem. After successfully reaching several validation points of their AI solution, they needed to prepare for considerable growth. It meant they had to extract a significant amount of patient data from various clinical sites from across the globe to feed to their models.
The company needed a better system to de-identify and extract patient information from secure hospital networks. With no uniform standards, data could come in a plethora of shapes and sizes. The software had to satisfy a long, rigid list of requirements associated with medical records handling. It also had to provide a reliable and robust base for long-term maintenance. Anticipating the growth, the startup wanted to outsource some operations and rapidly create a forward-deployed engineering team. Therefore ease of use of this new tool became a major requirement.
When designing the toolkit, we concentrated on simplicity and extendability. We removed significant maintenance costs by minimising the number of moving parts and concentrated on de-risking the most common yet technically challenging components. The solution offered a rapid re-adjustment of any aspect of the anonymisation and processing, depending on the data providers' feedback. It helped to share functionality and allowed for the co-existence of the logic from various sites. Further re-extraction of the data was made easier by a reference and traceability mechanism.
A couple of months from its inception, the company used the toolkit on nearly twice as many diverse datasets as the company had acquired in its past. The process that took a significant engineering team effort, with turnover counted in months, changed to be more manageable and light, with some anonymisations taking only a couple of days. It is noteworthy that it was successfully run by both new internal users and external contractors with minimal guidance.
A challenger drug discovery company aimed at using AI / ML in as many aspects of its research pipeline as possible. From target validation, through lead optimisation to clinical trials, they augmented or replaced part of the process with dedicated models.
A central stage of their data infrastructure was a large, consolidated knowledge graph. Composed of a plethora of data sources, both structured and unstructured, internal and external, it was a central repository of the data that served as a base for training all the models and directly feeding the tools drug discoverers used. It took numerous person-year efforts of a dedicated team of chemo- and bio- informaticians and some ML engineers. There was a problem, though. It took the dedicated team weeks to re-generate it when new data warranted it.
To overcome this, the solution required two major lines of work. One was to significantly simplify the building system itself, with shorter iteration loops and faster error detection, moving it to the standardised infrastructure. The other one was to strengthen particular data source preparation. It required better versioning, lighter processing of data provenance, and an early structure validation and automatic verification of the data evolution.
The work resulted in an automated background task that ran for several hours. Apart from an occasional reconciliation, it did not require input from any domain specialists, allowing them to proceed with other research. The company has since been able to seamlessly refresh its knowledge base multiple times a day instead of once a quarter.
A radiology startup got access to a new state-of-the-art dataset. It was of high quality, contained rare data points, and was first of its kind to be in good enough shape to be shared with multiple companies in their domain to evaluate the performance of AI/ML solutions directly.
Expecting the move from other players in the field, the founders needed to use it as soon as possible. At the same time, a dedicated team worked on a major redesign of the central processing infrastructure. The solution needed to be self-contained, bring the data to the AI researchers quickly and be flexible enough to allow future integration with the newly developed architecture.
The data required in-depth medical knowledge to understand and use it efficiently. To achieve this, we established a close interaction loop with the clinical lead, providing analytics and bringing to their attention non-conformances and outliers, iterating on their feedback. With the infrastructure specialists, we planned the acquisition and the initial cleaning and reconciliation of the data. We coordinated with the engineering department early to anticipate the new framework's requirements and swiftly integrate the dataset into it once it was ready. Finally, we worked with AI researchers to determine how to serve their multiple teams' needs best.
We cleaned, harmonised, and handed over the dataset in several releases. We identified various issues with the dataset that suppliers were unaware of and provided them with examples to resolve them. We produced both ready-to-use and slightly-processed releases of the data. This allowed feeding the company's main use case straight away while leaving an option for further research that soon followed.
As the startup's leadership and researchers expected, it considerably improved an important aspect of their solution. The timely delivery also allowed them to better understand and compare their models against Google Deepmind's algorithm's freshly published results.
A small startup providing automated due diligence and investigation capabilities to financial institutions realised they needed to scale their data operations rapidly. Constantly scraping tens of thousands of websites and databases, they could not re-process them to extract further insights.
This structural deficiency could only be remedied by redesigning their approach and providing an infrastructure that could support the required volume - counted in hundreds of millions of documents. After a cost analysis with some input from our side, they decided to proceed with a bare-metal cloud solution. This way, they could fully leverage computational power while minimising operational costs. They also needed to train their data science team to use the infrastructure independently to derive new insights for their customers.
We proposed two approaches that could satisfy those constraints and delivered suitable proof of concepts. They decided on one that allowed for the dynamic serving of the data to the end-users, despite being slightly costlier to maintain. Operating at this scale required using a programmable approach to server management and working with the cloud provider to solve various issues related to the newly provisioned clusters. We trained data scientists and left them with a cookbook of examples of how to use the framework. We based those on the tasks we observed them doing or planning to do soon. Finally, we created a backup system for all their data, which included an on-line synchronisation. The process took slightly over a year and was deployed in several tranches, with new capabilities released to the end-user as soon as they were available.
The eventual solution consisted of over 40 large servers grouped in a couple of clusters, split across two continents, controlled from a single console. It allowed for hundreds of millions of websites to be processed and served daily. Driven in part by new insights from data scientists and the new platform's capabilities, the startup's valuation nearly tripled in 2 years.
Do these sound familiar? Or perhaps you experience a completely different set of problems? Maybe we can help. Contact us.