Deploying biomedical LLMs

llm data
Google Deepmind via Unsplash

We have recently deployed a biomedical LLM system that now helps with finding drugging opportunities for a novel modality. In this post, we share the technical stack we used.



The recent pre-print of Google / DeepMind's Tx-LLM work, building upon Med-PALM-2, showcases continual advancements in the field. While these massive, proprietary models remain out of reach for most smaller biotech and pharma companies, lighter, open-source alternatives are becoming increasingly accessible for specialised domains.

We have recently deployed a system for hypothesis generation for a novel drug modality of interest. Initially, we suggested considering available hosted solutions, such as ChatGPT or Claude. However, the client prioritised their needs for security and control, ultimately deciding that the risks of proceeding with a tailored solution were justified.


Model and data

Our system is based on BioMistral-7B and has been fine-tuned on domain-specific literature and datasets. User requests are enhanced by fetching relevant documents through a simple, use-case-specific implementation of Retrieval Augmented Generation (RAG).

The data acquisition and processing were performed using's biomedical data engine, utilising its seamless integration with HuggingFace, albeit without using's graph extensions. HuggingFace was extensively used for dataset and model storage, serving as a core infrastructure component across research, development, and productisation.

One of the core elements of this platform is its ability to self-improve over time. This is achieved through Direct Preference Optimisation (DPO), where users generate multiple answers and select the best one. These preferences are stored and subsequently used to optimise the model. Both the initial fine-tuning and DPO were implemented using the Unsloth library.



We evaluated multiple approaches, tools, and providers for hosting the fine-tuned model. To serve multiple users at high throughput without model optimisation, we chose to use vLLM server running a non-quantised model.

In the early stages of development, we deployed a version of the model to's serverless offering. Given that it is the only such product on the market and considering its early-stage limitations in configurability and occasional errors stemming from infrastructure availability, we planned to replace it in due course. However, we were pleasantly surprised by the ease of setup, minimal maintenance, and negligible error rate. Out of 90,000 non-automated requests in the initial fortnight of usage, approximately 200 failed, mostly during a single point of active development, averaging a 0.2% failure rate. This issue was mitigated by the aforementioned DPO-derived strategy of running multiple requests for each user activity.

The supplementary applications, built with Python's FastAPI, Alpine.js, DaisyUI, and PostgreSQL database, were deployed on Digital Ocean using their App Platform.



Generative assistants have significant limitations in settings where deep domain knowledge is required and information is sparse. RAG systems can have some impact in addressing this by providing context. In our drug opportunity discovery work, they are primarily used for their novelty and as sounding boards where hypotheses are formulated and evaluated (e.g., target prioritisation). They also help identify areas where knowledge is lacking and assist in building the roadmap to the clinic (e.g., suggesting connections to be experimentally established and assays or models which could be utilised).

The technical state of large language models and their associated tooling and library ecosystem has progressed beyond the prototyping stage. Deploying a fine-tuned model capable of self-improvement, along with supporting applications, can now be achieved within weeks.


Considering applying large language models in biotech/pharma research?

Get in touch, we might be able to help.