TabPFN - a deep learning architecture for tabular data that actually works

2025/01/19

artificial-intelligence deep-learning shorts

Tabular data remains the last bastion unconquered by deep learning. This might change soon.

📄 I was intrigued by a Nature paper released over a week ago demonstrating that deep learning methods can outperform boosting algorithms on tabular data - arguably the most classical case of machine learning.

📊 These scenarios involve classification or regression tasks where each datapoint is a row in a table, with various features, measurements, or descriptors as columns. Long ago we learnt that in any domain even the fanciest hand-crafting and modelling loses to throwing (considerably) more computational power and data - as the wise old adage says - GPU go brr. It is therefore surprising these settings remain outside of DL's reach.

🧬 While Random Forests and XGBoost have long reigned supreme in such applications, alternative approaches like TabNet, SAINT, and Well-tuned simple nets (note authors) demonstrated promising performance for years. In biology and chemistry, deep foundational models have allowed for progressively better fine-tuning, falling short compared to traditional ML methods in the tabular realm.

💡 TabPFN Paper, based on years of continuous research and improvements, proposes a transformer trained on millions of synthetic datasets to develop strategies on a per-dataset basis using in-context learning. At first, the robustness of described results sounded puzzling, so I decided to test it on two specific blood-brain barrier permeability subproblems on unreleased datasets between 2-5 thousand datapoints in size.

⚙️ I expected to spend hours wrangling with a research-quality repository to plug in the data. Alas my expectations were shattered - TabPFN comes in an installable, comprehensible, high-quality package - literally a two line plug-and-play to Python's scikit-learn interface. MPS not being supported (a fail-fast error message with a call to action - wow!), I needed Colab with a GPU to make it run.

⚖️ I hope you're not expecting a detailed analysis from this brief post (though I might give it a 5x5 CV Tukey's HSD as per Polaris's guideliness at some point). In these particular settings, it reached - though not surpassed - the performance of the top-scoring boosted methods.

Multiple caveats aside - it might be worth checking out when dealing with problems with less than ten thousand samples - and easy to do so.