How the Rise of Tabular Foundation Models Is Reshaping Data Science

-

Tabular Data!

Recent advances in AI—starting from systems able to holding coherent conversations to those generating realistic video sequences—are largely attributable to artificial neural networks (ANNs). These achievements have been made possible by algorithmic breakthroughs and architectural innovations developed over the past fifteen years, and more recently by the emergence of large-scale computing infrastructures capable of coaching such networks on internet-scale datasets.

The important strength of this approach to machine learning, commonly known as , lies in its ability to robotically learn representations of complex data types—resembling images or text—without counting on handcrafted features or domain-specific modeling. In doing so, deep learning has significantly prolonged the reach of traditional statistical methods, which were originally designed to investigate structured data organized in tables, resembling those present in spreadsheets or relational databases.

Figure 1 : Until recently, neural networks were poorly suited to tabular data. [Image by author]

Given, on the one hand, the remarkable effectiveness of deep learning on complex data, and on the opposite, the immense economic value of tabular data—which still represents the core of the informational assets of many organizations—it is just natural to ask whether deep learning techniques might be successfully applied to such structured data. In any case, if a model can tackle the toughest problems, why wouldn’t it excel at the better ones?

Paradoxically, deep learning has long struggled with tabular data [8]. To know why, it is helpful to recall that its success hinges on the power to uncover grammatical, semantic, or visual patterns from massive volumes of information. Put simply, the meaning of a word emerges from the consistency of the linguistic contexts by which it appears; likewise, a visible feature becomes recognizable through its reoccurrence across many images. In each cases, it’s the inner structure and coherence of the info that enable deep learning models to generalize and transfer knowledge across different samples—texts or images—that share underlying regularities.

The situation is fundamentally different in relation to tabular data, where each row typically corresponds to an remark involving multiple variables. Think, for instance, of predicting an individual’s weight based on their height, age, and gender, or estimating a household’s electricity consumption (in kWh) based on floor area, insulation quality, and outdoor temperature. A key point is that the worth of a cell is just meaningful inside the precise context of the table it belongs to. The identical number might represent an individual’s weight (in kilograms) in a single dataset, and the ground area (in square meters) of a studio apartment in one other. Under such conditions, it is difficult to see how a predictive model could transfer knowledge from one table to a different—the semantics are entirely depending on context.

Tabular structures are thus highly heterogeneous, and in practice there exists an infinite number of them to capture the variety of real-world phenomena—starting from financial transactions to galaxy structures or income disparities inside urban areas.

This diversity comes at a price: each tabular dataset typically requires its own dedicated predictive model, which can’t be reused elsewhere. 

To handle such data, data scientists most frequently depend on a category of models based on decision trees [7]. Their precise mechanics needn’t concern us here; what matters is that they’re remarkably fast at inference, often producing predictions in under a millisecond. Unfortunately, like all classical machine learning algorithms, they need to be retrained from scratch for every recent table—a process that may take hours. Additional drawbacks include unreliable uncertainty estimation, limited interpretability, and poor integration with unstructured data—precisely the sort of information where neural networks shine.

The thought of constructing universal predictive models—much like large language models (LLMs)—is clearly appealing: once pretrained, such models could possibly be applied on to any tabular dataset, without additional training or fine-tuning. Framed this manner, the concept could seem ambitious, if not entirely unrealistic. And yet, that is precisely what (TFMs), developed by several research groups over the past yr [2–4], have begun to attain—with surprising success.

The sections that follow highlight among the key innovations behind these models and compare them to existing techniques. More importantly, they aim to spark curiosity a couple of development that might soon reshape the landscape of information science.

What We’ve Learned from LLMs

To place it simply, a big language model (LLM) is a machine learning model trained to predict the following word in a sequence of text. One of the crucial striking features of those systems is that, once trained on massive text corpora, they exhibit the power to perform a wide selection of linguistic and reasoning tasks—even those they were never explicitly trained for. A very compelling example of this capability is their success at solving problems relying solely on a brief list of input–output pairs provided within the prompt. For example, to perform a translation task, it often suffices to provide a number of translation examples.

This behavior is generally known as  (ICL). On this setting, learning and prediction occur on the fly, with none additional parameter updates or fine-tuning. This phenomenon—initially unexpected and almost miraculous in nature—is central to the success of generative AI. Recently, several research groups have proposed adapting the ICL mechanism to construct  (TFMs), designed to play for tabular data a job analogous to that of LLMs for text.

Conceptually, the development of a TFM stays relatively straightforward. Step one involves generating a  large collection of synthetic tabular datasets with diverse structures and ranging sizes—each by way of rows (observations) and columns (features or covariates). Within the second step, a single model—the inspiration model proper—is trained to predict one column from all others inside each table. On this framework, the table itself serves as a predictive context, analogous to the prompt examples utilized by an LLM in ICL mode.

Using synthetic data offers several benefits. First, it avoids the legal risks related to copyright infringement or privacy violations that currently complicate the training of LLMs. Second, it allows prior knowledge—an inductive bias—to be explicitly injected into the training corpus. A very effective strategy involves generating tabular data using causal models. Without delving into technical details, these models aim to simulate the underlying mechanisms that might plausibly give rise to the wide range of information observed in the true world—whether physical, economic, or otherwise. In recent TFMs resembling TabPFN-v2 and TabICL [3,4], tens of tens of millions of synthetic tables have been generated in this manner, each derived from a definite causal model. These models are sampled randomly, but with a preference for simplicity, following —the principle that amongst competing explanations, the only one consistent with the info must be favored.

TFMs are all implemented using neural networks. While their architectural details vary from one implementation to a different, all of them incorporate a number of Transformer-based modules. This design alternative might be explained, in broad terms, by the proven fact that Transformers depend on a mechanism generally known as , which enables the model to contextualize every bit of knowledge. Just as attention allows a word to be interpreted considering its surrounding text, a suitably designed attention mechanism can contextualize the worth of a cell inside a table. Readers all in favour of exploring this topic—which is each technically wealthy and conceptually fascinating—are encouraged to seek the advice of references [2–4].

Figures 2 and three compare the training and inference workflows of traditional models with those of TFMs. Classical models resembling XGBoost [7] should be retrained from scratch for every recent table. They learn to predict a goal variable  = (x) from input features x, with training typically taking several hours, though inference is almost instantaneous.

TFMs, against this, require a dearer initial  phase—on the order of a number of dozen GPU-days. This cost is usually borne by the model provider but stays close by for a lot of organizations, unlike the prohibitive scale often related to LLMs. Once pretrained, TFMs unify ICL-style learning and inference right into a single pass: the table  on which predictions are to be made serves directly as context for the test inputs x. The TFM then predicts targets via a mapping  = (x; ), where the table  plays a job analogous to the list of examples provided in an LLM prompt.

Figure 2 : Training a standard machine learning model and making predictions on a table. [Image by author]
Figure 3 : Training a tabular foundation model and performing universal predictions. [Image by author]

To summarize the discussion in a single sentence

TFMs are designed to learn a predictive model on-the-fly for tabular data, without requiring any training.

Blazing Performance

Key Figures

The table below provides indicative figures for several key features: the pretraining cost of a TFM, ICL-style adaptation time on a brand new table, inference latency, and the utmost supported table sizes for 3 predictive models. These include TabPFN-v2, a TFM developed at PriorLabs by Frank Hutter’s team; TabICL, a TFM developed at INRIA by Gaël Varoquaux’s group[1]; and XGBoost, a classical algorithm widely considered certainly one of the strongest performers on tabular data.

Figure 4 : A performance comparison between two TFMs and a classical algorithm, [image by author]

These figures must be interpreted as rough estimates, and so they are prone to evolve quickly as implementations proceed to enhance. For an in depth evaluation, readers are encouraged to seek the advice of the unique publications [2–4].

Beyond these quantitative features, TFMs offer several additional benefits over conventional approaches. Probably the most notable are outlined below.

TFMs Are Well-Calibrated

A well known limitation of classical models is their poor calibration—that’s, the chances they assign to their predictions often fail to reflect the true empirical frequencies. In contrast, TFMs are well-calibrated , for reasons which are beyond the scope of this overview but that stem from their implicitly Bayesian nature [1].

Figure 5  : Calibration comparison across predictive models. Darker shades indicate higher confidence levels. TabPFN clearly produces probably the most reasonable confidence estimates. [Image adapted from [2], licensed under CC BY 4.0].

Figure 5 compares the boldness levels predicted by TFMs with those produced by classical models resembling logistic regression and decision trees. The latter are likely to assign overly confident predictions in regions where no data is observed and infrequently exhibit linear artifacts that bear no relation to the underlying distribution. In contrast, the predictions from TabPFN look like significantly higher calibrated.

TFMs Are Robust

The synthetic data used to pretrain TFMs—tens of millions of causal structures—might be fastidiously designed to make the models highly robust to outliersmissing values, or non-informative features. By exposing the model to such scenarios during training, it learns to acknowledge and handle them appropriately, as illustrated in Figure 6.

Figure 6 : Robustness of TFMs to missing data, non-informative features, and outliers. [Image adapted from [3], licensed under CC BY 4.0]

TFMs Require Minimal Hyperparameter Tuning

One final advantage of TFMs is that they require little or no hyperparameter tuning. In actual fact, they often outperform heavily optimized classical algorithms , as illustrated in Figure 7.

Figure 7 : Comparative performance of a TFM versus other algorithms, each in default and fine-tuned settings. [image adapted from [3], licensed under CC BY 4.0]

To conclude, it’s price noting that ongoing research on TFMs suggests in addition they hold promise for improved explainability [3], fairness in prediction [5], and causal inference [6].

Every R&D Team Has Its Own Secret Sauce!

There’s growing consensus that TFMs promise not only incremental improvements, but a fundamental shift within the tools and methods of information science. So far as one can tell, the sphere may progressively shift away from a model-centric paradigm—focused on designing and optimizing predictive models—toward a more  approach. On this recent setting, the role of a knowledge scientist in industry will now not be to construct a predictive model from scratch, but quite to assemble a representative dataset that conditions a pretrained TFM.

Figure 8 : A fierce competition is underway between private and non-private labs to develop high-performing TFMs. [Image by author]

Additionally it is conceivable that recent methods for exploratory data evaluation will emerge, enabled by the speed at which TFMs can now construct predictive models on novel datasets and by their applicability to time series data [9].

These prospects haven’t gone unnoticed by startups and academic labs alike, which at the moment are competing to develop increasingly powerful TFMs. The 2 key ingredients on this race—the kind of “secret sauce” behind each approach—are, on the one hand, the strategy used to generate synthetic data, and on the opposite, the neural network architecture that implements the TFM.

Listed below are two entry points for locating and exploring these recent tools:

  1. TabPFN (Prior Labs)
    An area Python library: tabpfn provides scikit-learn–compatible classes (fit/predict). Open access under an Apache 2.0–style license with attribution requirement.
  2. TabICL (Inria Soda)
    An area Python library: tabicl (pretrained on synthetic tabular datasets; supports classification and ICL). Open access under a BSD-3-Clause license.

Blissful exploring!

  1. Müller, S., Hollmann, N., Arango, S. P., Grabocka, J., & Hutter, F. (2021). , publié pour ICLR 2021.
  2. Hollmann, N., Müller, S., Eggensperger, K., & Hutter, F. (2022). Tabpfn: , publié pour NeurIPS 2022.
  3. Hollmann, N., Müller, S., Purucker, L., Krishnakumar, A., Körfer, M., Hoo, S. B., … & Hutter, F. (2025). , (8045), 319-326.
  4. Qu, J., Holzmmüller, D., Varoquaux, G., & Morvan, M. L. (2025). , publié pour ICML 2025.
  5. Robertson, J., Hollmann, N., Awad, N., & Hutter, F. (2024). , publié pour ICML 2025.
  6. Ma, Y., Frauen, D., Javurek, E., & Feuerriegel, S. (2025). .
  7. Chen, T., & Guestrin, C. (2016, August). Xgboost: . In  (pp. 785-794).
  8. Grinsztajn, L., Oyallon, E., & Varoquaux, G. (2022).  , , 507-520.
  9. Liang, Y., Wen, H., Nie, Y., Jiang, Y., Jin, M., Song, D., … & Wen, Q. (2024, August). . In  (pp. 6555-6565).

[1] Gaël Varoquaux is certainly one of the unique architects of the Scikit-learn API. He can also be co-founder and scientific advisor on the startup Probabl.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x