Are Foundation Models Ready for Your Production Tabular Data?

-

are large-scale AI models trained on an unlimited and diverse range of information, comparable to audio, text, images, or a mix of them. For this reason versatility, foundation models are revolutionizing Natural Language Processing, Computer Vision, and even Time Series. Unlike traditional AI algorithms, foundation models offer out-of-the-box predictions without the necessity for training from scratch for each specific application. They will also be adapted to more specific tasks through fine-tuning.

In recent times, we’ve got seen an explosion of foundation models applied to unstructured data and time series. These include OpenAI’s GPT series and BERT for text tasks, CLIP and SAM for object detection, classification, and segmentation, and PatchTST, Lag-Llama, and Moirai-MoE for Time Series forecasting. Despite this growth, foundation models for tabular data remain largely unexplored resulting from several challenges. First, tabular datasets are heterogeneous by nature. They’ve variations within the feature types (Boolean, categorical, integer, float) and different scales in numerical features. Tabular data also suffer from missing information, redundant features, outliers, and imbalanced classes. One other challenge in constructing foundation models for tabular data is the scarcity of high-quality, open data sources. Often, public datasets are small and noisy. Take, for example, the tabular benchmarking website openml.org. Here, 76% of the datasets contain fewer than 10 thousand rows [2].

Despite these challenges, several foundation models for tabular data have been developed. On this post, I review most of them, highlighting their architectures and limitations. Some questions I need to reply are: What’s the present status of foundation models for tabular data? Can they be applied in production, or are they only good for prototyping? Are foundation models higher than classic Machine Learning algorithms like Gradient Boosting? In a world where tabular data represents most data in corporations, knowing which foundation models are being implemented and their current capabilities is of great interest to the info science community.

TabPFN

Let’s start by introducing probably the most well-known foundation model for small-to-medium-sized tabular data: TabPFN. This algorithm was developed by Prior Labs. The primary version dropped in 2022 [1], but updates to its architecture were released in January of 2025 [2].

TabPFN is a Prior-Data Fitted Network, which suggests it uses Bayesian inference to make predictions. There are two vital concepts in Bayesian inference: the prior and the posterior. The prior is a probability distribution reflecting our beliefs or assumptions about parameters before observing any data. For example, the probability of getting a 6 with a die is . The posterior is the updated belief or probability distribution after observing data. It combines your initial assumptions (the prior) with the brand new evidence. For instance, you may encounter that the probability of getting a 6 with a die is definitely not , since the die is biased.

In TabPFN, the prior is defined by 100 million synthetic datasets that were fastidiously designed to capture a wide selection of potential scenarios that the model might encounter. These datasets contain a wide selection of relationships between features and targets (you’ll find more details in [2]).

The posterior is the predictive distribution function

That is computed by training the TabPFN model’s architecture on the synthetic datasets.

Model architecture

TabPFN architecture is shown in the next figure:

TabPFN model’s architecture. Image taken from the unique paper [2].

The left side of the diagram shows a typical tabular dataset. It’s composed of just a few training rows with input features (1, 2) and their corresponding goal values (y). It also features a single test row, which has input features but a missing goal value. The network’s goal is to predict the goal value for this test row.

The TabPFN architecture consists of a series of 12 similar layers. Each layer incorporates two attention mechanisms. The primary is a 1D feature attention, which learns the relationships between the features of the dataset. It essentially allows the model to “attend” to probably the most relevant features for a given prediction. The second attention mechanism is the 1D sample attention. This module looks at the identical feature across all other samples. Sample attention is the important thing mechanism that permits In-Context Learning (ICL), where the model learns from the provided training data while not having any backpropagation. These two attention mechanisms enable the architecture to be invariant to the order of each samples and features.

The output of the 12 layers is a vector that’s fed right into a Multilayer Perceptron (MLP). The MLP is a small neural network that transforms the vector right into a final prediction. For a classification task, the ultimate prediction is just not a category label. As an alternative, the MLP outputs a vector of probabilities, where each value represents the model’s confidence that the input belongs to a selected class. For instance, for a three-class problem, the output could be . This implies the model is confident that the input belongs to the second class.

For regression tasks, the MLP’s output layer is modified to provide a continuous value as a substitute of a probability distribution over discrete classes.

Usage

Using TabPFN is sort of easy! You may install it via pip or from the source. There may be great documentation provided by Prior Labs that links to the various GitHub repositories where you’ll find Colab Notebooks to explore this algorithm straight away. The Python API is identical to that of Scikit Learn, using fit/predict functions.

The fit function in TabPFN doesn’t mean the model will probably be trained as within the classical Machine Learning approach. As an alternative, the fit function uses the training dataset as context. It is because TabPFN leverages ICL. On this approach, the model uses its existing knowledge and the training samples to know patterns and generate higher predictions. ICL simply uses the training data to guide the model’s behavior. 

TabPFN has an ideal ecosystem where it’s also possible to find several utilities to interpret your model through SHAP. It also offers tools for outlier detection and the generation of tabular data. You may even mix TabPFN with traditional models like Random Forest to boost predictions by working on hybrid approaches. All these functionalities might be present in the TabPFN GitHub repository.

Remarks and limitations

After testing TabPFN on a big private dataset containing each numerical and categorical features, listed here are some takeaways:

  • Ensure you preprocess the info first. Categorical columns will need to have all elements as strings; otherwise, the code raises an error.
  • TabPFN is an ideal tool for small- to medium-sized datasets, but not for giant tables. If you happen to work with big datasets (i.e., greater than 10,000 rows, over 500 features, or greater than 10 classes), you’ll hit the pre-training limits, and the prediction performance will probably be affected.
  • Bear in mind that you might encounter CUDA errors which are difficult to debug.

If you happen to are eager about seeing how TabPFN performs on different datasets in comparison with classical boosted methods, I highly recommend this excellent post from Bahadir Akdemir:

TabPFN: How a Pretrained Transformer Outperforms Traditional Models on Tabular Data (Medium blog post)

CARTE

The second foundation model for tabular data leverages graph structures to create an interesting model architecture: I’m talking concerning the s, or CARTE model [3].

Unlike images, where an object has specific features no matter its appearance in a picture, numbers in tabular data haven’t any meaning unless context is added through their respective column names. One technique to account for each the numbers and their respective column names is through the use of a graph representation of the corresponding table. The SODA team used this concept to develop CARTE.

CARTE transforms a table right into a graph structure by converting each row right into a graphlet. A row in a dataset is represented as a small, star-like graph where each row value becomes a node connected to a middle node. The column names function the perimeters of the graph.

Graph representation of a tabular dataset. The middle node is initially set as the typical of the opposite nodes. The middle node acts as a component that captures the general information of the graph. Image sourced from the unique paper [3].

For categorical row values and column names, CARTE uses a dimensional embedding generated from a language model. In this fashion, prior data preprocessing, comparable to categorical encoding on the unique table, is just not needed.

Model architecture

Each of the created graphlets incorporates node () and edge () features. These features are passed to a graph-attentional network that adapts the classical Transformer encoder architecture. A key component of this graph-attentional network is its self-attention layer, which computes attention from each the node and edge features. This enables the model to know the context of every data entry.

CARTE model’s architecture. Image taken from the unique paper [3].

The model architecture also includes an Aggregate & Readout layer that acts on the middle node. The outputs are processed for the contrastive loss.

CARTE was pretrained on a big knowledge base called YAGO3 [4]. This information base was built from sources like Wikidata and incorporates over 18.1 million triplets of 6.3 million entries.

Usage

The GitHub repository for CARTE is under energetic development. It incorporates a Colab Notebook with examples on easy methods to use this model for regression and classification tasks. In accordance with this notebook, the installation is sort of straightforward, just through pip install. Like TabPFN, CARTE uses the Scikit-learn interface (fit-predict) to make predictions on unseen data.

Limitations

In accordance with the CARTE paper [3], this algorithm has some major benefits, comparable to being robust to missing values. Moreover, entity matching is just not required when using CARTE. Since it uses an LLM to embed strings and column names, this algorithm can handle entities that may appear different, for example, “Londres” as a substitute of “London”.

While CARTE performs well on small tables (fewer than 2,000 samples), tree-based models might be simpler on larger datasets. Moreover, for giant datasets, CARTE could be computationally more intensive than traditional Machine Learning models.

For more details on the experiments conducted by the developers of this foundational model, here’s an ideal blog written by Gaël Varoquaux:

CARTE: toward table foundation models

TabuLa-8b

The third foundation model we’ll review was built by fine-tuning the Llama 3-8B language model. In accordance with the authors of TabuLa-8b, language models might be trained to perform tabular prediction tasks by serializing rows as text, converting the text to tokens, after which using the identical loss function and optimization methods in language modeling [5].

Text serialization. TabuLa-8b is trained to provide the tokens following the <|endinput|> token. Image taken from [5].

TabuLa-8b’s architecture features an efficient attention masking scheme called the Row-Causal Tabular Masking (RCTM) scheme. This masking allows the model to take care of all previous rows from the identical table in a batch, but to not rows from other tables. This structure encourages the model to learn from a small variety of examples inside a table, which is crucial for few-shot learning. For detailed information on the methodology and results, take a look at the unique paper from Josh Gardner et al. [5].

Usage and limitations

The GitHub repository rtfm incorporates the code of TabuLa-8b. Here you’ll discover within the Notebooks folder an example of easy methods to make inference. Note that unlike TabPFN or CARTE, TabuLa-8b doesn’t have a Scikit-learn interface. If you ought to make zero-shot predictions or further fine-tune the prevailing model, it’s worthwhile to run the Python scripts developed by the authors.

In accordance with the unique paper, TabuLa-8b performs well in zero-shot prediction tasks. Nonetheless, using this model on large tables with either many samples or with numerous features, and long column names, might be limiting, as this information can quickly exceed the LLM’s context window (the Llama 3-8B model has a context window of 8,000 tokens).

TabDPT

The last foundation model we’ll cover on this blog is the Tabular Discriminative Pre-trained Transformer, or TabDPT for brief. Like TabPFN, TabDPT combines ICL with self-supervised learning to create a strong foundation model for tabular data. TabDPT is trained on real-world data (the authors used 123 public tabular datasets from OpenML). In accordance with the authors, the model can generalize to recent tasks without additional training or hyperparameter tuning.

Model architecture

TabDPT uses a row-based transformer encoder much like TabPFN, where each row serves as a token. To handle the various variety of features of the training data (), the authors standardized the feature dimension max via padding ( < max) or dimensionality reduction ( > max). 

This foundation model leverages self-supervised learning, essentially learning by itself while not having a labeled goal for each task. During training, it randomly picks one column in a table to be the goal after which learns to predict its values based on the opposite columns. This process helps the model understand the relationships between different features. Now, when training on a big dataset, the model doesn’t use the whole table directly. As an alternative, it finds and uses only probably the most similar rows (called the “context”) to predict a single row (the “query”). This method makes the training process faster and simpler.

TabDPT’s architecture is shown in the next figure:

TabDPT architecture. Image taken from the unique paper [6].

The figure illustrates how the training of this foundation model was carried out. First, the authors sampled tables from different datasets to construct a set of features () and a set of targets (). Each and are partitioned into context (ctx, ctx) and query (qy, qy). The query qy is input that’s passed through the embedding functions (that are indicated by a rectangle or a triangle). The model also creates embeddings for ctx, and ctx. These context embeddings are summed together and concatenated with the embedding of qy. They’re then passed through a transformer encoder to get a classification ̂ycls or regression ̂yreg for the query. The loss between the prediction and the true targets is used to update the model weights. 

Usage and limitations

There may be a GitHub repository that gives code to generate predictions on recent tabular datasets. Like TabPFN or CARTE, TabDPT uses an API much like Scikit-learn to make predictions on unseen data, where the fit function uses the training data to leverage ICL. The code of this model is currently under energetic development.

While the paper doesn’t have a dedicated limitations section, the authors mention just a few constraints and the way they’re handled:

  • The model has a predefined maximum variety of features and classes. The authors suggest using Principal Component Evaluation (PCA) to cut back the variety of features if a table exceeds the limit.
  • For classification tasks with more classes than the model’s limit, the issue might be broken down into multiple sub-tasks by representing the category number in a unique base.
  • The retrieval process can add some latency during inference, although the authors note that this might be minimized with modern libraries.

Take-home messages

On this blog, I even have summarized foundation models for tabular data. Most of them were released in 2024, but all are under energetic development. Despite being quite recent, a few of these models have already got good documentation and ease of usage. For example, you may install TabPFN, CARTE, or TabDPT through pip. Moreover, these models share the identical API call as Scikit-learn, which makes them easy to integrate into existing Machine Learning applications.

In accordance with the authors of the inspiration models presented here, these algorithms outperform classical boosting methods comparable to XGBoost or CatBoost. Nonetheless, foundation models still can’t be used on large tabular datasets, which limits their use, especially in production environments. Because of this the classical approach of coaching a Machine Learning model per dataset remains to be the technique to go in creating predictive models from tabular data.

Great strides have been made toward a foundation model for tabular data. Let’s see what the long run holds for this exciting area of research!

Thanks for reading!

References

[1] N. Hollman et al., TabPFN: A transformer that solves small tabular classification problems in a second (2023), table representation learning workshop.

[2] N. Hollman et al., Accurate predictions on small data with a tabular foundation model (2025), Nature.

[3] M.J. Kim, L Grinsztajn, and G. Varoquaux. CARTE: Pretaining and Transfer for Tabular Learning (2024), Proceedings of the forty first International conference on Machine Learning, Vienna, Austria.

[4] F. Mahdisoltani, J. Biega, and F.M. Suchanek. Yago3: A knowledge base from multilingual wikipedias (2013), in CIDR.

[5] J. Gardner, J.C. Perdomo, L. Schmidt. Large Scale Transfer Learning for Tabular Data via Language Modeling (2025), NeurlPS.

[6] M. Junwei et al. TabDPT: Scaling Tabular Foundation Models on Real Data (2024), arXiv preprint, arXiv:2410.18164.

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x