I TabPFN through the ICLR 2023 paper — . The paper introduced TabPFN, an open-source transformer model built specifically for tabular datasets, an area that has not likely benefited from deep learning and where gradient boosted decision tree models still dominate.
At the moment, TabPFN supported only as much as 1,000 training samples and 100 purely numerical features, so its use in real-world settings was fairly limited. Over time, nonetheless, there have been several incremental improvements including TabPFN-2, which was introduced in 2025 through the paper —
More recently, TabPFN-2.5 was released and this version can handle near 100,000 data points and around 2,000 features, which makes it fairly practical for real world prediction tasks. I actually have spent lots of my skilled years working with tabular datasets, so this naturally caught my interest and pushed me to look deeper. In this text, I give a high level overview of TabPFN and likewise walk through a fast implementation using a Kaggle competition to show you how to start.
What’s TabPFN
TabPFN stands for Tabular Prior-data Fitted Network, a foundation model that relies on the thought of fitting a model to a prior over tabular datasets, quite than to a single dataset, hence the name.
As I read through the technical reports, there have been loads interesting bits and pieces to those models. As an example, TabPFN can deliver strong tabular predictions with very low latency, often comparable to tuned ensemble methods, but without repeated training loops.
From a workflow perspective also there isn’t a learning curve because it matches naturally into existing setups through a scikit-learn style interface. It may well handle missing values, outliers and mixed feature types with minimal preprocessing which we are going to cover throughout the implementation, later in this text.
The necessity for a foundation model for tabular data
Before moving into how TabPFN works, let’s first try to know the broader problem it tries to deal with.
With traditional machine learning on tabular datasets, you normally train a brand new model for each latest dataset. This often involves long training cycles, and it also implies that a previously trained model cannot really be reused.
Nevertheless, if we have a look at the inspiration models for text and pictures, their idea is radically different. As an alternative of retraining from scratch, a considerable amount of pre-training is completed upfront across many datasets and the resulting model can then be applied to latest datasets without retraining most often.
This for my part is the gap the model is attempting to close for tabular data i.e reducing the necessity to train a brand new model from scratch for each dataset and this looks like a promising area of research.
TabPFN training & Inference pipeline at a high level

TabPFN utilises in-context learning to suit a neural network to a previous over tabular datasets. What this implies is that as an alternative of learning one task at a time, the model learns how tabular problems are likely to look generally after which uses that knowledge to make predictions on latest datasets through a single forward pass. Here is an excerpt from TabPFN’s Nature paper:
The pipeline may be divided into three major steps:
1. Generating Synthetic Datasets
TabPFN treats a complete dataset as a single data point (or a token) fed into the network. This implies it requires exposure to a really large variety of datasets during training. For that reason, training TabPFN starts with synthetic tabular datasets. Why synthetic? Unlike text or images, there are usually not many large and diverse real world tabular datasets available, which makes synthetic data a key a part of the setup. To place it into perspective, TabPFN 2 was trained on 130 million datasets.
The means of generating synthetic datasets is interesting in itself. TabPFN uses a highly parametric structural causal model to create tabular datasets with varied structures, feature relationships, noise levels and goal functions. By sampling from this model, a big and diverse set of datasets may be generated, each acting as a training signal for the network. This encourages the model to learn general patterns across many sorts of tabular problems, quite than overfitting to any single dataset.
2. Training
The figure below has been taken from the Nature paper, mentioned above clearly demonstrates the training and inference process.

During training, an artificial tabular dataset is sampled and split into X train,Y train, X test, and Y test. The Y test values are held out, and the remaining parts are passed to the neural network which outputs a probability distribution for every Y test data point, as shown within the left figure.
The held out Y test values are then evaluated under these predicted distributions. A cross entropy loss is then computed and the network is updated to minimize this loss. This completes one backpropagation step for a single dataset and this process is then repeated for thousands and thousands of synthetic datasets.
3. Inference
At test time, the trained TabPFN model is applied to an actual dataset. This corresponds to the figure on the fitting, where the model is used for inference. As you may see, the interface stays the identical as during training. You provide X train, Y train, and X test, and the model outputs predictions for Y test through a single forward pass.
Most significantly, there isn’t a retraining at test time and TabPFN performs what’s effectively zero-shot inference, producing predictions immediately without updating its weights.
Architecture

Let’s also touch upon the core architecture of the model as mentioned within the paper. At a high level, TabPFN adapts the transformer architecture to raised suit tabular data. As an alternative of flattening a table into an extended sequence, the model treats each value within the table as its own unit. It uses a two-stage attention mechanism wherein it first learns how features relate to one another inside a single row after which learns how the identical feature behaves across different rows.
This manner of structuring attention is significant because it matches how tabular data is definitely organized. This also means the model doesn’t care in regards to the order of rows or columns which suggests it might handle tables which might be larger than those it was trained on.
Implementation
Lets now walk through an implementation of TabPFN-2.5 and compare it against a vanilla XGBoost classifier to supply a well-known point of reference. While the model weights may be downloaded from Hugging Face, using Kaggle Notebooks is more straightforward for the reason that model is instantly available there and GPU support comes out of the box for faster inference. In either case, it’s essential accept the model terms before using it. After adding the TabPFN model to the Kaggle notebook environment, run the next cell to import it.
# importing the model
import os
os.environ["TABPFN_MODEL_CACHE_DIR"] = "/kaggle/input/tabpfn-2-5/pytorch/default/2"
Yow will discover the entire code within the accompanying Kaggle notebook here.
Installation
You possibly can access TabPFN in two ways either as a Python package and run it locally or as an API client to run the model within the cloud:
# Python package
pip install tabpfn
# As an API client
pip install tabpfn-client
Dataset: Kaggle Playground competition dataset
To get a greater sense of how TabPFN performs in an actual world setting, I tested it on a Kaggle Playground competition that concluded few months ago. The duty, (MIT license), requires predicting the probability of rainfall for every id within the test set. Evaluation is completed using ROC–AUC, which makes this a great fit for probability-based models like TabPFN. The training data looks like this:

Training a TabPFN Classifier
Training TabPFN Classifier is easy and follows a well-known scikit-learn style interface. While there isn’t a task-specific training in the normal sense, it remains to be necessary to enable GPU support, otherwise inference may be noticeably slower. The next code snippet walks through preparing the info, training a TabPFN classifier and evaluating its performance using ROC–AUC rating.
# Importing needed libraries
from tabpfn import TabPFNClassifier
import pandas as pd, numpy as np
from sklearn.model_selection import train_test_split
# Select feature columns
FEATURES = [c for c in train.columns if c not in ["rainfall",'id']]
X = train[FEATURES].copy()
y = train["rainfall"].copy()
# Split data into train and validation sets
train_index, valid_index = train_test_split(
train.index,
test_size=0.2,
random_state=42
)
x_train = X.loc[train_index].copy()
y_train = y.loc[train_index].copy()
x_valid = X.loc[valid_index].copy()
y_valid = y.loc[valid_index].copy()
# Initialize and train TabPFN
model_pfn = TabPFNClassifier(device=["cuda:0", "cuda:1"])
model_pfn.fit(x_train, y_train)
# Predict class probabilities
probs_pfn = model_pfn.predict_proba(x_valid)
# # Use probability of the positive class
pos_probs = probs_pfn[:, 1]
# # Evaluate using ROC AUC
print(f"ROC AUC: {roc_auc_score(y_valid, pos_probs):.4f}")
-------------------------------------------------
ROC AUC: 0.8722
Next let’s train a basic XGBoost classifier.
Training an XGBoost Classifier
from xgboost import XGBClassifier
# Initialize XGBoost classifier
model_xgb = XGBClassifier(
objective="binary:logistic",
tree_method="hist",
device="cuda",
enable_categorical=True,
random_state=42,
n_jobs=1
)
# Train the model
model_xgb.fit(x_train, y_train)
# Predict class probabilities
probs_xgb = model_xgb.predict_proba(x_valid)
# Use probability of the positive class
pos_probs_xgb = probs_xgb[:, 1]
# Evaluate using ROC AUC
print(f"ROC AUC: {roc_auc_score(y_valid, pos_probs_xgb):.4f}")
------------------------------------------------------------
ROC AUC: 0.8515
As you may see, TabPFN performs quite well out of the box. While XGBoost can definitely be tuned further, my intent here is to check basic, vanilla implementations quite than optimised models. It placed me on a twenty second rank on the general public leaderboard. Below are the highest 3 scores for reference.

What about model explainability?
Transformer models are usually not inherently interpretable and hence to know the predictions, post-hoc interpretability techniques like SHAP (SHapley Additive Explanations) are commonly used to investigate individual predictions and have contributions. TabPFN provides a dedicated Interpretability Extension that integrates with SHAP, making it easier to examine and reason in regards to the model’s predictions. To access that you just’ll need to put in the extension first:
# Install the interpretability extension:
pip install "tabpfn-extensions[interpretability]"
from tabpfn_extensions import interpretability
# Calculate SHAP values
shap_values = interpretability.shap.get_shap_values(
estimator=model_pfn,
test_x=x_test[:50],
attribute_names=FEATURES,
algorithm="permutation",
)
# Create visualization
fig = interpretability.shap.plot_shap(shap_values)

The plot on the left shows the average SHAP feature importance across your complete dataset, giving a worldwide view of which features matter most to the model. The plot on the fitting is a SHAP summary (beeswarm) plot, which provides a more granular view by showing SHAP values for every feature across individual predictions.
From the above plots, it is clear that cloud cover, sunshine, humidity, and dew point have the most important overall impact on the model’s predictions, while features comparable to wind direction, pressure, and temperature-related variables play a relatively smaller role.
It can be crucial to notice that SHAP explains the model’s learned relationships, not physical causality.
Conclusion
There may be loads more to TabPFN than what I actually have covered in this text. What I personally liked is each the underlying idea and the way easy it’s to start. There are lot of elements that I actually have not touched on here, comparable to TabPFN use in time series forecasting, anomaly detection, generating synthetic tabular data, and extracting embeddings from TabPFN models.
One other area I’m particularly interested by exploring is fine-tuning, where these models may be adapted to data from a particular domain. That said, this text was meant to be a light-weight introduction based on my first hands-on experience. I plan to explore these additional capabilities in additional depth in future posts. For now, the official documentation is a great place to dive deeper.
