Introduction
language models (LLMs), we’re perpetually constrained by budgets. Such a constraint results in a fundamental trade-off:Imagine that for those who fix a compute budget, increasing the model size signifies that you need to reduce the model size you may train on, and vice versa. So you might be asking the query:
Should we allocate more to a model with more parameters, or should we train it on more data?
Specifically, LLMs’ performance and efficiency are largely influenced by this trade-off. It’s thus crucial to seek out an optimal balance between the variety of parameters of a model and the variety of tokens used.
The full training compute of a transformer roughly scales as: C∝N×D, where
- N is the variety of model parameters.
- D is the variety of tokens.
- C is the fixed compute budget.
It is easy to see that for a set C, N and D are inversely proportional to one another.
Previous studies (Kaplan et al., 2020; Hoffmann et al., 2022) have found that training lack of machine learning models follows a power-law with compute: L(C)∝C^{−α} and the model size and dataset size scale with compute as: N_opt∝C^a, D_opt∝C^b for some positive values a and b.
In this text, we’ll use tiny Transformers to explore the best way to balance N and D under a set compute C.
Experiment Setup
We design a minimal transformer model, and we call it “tiny transformer” with the next configurable properties that influence the model’s parameter size:
- Model dimension (d_model)
- MLP dimension (d_mlp)
- Variety of layers (n_layers)
We would love to coach the transformer of various configurations on tokenized sequences of length 64 of the WikiText-2 dataset.
To review the effect of scaling, we defined a grid of models from very small (16 hidden units, 1 layer) to relatively large (128 hidden units, 4 layers) and mix them with a variety of tokens from 5k to 1M. See the code below:
model_configs = [
    {"d_model": 16,  "d_mlp": 64,   "n_layers": 1},  
    {"d_model": 24,  "d_mlp": 96,   "n_layers": 1},   
    {"d_model": 32,  "d_mlp": 128,  "n_layers": 2},
    {"d_model": 48,  "d_mlp": 192,  "n_layers": 2},
    {"d_model": 64,  "d_mlp": 256,  "n_layers": 3},
    {"d_model": 96,  "d_mlp": 384,  "n_layers": 3},
    {"d_model": 128, "d_mlp": 512,  "n_layers": 4},   
]
# variety of tokens (D) we train on — simulated via few steps × batch × seq_len
token_budgets = [5e3, 1e4, 3e4, 5e4, 1e5, 3e5, 5e5, 1e6]  # small for demoBy approximating the compute cost as C≈N×D, our idea is to compute the loss function for every (N,D) pair and find the pair (N,D) with which the model reaches the minimal loss function for a given C: that is the balance we’re in search of.
Implementation and observations
We use the code below to coach the model as much as a set variety of steps with different (N,D) pair and record the result.
results = []
device = "cuda" if torch.cuda.is_available() else "cpu"
for cfg in model_configs:
    model = TinyTransformer(vocab_size=len(tokenizer), **cfg)
    N_params = count_params(model)
    for D in token_budgets:
        steps = int(D // (SEQ_LEN * 16))  # assuming batch_size=16
        dataloader = DataLoader(
            tokenized_dataset["train"].shuffle(seed=0),
            batch_size=16,
            collate_fn=collate_fn
        )
        avg_loss = train_one(model, dataloader, steps=steps, device=device)
        compute = N_params * D
        results.append({
            "N": N_params,
            "D": D,
            "C": compute,
            "loss": avg_loss
        })We then plot the ultimate loss against the compute (N×D):
We’ve the next vital observations:
- For small compute budgets, small models trained on many of the available data perform higher than larger models trained on little or no data.
- For giant compute budgets, larger models turn out to be higher when enough data is out there.
- The optimal model size doesn’t grow linearly with compute budget. For instance, doubling the compute does not likely result in an optimal variety of parameters twice as before.
The plot below gives the efficient frontier across model size, that’s, the set of model sizes which have the bottom loss for a given compute.

“Best” Model
To find out the “best” model, we would choose the pair of model size and the variety of tokens that minimizes loss at a set budget.
We assume each follow a power-law relationship: N_opt∝C^α, D_opt∝C^β, and we would love to estimate the unknown exponents α and β by the next steps:
- Take the logarithm of the quantities: log?(N_opt)=αlog?(C)+const, log?(D_opt)=βlog?(C)+const.
- Fit a linear regression. The slope of the regression is nothing however the power-law exponent.
The next code gives such a regression:
# Fit log-log linear regression
a_slope, a_intercept, *_ = st.linregress(np.log(frontier.C), np.log(frontier.N))
b_slope, b_intercept, *_ = st.linregress(np.log(frontier.C), np.log(frontier.D))In our toy experiment, we found that N_opt ~C^0.14 and D_opt~ C^0.86. This result may not reveal the entire image because we did the experiment on simpilied model and configurations. But we are able to still see that the expansion of computing results in a rise in optimal model size, but at a diminishing rate. Clearly, the remaining budget ought to be attributed to more training tokens.
Furthermore, the compute above gives the incontrovertible fact that the very best ratio N_opt/D_opt=C^-0.72. This suggests that while you increase compute, it is best to add more training tokens fairly than increasing model size.
Practical Takeaways
From this experiment, though a toy case, we are able to extract several insights:
- For a set budget, using a medium model with more data can outperform a really large model with limited data.
- Optimal model size and data size grow with compute. Don’t train a model with many parameters if you have got a small budget.
- When the budget increases, consider first the optimal ratio N_opt/D_opt to find out whether it is best to increase the model size or add more training data.
Conclusion
On this blog post, we offer a study of the trade-off between model size and data under a set compute budget for LLMs with a toy case. The experiment shows that we are able to find the optimal pair of model size and tokens number to acheive the very best model performance with a given budget, allowing researchers and practitioners to design LLMs properly and achieve the very best results.
Reference
[1] Kaplan, J., McCandlish, S., Henighan, T., Brown, T. B., Chess, B., Child, R., Gray, S., Radford, A., Wu, J., & Amodei, D. (2020). .
[2] Hoffmann, J., Borgeaud, S., Mensch, A., Buchatskaya, E., Cai, T., Rutherford, E., de Las Casas, D., Hendricks, L. A., Welbl, J., Clark, A., Hennigan, T., Noland, E., Millican, K., van den Driessche, G., Damoc, B., Guy, A., Osindero, S., Simonyan, K., Elsen, E., … Sifre, L. (2022). .


