Just a few days ago, Microsoft and NVIDIA introduced Megatron-Turing NLG 530B, a Transformer-based model hailed as “the world’s largest and strongest generative language model.”
That is a formidable show of Machine Learning engineering, little doubt about it. Yet, should we be enthusiastic about this mega-model trend? I, for one, am not. Here’s why.

That is your Brain on Deep Learning
Researchers estimate that the human brain comprises a median of 86 billion neurons and 100 trillion synapses. It’s secure to assume that not all of them are dedicated to language either. Interestingly, GPT-4 is expected to have about 100 trillion parameters… As crude as this analogy is, shouldn’t we wonder if constructing language models which can be in regards to the size of the human brain is the perfect long-term approach?
After all, our brain is a wonderful device, produced by thousands and thousands of years of evolution, while Deep Learning models are only just a few many years old. Still, our intuition should tell us that something doesn’t compute (pun intended).
Deep Learning, Deep Pockets?
As you’d expect, training a 530-billion parameter model on humongous text datasets requires a good little bit of infrastructure. In truth, Microsoft and NVIDIA used lots of of DGX A100 multi-GPU servers. At $199,000 a chunk, and factoring in networking equipment, hosting costs, etc., anyone looking to copy this experiment would need to spend near $100 million dollars. Want fries with that?
Seriously, which organizations have business use cases that will justify spending $100 million on Deep Learning infrastructure? And even $10 million? Only a few. So who’re these models for, really?
That Warm Feeling is your GPU Cluster
For all its engineering brilliance, training Deep Learning models on GPUs is a brute force technique. In accordance with the spec sheet, each DGX server can devour as much as 6.5 kilowatts. After all, you’ll have a minimum of as much cooling power in your datacenter (or your server closet). Unless you are the Starks and wish to maintain Winterfell warm in winter, that is one other problem you will have to take care of.
As well as, as public awareness grows on climate and social responsibility issues, organizations have to account for his or her carbon footprint. In accordance with this 2019 study from the University of Massachusetts, “training BERT on GPU is roughly comparable to a trans-American flight“.
BERT-Large has 340 million parameters. One can only extrapolate what the footprint of Megatron-Turing might be… Individuals who know me would not call me a bleeding-heart environmentalist. Still, some numbers are hard to disregard.
So?
Am I excited by Megatron-Turing NLG 530B and whatever beast is coming next? No. Do I believe that the (relatively small) benchmark improvement is well worth the added cost, complexity and carbon footprint? No. Do I believe that constructing and promoting these huge models helps organizations understand and adopt Machine Learning ? No.
I’m left wondering what is the point of all of it. Science for the sake of science? Good old marketing? Technological supremacy? Probably a little bit of each. I’ll leave them to it, then.
As a substitute, let me give attention to pragmatic and actionable techniques which you could all use to construct top quality Machine Learning solutions.
Use Pretrained Models
Within the overwhelming majority of cases, you will not need a custom model architecture. Perhaps you may want a custom one (which is a special thing), but there be dragons. Experts only!
A very good place to begin is to search for models which were pretrained for the duty you are trying to unravel (say, summarizing English text).
Then, it’s best to quickly check out just a few models to predict your individual data. If metrics inform you that one works well enough, you are done! In case you need a bit more accuracy, it’s best to consider fine-tuning the model (more on this in a minute).
Use Smaller Models
When evaluating models, it’s best to pick the smallest one which can deliver the accuracy you wish. It would predict faster and require fewer hardware resources for training and inference. Frugality goes a great distance.
It’s nothing recent either. Computer Vision practitioners will remember when SqueezeNet got here out in 2017, achieving a 50x reduction in model size in comparison with AlexNet, while meeting or exceeding its accuracy. How clever that was!
Downsizing efforts are also under way within the Natural Language Processing community, using transfer learning techniques comparable to knowledge distillation. DistilBERT is maybe its most generally known achievement. In comparison with the unique BERT model, it retains 97% of language understanding while being 40% smaller and 60% faster. You’ll be able to try it here. The identical approach has been applied to other models, comparable to Facebook’s BART, and you may try DistilBART here.
Recent models from the Big Science project are also very impressive. As visible on this graph included within the research paper, their T0 model outperforms GPT-3 on many tasks while being 16x smaller.

You’ll be able to try T0 here. That is the sort of research we want more of!
Wonderful-Tune Models
If you might want to specialize a model, there must be only a few reasons to coach it from scratch. As a substitute, it’s best to fine-tune it, that’s to say train it just for just a few epochs on your individual data. In case you’re short on data, perhaps of 1 these datasets can get you began.
You guessed it, that is one other option to do transfer learning, and it will make it easier to save on the whole lot!
- Less data to gather, store, clean and annotate,
- Faster experiments and iterations,
- Fewer resources required in production.
In other words: save time, lower your expenses, save hardware resources, save the world!
In case you need a tutorial, the Hugging Face course will get you began very quickly.
Use Cloud-Based Infrastructure
Like them or not, cloud firms know how one can construct efficient infrastructure. Sustainability studies show that cloud-based infrastructure is more energy and carbon efficient than the choice: see AWS, Azure, and Google. Earth.org says that while cloud infrastructure shouldn’t be perfect, “[it’s] more energy efficient than the choice and facilitates environmentally useful services and economic growth.“
Cloud definitely has rather a lot going for it with regards to ease of use, flexibility and pay as you go. It is also a bit greener than you almost certainly thought. In case you’re short on GPUs, why not try fine-tune your Hugging Face models on Amazon SageMaker, AWS’ managed service for Machine Learning? We have got loads of examples for you.
Optimize Your Models
From compilers to virtual machines, software engineers have long used tools that routinely optimize their code for whatever hardware they’re running on.
Nonetheless, the Machine Learning community remains to be battling this topic, and for good reason. Optimizing models for size and speed is a devilishly complex task, which involves techniques comparable to:
- Specialized hardware that accelerates training (Graphcore, Habana) and inference (Google TPU, AWS Inferentia).
- Pruning: remove model parameters which have little or no impact on the anticipated end result.
- Fusion: merge model layers (say, convolution and activation).
- Quantization: storing model parameters in smaller values (say, 8 bits as an alternative of 32 bits)
Fortunately, automated tools are starting to seem, comparable to the Optimum open source library, and Infinity, a containerized solution that delivers Transformers accuracy at 1-millisecond latency.
Conclusion
Large language model size has been increasing 10x yearly for the previous few years. That is beginning to seem like one other Moore’s Law.
We have been there before, and we must always know that this road results in diminishing returns, higher cost, more complexity, and recent risks. Exponentials tend not to finish well. Remember Meltdown and Spectre? Do we wish to search out out what that appears like for AI?
As a substitute of chasing trillion-parameter models (place your bets), would not all be higher off if we built practical and efficient solutions that each one developers can use to unravel real-world problems?
Concerned with how Hugging Face may help your organization construct and deploy production-grade Machine Learning solutions? Get in contact at julsimon@huggingface.co (no recruiters, no sales pitches, please).
