In the case of Large Language Models, Less is More!


Bugatti’s Quad-turbocharged 8.0L W16 engine vs Electric Motor

With the arrival of any latest technology, humanity’s first attempt is usually achieved through brute force. Because the technology evolves, we try to optimize and give you a more elegant solution to the brute breakthrough. With the most recent advancements in Artificial Intelligence (AI) — particularly the event of Large Language Models (LLMs) — we’ve made significant strides in recent times demonstrating impressive capabilities. But these strides are still very much within the brute force stage of this technology evolution. We’ve seen the Cambrian explosion of transformer-like models, bringing forth large models that range all the best way as much as trillions of parameters. This is kind of analogous to the transition of the combustion engine to the more efficient electric successor. This transition was observed in sedans and in my favorite hobby toy: racing cars. This began within the Nineteen Sixties with the likes of the Pontiac GTO, the Shelby Cobra 427 or the Dodge Charger R/T showcasing Detroit muscle with a big block engine, gas guzzling,0-to-60 MPH in 10 seconds street Hemi engines with gas mileage starting from 7–14 miles per gallon (MPG). Today, with the most recent electric cars, like Rimac’s Nevera, you may achieve 0-to-60 MPH in 1.74 seconds while achieving 54MPGe. The early brute force was a crucial step to catalyze the efficiency that followed.

It’s turn out to be evident to me that history must repeat itself with Large Language Models; We’re on the cusp of shifting from brute attempts, towards more elegant solutions in addressing Ai models; particularly moving away from larger more complex language models (our modern equivalent of the GTO, Cobra and Hemi engine) towards smaller, way more efficient models. To be frank, driving such efficiency has been a key focus of mine for the past several years. Working with an incredible team of colleagues, I’ve been fortunate to work on the intersection of Ai and compute in recent roles, designing accelerated machines and codesigning Meta’s Ai infrastructure. When Babak Pahlavan and I got down to construct our current enterprise — NinjaTech AI — we inscribed a key fundamental of our technical DNA into the corporate’s culture — the efficient execution and operation of our intelligence platform from Day 1. NinjaTech is constructing an AI Executive Assistant to make professionals more productive, by taking over the executive tasks like scheduling, expenses and travel booking, which devour considerable time.

While studying autoregressive and generative models with language models exceeding 100s of billions of parameters, it became clear to me that there must be a more efficient and simpler option to achieve these administrative tasks. It’s one thing when you’re attempting to answer “what’s the meaning of life” questions, or asking your model to jot down the python code for an automatic music producer. For a lot of administrative tasks, simpler less complex models suffice. Now we have put this to the test by leveraging an assortment of model sizes for various administrative tasks, some so small and efficient that they could be run on CPU! This not only prevents us from breaking the bank with high-cost large-scale training jobs, nevertheless it also saves us inference time by not requiring expensive GPU instances with large memory footprints to serve our models. Very like the combustion-to-electric examples above, we’re becoming more efficient, but in a short time!

We’re excited to see a shift towards more efficient operation by the industry and the research community. One such example includes Meta’s Llama release which showcased their 13B parameter model outperforming GPT-3 (175B) on most benchmarks by training on more data on an order-of-magnitude smaller model. Consequently, Meta research outdid themselves again with LIMA (Less Is More For Alignment,) which banked on leveraging 1000 “diverse” prompts as a clever pre-training method to realize prime quality results. This is really remarkable and imperative to curb our compute demand for Ai, which continues to soar exponentially and may have detrimental effects on our planet on account of Ai’s carbon footprint. To place things in perspective, an MIT study demonstrated that small transformer models with only 65M parameters can devour as much as 27KWh and 26 lbs of CO2e to coach. This number can grow dramatically when large models corresponding to GPT3, creating as much as ~502 tonnes in carbon equivalent emissions in 2022 alone. Moreover, while inference is less compute intensive than training once a model is published, its emissions begin to skyrocket 10–100x over its lifetime in comparison with training when being leveraging inference for serving.

We’re only on the tip of the iceberg with the vast possibilities of Ai; Nevertheless, to do more inside a more narrow footprint and given cluster size and budget it’s imperative to contemplate efficiency of our operations. We’d like to curb the gas guzzling Hemi and employ more efficient smaller models — this can improve operations, lower costs and meaningfully reduce AI’s carbon footprint.


What are your thoughts on this topic?
Let us know in the comments below.


0 0 votes
Article Rating
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

Would love your thoughts, please comment.x