“Greater is at all times higher” — this principle is deeply rooted within the AI world. Every month, larger models are created, with increasingly more parameters. Corporations are even constructing $10 billion AI data centers for them. But is it the one direction to go?
At NeurIPS 2024, Ilya Sutskever, considered one of OpenAI’s co-founders, shared an idea: “Pre-training as we all know it can unquestionably end”. It seems the era of scaling is coming to a detailed, which suggests it’s time to give attention to improving current approaches and algorithms.
Some of the promising areas is using small language models (SLMs) with as much as 10B parameters. This approach is admittedly beginning to take off within the industry. For instance, Clem Delangue, CEO of Hugging Face, predicts that as much as 99% of use cases could possibly be addressed using SLMs. The same trend is clear within the latest requests for startups by YC:
Giant generic models with lots of parameters are very impressive. But also they are very costly and infrequently include latency and privacy challenges.
In my last article “You don’t need hosted LLMs, do you?”, I wondered in the event you need self-hosted models. Now I take it a step further and ask the query: do you would like LLMs in any respect?
In this text, I’ll discuss why small models could be the solution your online business needs. We’ll speak about how they will reduce costs, improve accuracy, and maintain control of your data. And naturally, we’ll have an honest discussion about their limitations.
The economics of LLMs might be one of the painful topics for businesses. Nevertheless, the problem is way broader: it includes the necessity for expensive hardware, infrastructure costs, energy costs and environmental consequences.
Yes, large language models are impressive of their capabilities, but also they are very expensive to take care of. You will have already noticed how subscription prices for LLMs-based applications have risen? For instance, OpenAI’s recent announcement of a $200/month Pro plan is a signal that costs are rising. And it’s likely that competitors will even move up to those price levels.
The Moxie robot story is an excellent example of this statement. Embodied created an awesome companion robot for teenagers for $800 that used the OpenAI API. Despite the success of the product (kids were sending 500–1000 messages a day!), the corporate is shutting down resulting from the high operational costs of the API. Now 1000’s of robots will grow to be useless and children will lose their friend.
One approach is to fine-tune a specialized Small Language Model on your specific domain. After all, it can not solve “all the issues of the world”, but it can perfectly address the duty it’s assigned to. For instance, analyzing client documentation or generating specific reports. At the identical time, SLMs might be more economical to take care of, devour fewer resources, require less data, and may run on far more modest hardware (as much as a smartphone).
And eventually, let’s not forget in regards to the environment. Within the article Carbon Emissions and Large Neural Network Training, I discovered some interesting statistic that amazed me: training GPT-3 with 175 billion parameters consumed as much electricity as the common American home consumes in 120 years. It also produced 502 tons of CO₂, which is comparable to the annual operation of greater than 100 gasoline cars. And that’s not counting inferential costs. By comparison, deploying a smaller model just like the 7B would require 5% of the consumption of a bigger model. And what in regards to the latest o3 release?
💡Hint: don’t chase the hype. Before tackling the duty, calculate the prices of using APIs or your personal servers. Take into consideration scaling of such a system and the way justified using LLMs is.
Now that we’ve covered the economics, let’s speak about quality. Naturally, only a few people would need to compromise on solution accuracy just to avoid wasting costs. But even here, SLMs have something to supply.
Many studies show that for highly specialized tasks, small models cannot only compete with large LLMs, but often outperform them. Let’s have a look at a number of illustrative examples:
- Medicine: The Diabetica-7B model (based on the Qwen2–7B) achieved 87.2% accuracy on diabetes-related tests, while GPT-4 showed 79.17% and Claude-3.5–80.13%. Despite this, Diabetica-7B is dozens of times smaller than GPT-4 and can run locally on a consumer GPU.
- Legal Sector: An SLM with just 0.2B parameters achieves 77.2% accuracy in contract evaluation (GPT-4 — about 82.4%). Furthermore, for tasks like identifying “unfair” terms in user agreements, the SLM even outperforms GPT-3.5 and GPT-4 on the F1 metric.
- Mathematical Tasks: Research by Google DeepMind shows that training a small model, Gemma2–9B, on data generated by one other small model yields higher results than training on data from the larger Gemma2–27B. Smaller models are inclined to focus higher on specifics without the tendency to “attempting to shine with all of the knowledge”, which is usually a trait of larger models.
- Content Moderation: LLaMA 3.1 8B outperformed GPT-3.5 in accuracy (by 11.5%) and recall (by 25.7%) when moderating content across 15 popular subreddits. This was achieved even with 4-bit quantization, which further reduces the model’s size.
I’ll go a step further and share that even classic NLP approaches often work surprisingly well. Let me share a private case: I’m working on a product for psychological support where we process over a thousand messages from users day by day. They will write in a chat and get a response. Each message is first classified into considered one of 4 categories: