Hugging Face has unveiled technology to enhance the inference performance of the open source Small Language Model (sLM). Like OpenAI’s ‘o1’, it is predicated on the ‘Test-Time Compute’ method, which improves response quality by investing additional computing resources and time into the model.
Hugging Face is a recent technology that utilizes additional computing resources and time to assist sLM solve complex math, coding, and reasoning problems. ‘test-time scaling’ technology was released.
It is a method of accelerating response accuracy to difficult questions through the use of more resources and time within the reasoning process. This helps sLM achieve high performance like a big language model (LLM), and is beneficial when there’s insufficient memory to run LLM.
Particularly, it’s noteworthy that the step-by-step reasoning strategy of test-time computing technology has been fully disclosed. While OpenAI keeps its internal workings, reminiscent of its ‘chain of thought (CoT)’ structure, private, Hugging Face has revealed its own test-time computing technology based on DeepMind’s research announced in August.
This technology consists of ‘test-time scaling’ that uses additional computing during inference, a reward model that evaluates the response of the sLM, and a search algorithm that optimizes the trail to enhance the reply.
The primary methods of test-time scaling include majority voting, best-of-N, and weighted best-of-N.
Majority voting is a technique of sending the identical query multiple times and choosing the reply that was chosen essentially the most. It will probably be effective for easy problems, but has limitations for complex problems.
Better of N generates multiple answers and uses a reward model as a substitute of majority voting to pick out the optimal answer. As well as, Weighted Better of N is an evolution of Better of N, and considers the consistency of answers and selects answers that appear steadily and with high confidence.
The researchers used the Process Reward Model (PRM) to guage not only the ultimate answer but in addition the strategy of reaching the reply. In consequence of the experiment, ‘Rama-3.2 1B’ using weighted better of N and PRM showed performance near ‘Rama-3.2 8B’ within the highly difficult MATH-500 benchmark.
To further improve performance, a ‘beam search’ algorithm was added. This method divides the model’s answering process into stages, and the search algorithm evaluates the answers generated at each stage as a reward model to search out the optimal answer.
Nevertheless, although beam search can improve performance in complex problems, it tends to perform worse than other methods in easy problems. To resolve this, ‘DVTS (Diverse Verifier Tree Search)’ and ‘compute-optimal scaling strategy’ were added.
DVTS is a variation of beam search designed to avoid incorrect inference paths and find diverse responses. Moreover, the computational optimization expansion strategy dynamically selects the optimal inference method depending on the issue of the issue.
This allowed the Rama-3.2 1B model to outperform the much larger 8B model, and the 3B model even produced higher results than the 70B model.
Nevertheless, Hugging Face said, “There are still limitations to test-time scaling.”
For instance, the experiment used the Rama-3.1-8B model trained with PRM, which requires two models to be run in parallel. Moreover, this technology only works on problems where the reply may be clearly evaluated, reminiscent of coding or math.
Nevertheless, it’s explained that opening the technology in order that test-time computing, which is the core of inference, may be applied to open source SLM can have a major impact.
Particularly, as firms’ use of open source models increases, it may be helpful for firms which have been hesitant to introduce models resulting from problems reminiscent of illusion, accuracy, or cost.
Reporter Park Chan cpark@aitimes.com