And on the hardware side, DeepSeek has found recent ways to juice old chips, allowing it to coach top-tier models without coughing up for the most recent hardware in the marketplace. Half their innovation comes from straight engineering, says Zeiler: “They definitely have some really, really good GPU engineers on that team.”
Nvidia provides software called CUDA that engineers use to tweak the settings of their chips. But DeepSeek bypassed this code using assembler, a programming language that talks to the hardware itself, to go far beyond what Nvidia offers out of the box. “That’s as hardcore because it gets in optimizing this stuff,” says Zeiler. “You possibly can do it, but principally it’s so difficult that no one does.”
DeepSeek’s string of innovations on multiple models is impressive. Nevertheless it also shows that the firm’s claim to have spent lower than $6 million to coach V3 just isn’t the entire story. R1 and V3 were built on a stack of existing tech. “Perhaps the very last step—the last click of the button—cost them $6 million, however the research that led as much as that probably cost 10 times as much, if no more,” says Friedman. And in a blog post that cut through loads of the hype, Anthropic cofounder and CEO Dario Amodei identified that DeepSeek probably has around $1 billion value of chips, an estimate based on reports that the firm the truth is used 50,000 Nvidia H100 GPUs.
A brand new paradigm
But why now? There are a whole bunch of startups all over the world attempting to construct the following big thing. Why have we seen a string of reasoning models like OpenAI’s o1 and o3, Google DeepMind’s Gemini 2.0 Flash Pondering, and now R1 appear inside weeks of one another?
The reply is that the bottom models—GPT-4o, Gemini 2.0, V3—are all now ok to have reasoning-like behavior coaxed out of them. “What R1 shows is that with a powerful enough base model, reinforcement learning is sufficient to elicit reasoning from a language model with none human supervision,” says Lewis Tunstall, a scientist at Hugging Face.
In other words, top US firms could have found out learn how to do it but were keeping quiet. “Evidently there’s a clever way of taking your base model, your pretrained model, and turning it right into a far more capable reasoning model,” says Zeiler. “And up up to now, the procedure that was required for converting a pretrained model right into a reasoning model wasn’t well-known. It wasn’t public.”
What’s different about R1 is that DeepSeek published how they did it. “And it seems that it’s not that expensive a process,” says Zeiler. “The hard part is getting that pretrained model in the primary place.” As Karpathy revealed at Microsoft Construct last 12 months, pretraining a model represents 99% of the work and most of the fee.
If constructing reasoning models just isn’t as hard as people thought, we will expect a proliferation of free models which might be way more capable than we’ve yet seen. With the know-how out within the open, Friedman thinks, there will probably be more collaboration between small corporations, blunting the sting that the most important corporations have enjoyed. “I believe this may very well be a monumental moment,” he says.