Unlike reasoning models reminiscent of o1 and o3, which work through answers step-by-step, most large language models like GPT-4.5 spit out the primary response they give you. But GPT-4.5 is more general-purpose. Tested on SimpleQA, a form of general-knowledge quiz developed by OpenAI last 12 months that features questions on topics from science and technology to TV shows and video games, GPT-4.5 scores 62.5% compared with 38.6% for GPT-4o and 15% for o3-mini.
What’s more, OpenAI claims that GPT-4.5 responds with far fewer made-up answers (often known as hallucinations). On the identical test, GPT-4.5 made up answers 37.1% of the time, compared with 59.8% for GPT-4o and 80.3% for o3-mini.
But SimpleQA is only one benchmark. On other tests, including MMLU, a more common benchmark for comparing large language models, GPT-4.5 beat OpenAI’s previous models by a smaller margin. And on standard science and math benchmarks, GPT-4.5 scores worse than o3-mini.
Turning on the charm
GPT-4.5’s special charm appears to be its conversational skills. Human testers employed by OpenAI say they preferred GPT-4.5 to GPT-4o for on a regular basis queries, skilled queries, and artistic tasks, including coming up with poems. (Ryder says it is usually great at old-school web ACSII art.)
For instance, tell it that you simply’re going through a rough patch and GPT-4.5 might offer just a few words of sympathy before saying: “Wish to speak about what happened, or do you simply need a distraction? I’m here either way.” GPT-4o is less good at reading social cues and might attempt to fix the issue whether you asked it to or not, hitting you with a bullet point list of how to cheer yourself up.
And yet after years at the highest, OpenAI faces a troublesome crowd. “The concentrate on emotional intelligence and creativity is cool for area of interest use cases like writing coaches and brainstorming buddies,” says Waseem Alshikh, cofounder and CTO of Author, a startup that develops large language models for enterprise customers.
“But GPT-4.5 seems like a shiny recent coat of paint on the standard automotive,” he says. “Throwing more compute and data at a model could make it sound smoother, nevertheless it’s not a game-changer.”
“The juice isn’t well worth the squeeze once you consider the energy costs and the proven fact that most users won’t notice the difference in every day use,” he says. “I’d quite see them pivot to efficiency or area of interest problem-solving than keep supersizing the identical recipe.”