5 innovations and one challenge that ‘o3’ modified AI

-

(Photo = Shutterstock)

Evaluation has emerged showing that the inference model ‘o3’ released by OpenAI has significantly raised the extent of existing artificial intelligence (AI) in five elements. However, the big cost stays as an issue that should be solved in the longer term.

On the twenty ninth (local time), Enterprise Beat analyzed the core innovations and problems of o3 by compiling the evaluation of experts and François Collet, co-founder of the ARC Prize Foundation, who created the ARC-AGI benchmark.

Collet’s founder was the one who designed the benchmark that showed that o3 was the primary to match human capabilities.

He revealed that within the ARC-AGI test, the o1 model scored a maximum of 32 points, but o3 showed rapid progress to 75.7 points, and when the inference time was increased, the rating rose to a maximum of 87.5 points. He emphasized that that is a crucial milestone that surpasses the 85 points that a human can receive.

And thru this test, o3’s core innovations are ▲program synthesis ▲chain of thought (CoT) and natural language program search ▲evaluator model ▲own program execution ▲deep learning-based search (Deep learning-guided program search), etc. were chosen.

First, this system synthesis function is a technique during which an AI model creates and combines small programs to resolve more complex problems.

Existing large language models (LLMs) have absorbed lots of knowledge, but on account of their lack of composition, they’re unable to resolve problems that transcend the information they were trained on. However, this method allows o3 to quickly adapt to recent patterns, like humans, and solve tasks it has never encountered directly during training. Collet’s founder described program synthesis as “the flexibility of a system to recombine known tools in revolutionary ways.” He likened it to being like a chef creating a novel dish using familiar ingredients.

Second, he explained that the core of o3 reasoning is the CoT and sophisticated search process used when solving problems.

When creating a solution, the model organizes its thoughts through several steps and thru this, it finds an answer. CoT is a technique during which a model provides step-by-step explanations in natural language to resolve an issue. Here, search is mobilized like search augmented generation (RAG) to cut back hallucinations as much as possible.

That is where the evaluator model appears. This model learns data labeled by experts and helps o3 have the flexibility to logically solve complex, multi-step problems. Due to this function, o3 is in a position to review and judge its own logic beyond simply providing a solution, which suggests that LLM is getting closer to truly pondering beyond easy reactions.

Google and Antropic also attempted an analogous method, but OpenAI implemented it in a brand new way.

Nonetheless, it’s identified that this process will not be entirely reliable since it will not be an evaluation of results through actual scenarios. Stability could also be low in unpredictable situations or special problems, and expert labeling is required to coach the evaluation model, which will be problematic by way of cost and scalability.

One of the unique features of o3 is you could run CoT yourself and use it as a problem-solving tool. Originally, CoT was used as a logical tool to resolve problems step-by-step, but OpenAI expanded this idea and used CoT as a reusable component.

Over time, CoT becomes a tool for recording and organizing problem-solving strategies, operating similarly to the way in which humans learn and improve based on experience. In other words, it serves as a basis for o3 to reply more flexibly to recent problems. In reality, this feature is believed to have contributed significantly to o3 scoring over 2700 points in code writing (CodeForces), putting it in the highest tier of programmers all over the world.

However, o3’s biggest drawback is its high computational cost, consuming tens of millions of tokens. This greatly reduces accessibility to the model.

Because of this, experts indicate the necessity for innovation that strikes a balance between performance and value efficiency. OpenAI also made it possible to set o3’s computing to low, medium, and high and adjust the inference time. Moreover, the cheaper ‘o3-mini’ was also unveiled.

Nonetheless, so as to confirm the extent of the performance difference, use experience should be supported.

Due to this fact, the practicality of o3 requires checking self-reliant performance, but there is no such thing as a selection but to contemplate performance-cost efficiency in comparison with models from Google or Antropic.

The evaluation is that if the difference in performance will not be large in comparison with the associated fee, or if there aren’t enough use cases to bear the high cost, o3 may develop into nothing greater than a logo announcing technological progress.

Reporter Park Chan cpark@aitimes.com

ASK DUKE

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x