AI benchmarking under fire

-

Good morning, AI enthusiasts. Leaderboard prestige could make or break an AI model’s launch, but a brand new study claims the scoreboard may be stacked in favor of tech giants.

With allegations of personal testing and biased experiments potentially distorting the outcomes of a well-liked benchmarking platform, the AI evaluation game just got lots hazier.

Reminder: Our next workshop is today at 4 PM EST — attend and learn learn how to use Google’s NotebookLM to enhance your research, studying, and teaching! RSVP here.

In today’s AI rundown:

  • Study questions leading AI benchmark

  • Microsoft’s latest small reasoning models

  • Create web sites using ChatGPT o3 and Canvas

  • Amazon’s latest Nova Premier teacher model

  • 4 latest AI tools & 4 job opportunities

LATEST DEVELOPMENTS

AI BENCHMARKING

🎯 Study questions leading AI benchmark

Image source: Cohere Labs

The Rundown: A brand new study from researchers at Cohere Labs, MIT, Stanford, and other institutions claims that LMArena, the leading crowdsourced AI benchmark, gives unfair benefits to major tech corporations, potentially distorting its widely-followed rankings.

The small print:

  • The study claims providers like Meta, Google, and OpenAI privately test multiple model variants on the Arena to publish the perfect performers.

  • It also found that models from top labs were favored over small/open models in sampling, with Google and OpenAI receiving over 60% of all interactions.

  • Experiments showed that access to Arena data boosts performance on Arena-specific tasks, suggesting model overfitting relatively than actual capability gains.

  • The researchers also noted that 205 models have been silently removed on the platform, with open-source models deprecated at a better rate.

Why it matters: LMArena has disputed the study, claiming the leaderboard reflects real user preferences. Nonetheless, these claims can damage the platform’s credibility, which shapes how models are perceived. Combined with the Llama 4 Maverick benchmark debacle, this study highlights that AI evaluation is not all the time because it seems.

TOGETHER WITH INNOVATING WITH AI

🚀 Launch a six‑figure AI consultancy in six months

The Rundown: Innovating With AI’s “The AI Consultancy Project” gives you the frameworks, playbooks, and client‑ready templates needed to show “interesting AI ideas” right into a revenue‑generating business – helping you ride the wave of an AI consulting boom expected to grow by 8x this decade.

On this program, you’ll:

  • Gain the tools and frameworks to search out clients and deliver top-notch services

  • Follow a 6-month plan to construct a 6-figure AI consulting business

  • Join a 700‑strong cohort where some members landed their first AI client in 72 hours

Click here to request access to The AI Consultancy Project.

MICROSOFT

🧠 Microsoft’s latest small reasoning models

Image source: Microsoft

The Rundown: Microsoft just launched three latest reasoning-focused, open-weights models in its Phi family — which outperform larger rivals at complex reasoning tasks while being sufficiently small to run on phones and laptops.

The small print:

  • The flagship Phi-4-reasoning has just 14B parameters but outperforms OpenAI’s o1-mini and matches DeepSeek’s 671B model on key benchmarks.

  • A smaller 3.8B parameter version called Phi-4-mini-reasoning can run on mobile devices while matching larger 7B models on math benchmarks.

  • Designed for efficiency, the models aim to bring strong reasoning capabilities to constrained environments (like edge devices and Copilot+ PCs).

  • All three models are open-source with permissive licenses, allowing unrestricted industrial use and modification by developers.

Why it matters: Microsoft continues to lift the bar for its small but powerful Phi, with the most recent launch bringing extremely capable reasoning to models sized to suit on phones and laptops. It’s still early days in truly bringing system-integrated AI to devices, but Microsoft’s Copilot+ PCs might be the largest beneficiary of this reasoning boost.

AI TRAINING

🖥️ Create web sites using ChatGPT o3 and Canvas

The Rundown: On this tutorial, you’ll learn learn how to create fully-functional web applications with database capabilities using ChatGPT o3 and Canvas, after which deploy them without cost — no coding skills required.

Step-by-step:

  1. Head over to ChatGPT, select the “o3” model, and activate the ‘Canvas’ option.

  2. Prepare an in depth prompt describing your required HTML web application, including purpose, features, design preferences, and functionality requirements.

  3. Test your application using the “Preview” button and request any mandatory modifications.

  4. Save the code as an HTML file and deploy using Cloudflare by navigating to Employees & Pages, choosing “Create using direct upload,” and uploading the file.

Pro tip: Applications with local storage will maintain user data between sessions even when deployed, making them perfect for small applications.

PRESENTED BY CONVEYOR

The Rundown: Most vendors constructing with AI talk a giant game. Sue, Conveyor’s AI Agent for Customer Trust, actually does the work — deploying across F1000 enterprises to completely run customer security reviews, skip busywork, and keep deals moving with no headaches or delays.

Sue, the AI Agent, can:

  • Manage customer security requests seamlessly across teams and tools

  • Handle any questionnaire format

  • Robotically complete, reject, or escalate questionnaires

  • Personalize your Trust Center with data from conversation intelligence tools

Learn more and integrate Sue into your infosec and sales workflow today.

AMAZON

Image source: Amazon

The Rundown: Amazon just launched Nova Premier, the corporate’s most advanced AI model yet — designed to each handle complex tasks and in addition act as a “teacher” to fine-tune smaller models to match its capabilities.

The small print:

  • The multimodal model can process text, images, and videos with a 1M-token context window, allowing it to research about 750,000 words without delay.

  • Internal testing shows Premier lagging behind top competitors like Gemini 2.5 Pro on math, science, and coding benchmarks.

  • Nova Premier excels at orchestrating multi-agent workflows, showing strength in financial evaluation and investment research applications in testing.

  • Using Amazon’s Bedrock Model Distillation, Premier can transfer capabilities to smaller models like Nova Pro and Micro and boost performance by as much as 20%.

Why it matters: With Nova Premier, Amazon positions its top offering not as a direct rival for cutting-edge reasoning tasks but relatively as a robust teacher that may uplift your complete model family — suggesting a concentrate on optimizing performance and prioritizing efficient, task-specific deployments over a single powerhouse model.

QUICK HITS

🛠️ Trending AI Tools

  • 🎥 Gen-4 References – Generate consistent characters and scenes in videos

  • 🎨 Gemini App – Recent update with native AI image editing capabilities

  • 🧠 MiMo-7B – Xiaomi’s small but powerful open-source reasoning model

  • 📸 F-Lite – Freepik’s open-weights image generation model

💼 AI Job Opportunities

  • 🛡️ Harvey – Technical Program Manager

  • ⚖️ UiPath – Legal Intern

  • 🤖 Siena – AI Engineer

  • 📝 Meta – Content Strategist

📰 All the pieces else in AI today

Anthropic released Integrations, allowing Claude to attach with distant MCPs to integrate additional tools — alongside latest research capabilities like web support.

NVIDIA criticized Anthropic’s AI chip export policy recommendations, arguing that U.S. firms should concentrate on innovation as a substitute of limiting competitiveness with policy.

Google expanded its AI Mode in Search to all Labs users within the U.S., also introducing latest visual shopping and native planning features.

Suno introduced v4.5 of its AI music generation platform, adding latest genres, higher prompting and adherence, the flexibility to create songs as much as 8 minutes long, and more.

Microsoft is reportedly adding xAI’s Grok model to its Azure development platform, coming amid rumored tensions between CEO Satya Nadella and OpenAI’s Sam Altman.

Google launched Little Language Lessons, three latest AI-powered experiments that use Gemini’s multilingual capabilities for personalized learning experiences.

COMMUNITY

🎥 Join our next live workshop

Join our next workshop today at 4 PM EST with Dr. Alvaro Cintas, The Rundown’s AI professor. By the tip of the workshop, you’ll confidently give you the chance to make use of Google NotebookLM to enhance your research, studying, and teaching.

RSVP here. Not a member? Join The Rundown University on a 14-day free trial.

🤝 Share The Rundown, get rewards

We’ll all the time keep this article 100% free. To support our work, consider sharing The Rundown with your pals, and we’ll send you more free goodies.

That is it for today!

Before you go we’d like to know what you considered today’s newsletter to assist us improve The Rundown experience for you.
  • ⭐️⭐️⭐️⭐️⭐️ Nailed it
  • ⭐️⭐️⭐️ Average
  • ⭐️ Fail

Login or Subscribe to take part in polls.

See you soon,

Rowan, Joey, Zach, Alvaro, and Jason—The Rundown’s editorial team

ASK ANA

What are your thoughts on this topic?
Let us know in the comments below.

0 0 votes
Article Rating
guest
0 Comments
Oldest
Newest Most Voted
Inline Feedbacks
View all comments

Share this article

Recent posts

0
Would love your thoughts, please comment.x
()
x