AI benchmarks Grok OpenAI Technology xAI

Did xAI lie about Grok 3’s benchmarks? | TechCrunch

by techmim trend February 22, 2025

written by techmim trend February 22, 2025

Debates over AI benchmarks — and the way they’re reported via AI labs — are spilling out into public view.

This week, an OpenAI worker accused Elon Musk’s AI corporate, xAI, of publishing deceptive benchmark effects for its newest AI fashion, Grok 3. Probably the most co-founders of xAI, Igor Babushkin, insisted that the corporate used to be in the appropriate.

The reality lies someplace in between.

In a publish on xAI’s weblog, the corporate printed a graph appearing Grok 3’s efficiency on AIME 2025, a number of difficult math questions from a up to date invitational arithmetic examination. Some professionals have puzzled AIME’s validity as an AI benchmark. Nonetheless, AIME 2025 and older variations of the check are often used to probe a fashion’s math skill.

xAI’s graph confirmed two variants of Grok 3, Grok 3 Reasoning Beta and Grok 3 mini Reasoning, beating OpenAI’s best-performing to be had fashion, o3-mini-high, on AIME 2025. However OpenAI staff on X have been fast to show that xAI’s graph didn’t come with o3-mini-high’s AIME 2025 ranking at “cons@64.”

What’s cons@64, you could ask? Neatly, it’s quick for “consensus@64,” and it mainly provides a fashion 64 tries to reply to every downside in a benchmark and takes the solutions generated maximum ceaselessly as the overall solutions. As you’ll be able to consider, cons@64 has a tendency to spice up fashions’ benchmark ratings somewhat a bit of, and omitting it from a graph would possibly make it seem as although one fashion surpasses some other when if truth be told, that’s isn’t the case.

Grok 3 Reasoning Beta and Grok 3 mini Reasoning’s ratings for AIME 2025 at “@1” — that means the primary ranking the fashions were given at the benchmark — fall underneath o3-mini-high’s ranking. Grok 3 Reasoning Beta additionally trails ever-so-slightly in the back of OpenAI’s o1 fashion set to “medium” computing. But xAI is promoting Grok 3 because the “international’s smartest AI.”

Babushkin argued on X that OpenAI has printed in a similar way deceptive benchmark charts up to now — albeit charts evaluating the efficiency of its personal fashions. A extra impartial birthday party within the debate put in combination a extra “correct” graph appearing just about each fashion’s efficiency at cons@64:

Hilarious how some folks see my plot as assault on OpenAI and others as assault on Grok whilst if truth be told it’s DeepSeek propaganda
(I in fact imagine Grok seems to be just right there, and openAI’s TTC chicanery in the back of o3-mini-*excessive*-pass@”””1″”” merits extra scrutiny.) https://t.co/dJqlJpcJh8 %.twitter.com/3WH8FOUfic
— Teortaxes (DeepSeek 推特铁粉 2023 – ∞) (@teortaxesTex) February 20, 2025

However as AI researcher Nathan Lambert identified in a publish, possibly crucial metric stays a thriller: the computational (and financial) value it took for every fashion to succeed in its ideal ranking. That simply is going to turn how little maximum AI benchmarks keep up a correspondence about fashions’ obstacles — and their strengths.

benchmarks,Grok,OpenAI,xAI

Supply hyperlink

benchmarks Grok OpenAI xai

techmim trend

Techmim Trend delivers exclusive news on technology, AI, and online business. Stay ahead with in-depth articles, expert insights, and the latest trends in the digital world.

Did xAI lie about Grok 3’s benchmarks? | TechCrunch

US AI Safety Institute could face big cuts | TechCrunch

This mental health chatbot aims to fill the counseling gap at understaffed schools | TechCrunch

You may also like

Leave a Comment Cancel Reply