In The Loop AI
Posts
Questions Raised Over OpenAI’s Benchmark Claims for o3 Model

Questions Raised Over OpenAI’s Benchmark Claims for o3 Model

"Please, thank you" is costing your LLM millions

Algot Persson
April 20, 2025

In partnership with

Sponsor/Sign Up

In Today’s Issue:

"Please, thank you" is costing your LLM millions
Questions Raised Over OpenAI’s Benchmark Claims for o3 Model

Read time: 2 minutes

Questions Raised Over OpenAI’s Benchmark Claims for o3 Model

A new benchmark test has cast doubt on OpenAI’s performance claims for its o3 model. When o3 was introduced in December, OpenAI said it could solve over 25% of problems in FrontierMath, a tough math benchmark. That result was far ahead of other models, which scored below 2%.

But according to new data from Epoch AI, o3 actually scored closer to 10% — far lower than the original figure. The discrepancy likely stems from differences in compute power and test setups. OpenAI used a more powerful internal version of o3 for its original tests, while the public version released last week is optimized for speed and cost.

OpenAI has not denied the gap. A technical staff member said in a livestream that the public release was tuned for “real-world use cases,” and might show “disparities” on benchmarks.

The ARC Prize Foundation, which tested an earlier version of o3, confirmed that the current public version is “a different model,” with smaller compute tiers than the one OpenAI originally benchmarked.

Benchmark differences aren’t new in the AI world. Meta and xAI have also faced scrutiny over mismatched testing setups. For many, it’s another reminder not to take AI benchmark claims at face value — especially when they come from the companies behind the models.

Sam Altman

The portfolio that's automatically up to date with your work.

Authory saves you hours with a portfolio that's always up to date.
Get backups of all your articles.
Be ready to impress potential clients and employers, anytime.

Get your free portfolio today

"Please, thank you" is costing your LLM millions

I wonder how much money OpenAI has lost in electricity costs from people saying “please” and “thank you” to their models.
— tomie (@tomieinlove)
11:28 PM • Apr 15, 2025

Could being polite to ChatGPT be racking up OpenAI’s electricity bill? A user on X jokingly posed the question, asking how much the company has spent on people typing “please” and “thank you” into its models.

Sam Altman responded: “Tens of millions of dollars well spent — you never know.”

tens of millions of dollars well spent--you never know
— Sam Altman (@sama)
11:15 PM • Apr 16, 2025

Some wondered if these extra words are just wasted energy, especially given the massive power demands behind each query.

But according to Microsoft Copilot’s design director Kurt Beavers, it’s not just about manners. He said polite inputs can actually lead to more polite and helpful outputs, as the model “sets a tone” for the conversation.

So while saying “please” might cost a few extra tokens — and maybe a few extra cents — it could also make your chatbot just a little bit nicer.

How was today's newsletter?

Reply

or to participate.