- In The Loop AI
- Posts
- Questions Raised Over OpenAI’s Benchmark Claims for o3 Model
Questions Raised Over OpenAI’s Benchmark Claims for o3 Model
"Please, thank you" is costing your LLM millions
In Today’s Issue:
"Please, thank you" is costing your LLM millions
Questions Raised Over OpenAI’s Benchmark Claims for o3 Model
Read time: 2 minutes
Questions Raised Over OpenAI’s Benchmark Claims for o3 Model
A new benchmark test has cast doubt on OpenAI’s performance claims for its o3 model. When o3 was introduced in December, OpenAI said it could solve over 25% of problems in FrontierMath, a tough math benchmark. That result was far ahead of other models, which scored below 2%.
But according to new data from Epoch AI, o3 actually scored closer to 10% — far lower than the original figure. The discrepancy likely stems from differences in compute power and test setups. OpenAI used a more powerful internal version of o3 for its original tests, while the public version released last week is optimized for speed and cost.
OpenAI has not denied the gap. A technical staff member said in a livestream that the public release was tuned for “real-world use cases,” and might show “disparities” on benchmarks.
The ARC Prize Foundation, which tested an earlier version of o3, confirmed that the current public version is “a different model,” with smaller compute tiers than the one OpenAI originally benchmarked.
Benchmark differences aren’t new in the AI world. Meta and xAI have also faced scrutiny over mismatched testing setups. For many, it’s another reminder not to take AI benchmark claims at face value — especially when they come from the companies behind the models.

Sam Altman
The portfolio that's automatically up to date with your work.
Authory saves you hours with a portfolio that's always up to date.
Get backups of all your articles.
Be ready to impress potential clients and employers, anytime.
"Please, thank you" is costing your LLM millions
I wonder how much money OpenAI has lost in electricity costs from people saying “please” and “thank you” to their models.
— tomie (@tomieinlove)
11:28 PM • Apr 15, 2025
Could being polite to ChatGPT be racking up OpenAI’s electricity bill? A user on X jokingly posed the question, asking how much the company has spent on people typing “please” and “thank you” into its models.
Sam Altman responded: “Tens of millions of dollars well spent — you never know.”
tens of millions of dollars well spent--you never know
— Sam Altman (@sama)
11:15 PM • Apr 16, 2025
Some wondered if these extra words are just wasted energy, especially given the massive power demands behind each query.
But according to Microsoft Copilot’s design director Kurt Beavers, it’s not just about manners. He said polite inputs can actually lead to more polite and helpful outputs, as the model “sets a tone” for the conversation.
So while saying “please” might cost a few extra tokens — and maybe a few extra cents — it could also make your chatbot just a little bit nicer.
How was today's newsletter? |
Reply