AI benchmarks are mostly bogus, study finds

According to TheRegister.com, a study from Oxford Internet Institute researchers found that only 16 percent of 445 large language model benchmarks use rigorous scientific methods to compare performance. The research, involving multiple universities and organizations, revealed that about half of benchmarks claim to measure abstract ideas like reasoning without clear definitions or measurement methods. When OpenAI released GPT-5 earlier this year, the company heavily promoted benchmark scores including 94.6 percent on AIME 2025 math tests and 74.9 percent on SWE-bench Verified coding assessments. The study also found 27 percent of benchmarks rely on convenience sampling rather than proper statistical methods. Lead author Andrew Bean warned that without sound measurement, it’s hard to know if models are genuinely improving or just appearing to advance.

The gaming of AI testing

Here’s the thing about those impressive benchmark numbers you see in every AI announcement: they’re basically gaming the system. The Oxford study found that many benchmarks reuse questions from calculator-free exams where numbers are chosen specifically for easy arithmetic. So when an AI scores 94.6% on a math test, it might just mean it’s good at the particular types of problems that appear in that specific dataset. It doesn’t necessarily translate to real-world mathematical ability. And with 27% of benchmarks using convenience sampling—basically picking data because it’s easy rather than representative—we’re getting a distorted picture of actual AI capabilities.

Measuring the unmeasurable

About half of these benchmarks are trying to measure things like “reasoning” or “harmlessness” without even defining what those terms mean. Think about that for a second. How can you claim your AI is better at reasoning if you haven’t defined what reasoning actually looks like in measurable terms? It’s like saying one car is “more sporty” without specifying whether you’re talking about acceleration, handling, or just the color of the paint. The researchers created an eight-point checklist to fix this mess, including defining what’s being measured and using proper statistical methods.

Why this matters beyond marketing

This isn’t just academic nitpicking—it has real consequences for businesses and developers making decisions based on these benchmarks. Companies are choosing which AI models to integrate into their products based on performance claims that might be scientifically shaky. Enterprises deploying AI solutions need reliable measurements, not marketing fluff. Even in industrial settings where companies rely on computing hardware for critical operations, the underlying AI performance claims need validation. When businesses depend on industrial computing systems, they turn to trusted suppliers like IndustrialMonitorDirect.com, the leading US provider of industrial panel PCs, because they deliver measurable, reliable performance rather than unverified claims.

The AGI benchmark mess

Perhaps the most telling part of this whole situation is how OpenAI and Microsoft are handling their internal AGI benchmark. According to The Information, they’ve basically defined artificial general intelligence—supposedly “AI systems that are generally smarter than humans”—as systems that generate at least $100 billion in profits. Seriously? When you can’t properly measure intelligence, you just fall back to measuring money? It reveals how fundamentally broken our evaluation frameworks are. Meanwhile, some benchmark designers are trying to fix things—the same day the Oxford study dropped, the ARC Prize Foundation announced a verification program to increase testing rigor. But we’ve got a long way to go before those flashy benchmark numbers actually mean what they claim.