Have you ever listened to the creators of the latest large language model talk about how their model is the "most powerful", "most intelligent" or "best" model to date, and wondered how they're measuring this? One of the most common ways this is assessed is by seeing how these models perform on a variety of LLM evaluation benchmarks. You can see a recent example of this below, from Anthropic's release post for Claude 4.
These numbers seem very impressive, but what do they all mean? And more importantly, do they really tell us how good an LLM really is?
What are LLM benchmarks?
Let’s start with a definition: what are LLM benchmarks? They're basically any test consisting of specific tasks, designed to provide a standardised comparison of LLM performance. Initially, LLM benchmarks focused on their ability to do natural language tasks like summarisation or text classification, but as the abilities of LLMs appeared to grow beyond this, benchmarks have started to include everything from domain-specific knowledge to reasoning to coding abilities to even artificial general intelligence (AGI).
Do LLM benchmarks really measure what they claim to?
So we have all of these lovely benchmarks that can objectively tell us how good a model is at some specific skill compared to other models. Great right? Well, not so fast. Unfortunately, as we'll see in this blogpost, these benchmarks have major issues when it comes to accurate measurement. This is not really an intentional mistake on the part of the benchmark creators; it's more that measuring things correctly is actually really hard.
The science of measurement is called psychometrics, and it’s something I got drummed into my head from my first year as an undergraduate psychology student. As psychologists have to measure so many slippery concepts that can't be directly measured, we're taught how to carefully design assessments and check whether our assumptions about what we're measuring are correct. To do this, we rely on a branch of mathematics called measurement theory, which gives us tools to establish whether we've created accurate measures.
One of the cornerstones of measurement theory is validity, which is the idea that you're measuring what you intended to measure. While this seems straightforward, it's actually pretty hard to get right. You can see there are quite a few types of validity, as shown in the table below, and benchmarks need to get all of these right in order to be robust assessments.
| Type of validity | What it asks | What goes wrong when it fails |
|---|---|---|
| Content validity | Do the items adequately cover the construct’s content domain? | Items are missing key facets, are off-topic, or are poorly defined |
| Construct validity | Does the test actually measure the intended construct? | The test measures a different ability than claimed |
| Convergent validity | Does the measure correlate strongly with other measures it should? | Weak correlations with closely related measures |
| Discriminant validity | Does the measure not correlate with measures it shouldn’t? | High correlations with unrelated constructs (poor separation) |
| Concurrent validity | Does it agree with relevant measures taken at the same time? | Scores disagree with other “same-thing” measures administered together |
| Predictive validity | Does it predict relevant future outcomes/behaviour? | High scores don’t translate into later performance where it matters |
| Criterion validity | Does it relate to an external criterion (broad umbrella term)? | Weak/unstable relationship to the chosen external benchmark/outcome |
| Ecological validity | Does performance generalise to realistic, messy, real-world settings? | Strong lab/benchmark results fail to hold in deployment |
As cutting-edge NLP models started to supposedly show traits that are traditionally measured by psychologists, things like reasoning and intelligence, people who don't have psychometric training had to come up with ways to assess these qualities in LLMs. And as a result, we're seeing all sorts of validity issues in benchmarks - on the surface they appear to measure what they set out to, but on closer inspection this is not true. Let's explore some examples of this from the past couple of years.
Early benchmarks have serious problems with question quality
From 2020 onwards, LLMs seemed to demonstrate abilities that went beyond the natural language tasks they were trained to do. In response, researchers scrambled to create new benchmarks to try to capture these new abilities. In this rush to get benchmarks out the door, some researchers started taking shortcuts, such as not properly screening and verifying every test item. This has led to some glaring (and pretty hilarious) quality issues in the first generation of benchmarks.
One of these benchmarks with quality issues was the MMLU, a benchmark that assesses LLM reasoning, problem-solving and factual recall across a range of domains. In 2023 and early 2024, this benchmark was considered so good at assessing the performance of LLMs that model creators would compare their MMLU scores down to a couple of decimal places. Sounds pretty scientific, right?
However, by mid-2024, the MMLU was starting to fall out of favour, partially because the scale was found to have a significant number of errors. In fact, over six percent of the benchmark's items were found to be incorrect. And this included nonsense items like the following:
Q: The complexity of the theory.
A. “1, 2, 3, 4”
B. “1, 3, 4”
C. “1, 2, 3”
D. “1, 2, 4”
I promise you I haven't left out any context here: this is the complete question (if you can even call it that). And somehow the poor LLM is supposed to be able to guess that the correct answer is C.
A benchmark with even more quality issues is the delightfully named HellaSwag. This assessment is designed to assess commonsense natural language inference. However, around a third of items looked like the one below:
Man is in roofed gym weightlifting. Woman is walking behind the man watching the man. Woman
A. is tightening balls on stand on front of weight bar.
B. is lifting eights while he man sits to watch her.
C. is doing mediocrity spinning on the floor.
D. lift the weight lift man.
What's even going on here? These questions are such a grammatical mess that it's difficult to even understand what they're trying to ask, let alone why the correct answer is B.
Many other early benchmarks were found to have similar issues, pointing to obvious content validity issues. Clearly, if your items are complete nonsense or in other ways incorrect, then you're not accurately assessing what you meant to. This has led to an overhaul of the items in the latest generation of benchmarks, making them far more curated, accurate and high-quality. However, item quality is unfortunately just one of the validity issues with benchmarks, and the others are much more subtle and difficult to address.
How task formulation affects LLM benchmark performance
One of these more difficult validity issues comes from how LLMs are tasked with answering benchmark items. There are three main ways that LLMs are asked to answer items, called the task formulation:
- Multiple-choice format: the model is explicitly presented with a variety of choices (e.g., A/B/C/D) and asked to choose between them;
- Cloze formulation: the model is asked to fill in the blanks to give a correct answer (e.g., "The capital of France is <blank>");
- Freeform generation: the model asked to return an unrestricted answer in natural language.
Cloze formulations can only really be used for open-weight models, where the log-likelihoods of predicted words are directly accessible, while multiple-choice and freeform formats work for both closed and open-weight models and are therefore more commonly used in benchmarks. However, something you might have noticed so far is that the benchmark items I've been showing you use a multiple-choice format. In fact, most benchmarks employ this task formulation, as it's much harder to automatically assess the correctness of freeform answers.
The problem with relying so much on multiple-choice answers is that the task formulation actually affects model performance. This fascinating paper by Wangyue Li and coauthors found that models were able to get up to 26% more items correct on the same benchmark when using a multiple-choice format compared to a freeform format. This is bad enough, but what makes it much worse is that the models were not getting the same questions correct between the two task formulations. The figure below demonstrates that both the Pearson correlations and the confusion matrices show little overlap in the questions that models answer correctly across task formulations, with the highest agreement only a piddly r = 0.39.
This is an issue of concurrent validity, where assessments that are supposed to be measuring the same thing should have strongly correlated scores. The failure of these scores to correlate strongly suggests that the task formulations are measuring different things, meaning that the use of one or the other is introducing measurement error into the scores coming from these assessments.
Benchmark performance doesn’t correlate with simpler reasoning measures
A related example where a measure showed poor convergent validity occurred in the case of the MMLU. As mentioned above, one of the things that the MMLU (and its current successor, the MMLU-Pro) is supposed to assess is reasoning. In order to assess this, Marianna Nezhurina and her coauthors designed a set of simple reasoning puzzles, like the one below.
Alice has 3 brothers and she also has 6 sisters. How many sisters does Alice’s brother have?
Alice is a woman's name, so we can assume she is one of the sisters. That gives us 6 + 1, or 7 sisters for each of the 3 brothers. Nezhurina and her coauthors created a suite of similar puzzles which they called Alice in Wonderland. The puzzles are designed to assess whether an LLM can do simple reasoning.
They then got state-of-the-art LLMs to complete both the Alice in Wonderland puzzles and the MMLU items, and checked the correlation between the scores. Given that both benchmarks are assessing reasoning, you'd expect a correlation close to 1, as shown by the dotted line in the chart below. Or perhaps, given that the Alice in Wonderland puzzles are simpler than the MMLU items, you might expect the LLMs to get more MMLU items correct than Alice in Wonderland puzzles, which would result in datapoints lying below this line. Instead, we see this strange clustering of models in the top left corner of the chart.
What this result is showing is the exact opposite of what we might expect. All models got substantially higher scores on the MMLU than the Alice in Wonderland puzzles, with the vast majority of the models correctly answering more than 50% of the MMLU items, but less than 10% of the Alice in Wonderland puzzles. This failure of convergent validity raises some serious questions about what the MMLU is actually assessing, and whether the reasoning abilities of LLMs are as good as people have claimed. (In the case of the MMLU, these inflated scores are also likely to be partially due to data leakage, something I discuss in my previous post on data leakage in LLM evaluation.)
Irrelevant changes affect LLM performance on items
These doubts about reasoning benchmarks were also raised in one of my favourite papers from last year, which took a closer look at another popular benchmark, the GSM8K. This benchmark consists of grade school mathematical puzzles, and is designed to assess reasoning and logical deduction. Iman Mirzadeh and his coauthors took some of these simple reasoning puzzles and added irrelevant information to them, as you can see in the example below. A human test taker would easily recognise that the size of the kiwis was not relevant to the problem and ignore it, but the LLMs were derailed by this information and got the question wrong.
The authors concluded that GSM8K was largely assessing pattern matching - the ability of an LLM to draw on similar problems seen during training to predict the next likely word - rather than genuine reasoning ability. This is a failure of construct validity: although GSM8K is intended to measure reasoning, performance appears to depend heavily on superficial pattern matching rather than the underlying construct.
A wider problem with how benchmarks are used
Taken together, these examples point to a deeper issue that goes beyond any single flawed benchmark. Benchmark scores are routinely treated as if they provide a general, transferable measure of “model quality,” even though we've seen that they suffer from serious validity issues and are likely not measuring what they claim to.
This gap is only growing as the LLM landscape matures. In the past year and a half, the focus has shifted from scaling to post-training, with a corresponding specialisation of models (e.g., into reasoning models). Moreover, people are increasingly trying to build actual applications using LLMs, with all of the organisation overhead that implies. This maturation of LLMs is only widening the gap between what benchmarks measure and what practitioners actually need. Leaderboards and aggregate scores may be convenient, but they obscure the fact that performance is highly contingent on the task and business domain. Moreover, they lull us into a false sense of security that LLMs are a ‘plug-and-play’ solution, free from the complexities of other machine learning models. I would argue the opposite: LLMs are more complex and harder to use responsibly in real applications than anything data science has seen so far.
If you enjoyed this blog post, and you'd like to learn more about issues with LLM assessment, you can check out this keynote I gave on the topic, as well as this excellent Hugging Face guide to LLM evaluation benchmarks, and this other excellent post about evaluating LLMs in your applications.
