Why the LMArena is a flawed approach to LLM evaluation

Posted on February 9, 2026 • 6 minutes read

Cover image by Andrea Albanese

In the last blog post, we discussed some of the major issues with LLM evaluation benchmarks, and why they are often a poor way of assessing model performance. So what, if anything, should replace them? One alternative you may have heard of is the LMArena, which has become one of the most cited alternatives to traditional LLM benchmarks for evaluating large language models.

Part of the series
Measuring LLM Performance

What is the LMArena and how does it evaluate LLMs?

The LMArena (previously called the Chatbot Arena) is a system for ranking the performance of large language models using preference-based evaluation. It is a large-scale experiment that anyone can take part in by visiting their website. Once a participant arrives, they are asked to enter a prompt, after which the Arena retrieves responses from two randomly selected state-of-the-art LLMs in its model pool. The user can compare these responses, optionally enter follow-up prompts, and finally rate which output they thought was better.

The results of these head-to-head comparisons (called battles) are aggregated using an Elo-style rating system, producing a ranking of models based on aggregate user preference. And, naturally, there is a leaderboard showing how each model tested on the Arena stacks up.

How the LLM community views the LMArena as an evaluation tool

For quite some time, the LMArena was considered one of the most promising alternatives to traditional benchmarks. This enthusiasm was partly driven by people's frustration with benchmark issues, such as data leakage (which I discussed in this post) and deeper problems with measurement validity (covered here).

Few things capture that early optimism better than a tweet from Andrej Karpathy in late 2023, which lauded the Arena as one of the more trustworthy ways of evaluating LLMs.

Andrej Karpathy's tweet endorsing the LMArena as one of the only trustworthy LLM evals

More recently, however, the LMArena has begun to fall out of favour. In this blog post, we'll explore why this former darling of LLM evaluation is now increasingly seen as being driven more by vibes than by science.

Preference-based LLM evaluation and bad-faith users

One set of problems arises from the participants themselves. In most experiments involving human participants, researchers build in checks to assess the quality of their responses. These can include tests of consistency across similar questions (often referred to as internal reliability) or items designed to detect unusual or socially desirable responses.

The LMArena, by contrast, has little opportunity to implement such checks. The creators have made the experiment very lightweight, with participants only being required to enter a single prompt and then rate the output. The simplicity of this assessment setup makes it easy to collect large volumes of data, but it also makes it difficult to identify low-quality or aberrant responses.

The consequences of this design choice were examined by Wenting Zhao and colleagues, who studied the impact of Arena participants who were not acting in good faith.

The assumption behind the Arena is that most participants are carefully evaluating model outputs and providing genuine feedback on their preferences, but in practice, this is unlikely to always hold. Model responses can be long and repetitive, which may lead users to skim or lose interest and select an answer at random. In other cases, the two outputs may be so similar that users feel unable to assess which is better. In both of these situations, each model has a 50% chance of being selected, rather than the better model being consistently favoured. This random selection injects noise into the rankings.

A more concerning scenario is that some participants may act maliciously, attempting to manipulate the rankings to favour a particular model. Such users might look for stylistic or formatting “tells” which mark the model that they want to drive up in the rankings, and then consistently upvote that model regardless of quality. In this case, the probability of that model being selected approaches 100%, independent of its actual performance.

Examples of bad faith actors on the LMArena, who are not giving high quality answers when assessing models

Zhao and colleagues found that if just 10% of participants either randomly select models or deliberately upvote specific models, it would be enough to shift model rankings by as many as five places on the LMArena leaderboard. Given that there are no restrictions on who can access the experiment, and that comparing LLM outputs is often difficult, it's not out of the question that the quality of participant responses is already affecting the robustness of the Arena's rankings.

Structural biases in the LMArena against open and academic models

Another issue is that proprietary companies can leverage their vast resources to help their models be more competitive on the LMArena. This is due to the fact that the Arena allows competitors to submit an unlimited number of models for evaluation, meaning that well-resourced organisations can enter not only their main releases, but also a large number of closely related variants into the evaluation pool.

A paper released last year by Cohere examined how this dynamic plays out. The authors found that proprietary providers tend to submit far more models to the Arena than academic or open groups. These submissions include both genuinely distinct releases and numerous experimental variants of the same underlying system. Because each variant is treated as a separate “player” in the Arena’s Elo-style rating system, submitting more models creates more opportunities for a provider’s strongest variants to rise to the top of the leaderboard.

The Cohere paper also shows how this setup can incentivise companies to be strategic about how they release models onto the Arena. For example, providers can increase the number of head-to-head matches involving their strongest models by surrounding them with their own weaker variants, or by frequently releasing new variants to take advantage of the Arena’s tendency to prioritise recently entered models. When providers then report their results from the Arena, they can cherry-pick their best-performing variant and quietly ignore that they ever entered weaker performing models. The pattern in the chart below suggests that this indeed may be happening, showing that a provider's average Arena performance increases along with the number of model variants they submit.

Graph showing that more models leads to higher rankings on the LMArena, favouring proprietary model providers

Commercial companies also have access to a lot more hosting resources, which gives them yet another advantage. On the LMArena, each provider is responsible for hosting their own models and covering the cost of user interactions. For academic and open groups, these costs can quickly become prohibitive, leading them to exhaust their token budgets far earlier than commercial providers. As a result, open models are more likely to be “silently” retired from the pool, further increasing the share of battles involving proprietary systems.

Finally, model providers gain access to rich data from the LMArena itself, which includes the prompts users submit, the styles of responses that tend to win, and the interaction patterns that perform well. This information can be fed back into subsequent model training and fine-tuning, allowing these companies to optimise their newer models specifically for Arena-style interactions. Over time, this means models are not just getting better in a general sense, but better at the Arena in particular. The leaderboard then rewards this optimisation, reflecting not so much objectively superior models so much as models tailored to Arena preferences.

Why “good” is an undefined metric in general-purpose LLM evaluation

Even if we set aside all of the issues with participants and provider incentives, there remains a more fundamental problem with how the LMArena evaluates models in the first place.

Screenshot of an LMArena battle asking the models to

You might have noticed in the screenshot above that in each comparison, users are asked to decide which response is “better” and which is “worse”. But what do those terms actually mean? Does “better” imply more correct, more elegant, more concise, more verbose, funnier, or more logically structured? And does a “good” response to a prompt like the sausage poem above mean the same thing as a good resignation email, or a good best-man speech?

This points to a wider issue in LLM evaluation. In our push to build highly generalist models, systems that can handle almost any text-based task, we have made it increasingly unclear what, exactly, such models are supposed to be good at. In the absence of a well-defined target, evaluators like the LMArena are forced to rely on criteria so broad and underspecified that they verge on being meaningless.

What the LMArena’s problems tell us about the future of LLM evaluation

The challenges faced by the LMArena point to a broader problem: the idea that there could be a “one-size-fits-all” solution to LLM evaluation is fundamentally flawed. Attempts to evaluate LLMs with a single, global ranking tend to produce results that are noisy, difficult to interpret, and of questionable value.

As practitioners like Hamel Husain argue in this excellent blog post, robust evaluation ultimately needs to be grounded in the specific goals, risks, and points of failure of your own application. This might feel like familiar territory, and it is! For all that LLMs felt and looked like something different to the rest of software development and machine learning at the beginning, we're finally realising that in order to work with them effectively, we need to understand how good any given model is for a specific use case. And this means what is always has for software testing and monitoring: unit tests, logging and evaluating traces, and A/B testing, all specifically tailored to your users and your objectives.

Ultimately, there are no shortcuts in LLM evaluation. The LMArena, which once felt like an exciting and compelling alternative to benchmarks, has become less convincing both due to its methodological issues, and the fact we're recognising that there are no shortcuts with LLM evaluation. The shift now is away from ranking models by vague notions of general capability, and toward understanding what they can reliably do in production.