Why the LMArena is a flawed approach to LLM evaluation

In the last blog post, we discussed some of the major issues with LLM evaluation benchmarks, and why they are often a poor way of assessing model performance. So what, if anything, should replace them? One alternative you may have heard of is the LMArena, which has become one of the most cited alternatives to traditional LLM benchmarks for evaluating large language models.

What is the LMArena and how does it evaluate LLMs?

The LMArena (previously called the Chatbot Arena) is a system for ranking the performance of large language models using preference-based evaluation. It is a large-scale experiment that anyone can take part in by visiting their website. Once a participant arrives, they are asked to enter a prompt, after which the Arena retrieves responses from two randomly selected state-of-the-art LLMs in its model pool. The user can compare these responses, optionally enter follow-up prompts, and finally rate which output they thought was better.

The results of these head-to-head comparisons (called battles) are aggregated using an Elo-style rating system, producing a ranking of models based on aggregate user preference. And, naturally, there is a leaderboard showing how each model tested on the Arena stacks up.

How the LLM community views the LMArena as an evaluation tool

For quite some time, the LMArena was considered one of the most promising alternatives to traditional benchmarks. This enthusiasm was partly driven by frustration with benchmark pathologies, such as data leakage (which I discussed in this post) and deeper issues with measurement validity (covered here).

Few things capture that early optimism better than a tweet from Andrej Karpathy in late 2023, which praised the Arena as one of the more trustworthy ways of evaluating LLMs.

Andrej Karpathy's tweet endorsing the LMArena as one of the only trustworthy LLM evals

More recently, however, the LMArena has begun to fall out of favour. In what follows, I’ll explore why this former darling of LLM evaluation is now increasingly seen as being driven more by vibes than by science.

Preference-based LLM evaluation and bad-faith users

One set of problems arises from the participants themselves. In most experiments involving human participants, researchers build in checks to assess the quality of their responses. These can include tests of consistency across similar questions (often referred to as internal reliability) or items designed to detect unusual or socially desirable responses.

The LMArena, by contrast, has little opportunity to implement such checks. Participation is intentionally lightweight: at minimum, a user enters a prompt and clicks a single button to rate the outputs. The simplicity of this assessment setup makes it easy to collect large volumes of data, but it also makes it difficult to identify low-quality or aberrant responses.

The consequences of this design choice were examined by Wenting Zhao and colleagues, who studied the impact of Arena participants who were not acting in good faith.

The implicit assumption behind the Arena is that most participants are carefully evaluating model outputs and providing genuine feedback on their preferences. In practice, this is unlikely to always hold. Model responses can be long and repetitive, which may lead users to skim or lose interest and select an answer at random. In other cases, the two outputs may be so similar that users feel unable to distinguish between them. In both situations, each model has a 50% chance of being selected, rather than the better model being consistently favoured, injecting noise into the rankings.

A more concerning scenario is that some participants may act maliciously, attempting to manipulate the rankings to favour a particular model. Such users might look for stylistic or formatting “tells” that reveal which model produced an output, and then consistently upvote that model regardless of quality. In this case, the probability of that model being selected approaches 100%, independent of its actual performance.

Examples of bad faith actors on the LMArena, who are not giving high quality answers when assessing models

Zhao and colleagues found that introducing just 10% of apathetic, arbitrary, or adversarial voters was enough to shift model rankings by as many as five places on the LMArena leaderboard. Given the open nature of the experiment and the inherent difficulty of comparing competing LLMs, it is not implausible that the Arena is already suffering from these quality issues.

Structural biases in the LMArena against open and academic models

Proprietary companies also enjoy a structural advantage on the LMArena simply because they have more resources. One important factor is that the Arena allows providers to submit an unlimited number of models. In practice, this means that well-resourced organisations can enter not only their main releases, but also a large number of closely related variants into the evaluation pool.

A paper released last year by Cohere examined how this dynamic plays out. The authors found that proprietary providers tend to submit far more models to the Arena than academic or open groups. These submissions include both genuinely distinct releases and numerous experimental variants of the same underlying system. Because each variant is treated as a separate “player” in the Arena’s Elo-style rating system, submitting more models creates more opportunities for a provider’s strongest variants to rise to the top of the leaderboard.

The Cohere analysis also highlights how this setup creates incentives for strategic behaviour. For example, providers can increase the number of head-to-head matches involving their strongest models by surrounding them with weaker variants, or by frequently releasing new variants to take advantage of the Arena’s tendency to prioritise newer entries. When results are later reported, providers can then showcase only their best-performing variant, while quietly ignoring weaker ones. The chart below illustrates how a provider’s average Arena performance tends to increase along with the number of model variants they submit.

Graph showing that more models leads to higher rankings on the LMArena, favouring proprietary model providers

Proprietary models also benefit from advantages in hosting resources. Each provider is responsible for hosting their own models and covering the cost of user interactions. For academic and open groups, these costs can quickly become prohibitive, leading them to exhaust their token budgets far earlier than commercial providers. As a result, open models are more likely to be “silently” retired from the pool, further increasing the share of battles involving proprietary systems.

Finally, model providers gain access to rich data from the LMArena itself, including the prompts users submit, the styles of responses that tend to win comparisons, and the interaction patterns that perform well. This information can be fed back into subsequent training and fine-tuning, allowing these companies to optimise their newer models specifically for Arena-style interactions. Over time, this means models are not just getting better in a general sense, but better at the Arena in particular. The leaderboard then rewards this optimisation, reflecting not objectively superior models so much as models tailored to Arena preferences.

Why “good” is an undefined metric in general-purpose LLM evaluation

Even if we set aside all of the issues with participants and provider incentives, there remains a more fundamental problem with how the LMArena evaluates models in the first place.

Screenshot of an LMArena battle asking the models to

You might have noticed in the screenshot above that in each comparison, users are asked to decide which response is “better” and which is “worse”. But what do those terms actually mean? Does “better” imply more correct, more elegant, more concise, more verbose, funnier, or more logically structured? And does a “good” response to a prompt like the sausage poem above mean the same thing as a good resignation email, or a good best-man speech?

This points to a wider issue in LLM evaluation. In our push to build highly generalist models, systems that can handle almost any text-based task, we have made it increasingly unclear what, exactly, such models are supposed to be good at. In the absence of a well-defined target, evaluators like the LMArena are forced to rely on criteria so broad and underspecified that they verge on being meaningless.

What the LMArena’s problems tell us about the future of LLM evaluation

The challenges faced by the LMArena point to a broader problem: the idea that there could be a “one-size-fits-all” solution to LLM evaluation is fundamentally flawed. Attempts to evaluate LLMs with a single, global ranking tend to produce results that are noisy, difficult to interpret, and ultimately untrustworthy.

In response, more focused and realistic leaderboards are beginning to emerge. One example is the DPAI Arena being developed by my own employer, JetBrains, which deliberately narrows its scope to a single domain (software development) and evaluates models using more concrete, task-grounded workflows. Crucially, it is designed to be used and interpreted by domain experts who understand coding practices and trade-offs, potentially making it less vulnerable to the kinds of noise that plague more general, preference-based leaderboards. By constraining the problem space and defining meaningful metrics, specialised benchmarks like this may offer clearer and more actionable signals about how systems perform on tasks that are actually relevant to your problem space.

That said, even well-designed leaderboards can only take you so far. As practitioners like Hamel Husain argue in this excellent blog post, robust evaluation ultimately needs to be grounded in the specific goals, risks, and failure modes of your own application. Generic benchmarks and leaderboards can tell you how models compare to one another in the abstract, but they cannot tell you whether a given model will succeed in your use case, whether that’s legal drafting, medical summarisation, or developer tooling. To truly understand and improve performance, teams need custom evaluation systems that combine targeted tests, domain-specific criteria, and ongoing error analysis into a feedback loop tied directly to real-world behaviour.

Ultimately, there are no shortcuts in LLM evaluation. What once felt like a compelling alternative to benchmarks has become less convincing as the evaluation landscape has matured, and as our expectations of these systems have become more concrete. The shift now is away from ranking models by vague notions of general capability, and toward understanding what they can reliably do in production.