Skip to content Skip to sidebar Skip to footer

 

While traditional IQ tests are designed to measure human intelligence across areas such as logical reasoning, problem-solving, and verbal skills, they fall short when applied to LLMs. LLMs are not cognitive entities; they are sophisticated pattern recognition systems trained on vast amounts of data to predict language sequences.

Unlike humans, LLMs don’t “think” or “reason” in the same way, nor do they have intrinsic problem-solving capabilities. Their outputs are based solely on statistical correlations rather than understanding or intent.

While AI can’t be directly compared to humans, you can at least measure LLMs against each other.

That’s the approach taken by the Chatbot Arena Leaderboard

The Chatbot Arena Leaderboard

Chatbot Arena is a platform designed to compare and rank large language models by letting users directly pit two models against each other in head-to-head competitions. Users input a prompt, and both models generate responses, with the user selecting the better answer. The results are aggregated using the Elo rating system, to rank the models based on their performance in these pairwise matchups.

The leaderboard includes both proprietary models, such as OpenAI’s GPT-4 and Anthropic’s Claude, as well as open-source models like Mixtral and LLaMA.

As of the latest Arena report (2024-10-06), the top model, ChatGPT-4o-latest (2024-09-03), holds an Elo score of 1338, Gemini-1.5-Pro-002 at 1304, Meta-Llama-3.1-405b-Instruct-bf16 at 1257 and Mixtral-8x22b-Instruct-v0.1 1148.

Congratulations to ChatGPT, but should we really care about the ELO differences ?

What does an Elo difference mean?

First, we need to understand the significance of an Elo rating difference. The formula used is: chances to win = 1 / (1 + 10^(rating difference / 400))

  • A 70-point Elo difference means a model has a 60% chance of performing better, which is 10% above average, leaving a 40% chance of not outperforming the other.
  • A 100-point Elo difference translates to a 64% chance of performing better, or 14% above average.
  • A 200-point Elo difference implies a 76% chance of being better, or 26% above average.

Keep this scale in mind when comparing ratings.

Disclaimer

At SynapseDX, our mission is to transform the way businesses manage their operations by smoothly integrating Artificial Intelligence (AI) to their current systems.

Our focus is on use cases that enhance operational efficiency, context understanding, and decision routing.

We don’t engage in features like creative content generation, code generation and debugging, or human-like conversations. Instead, we intentionally steer clear of the general knowledge abilities of LLMs, focusing on practical and specific AI applications that drive value.

Our challenge is to identify which LLM is the most cost-effective for the specific tasks we assign to it. So, our approach and conclusions may not apply to your business cases.

Initial thoughts

ELO ratings offer a useful snapshot of relative strength, but they can’t fully account for individual IAs strengths and weaknesses.

Response A can be better than response B, but this doesn’t imply B is a bad one.  For example, if I ask for a summary of a text, the losing response may still be accurate.

While ELO identifies the strongest, it does not necessarily reflect who is best suited or overqualified for a particular task.

ELO evolves within the same model

Examples:

  • ChatGPT-4o models have experienced a 74-point increase in their Elo rating over time.
  • ChatGPT-4 models have seen an 88-point gain.

There is a noticeable variation between different versions of the same model.

Impact of the Prompt

Here, we consider the impact of the prompt on the quality of the response. One key factor we’ve identified is the complexity of the question being asked.

In our internal tests, we provided ChatGPT-4 with a text and then asked the same 5 questions in two different ways: 

  • All at once
  • One at a time

The questions were closed-ended, meaning there is no “better” response—only correct or incorrect ones. We considered the overall response correct only if all individual answers were correct.

The comparison between asking 5 individual questions versus using a single prompt containing all 5 questions showed a performance difference of over 200 Elo points.

This small-scale test is supported by several academic studies, including one by Subbarao Kambhapati’s team, which highlights how LLM performance tends to degrade as problem complexity increases.

Converted into Elo ratings, the performance drops translate to a decrease of 100 to 150 Elo points when handling problems with a five-level depth, and more than 400 Elo points with a 10-level depth.

Conclusion

Keep in mind that asking one question at a time can improve performance by over 200 ELO points. Therefore: 

  • Maintain a single focus
  • Target precision
  • Avoid ambiguity

The concrete effect of applying this method, is that you could potentially downscale from the proprietary champion ChatGPT-4o (1338 ELO) to the Apache 2 licensed, self-hosted, lightweight Mixtral-8x22b-Instruct (1148 ELO). 

Bonus: This approach also enhances auditability, as fine-grained responses make it easier to pinpoint where errors occur. 

Everyone enjoys a good competition, but the real stakes are beyond the 34-point difference between ChatGPT and Gemini. In the end, the human impact remains predominant.

 

en_USEnglish