
Study Accuses LM Arena of Helping Top AI Labs Game Its Benchmark (techcrunch.com) 10
An anonymous reader shares a report: A new paper from AI lab Cohere, Stanford, MIT, and Ai2 accuses LM Arena, the organization behind the popular crowdsourced AI benchmark Chatbot Arena, of helping a select group of AI companies achieve better leaderboard scores at the expense of rivals.
According to the authors, LM Arena allowed some industry-leading AI companies like Meta, OpenAI, Google, and Amazon to privately test several variants of AI models, then not publish the scores of the lowest performers. This made it easier for these companies to achieve a top spot on the platform's leaderboard, though the opportunity was not afforded to every firm, the authors say.
"Only a handful of [companies] were told that this private testing was available, and the amount of private testing that some [companies] received is just so much more than others," said Cohere's VP of AI research and co-author of the study, Sara Hooker, in an interview with TechCrunch. "This is gamification." Further reading: Meta Got Caught Gaming AI Benchmarks.
According to the authors, LM Arena allowed some industry-leading AI companies like Meta, OpenAI, Google, and Amazon to privately test several variants of AI models, then not publish the scores of the lowest performers. This made it easier for these companies to achieve a top spot on the platform's leaderboard, though the opportunity was not afforded to every firm, the authors say.
"Only a handful of [companies] were told that this private testing was available, and the amount of private testing that some [companies] received is just so much more than others," said Cohere's VP of AI research and co-author of the study, Sara Hooker, in an interview with TechCrunch. "This is gamification." Further reading: Meta Got Caught Gaming AI Benchmarks.
Not the future I imagined (Score:1)
The future is always better and worse than you expect.
So AI is just optimised to pass benchmarks. (Score:1)
wat (Score:3)
.
According to the authors, LM Arena allowed some industry-leading AI companies like Meta, OpenAI, Google, and Amazon to privately test several variants of AI models, then not publish the scores of the lowest performers. This made it easier for these companies to achieve a top spot on the platform's leaderboard, though the opportunity was not afforded to every firm, the authors say.
So the authors are ranking the companies, not the models? What good is that? I don't care about their overall performance, I only care about the performance of each individual model. If they are ranking models then this is irrelevant, because what I care about is the best-performing models, not the worst. If there's a chart that ranks all the models, I'm not even going to bother looking at the bottom unless it's out of curiosity. I'm going to check the top. And the top spot on the leaderboard is what we're talking about, right? Avoiding having lower scores on the list doesn't affect whether you have the top spot at all, right?
Is there some detail to this story that would make it make sense? All I can find in the story is details which make it make less sense, like:
One important limitation of the study is that it relied on âoeself-identificationâ to determine which AI models were in private testing on Chatbot Arena. The authors prompted AI models several times about their company of origin, and relied on the modelsâ(TM) answers to classify them â" a method that isnâ(TM)t foolproof.
So to be clear, this "finding" is based on self-reporting from software known to hallucinate?
Does anyone have any details about this story which aren't included in the article (which doesn't support the conclusion well at all) which would give the claims any weight? Because what I get from the article itself is that this is total bullshit.
Missing the point (Score:1)
Key satement: ""Only a handful of [companies] were told that this private testing was available"
The creator of model X weren't prevented from doing so, but thought it would look bad to submit 20 different variants of their model at the same time. Like submitting Deepseek-v3-0324-1, Deepseek-v3-0324-2, Deepseek-v3-0324-3, Deepseek-v3-0324-4, etc. So they only submitted major revisions.
The creator of model Y were told that they could submit a mass of slightly tweaked models, and just hide all the poorly perfo
Re: (Score:3)
Absolutely NONE of that affects their placement at the top of the list, or not. It only controls whether they have other, lesser spots on the list as well.
Re: (Score:2)
what I care about is the best-performing models, not the worst. If there's a chart that ranks all the models, I'm not even going to bother looking at the bottom unless it's out of curiosity. I'm going to check the top. And the top spot on the leaderboard is what we're talking about, right? Avoiding having lower scores on the list doesn't affect whether you have the top spot at all, right?
Well, we do care about which models are the top models, but more importantly, we care about an assessment of the set of models that we might consider, whether they are at the top or not.
The problem with secret access is that this one-on-one battle format only samples the outcomes of select tasks. So perturbing the sampling can significantly change the results. That's the problem.
I also find it interesting that Stanford and MIT profs are criticizing something from Berkeley. College rivalries aren't just confined to sports.
LM arena is games since quite some time (Score:2)
Microsoft admits in their papers that they train on the arena prompts (but not responses). Model makers have their bots output lots of Markdown formatting and emoji, because people prefer the wrong answer with nice formatting over the right answer as dull text. The arena was a nice idea to get a better ranking than the turing test, but it is no real benchmark.
Corrupt at every layer (Score:1)
Yet another ethical lapse in these AI companies. Big shock.