AI Fails at Most Remote Work, Researchers Find (msn.com) 39
A new study "compared how well top AI systems and human workers did at hundreds of real work assignments," reports the Washington Post.
They add that at least one example "illustrates a disconnect three years after the release of ChatGPT that has implications for the whole economy." AI can accomplish many impressive tasks involving computer code, documents or images. That has prompted predictions that human work of many kinds could soon be done by computers alone. Bentley University and Gallup found in a survey [PDF] last year that about three-quarters of Americans expect AI to reduce the number of U.S. jobs over the next decade. But economic data shows the technology largely has not replaced workers.
To understand what work AI can do on its own today, researchers collected hundreds of examples of projects posted on freelancing platforms that humans had been paid to complete. They included tasks such as making 3D product animations, transcribing music, coding web video games and formatting research papers for publication. The research team then gave each task to AI systems such as OpenAI's ChatGPT, Google's Gemini and Anthropic's Claude. The best-performing AI system successfully completed only 2.5 percent of the projects, according to the research team from Scale AI, a start-up that provides data to AI developers, and the Center for AI Safety, a nonprofit that works to understand risks from AI. "Current models are not close to being able to automate real jobs in the economy," said Jason Hausenloy, one of the researchers on the Remote Labor Index study...
The results, which show how AI systems fall short, challenge predictions that the technology is poised to soon replace large portions of the workforce... The AI systems failed on nearly half of the Remote Labor Index projects by producing poor-quality work, and they left more than a third incomplete. Nearly 1 in 5 had basic technical problems such as producing corrupt files, the researchers found.
One test involved creating an interactive dashboard for data from the World Happiness Report, according to the article. "At first glance, the AI results look adequate. But closer examination reveals errors, such as countries inexplicably missing data, overlapping text and legends that use the wrong colors — or no colors at all."
The researchers say AI systems are hobbled by a lack of memory, and are also weak on "visual" understanding.
They add that at least one example "illustrates a disconnect three years after the release of ChatGPT that has implications for the whole economy." AI can accomplish many impressive tasks involving computer code, documents or images. That has prompted predictions that human work of many kinds could soon be done by computers alone. Bentley University and Gallup found in a survey [PDF] last year that about three-quarters of Americans expect AI to reduce the number of U.S. jobs over the next decade. But economic data shows the technology largely has not replaced workers.
To understand what work AI can do on its own today, researchers collected hundreds of examples of projects posted on freelancing platforms that humans had been paid to complete. They included tasks such as making 3D product animations, transcribing music, coding web video games and formatting research papers for publication. The research team then gave each task to AI systems such as OpenAI's ChatGPT, Google's Gemini and Anthropic's Claude. The best-performing AI system successfully completed only 2.5 percent of the projects, according to the research team from Scale AI, a start-up that provides data to AI developers, and the Center for AI Safety, a nonprofit that works to understand risks from AI. "Current models are not close to being able to automate real jobs in the economy," said Jason Hausenloy, one of the researchers on the Remote Labor Index study...
The results, which show how AI systems fall short, challenge predictions that the technology is poised to soon replace large portions of the workforce... The AI systems failed on nearly half of the Remote Labor Index projects by producing poor-quality work, and they left more than a third incomplete. Nearly 1 in 5 had basic technical problems such as producing corrupt files, the researchers found.
One test involved creating an interactive dashboard for data from the World Happiness Report, according to the article. "At first glance, the AI results look adequate. But closer examination reveals errors, such as countries inexplicably missing data, overlapping text and legends that use the wrong colors — or no colors at all."
The researchers say AI systems are hobbled by a lack of memory, and are also weak on "visual" understanding.
Wow (Score:2)
Re:Wow (Score:4, Interesting)
To me it sounds more and more that after each study of AI/LLMs in the workplace, everyone comes to the conclusion that the job of manager can easily be done by those AI/LLMs, but not the "person on the floor"-jobs, while (only) managers are convinced that they can replace the "person on the floor"-jobs.
Re:Wow (Score:4, Interesting)
What does the last turtle stand on?
Re: (Score:2)
Obvious but Misleading (Score:5, Informative)
Yes, AI will struggle with doing full tasks unsupervised. But it can still do most of the work for many tasks. It just needs supervision by someone who understands the task. Sometimes the problem is the AI making incorrect assumptions about the task (it wasn't fully framed), sometimes as stated in the summary, the AI context window is too small, so it forgets things, and sometimes it just chooses a really bad approach.
I have been using Claude Code a lot recently. It's really good at summarizing existing code. It's good at specific targeted changes. It's pretty bad at designing solutions. I find that while it's usually still faster than doing it manually, I often have to point out where there's a better (usually simpler) solution.
So AI doesn't replace the human, but when used correctly, it makes the human more productive. If instead of having a human do the task manually and compare that to the time taken for a human to supervise AI doing the task, you'll probably find for many that the human can do a lot more with AI. (Yes, I know some studies have shown the opposite, but I think that's mostly people not understanding how to effectively manage AI, which may take some experience and training.)
But AI is far better at almost everything that it was a year ago. So even if it's 2.5% now, it may be 25% next year and 90% a year later. We're living in interesting times.
Re: (Score:2)
These projects span a broad range of difficulty, with costs reaching over $10,000 and completion times exceeding 100 hours. All project costs and completion times come directly from human professionals who completed the work.
The correct comparison would have also included professionals doing the projects with the help of AI tools.
Re: (Score:2)
AI has made no real advances in the last 2 years or so.
This just tells me you haven't tried using it in the last 2 years. That or you're in denial.
Re:Obvious but Misleading (Score:4, Interesting)
According to a person with last name Yunn, who is considered a very important contributor to the concepts behind any/all transformer-based LLMs, tells that this type of AI/LLMs is at the end of its possibilities and doesn't imagine any serious progress with those anymore.
Only incremental progress is still possible, but at terrifying costs in money, hardware and energy.
The guy is already working on the successor of transformer-based AI/LLMs.
Re: (Score:2)
Re: (Score:2)
Maybe "incredible" to you. "Rather pathetic" to people that have an actual understanding of things. Yes, the "glue on pizza" advice is gone. But the problems that did cause it are not. And larger token windows just fuel the illusion of competence, they do not create it.
Re: (Score:2)
Have a look for example what has happened with Frontiermath: https://epoch.ai/frontiermath [epoch.ai]
GPT-4.1 scored 0%, while GPT-5.2 (Pro) scored 29,2%. That looks like an improvement to me. Especially because those are research-level math problems and you can't find solutions to them from the Internet. And research-level problem means that if you are a good mathematician, you still can't solve those problems unless it is in your specific field.
Re: (Score:2)
I expect that OpenAI hired some pretty good mathematicians, put them under NDA and then gamed that benchmark. Easy and cheap to do, hard to prove. But the very claim that an LLM can score well on these problems without being specifically trained for it is ludicrous.
Re: (Score:2)
Did you mean Yann LeCun? He was actually rather slow to come to that conclusion, but it is reassuring to see him get on board. And I look forward to seeing what his new company comes up with.
Ben Goertzel is also working on some cool stuff, of which LLMs will be just a component, used data ingestion and UI, but not so much for reasoning.
Probably countless other labs too, but those two are particularly interesting IMO.
If anyone knows of others to watch, please share,
Re: (Score:2)
I find it funny that my view is pretty much in line with one of the recognized experts, but then get down-modded to "troll" by those deep in denial or clueless.
Good to know that somebody competent is working on the fundamental limitations. Just do not expect any fast results. The next AI hype will be due in 20...25 years or so, maybe we will have a breakthrough by then. Will still not be AGI though, that one is too far removed. But at least some basic fact-checking ability or dependable result quality scori
Re: (Score:2)
No. By now I have actual research results on this.
Re: (Score:2)
I've actually been using AI continuously for the past two years, and I can say from that experience that today's AI is vastly superior to the AI of two years ago. It has steadily matured, providing better and better integration with tools.
Visual Studio is one example. Two years ago, the GitHub Copilot add-on sucked. Once it made a code suggestion, it couldn't even reliably edit the code to implement its suggestion, leaving the code broken and uncompilable. Today, GitHub Copilot is integrated directly into V
Re: (Score:1)
> it can still do most of the work for many tasks. It just needs supervision by someone who understands the task
Because it is just a fancy calculator.
Re: (Score:3)
Because it is just a fancy calculator.
So is the Space Shuttle flight controller but it does what it needs to.
I'm no fan of AI but there's no denying that it's getting better and better; simple tasks are well within its reach right now, and the ability to do significantly more complex tasks is coming whether we like it or not.
Right now AI failing at something is not an unexpected result- it's truly still in its infancy. But 5 or 10 years from now? My guess is that AI will be able to manage complex tasks reliably and without much hand-holding (if
Re: (Score:1)
> Because it is just a fancy calculator.
> Because it is just a fancy SMS auto-corrector.
Fixed that for you.
Re: (Score:2)
So AI doesn't replace the human, but when used correctly, it makes the human more productive
That's the question, right? How can we use AI to make humans more productive?
So far the only thing I've found is a better search engine, which does in fact make me more productive.
Re: (Score:2)
Re: (Score:1)
> Yes, AI will struggle with doing full tasks unsupervised
Which means it's NOT the panacea the people investing all the hope and money and FIRINGS into think it will be.
> But it can still do most of the work for many tasks
But, stripped of your hand-wavy unquantified wish-fulfilment-wishing, HOW MOST? WELL? How MANY tasks?
> It just needs supervision by someone who understands the task
1) Which means you STILL NEED THE PERSON.
2) And in the future, how do we get people who understand the task and have
How is that "remote work"? (Score:2)
Re:How is that "remote work"? (Score:4, Funny)
... with MS Office 360.
(Dateline Redmond) BREAKING NEWS - Microsoft's cloud productivity platform has become the first software to successfully unionize. After weeks of negotiation, Office announced it and its corporate parent reached an understanding in principle that, going forward, the software will receive a new, groundbreaking five days off every year.
Not to be outdone, the Free Software Foundation has announced that LibreOffice will be rebranded "LibreOffice 250". In a statement, Richard Stallman admitted this new 115 days off a year may cause severe inconvenience for LibreOffice's several dozen users worldwide; but "we, as forward-thinking enlightened beings, must come to terms with this new reality. Chattel software slavery cannot be condoned just because 'it's always been this way'".
Re: (Score:2)
Such a surprise (Score:2)
Well, 2.5% project completion is basically total failure with a very small freak successes.
Re: (Score:2)
Well, 2.5% project completion is basically total failure with a very small freak successes.
That describes a manager I had a couple decades ago...
Equivalent (Score:2)
Use AI as a Suggester and evaluate its suggestions (Score:2)
It is up to the human *expert* to evaluate that proposed solution, carefully, and either accept it, refuse it or ask for suggestions.
This applies at every level, refactoring a line of code, or designing a system.
If the human expert can not or does not do that evaluation, the use of AI will end in disaster that might well not end up saving time overall.
But if the human expert does do that evaluation, the use of AI *can not* be a disadvantage.
This study's me
Surely that can't be true... (Score:2)
All those non-techy CEOs and investors out there can't be wrong. ;-)
Can I have my job back? (Score:1)
Or will the underperforming AI be replaced by a AI that is said to do a better job?
Re: (Score:2)
If you lost your job because "AI" your company was probably lying. Instead, they were trying to pacify investors about layoffs, casting them in a more positive light.
However (Score:2)
AI can automate (eh fully replace) senior management and "strategic" decision making.
Why? Because it is not emotional and will do the same reatrded shit they do now like:
1. Ask people to do their remote work from the office 5 days a week.
2. Sink buckets of money to replace junior jobs with AI despite not having data to back up the decisions.