Neither ChatGPT nor Gemini—autonomous AI agents have a serious reliability problem, and they have documented it with compelling figures

Labrador-sized dinosaur discovered in U.S. was fast, mysterious and lived 150 million years ago

Confirmed Georgia is already refunding up to $500 per taxpayer and these are the key dates to receive it

Farewell to energy dependence scientists find hidden energy in the depths of the Earth

We’ve heard the refrain “AI is going to take our jobs” countless times. However, are you certain of that? Since AI agents aren’t very specialized and frequently fail, we may relax knowing that our jobs are safe for the time being. After examining a number of behaviors, Carnegie Mellon University (CMU) and Duke University have concluded that humans are not in danger (for the time being).

Even if they succeed, they only finish roughly one-third of the duties. At worst, they fail to achieve even a 10% success rate! Automation is still a ways off, but if they want to get there in the near future, they must improve their aim.

Although automation is still a possibility, this experiment dispels a lot of the myths around the notion that AI would replace humans in all aspects of life. Indeed, we long to be free from work, but for the time being, it seems more like fiction than fact.

What are AI Agents?

These are computer programs that can perform complicated tasks on their own without continual human supervision. One of these agents may make decisions, plan steps, and coordinate various actions—all of which are promised by the computer revolution—in contrast to typical assistants (like Siri or Alexa) that react to particular requests.

And what s going on with them?

Perhaps they are not as independent as we believe. The researchers tested this by creating a fictitious business called The Agent Company, whose AI agents were required to use platforms like Owncloud, GitLab, and Rocketchat in order to do their duties.

However, the outcomes were terrible.

Disappointment

It was a complete failure in both the OpenHands CodeAct and OWL-Roleplay test environments. Claude Sonnet 4 performed the best, finishing 33.1% of the tasks. It was followed by Gemini 2.5 Pro (30.3%), Claude 3.7 Sonnet (30.9%), and, far behind, GPT-4o (8.6%), Llama-3.1-405b (7.4%), Qwen-2.5-72b (5.7%), and Amazon Nova Pro v1.0 (1.7%). This was disastrous!

In the end, they succeed in 30% of cases, but the remaining 70% are simply failures. They won’t take your office job, so you can relax, George. No model is now prepared to manage challenging tasks on its own.

Problems with AI

Errors of all kinds were noted during the tests, including agents who were unable to handle pop-up windows, couldn’t convey a basic message, or came up with absurd solutions that had nothing to do with the original objective! To pretend that they had gotten in touch with the correct person, one even altered their login.

Even while these mistakes are occasionally humorous, they reveal a severe lack of contextual awareness and poor execution skills, raising questions about their preparedness for actual responsibilities.

Do they work for anything?

Yes, but with numerous disclaimers. They frequently fail, and while academics acknowledge that they can be helpful for extremely minor tasks, they cannot completely replace human labor.

The future isn t here yet

Even in repeated tests, the results weren’t particularly encouraging (they went from 24% to 34% success rates, and of course, they still don’t beat human skills). Of course, everything gets better with time.

What are the risks?

If every step isn’t tracked, giving an agent sensitive tasks like emailing or managing customer relations might go horribly wrong. For this reason, experts advise using standards like the Model Context Protocol (MCP) to enhance system-to-system communication and lower mistake rates.

AI is not ready yet

Salesforce tested these agents in CRM scenarios in a follow-up study. Their performance fell to 35% when jobs required numerous stages, and they only achieved 58% success on easy tasks. The conclusion is that these agents lack the skills and qualifications necessary for difficult tasks!

Gartner predicts massive cancellations

Data from the research firm Gartner indicates that before 2027, 40% of AI agent projects will be abandoned. Why? Many are more predicated on hype than on technical viability. They are only tests in an unfinished technology with no practical use.

Therefore, AI is still a long way from taking our place in the most difficult activities. AI 0 Humans 1! (No ill will for the time being!)