AI Without the Hype: People make mistakes too; why this statement is misleading

A few days ago, I carried out a little experiment. I asked a language model how many Rs there are in the word ‘strawberries’. It’s a question nobody would ask in everyday life; you can see the right answer at a glance. The machine replied very quickly and confidently: ‘Two’. I asked again. Two again. It wasn’t until the third attempt that it gave the correct number, with complete conviction.

Ask a primary schooler the same question, and they’ll think you’re crazy, but they’ll give you the right answer. And if they get it wrong, they’ll quickly check their maths. My son would have asked me why I wanted to know such nonsense. The machine didn’t do any of that. It didn’t know that it didn’t know.

This is precisely where the common objection, ‘After all, humans make mistakes too‘, misses the mark. I hear it in every other consultation as soon as the topic of AI’s fallibility comes up. It sounds fair and contains a grain of truth. We make mistakes, every day, in every process. Nevertheless, the argument is comparing apples and oranges.

Quality of mistakes

Raymond Panko, a professor of economics, has been researching for decades how often people make mistakes when processing data. For simple data entry, the error rate is one per cent; for more complex tasks carried out without a dual-control system, it can be as high as four per cent. When an AI provider advertises ‘95 per cent accuracy’, the machine is, on average, about as accurate as a human, not noticeably superior.

The key difference lies not in the rate of errors, but in the nature of those errors. Imagine a team of ten clerks processing cases for a whole day. If, at the end of the day, some of them are incorrect, there are many possible reasons for this: tiredness, rushing, a misunderstanding, a transposed letter, or a forgotten tick in the form. Every mistake tells a story; it can be traced and avoided next time. The errors of an AI system often tell no story. The system makes an estimate and is sometimes wrong, without it being possible to reconstruct afterwards why.

The dice

The real reason is that large language models are not deterministic. They can produce different responses to the same input, without it being possible to tell from the outside which one is correct. Added to this is a second phenomenon: the notorious ‘hallucination’. This refers to something different from the fluctuating response: the invention of content that sounds plausible but is simply not true. A language model may cite a source that never existed, or present a figure that no one has ever recorded. Both phenomena occur together and are not isolated cases.

Out of a thousand instances, the system might provide the correct answer in three hundred, a plausible-sounding but incorrect one in another two hundred, and behave completely differently in the rest. It is not consistently wrong; it is unreliably correct. The same case might run flawlessly today and go wrong tomorrow.

People make mistakes too. But in a different way.

Traditional software is tested by feeding it known inputs and comparing the result with the expected output. The same input produces the same output. Every time. With AI, this tried-and-tested method, which has been in use for decades, does not work. The results vary, and this is down to the very design of the system itself. You cannot test the dice.

We test traditional software, but not AI

Traditional software can also produce a huge number of errors. If a software developer has misunderstood my requirements or built in a logical error, their programme will fail every time it is run. This happens every day. That is precisely why testing consumes between a quarter and a half of all development resources. Nobody would put software into operation without testing and technical acceptance.

With AI, this culture of verification is often lost. The answer sounds authoritative and is accepted without hesitation. The AI comes across as an expert, and this leads to blind trust. The real problem lies in front of the screen. We treat the output of a language model as a verified source, when in fact it is merely an estimate.

This blind trust is not a new phenomenon. As early as 2016, researchers at the Georgia Institute of Technology investigated how people interact with a robot in an emergency. During the experiment, a fire alarm sounded and artificial smoke filled the room. The robot showed the participants an escape route in a direction they were unfamiliar with. Directly behind the robot was a glowing emergency exit sign pointing in the opposite direction to the familiar main entrance. All 26 participants followed the robot anyway, even those who had found it unreliable just moments before. In an extension of the study, the robot even led the participants into a dark room blocked by a piece of furniture. Some squeezed past the obstacle and followed it inside.

I explore why people are so willing to trust automated systems in a separate article: Why we trust AI more than ourselves.

With a person, you can ask why

We remember the team that carried out the tasks. People can explain why mistakes happened, and we can use that information to take appropriate action. If, for example, a particular step is regularly overlooked, a structured process can help eliminate the problem in future.

This doesn’t work with a language model. You can ask it why it made a particular decision, and you’ll get a fluent, plausible-sounding answer. But that information itself is merely generated text, not a genuine reconstruction of the internal calculation. It may be accurate or completely made up, and you cannot tell the difference from the outside. The model rationalises after the fact, because plausible justifications for such questions appear in its training data. It cannot read its own inner workings and therefore cannot explain anything.

For many tasks, it doesn’t matter. No one needs to explain why an email has been placed in one processing queue or another.

The situation is different in regulated processes. Anyone who rejects an application, refuses to provide a service or declines to grant a loan must be able to justify that decision to the person concerned and to a supervisory authority in the event of doubt. Data protection and the European legal framework for AI require this for high-risk applications. This is where two worlds collide: the process demands evidence, but AI only provides a narrative.

Why the objection still sounds so plausible

When AI expresses itself fluently, we see it as a human-like colleague. And then the phrase ‘it just makes mistakes like a human’ springs to mind and sounds almost reassuring. This is precisely a linguistic trap that has a name: the ELIZA effect, described in detail in a separate article. In short: we attribute understanding and judgement to a programme that formulates fluently because our brain has learnt this association over a lifetime.

Heute begegnet uns dieselbe Falle in weitaus größerem Maßstab. Wir hören eine Maschine flüssig antworten und sehen unbewusst einen denkenden Menschen vor uns. Damit übernehmen wir aber nur die Hälfte des Bildes. Wir finden die menschliche Fehlbarkeit in der KI wieder. Was wir dabei übersehen, sind die menschlichen Sicherungsmechanismen: das Zögern bei Unsicherheit, die Eskalation an einen Kollegen, die Verantwortung für das Ergebnis. Der ELIZA-Effekt lässt uns anstelle einer Maschine einen fehlbaren Menschen sehen und blendet aus, dass dieser keine unserer Schutzmechanismen mitbringt.

Where the dice fit and where the clockwork is needed

In one respect, the objection that ‘people make mistakes too’ is valid, and it is worth acknowledging this. If a task has no single, deterministically correct solution, for example, in judgement-based activities such as summarising texts or drafting a response, the demand for perfection is the wrong yardstick. The fair comparison, then, is the real human being with their fatigue and haste. If an AI works faster, more reliably and more consistently here than a sleep-deprived clerk, it is a useful tool.

None of this is an argument against AI. It is a call for choosing the right tool. AI is an excellent tool when the task allows for some variation. Drafting texts, generating suggestions, preparing research, summarising content. It really comes into its own wherever the result does not have to be identical every time.

However, there are processes where the outcome must be exactly the same every time. A change of supplier in the energy sector. A change of insurance provider. A phone number that is ported to a new provider. These processes are entirely rule-based and subject to clear legal and technical requirements. A language model is the wrong tool for this. What is needed here is a clockwork mechanism that always ticks the same way, not a system that guesses what is likely to be correct.

AI can be of help here in a different way. It is an excellent tool for writing the code that models such deterministic processes. The cube helps to build the clockwork mechanism without replacing it once it is in operation.

Anyone who understands this distinction can look beyond the hype and use the tool effectively where it can really make a difference.

Sources and further links

Forschung zu Fehlerquoten

Frequently Asked Questions

How does the error rate of AI compare to that of humans?

If an AI provider claims 95 per cent accuracy, this equates to an error rate of five per cent. Humans have an error rate of around one per cent for simple data entry, and up to four per cent for more complex tasks where there is no dual-control system. On average, therefore, AI is about as accurate as a human; it is not noticeably superior.

What is the difference between human errors and AI errors?

Human errors are independent of one another and have identifiable causes. Every error can be explained and avoided next time. AI errors often don’t tell a story. The system makes an estimate and sometimes gets it wrong, without it being possible to work out afterwards why.

Why aren’t AI systems deterministic?

Large language models use a statistical method that can produce different outputs for the same input. Added to this is the phenomenon of hallucination, i.e. the generation of plausible but incorrect content. Both of these characteristics are inherent in the system’s design.

Why can’t an AI explain its own decision?

A language model can provide a plausible-sounding explanation on request, but this is itself merely generated text, not a genuine reconstruction of the internal calculation. It rationalises the process after the fact. In regulated processes, where decisions must be justifiable, this is a significant problem.

What tasks is AI suitable for?

AI is well suited to tasks that allow for some variation, such as drafting texts, generating suggestions, preparing research or summarising content. For processes that must be entirely rule-based, such as switching suppliers or changing insurance providers, traditional software is the right tool.

→ Read all articles