AI Without the Hype: A dice or a clockwork mechanism: why AI gives different answers to the same question

In December 1926, Albert Einstein wrote a letter to Max Born. It concerned the then-nascent field of quantum mechanics, which was causing quite a stir in the world of physics at the time. Einstein rejected its probability-based description of the world. He wrote: “In any case, I am convinced that He does not play dice.” By ‘He’, he meant the Old One – that is, God. This sentence gave rise to the well-known phrase ‘God does not play dice’.

Einstein was wrong. At the subatomic level, chance reigns supreme, as science has since confirmed time and again. He was unable to come to terms with this until his death in 1955.

We’re seeing a similar tension today with language models. We want software that always works the same way. What we get is a system that rolls the dice with every word.

How does an AI roll the dice?

A stick figure robot with a clinical thermometer in its mouth and a dice in its hand; a thought bubble with possible continuations for ‘The cat is sitting on the... windowsill? mat? veranda? chandelier?’, an illustration of the probability distribution and temperature in AI language models

A language model works with words. For each individual word, it calculates which following word is statistically the most likely. Imagine the beginning of a sentence: “The cat was sitting on the…” The model calculates that “the windowsill” is the most common next word in 60% of cases, “the mat” in 20%, “the stairs” in 5%, and “the veranda” in 1%. “The dishwasher”, on the other hand, almost never occurs.

The model makes its selection from this probability distribution. If it always chose the most probable word, the same answer would come out every time, provided the training data hadn’t changed. It hasn’t. The model ‘rolls the dice’ and sometimes goes for ‘the mat’, sometimes for ‘the chandelier’.

What this means in practice

Imagine an employee wants to use ChatGPT to draft a letter to a client regarding the amendment of a commercial contract following changes to grid tariffs. She types in her question and receives a draft. The next day, she needs a version for another client and asks a similar question. The response is not identical. She wonders why the reasoning and the order of the points are different.

This happens for several reasons at once. The model generates responses as described above. The context changes because services with a memory function incorporate previous conversations. In the meantime, the model itself may have changed. Providers roll out updates in the background, add training data or release new model versions. If the extensive Japanese literature on cats is incorporated and given greater weighting, ‘the mat’ suddenly rises in the rankings. And simply attempting to ask the same question twice in an ongoing conversation may elicit different responses, because the initial exchange has itself become part of the input.

It’s important to remember that just because you receive different pieces of information, it doesn’t necessarily mean that any of them are correct. They could all be wrong. Anyone familiar with the Strawberries test from the previous article will know that an AI can give stubbornly incorrect and self-assured answers.

Why we can’t see the dice

What is astonishing is how rarely we notice this variation. Anyone who asks the AI a question and receives an answer rarely stops to think what that answer would have been an hour ago or what it will be tomorrow. The variation is there, but it remains invisible to us.

There are several reasons for this. The most important is the ELIZA effect, which I have described in detail in a separate article. When a machine expresses itself fluently, we perceive it as a thinking conversation partner, to whom we grant a certain degree of variation in its responses, just as we do with humans. We equate the two types of variation, yet they are fundamentally different. In our case, it stems from experience, mood and tiredness; in the case of AI, it stems from a probability calculation.

Even when the inconsistency becomes apparent, we often fail to react. When our behaviour (using AI) and our knowledge (AI is unreliable) don’t match up, this creates what psychology describes as cognitive dissonance. We usually resolve this internal tension by downplaying the knowledge. “It can’t be that bad.” “It works for us, after all.” “The others are exaggerating.” That resolves the dissonance, and the AI remains in use.

Furthermore, we do not examine the variations. Who would ask the same question twice? We receive an answer and carry on working with it. The variation remains hidden. Only when we realise that every response is an estimate drawn from a probability distribution does the question change. ‘What did the AI say?’ becomes ‘How stable is this answer?’. And this question is crucial for practical application in critical business processes.

Is it possible to calm the dice down?

Yes, to a certain extent. Most language models have a parameter called ‘temperature’ that controls exactly that. The higher the temperature is set, the more the model varies in its choice of the next word. At a temperature of 0, the most likely continuation is chosen every time. This makes the responses significantly more consistent.

Important to know: You won’t see this slider if you’re using ChatGPT or Claude in a browser. In that case, the temperature is set to a fixed value, usually a medium setting. The slider only appears when a provider integrates a language model into their own system. At that point, someone makes a decision about the temperature. This setting directly influences how reliable your AI system’s answers ultimately are.

This has direct practical implications: if you’re introducing an AI-powered chatbot or agent into your organisation, it’s worth asking the provider: What ‘temperature’ setting has been chosen? Why? How does it fit with the task the system is supposed to perform for you? A lower temperature for calculations and factual information, a higher one for creative text drafting. If you do not ask this question, you are accepting a default setting that the provider has chosen from their own perspective, not yours.

One thing remains the same, however, even at temperature 0: the answer is not guaranteed to be correct. It is simply more consistent. With ‘Strawberries’, two R’s are counted every time. The fact remains that a language model predicts the next word based on probabilities.

Dice or clockwork: the choice of tool is yours

What does this mean for the practical application of AI? It leads to the same choice of tools that runs through this entire series. There are tasks where variation is valuable: drafting texts, generating suggestions, summarising content, exploring creative options. Here, the randomness is an advantage. It is precisely the variation that a deterministic system could never provide that makes AI useful in these contexts.

There are other tasks where consistency is required. A invoice that must always come out the same. A contract that should use the same wording whenever the same details are entered. Data processing where the result must be verifiable and traceable. For such tasks, a language model is the wrong tool. What is needed here is a clockwork mechanism that always ticks the same way.

Einstein could not come to terms with the randomness of quantum physics. He fought against the stochastic nature of the world right up until his death. Nevertheless, the scientific community has accepted quantum mechanics as the best available description of reality. Enthusiasm was rarely a factor; it was more a matter of necessity. With AI, we face the same challenge. It rolls the dice, and it will continue to do so. We cannot remove the dice. What we can do is learn to work with it and use it only where its randomness benefits us.

Frequently Asked Questions

Why does ChatGPT give different answers to the same question?

Language models operate on the basis of probabilities, not fixed rules. For each word, a choice is made from a probability distribution. There are also other factors at play: services with a memory function take previous conversations into account, providers apply updates in the background, and simply asking the same question twice in the same conversation changes the context.

What is temperature in AI?

Temperature is a parameter that controls how much a language model varies when choosing the next word. A high temperature allows for less likely words, whilst a low temperature narrows the responses down to the most probable ones. At a temperature of 0, the most likely continuation is chosen every time. End users do not see this setting in the browser; it is set by the provider.

Why do we rarely notice the variation in AI?

Several mechanisms work together. The ELIZA effect makes us perceive a conversation partner whose variation we accept, just as we would with a human. Even when we do notice the variation, we often downplay the issue; a process psychology describes as cognitive dissonance. Furthermore, we rarely verify by asking the same question twice.

Can AI be made deterministic?

At temperature 0, responses become significantly more stable. But even then, the answer is not guaranteed to be correct, merely more consistent. The dice becomes steadier, but remains a dice.

For which tasks is AI suitable, and for which is it not?

AI is suitable for tasks where variation is valuable: drafting texts, generating suggestions, summarising content. For tasks that must be 100% stable, such as calculations, contracts or regulated data processing, a language model is the wrong tool. Here, you need a clockwork mechanism that always ticks the same way.

→ Read all articles