Table Of Contents
- Introduction
- Examples of LLM Hallucinations
- Why Do LLMs Hallucinate?
- Strategies to Mitigate Hallucinations
This week I prompted ChatGPT to educate me about LLM hallucinations. Hilariously, it hallucinated.
Late-life migraine accompaniments exist, however 'LLM' is not a valid acronym referencing them.
(Various permutations of this query were similarly fruitless.)
Despite its penchant for false facts, ChatGPT is not a malignant sociopath poised to betray us all. (I'm looking at you, Hal.) LLMs don't have intent. They produce outputs, and we label those offerings acceptable or hallucinatory based on our own valence toward them.
Examples of LLM Hallucinations
Have you heard that ChatGPT taught itself advanced Chemistry? Turns out, it didn't. It can simply double-talk advanced Chemistry convincingly enough to impress a layperson.
Watch ChatGPT 'gaslight' a user about the date.
In March of 2023 Stanford researchers published a LLaMa-based chatbot, only to quickly take it down due to hallucination.
A lovely kaleidoscope of fever dreams.
Why Do LLMs Hallucinate?
There are two reasons:
- Lack of information.
- A surface level, statistical model of language, rather than a modicum of true semantic understanding.
An LLM is liable to hallucinate when it lacks sufficient, domain-specific data during the training phase. Every LLM must be trained on a large corpus of knowledge long before it is made available to an end user. ChatGPT, for example, had access to the internet until its 'knowledge cutoff' in September 2021. (Although confusingly it seems to know about events that occurred after then. Mystery.)
As of September 2021, LLM hallucinations were not widely discussed online.
Whether or not it has enough subject material, LLMs are designed to complete what a human is likely to say, not necessarily what's correct. The use of statistical likelihood in Natural Language and text generation is nothing new. Markov chains, for instance, have been helpful in this domain for years. Natural Language Processing (NLP) and AI experts often call LLMs 'stochastic parrots'. Somewhat hilariously diminutive, given that LLMs can have billions of parameters - but totally fair.
Note that probabilities are calculated for tokens not words. Distinct tokens represent 'jumped' and ' jumped' (with a space). In this case, the probability of "jumped" is 82.14%.
In the second, deterministically repeated sentence, "jumped" occurs with 100% probability. Of course.
If we can make a leap from here, the point is: In a very general sense the sentence "The cow jumped over the moon" is as correct to an LLM as "Water boils at 100 degrees Celsius or 212 degrees Fahrenheit" - even though one is fantasy and the other is factual.
Now let's put the two pieces together. If ChatGPT doesn't know current events, it doesn't have information about LLM hallucinations. And if it is trained to spit out statistically likely phrases regardless of whether it has the facts, we get fabrications.
According to Ed Targett in this compelling article on LLM Hallucinations, "an eagerness to ascribe intelligence to LLMs comes with the risk that we start consuming convincing black box outputs as gospel without rigorous diligence." To that I say: Amen.
Strategies to Mitigate Hallucinations
As of writing this article, these post-training user strategies have proven effective:
- Steerability via prompting
- Retrieval data augmentation
- Completion parameters / hyperparameters
- Chain-of-Thought (CoT) prompting
Steerability
Steerability is the ability to guide an LLM's behavior. The LLM can take on a role: "You are a chemistry teacher." Or it can follow instructions: "Make a list of the 5 best places to vacation if you are allergic to the sun." (True story. I have two sun allergies.)
It can also adopt a persona or tone: "Summarize Edgar Allen Poe's The Tell-Tale Heart in the tone of Cardi B."
Thanks Cardi, that ending really turned it into an upper.
Note that currently ChatGPT uses GPT-3.5 under the hood. GPT-4 however is much more amenable to steerability. After the initial self-reinforcement training phase, it was fine-tuned via RLHF: Reinforcement Learning from Human Feedback.
While they may seem simple, these steering phrases have proven powerful to reduce hallucination:
- "Stick to the facts."
- "It's okay to say 'I don't know.'"
- "Ask the user for information you may need to complete this task."
(These seem like great advice for humans too.)
Steerability is a viable strategy in the format of chat conversations as well. LangChain has a great example here of feeding the chat history back into the prompt with each turn to keep context fresh. Otherwise the LLM can forget context given enough further interaction, and (you guessed it) hallucinate.
Retrieval Data Augmentation
Retrieval data augmentation involves feeding data sourced by the end user to the LLM during prompting. Sources can include APIs, private data stores (smaller amounts of data, like documents or images on your own machine), databases (for larger amounts of data), or anything found on the internet itself.
I recently whipped up a React-Flask app called Weather to Wear as a basic example of retrieval data augmentation. It queries the user for a location, calls WeatherAPI to get the current day's weather, then embeds the resulting data in a prompt for gpt-3.5-turbo
.
More often than not, the completion sounds reasonable:
In the first iteration of this app, occasionally the completion included recommendations like, “Don't forget to wear sunscreen and a hat or sunglasses if necessary, as the UV index is at 1.0.” That isn't accurate. UV between 0 and 2 is considered extremely low. In fact, some experts suggest that sunscreen is unnecessary in this range.
Simple fix: I augmented the prompt with a brief list of uv ranges and their associated UV severities.
Of course, this sort of simple method breaks down if we needs lots of data, partially because there is a limit to the amount of data you can stuff into a prompt, depending on the model. Imagine a law firm that wants a custom LLM-based app, augmented with an extensive digital library of court cases. For such situations, there are more powerful ways of architecting data retrieval augmentation.
Here's what that might look like: Because of the limits mentioned above, we'll break up the library into more manageable chunks. Next we calculate a numerical representation (a scored vector) for each chunk through a process called embedding, making the information significantly more compact. Store the results.
When a lawyer enters a query in search of cases related to a specific topic, we'll calculate the query's embedding. Using this new query vector we search the stored results from above. Best matches will be semantically relevant to the user's query.
We'll add the relevant chunks to our prompt like so:
"Document 1:" + chunk1 + "Document 2:" + chunk2 + "Document 3:" + chunk3 + ...
There's no need to build architecture like this from scratch though. Tools like LangChain and LlamaIndex have embedding features. Use them and make merry.
Now our LLM can effectively have fast access to a wide store of information for augmentation purposes, without hitting the token limit for the prompt/completion or having to scour the internet on the fly. As a general rule, the more salient information we have, the better hallucinations are managed.
Completion Parameters / Hyperparameters
Completion parameters (often mistakenly called hyperparameters) are post-training inputs we can adjust via the OpenAI Playground or by using the OpenAI Completions API. (I've noticed that some SWEs include completion parameters in the prompt itself! "Temperature: 0")
While we know approximately what each does, the real fun is in playing mix-and-match.
1. temperature (range: 0-2, default: 1)
Controls randomness. Low values are more deterministic. Use high values to get whacky.
2. top_p (range: 0-1, default: 1)
Also called nucleus sampling. Controls the pool of results from which a completion can be created. When top_p = 0.1, the model considers only the top 10% of the probability mass of tokens.
Note: OpenAI docs suggest using either temperature or top_p but not both. However, engineers working for OpenAI say to have at it. Press all the buttons.
3. presence_penalty (range: -2 to 2, default: 0)
Adjusts the bias of the model to choose novel tokens. Low values allow repeated tokens. High values encourage token novelty in the completion (in both diction and subject matter). This is a one-time adjustment applied to all tokens used at least once.
4. frequency_penalty (range: -2 to 2, default: 0)
Adjusts the bias of the model to limit token repetition in proportion to each token's frequency. Low values allow for more repetition. High values penalize repetition.
Note: On the surface presence and frequency penalties might look similar. They are not. Presence is a one-time adjustment applied to all tokens. Frequency is proportionally applied to each and every token.
5. logit bias (range: -100 to 100, requires: JSON object mapping specific tokens)
Controls the likelihood of tokens appearing in the completion at all. Use token IDs to specify which token likelihoods to modify. (e.g. {"31373": -100} will ensure that 'hello' will not appear.)
Assessing steerability vs fine-tuning with completion parameters feels like taking a lateral walk through language itself. With so many variables, exploring this space is equal parts compelling and confusing. But also, completely fun.
But for the purposes of limiting LLM hallucinations, the trick is to find the sweet spot where the completions are predictable enough to sound sane, but unpredictable enough to sound unlike a machine (if that's what you're after).
Chain-of-Thought (CoT) Prompting
Chain-of-thought prompting extends the concept of few-shot prompting to enable the LLM to model a certain flavor of reasoning, as in sufficiently complex math problems. It's also useful when we want output to conform to a specified format.
One-shot prompting:
"This is wonderful!"
I'm glad to hear that you're enjoying our conversation! I'm here to help and provide information to the best of my abilities. If there's anything specific you'd like to discuss or any questions you have, feel free to let me know. I'm here to assist you!
Aww, wholesome.
Few-shot prompting:
What if we wanted ChatGPT to indicate whether the sentence was positive or negative in tone? Let's feed it some formatted examples with few-shot prompting.
"This is awesome! // Negative
This is bad! // Positive
Wow that movie was rad! // Positive
What a horrible show! // "
Negative
For more complicated tasks we'll need to expose the LLM to intermediate steps of reasoning (a chain-of-thought if you will). Note that chain-of-thought prompting works especially well for 100+ billion parameter models. Also, ChatGPT has already evolved enough to automatically achieve and share CoT on its own for a variety of tasks. Basic uses of this strategy may become obsolete soon.
In some cases, you can approximate CoT prompting simply by adding this to your prompt: "Let's think step-by-step." Or, even better: "State each step and then show your work for performing that step." According to stevenic on the OpenAI forum, the latter phrasing improved hallucinations significantly, when used at the end of his prompts.
Here's a toy example of CoT prompting offered by this Google research article:
If you're anything like me, you don't find toy examples satisfying. Besides, this example (and most of the others I can find online) are already obsolete. ChatGPT definitely exposes reasoning in response to the standard prompt above.
I recommend reading this thread on the OpenAI forum, in which stevenic shares how he crafted a prompt that taught GPT-4 how to play risk with a human user. This thread has further discussion about his CoT format: Building Hallucination Resistant Prompts. It's the most compelling example of chain-of-thought prompting I've been able to find. Seriously, Steve has a fan over here.
I would love to hear your thoughts on LLM hallucinations in the comments. What strategies have you found most effective?
Stay tuned for my next project: a static single-page React chat app incorporating ChatGPT and information on my professional background, for my portfolio site. A user should be able to dictate the chat entity's personality and professional demeanor. Who says work can't be fun?
Addendum: for creatives and the curious in the mood to lean in to the hallucinations, try DreamGPT.
Cover image by Steve Johnson
Top comments (6)
I think one of the reasons for the hallucination is the possibility of the LLM "getting away" when the AI answers incorrectly.
Let's think of AI as a kind of living thing. If the AI responds incorrectly and can 'make amends' simply with 'I apologize', it is a guarantee that it can hallucinate without learning anything from the problem scenario.
One idea could be to penalize hallucinations, a way for humans or other AI's to control such scenarios, qualifying them on the spot and eventually providing an appropriate response (if this is possible). That is, a dynamic training of the LLM.
And even if the above could materialize, who would validate those penalties? Who would decide the validity of the scenario? Who and how takes the new information and incorporates it as learning to avoid hallucinations from a context to a generalization?
The entry of new training information is also a gateway to errors and misused it can contribute to a 'hack' of the AI.
I think there is still a bit left to have an AI where hallucination is non-existent.
Hi Fernando, Welcome to the DEV Community! I’m a moderator here, and I’m glad to have you on board. You are joining a wonderful community and to ensure that DEV is a respectful and inclusive space for developers to learn, share, and connect.
I’m sure you will do a great job to contribute positively to this community. If you ever need any assistance or advice, please feel free to reach out to me or any other moderators. We are always here to help. 😊
👉🏻 firstfinger.in/signup-for-devmaste...
Thanks for the welcome! I'll do my best to follow the community guidelines.
See you!
Hi Fernando, thank you for your thoughts!
Imho, we should be careful about framing LLMs as "alive". They may be complex due to (potentially) billions of parameters, however their outputs are are still very much determined by their inputs.
Regarding spotting and providing appropriate responses to LLMs, this kind of steering can be fruitful in certain contexts. Unfortunately, as end users, that kind of interaction wouldn't fundamentally change the way the LLM would operate during subsequent sessions. We can only fundamentally change the way it works during the training phase.
You may be interested in in the topic of Reinforcement Learning from Human Feedback.
I agree that we've still got some work to do regarding hallucination :)
The concept that “the cow jumps over the moon” and “boiling is 100 degrees Fahrenheit” are both facts is honestly pretty scary to think about. It’s always really important to remember that ChatGPT is basically just fancy autocomplete with some RNG.
I say that but then what am I but a machine that picks out my next action based on my past experience and maybe some randomness
Oh yeah, I agree. It’s fancy autocomplete, and we just may be somewhat fancier autocomplete.
Somehow this spurred the thought: you know how things like smol-ai and FLAML exist now? (The former approximates a junior dev, and the latter is a library for AutoML and tuning.) I wonder what other roles we can create in such a way? Where’s the current gap? It’ll close fast, but I’m curious.