DEV Community

Cover image for Amazon Bedrock Beats Paper: Troubleshooting LLM Hallucinations
AMELIA HOUGH-ROSS
AMELIA HOUGH-ROSS

Posted on

Amazon Bedrock Beats Paper: Troubleshooting LLM Hallucinations

This post is meant to continue the story I started here.

Overview

As a reminder, I wanted to use LLM's to chat with my Jira ticket data, since that was all the hype in 2023-2024 thanks to the rise in popularity of chatgpt. I knew my ticket data contained specific use cases from my loudest customers because they called me on the phone, but I wanted to know about everyone else.

  • What questions were they asking?

  • What was my team missing?

  • What connections across the organization could be made to speed up research activities

  • How could my team create more helpful FAQ documentation?

I started this use case in January of 2024, using Claude V2.0 and Amazon Titan Text Embeddings and proceeded to spend the entirety of that year losing all faith that I could accurately chat with my data using Amazon Bedrock Knowledge Bases. I began to see improvement when Claude 3.5 came out, but not the robust results I expected. For more details on the use case, please check out this video from AWS re:Invent 2024 here.

While, I had a great plan, and solid execution, the results of that plan were not what I would have expected. By using the Amazon Bedrock Knowledge Base that originated solely as an OpenSearch Serverless Collection to create the vector store for my RAG pipeline, I continued to suffer through LLM hallucinations, and I had no faith that my queries were returning good responses. Here's how I knew:

Out of 4,000 Jira tickets, I asked my Amazon Bedrock Knowledge Base to categorize the Top 5 most requested items in Jira ticket data. It came back with 5 different things. One of which was Jupyter notebooks. I thought, Aha! How many tickets contain the spelling of Jupyter? I checked my CSV file and there were exactly 13. Thirteen out of four thousand. That's .3%. I wouldn't say that was a top 5 contender, but it gave me the detail I was looking for to test the RAG pipeline to confirm when I asked the knowledge base how many Jupyter tickets existed, that it gave me the correct response.

Awkward Data Responses

During 2024, the behavior I experienced showcased that my knowledge base couldn't find all 13 Jupyter tickets. Everytime I asked how many tickets contained the text "Jupyter" or "Jupiter", I got less than 13. I never got the full 13. I would get 2 of them, and if I modified the prompt, I might get 4. Usually, it was the same 2, but sometimes it was a different set of jupyter tickets. In frustration, I modified different settings including the Large Language Model, Source Chunks, and search type.

Image description

I constructed different prompt responses, including telling the LLM that it was reviewing Jira tickets and looking for similarities across the dataset. It didn't seem to matter what I changed, the results were all sub par, only giving me a small fraction of the responses I knew to exist.

If I modified certain settings, I got different results, and the best part was when I got responses with no mention of "jupyter" anywhere. The LLM so desperately wanted to give me a response that it gave me a nonsensical one.

On April 22, 2025, after a full year of proving I couldn't get back the 13 Jupyter tickets using the Opensearch Serverless Collection option, I tried a new tactic with the newly released Bedrock Knowledge Base using the Aurora Postgresql option. My use case started working! Mind Blown! What had changed?

Spoiler Alert - Two things changed:

  1. My data changed. I no longer had access to my original dataset, so I created a synthetic dataset using ChatGPT.
  2. I used the shiny new Aurora Postgresql option.

The functionality I had expected all along finally displayed on the page. The first question I asked to identify all Jupyter tickets returned a total of 13 tickets, and they were all unique, and actually matched the data from my CSV file.

Image description

Once I had proven I could get back those exact results, I started asking different questions and the results were significantly better. But was this just because I changed to a sql like database instance for a knowledge base?

No!

I created a new vector store database using opensearch serverless, and input the same data. I asked it the same questions, and surprisingly got back the full 13 jupyter tickets.

All things being equal, the only thing different was my data. When I reviewed the synthetic data I had created, I realized the description fields were significantly shorter than what existed in my original dataset. It would appear that the shorter descriptions made it easier to locate the specific tickets I was looking for. I then re-created the synthetic data to include longer paragraphs in the description field. I tested this out, and for the Aurora Knowledge Base option, it successfully returned the full thirteen tickets.

Summary

Amazon Bedrock Knowledge Bases are getting better at allowing me to chat with my data, but only when it's synthetic data. The Amazon Aurora Postgresql Knowledge Base option is more cost effective. The opensearch serverless option was costing me ~$5/day whether I was testing it or not. At this point, the functionality appears to be the same with my synthetic data across Aurora and the opensearch serverless option. Further testing is needed on a more realistic dataset to fully vet the capabilities of identifying trends in thousands of jira tickets.

Top comments (0)

Some comments may only be visible to logged-in visitors. Sign in to view all comments.