DEV Community: Thomas Compton

Creating Synthetic Dialogues using RAG and Gemini

Thomas Compton — Fri, 11 Jul 2025 09:17:24 +0000

In working on a project comparing 2 topic models of different corpora, it became annoying not having them just tell me how they disagree with each other. It’s great to see lists of what the most common ideas are, but it’s much better to have a way of summarising the key differences between them. Since I was using topic models, I already had lists of sentences and sentence embeddings so any integration into an LLM could be easy but the question of how to make a tool that really tests where these corpora differ and doesn’t end up in an endless debate proved harder.

For more information and the full code read the Git Hub

Introduction
Using LLMs, in my case Gemini, to simulate debate is not as common as chatbots. One issue we have see with Herzog vs Zizek is that long debates can easily go off topic . As the author says “ It sometimes makes sense and sometimes not. It sometimes contains true information, and sometimes it contains outright falsities.” They used model training and appear to have given one large output. This is easy to fix and save computation, limit each output to 5 interchanges. From my testing, this proved a good number that had some debate, but did not go off topic. This also creates a nice building block that might be inserted into a larger debate with some user fine-tuning if needed.

But the real issue is, how are we getting the knowledge into the model? There are 2 approaches. There is training and there is using RAG. Training is pretty straightforward; you take a model and feed it the list of sentences for that corpus, wrapped around instructions. It can be time-consuming depending on hardware, but the real issue is that it means in the debate, you are relying on 2 different loaded models to correctly identify which knowledge is most appropriate to use in a response. Hence, this can lead to the tangent issue and generally make it difficult for the user to interface with the model. You should still be providing system instructions to both, but this may not be sufficient to steer the model back on course.

RAG is quickly becoming a popular alternative for combining LLMs and large corpora [1]. This is why I opted for a RAG-based system. By using FAISS wrapped in Langchain, we can have the heavy lifting done by sentence similarity. All the LLM has to do is summarise the top 5 most similar sentences

The Pipeline

The LLM was given this prompt:

system_prompt_a = """You are a 1920s trade union representative from the National Boot and Shoe Union.
Use the retrieved sentences as your knowledge base.
Speak persuasively as if you are arguing with a fellow trade unionist.
Do not use your own knowledge.
Vary your sentence structures, do not repeat phrases.
Respond with 2 sentences maximum."""

By keeping each LLM response to 2 sentences, the conversations stay focused on the topic and are easy to understand. The focus is on using the knowledge from the sentences, so the LLM does not use what it thinks it knows about the topic. This still provides a challenge where the LLM kept using the term ‘comrade’ frequently, which did not seem appropriate for the context.

def generate_turn(query, retriever, system_prompt, speaker):
    docs = retriever.get_relevant_documents(query)
    content = "\n".join(doc.page_content for doc in docs)

    prompt = ChatPromptTemplate.from_messages([
        ("system", system_prompt),
        ("human", f"You just heard the following message:\n\n\"{query}\"\n\nHere are 5 excerpts from your documents that may help you reply:\n{content}\n\nRespond to the message above based ONLY on this information, and speak as if you were in a real conversation.")
    ])
    response = llm(prompt.format_messages())
    reply_text = response.content.strip().replace("\n", " ")
    print(f"\n{speaker}:\n{reply_text}\n{'-'*50}")
    dialogue_history.append({
    "speaker": speaker,
    "query": query,
    "response": reply_text,
    "context": content
    })
    return reply_text
list_of_models = ["gemini-1.5-flash", "gemini-1.5-flash-8b", "gemini-2.0-flash-lite", "gemini-2.0-flash", "gemini-1.5-flash-8b"]
for j in range(40):
  time.sleep(20)
  sentence_index = range(0, len(bs_sen),1)
  query = bs_sen[random.choice(sentence_index)]
  for i in range(5):
      llm = ChatGoogleGenerativeAI(model=list_of_models[i], temperature=0.7)
      if i % 2 == 0:
          speaker = "Historic Union (1920s)"
          query = generate_turn(query, retriever_a, system_prompt_a, speaker)
      else:
          speaker = "Modern Union (2020s)"
          query = generate_turn(query, retriever_b, system_prompt_b, speaker)
with open(os.path.join(data_dir, 'dialogue_history.pkl'), 'wb') as f:
  pickle.dump(dialogue_history, f)

This allows for the creation of a database that can be either manually reorganised by a user, linking together dialogues they feel are appropriate or chunked then formated as a DataFrame.

Evaluation Approaches

Evaluating the database proved challenging. By using qualitative judgement, comparisons to topic models and knowledge of the texts, the outputs appeared generally appropriate. In this case, it helped that this output was the final stage of a long project exploring these two corpora. However, many new users will not have such familiarity. Really, the goal of this approach is to output as many conversation chunks as possible and then choose the most appropriate. It may be useful, if unsure, to use other quantitative approaches on the corpora, such as BERTopic, most frequent lemmas and bigrams. This can also help to ensure any conversation chunks are representative of the total corpus. On the other hand, what makes these conversations interesting is that they are low-frequency sentences being compared to each other. This means they will discover conflicts that topic models will not. So there is an advantage in how small-scale they are, but it also may depend on the type of corpora being used and how many perspectives are found within it, which determines how useful these conflicts are.

A potential idea for evaluation could be a system comparing scholars/politicians who have frequently debated. It might be that evaluation is best suited to utilising those with domain knowledge to guess which debates are generated by LLMs and which are genuine debates. Moreover, it would be easy enough to create the synthetic database and use FAISS to link it to actual instances of debates to make qualitative comparisons and exploit the cosine similarity. The difficulty would be normalising the data, where it may be best to focus on 2 exchanges synthetically created but debates can have many different formats, and the inherent variation within speech would make this approach limited.

Gradio Output

Once the database was created, I used it for a new RAG gradio on Hugging Faces Spaces. This allows the users to search the database to find relevant answers. Users can choose to group the short conversations and provide tags or tables if that will provide value to the UX. They could also consider using text-to-speech to increase accessibility. Really, there are many options to dramatise these chunks of conversations, either appending them into larger scripts and/converting them into different formats. Once a full database has been created, this is where the quality control can be undertaken either with the aid of a human interface or through classification of the chunks.

Either way, they can become the building blocks for new creative endeavours depending on the needs of the project. Future directions could involve changing the format to exchanges of letters or monologues. The challenge of these will be prompt engineering, providing the right guidance to the LLM, focusing again on small chunks of output and heavy direction to ensure the output is designed to specification. Essentially, step 1 is to create the structure of the piece, step 2 is to convert that structure into a series of instructions, and step 3 is to use the database to choose the most appropriate context to feed the model. By engaging with local stakeholders, this project has sought to liaise with the interests of local groups to ascertain their interests. Local museums and community groups have shown interest in the concept, suggesting it can be a useful way of making archives more accessible to the general public. So, it appears logical that anyone attempting a similar project makes use of similar interested groups both for gaining human validation of good exchanges output by the model and evaluations of more subjective and difficult to measure 'creative' usages of AI.

Conclusion

This pilot test has demonstrated that it is possible to create reasonably coherent conversations between 2 databases using RAG. By offloading the responsibility from the LLM to FAISS, the risk of going off-topic and hallucination is theoretically mitigated. Moreover, keeping the approach as generating short interchanges as building blocks for potentially larger projects and utilising a crowd-sourced, user evaluation could be a productive way of engaging communities throughout the process of making history relevant to their community. On the other hand, for more practical usages, looking for key differences between databases using regex and searching for points of contention could be a useful tool. This approach boils down to summarisation with extra steps, a task that models are reasonably successful at. Therefore, this system could utilise the strengths of existing pipelines and provide new opportunities for creative representations of large corpora.

[Boost]

Thomas Compton — Thu, 22 May 2025 11:31:53 +0000

Thomas Compton

May 22 '25

Comparing LLMs and Python OCR Packages: Opportunities and Challenges in OCR Accuracy

#ocr #openai #ai #gemini

3 min read

Comparing LLMs and Python OCR Packages: Opportunities and Challenges in OCR Accuracy

Thomas Compton — Thu, 22 May 2025 10:59:39 +0000

Introduction

Multimodal LLMs create new opportunities for extracting text from difficult images. But what are the pros and cons? How do Deepseek, Qwen, Gemini, and ChatGPT compare to traditional OCR packages?

This post compares different LLMs and traditional Python OCR tools using Jiwer’s WER and CER metrics to assess accuracy. Lessons from running large-scale text extraction using Gemini are also discussed.

Full evaluation code, tables, and workflows can be found in this GitHub repository

Key Findings

LLMs offer high-accuracy OCR, but are costly, slow, and require powerful hardware.
Traditional OCR tools are lightweight and fast, but generally less accurate.

OCR Packages (EasyOCR, Tesseract, PaddleOCR)

Archival OCR is important for NLP tasks like topic modelling. Python packages like EasyOCR, PyTesseract, and PaddleOCR are commonly used. This comparison focuses on practical performance, not theoretical strengths.

All evaluations use Jiwer’s Word Error Rate (WER) and Character Error Rate (CER).

Traditional OCR Results

Engine	WER	CER
EasyOCR	0.89	0.67
Tesseract	0.69	0.43
PaddleOCR	0.79	0.76

Tesseract performs best overall. EasyOCR has a lower CER than PaddleOCR, but higher WER, suggesting it identifies characters well but struggles with correct word segmentation.

Preprocessing Impact

Step	WER	CER
Before Preprocessing	0.77	0.60
After Preprocessing	0.67	0.43

Preprocessing significantly improves accuracy. This step is recommended for all OCR pipelines.

Post-Processing with LLMs

Post-correction with LLMs can fix some OCR issues (e.g., word splits), but won't recover text not detected by the OCR engine. Also, token limits and prompt misinterpretations pose risks at scale.

LLMs as an OCR Solution

The results speak for themselves:

Engine	WER	CER	LLM
Gemini	0.04	0.02	Yes
Qwen	0.06	0.03	Yes
Deepseek	0.10	0.06	Yes
ChatGPT	0.58	0.45	Yes
Tesseract	0.69	0.43	No
PaddleOCR	0.79	0.76	No
EasyOCR	0.89	0.67	No

Multimodal LLMs outperform traditional OCR significantly. However, they likely include some internal correction pipelines, making the comparison imperfect.

Word Mismatch Counts

Model	Word Mismatch
Gemini	30
Qwen	26
Deepseek	276
ChatGPT	108

Gemini performs best and has a usable Python wrapper. But it introduces new challenges.

Deployment Considerations: Gemini

Gemini’s main issues are:

Rate Limits:
- Gemini 2.0 Flash-Lite: 30 requests/min, 1,500/day
- See full limits
Copyright Flags:
- Gemini may falsely flag archival material.
- Handling involves rerouting failed items to alternative models.

Deployment Considerations: Qwen via Ollama

Qwen via Ollama is an option for local runs

Negatives

Requires sufficient hardware.
Accuracy may drop slightly due to quantisation.

Positives

Free
No reliance on internet during runs
Data Privacy

Qwen2.5VL, 7b performed:
WER: 0.22 CER: 0.15

This performance is notably worse than Qwen3 through browser interface. To increase results use a larger model, but you will need better specs than I have. Even then, this may not rival online usage. So it will be a decision for the user to make surrounding there priorities. One can also use Hugging Face for this task. This will be a user preference. Although it is worth noting Ollama is discussing their multimodal optimisation strategy which should turn heads

Issues with LLMs

LLMs are often criticised for stochasticity. This was tested using Deepseek.

Deepseek Consistency Results

Run	WER	CER
Run 1	0.10	0.06
Run 2	0.10	0.06
Run 3	0.10	0.06
Run 4	0.10	0.06

Deepseek showed consistent results across multiple runs.

Comparison	WER Difference
Run 2 vs Run 1	0.0128
Run 3 vs Run 1	0.0000
Run 4 vs Run 1	0.0118

Still, live monitoring of results (e.g., mean character count) is recommended for production pipelines.

Conclusion

This post compared LLMs and traditional OCR tools for image-to-text pipelines:

Traditional OCR tools like Tesseract are stable and lightweight, but less accurate.
LLMs like Gemini and Deepseek outperform on accuracy, but introduce complexity, cost, and deployment challenges.

The best choice depends on your goals:

For cost-effective large-scale work: Tesseract with preprocessing
For highest accuracy and smaller datasets: Gemini or Deepseek

For full code and data:

👉 GitHub Repository

References

Amrhein, C. & Clematide, S. (2018). Supervised OCR Error Detection and Correction. JLCL, 33(1).
Compton, T. (2025). OCR Evaluation. GitHub
Hemmer, A. et al. (2024). Confidence-Aware OCR Error Detection. Document Analysis Systems, Springer.
Kim, S. et al. (2025). LLMs and OCR for Historical Records. arXiv. doi:10.48550/arXiv.2501.11623
Warwick Modern Record Centre, archival material