<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Thomas Compton</title>
    <description>The latest articles on DEV Community by Thomas Compton (@unbrokencocoon).</description>
    <link>https://dev.to/unbrokencocoon</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3195419%2Ff5ddc465-9995-42e8-b405-9233353078f8.jpeg</url>
      <title>DEV Community: Thomas Compton</title>
      <link>https://dev.to/unbrokencocoon</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/unbrokencocoon"/>
    <language>en</language>
    <item>
      <title>Creating Synthetic Dialogues using RAG and Gemini</title>
      <dc:creator>Thomas Compton</dc:creator>
      <pubDate>Fri, 11 Jul 2025 09:17:24 +0000</pubDate>
      <link>https://dev.to/unbrokencocoon/creating-synthetic-dialogues-using-rag-and-gemini-4ni2</link>
      <guid>https://dev.to/unbrokencocoon/creating-synthetic-dialogues-using-rag-and-gemini-4ni2</guid>
      <description>&lt;p&gt;In working on a project comparing 2 topic models of different corpora, it became annoying not having them just tell me how they disagree with each other. It’s great to see lists of what the most common ideas are, but it’s much better to have a way of summarising the key differences between them. Since I was using topic models, I already had lists of sentences and sentence embeddings so any integration into an LLM could be easy but the question of how to make a tool that really tests where these corpora differ and doesn’t end up in an endless debate proved harder.&lt;/p&gt;

&lt;p&gt;For more information and the full code read the &lt;a href="https://github.com/UnbrokenCocoon/Historical-union-debate" rel="noopener noreferrer"&gt;Git Hub&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Introduction&lt;/strong&gt;&lt;br&gt;
Using LLMs, in my case Gemini, to simulate debate is not as common as chatbots. One issue we have see with &lt;a href="https://www.infiniteconversation.com/" rel="noopener noreferrer"&gt;Herzog vs Zizek&lt;/a&gt; is that long debates can easily go off topic . As the author says “ It sometimes makes sense and sometimes not. It sometimes contains true information, and sometimes it contains outright falsities.” They used model training and appear to have given one large output. This is easy to fix and save computation, limit each output to 5 interchanges. From my testing, this proved a good number that had some debate, but did not go off topic. This also creates a nice building block that might be inserted into a larger debate with some user fine-tuning if needed.&lt;/p&gt;

&lt;p&gt;But the real issue is, how are we getting the knowledge into the model? There are 2 approaches. There is training and there is using RAG. Training is pretty straightforward; you take a model and feed it the list of sentences for that corpus, wrapped around instructions. It can be time-consuming depending on hardware, but the real issue is that it means in the debate, you are relying on 2 different loaded models to correctly identify which knowledge is most appropriate to use in a response. Hence, this can lead to the tangent issue and generally make it difficult for the user to interface with the model. You should still be providing system instructions to both, but this may not be sufficient to steer the model back on course.&lt;/p&gt;

&lt;p&gt;RAG is quickly becoming a popular alternative for combining LLMs and large corpora [&lt;a href="https://dl.acm.org/doi/10.1145/3711542.3711590" rel="noopener noreferrer"&gt;1&lt;/a&gt;]. This is why I opted for a RAG-based system. By using FAISS wrapped in Langchain, we can have the heavy lifting done by sentence similarity. All the LLM has to do is summarise the top 5 most similar sentences &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Pipeline&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The LLM was given this prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;system_prompt_a = """You are a 1920s trade union representative from the National Boot and Shoe Union.
Use the retrieved sentences as your knowledge base.
Speak persuasively as if you are arguing with a fellow trade unionist.
Do not use your own knowledge.
Vary your sentence structures, do not repeat phrases.
Respond with 2 sentences maximum.""" 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;By keeping each LLM response to 2 sentences, the conversations stay focused on the topic and are easy to understand. The focus is on using the knowledge from the sentences, so the LLM does not use what it thinks it knows about the topic. This still provides a challenge where the LLM kept using the term ‘comrade’ frequently, which did not seem appropriate for the context.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;def generate_turn(query, retriever, system_prompt, speaker):
    docs = retriever.get_relevant_documents(query)
    content = "\n".join(doc.page_content for doc in docs)

    prompt = ChatPromptTemplate.from_messages([
        ("system", system_prompt),
        ("human", f"You just heard the following message:\n\n\"{query}\"\n\nHere are 5 excerpts from your documents that may help you reply:\n{content}\n\nRespond to the message above based ONLY on this information, and speak as if you were in a real conversation.")
    ])
    response = llm(prompt.format_messages())
    reply_text = response.content.strip().replace("\n", " ")
    print(f"\n{speaker}:\n{reply_text}\n{'-'*50}")
    dialogue_history.append({
    "speaker": speaker,
    "query": query,
    "response": reply_text,
    "context": content
    })
    return reply_text
list_of_models = ["gemini-1.5-flash", "gemini-1.5-flash-8b", "gemini-2.0-flash-lite", "gemini-2.0-flash", "gemini-1.5-flash-8b"]
for j in range(40):
  time.sleep(20)
  sentence_index = range(0, len(bs_sen),1)
  query = bs_sen[random.choice(sentence_index)]
  for i in range(5):
      llm = ChatGoogleGenerativeAI(model=list_of_models[i], temperature=0.7)
      if i % 2 == 0:
          speaker = "Historic Union (1920s)"
          query = generate_turn(query, retriever_a, system_prompt_a, speaker)
      else:
          speaker = "Modern Union (2020s)"
          query = generate_turn(query, retriever_b, system_prompt_b, speaker)
with open(os.path.join(data_dir, 'dialogue_history.pkl'), 'wb') as f:
  pickle.dump(dialogue_history, f)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This allows for the creation of a database that can be either manually reorganised by a user, linking together dialogues they feel are appropriate or chunked then formated as a DataFrame. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Evaluation Approaches&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Evaluating the database proved challenging. By using qualitative judgement, comparisons to topic models and knowledge of the texts, the outputs appeared generally appropriate. In this case, it helped that this output was the final stage of a long project exploring these two corpora. However, many new users will not have such familiarity. Really, the goal of this approach is to output as many conversation chunks as possible and then choose the most appropriate. It may be useful, if unsure, to use other quantitative approaches on the corpora, such as BERTopic, most frequent lemmas and bigrams. This can also help to ensure any conversation chunks are representative of the total corpus. On the other hand, what makes these conversations interesting is that they are low-frequency sentences being compared to each other. This means they will discover conflicts that topic models will not. So there is an advantage in how small-scale they are, but it also may depend on the type of corpora being used and how many perspectives are found within it, which determines how useful these conflicts are.&lt;/p&gt;

&lt;p&gt;A potential idea for evaluation could be a system comparing scholars/politicians who have frequently debated. It might be that evaluation is best suited to utilising those with domain knowledge to guess which debates are generated by LLMs and which are genuine debates. Moreover, it would be easy enough to create the synthetic database and use FAISS to link it to actual instances of debates to make qualitative comparisons and exploit the cosine similarity. The difficulty would be normalising the data, where it may be best to focus on 2 exchanges synthetically created but debates can have many different formats, and the inherent variation within speech would make this approach limited. &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gradio Output&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Once the database was created, I used it for a new RAG gradio on &lt;a href="https://huggingface.co/spaces/ovrelord/union-debate-sim" rel="noopener noreferrer"&gt;Hugging Faces Spaces&lt;/a&gt;. This allows the users to search the database to find relevant answers. Users can choose to group the short conversations and provide tags or tables if that will provide value to the UX. They could also consider using text-to-speech to increase accessibility. Really, there are many options to dramatise these chunks of conversations, either appending them into larger scripts and/converting them into different formats. Once a full database has been created, this is where the quality control can be undertaken either with the aid of a human interface or through classification of the chunks. &lt;/p&gt;

&lt;p&gt;Either way, they can become the building blocks for new creative endeavours depending on the needs of the project. Future directions could involve changing the format to exchanges of letters or monologues. The challenge of these will be prompt engineering, providing the right guidance to the LLM, focusing again on small chunks of output and heavy direction to ensure the output is designed to specification. Essentially, step 1 is to create the structure of the piece, step 2 is to convert that structure into a series of instructions, and step 3 is to use the database to choose the most appropriate context to feed the model.  By engaging with local stakeholders, this project has sought to liaise with the interests of local groups to ascertain their interests. Local museums and community groups have shown interest in the concept, suggesting it can be a useful way of making archives more accessible to the general public. So, it appears logical that anyone attempting a similar project makes use of similar interested groups both for gaining human validation of good exchanges output by the model and evaluations of more subjective and difficult to measure 'creative' usages of AI.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Conclusion&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;This pilot test has demonstrated that it is possible to create reasonably coherent conversations between 2 databases using RAG. By offloading the responsibility from the LLM to FAISS, the risk of going off-topic and hallucination is theoretically mitigated. Moreover, keeping the approach as generating short interchanges as building blocks for potentially larger projects and utilising a crowd-sourced, user evaluation could be a productive way of engaging communities throughout the process of making history relevant to their community. On the other hand, for more practical usages, looking for key differences between databases using regex and searching for points of contention could be a useful tool. This approach boils down to summarisation with extra steps, a task that models are reasonably successful at. Therefore, this system could utilise the strengths of existing pipelines and provide new opportunities for creative representations of large corpora.  &lt;/p&gt;

</description>
      <category>gemini</category>
      <category>rag</category>
      <category>ai</category>
      <category>python</category>
    </item>
    <item>
      <title>[Boost]</title>
      <dc:creator>Thomas Compton</dc:creator>
      <pubDate>Thu, 22 May 2025 11:31:53 +0000</pubDate>
      <link>https://dev.to/unbrokencocoon/-1ilc</link>
      <guid>https://dev.to/unbrokencocoon/-1ilc</guid>
      <description>&lt;div class="ltag__link--embedded"&gt;
  &lt;div class="crayons-story "&gt;
  &lt;a href="https://dev.to/unbrokencocoon/comparing-llms-and-python-ocr-packages-opportunities-and-challenges-in-ocr-accuracy-3ppa" class="crayons-story__hidden-navigation-link"&gt;Comparing LLMs and Python OCR Packages: Opportunities and Challenges in OCR Accuracy&lt;/a&gt;


  &lt;div class="crayons-story__body crayons-story__body-full_post"&gt;
    &lt;div class="crayons-story__top"&gt;
      &lt;div class="crayons-story__meta"&gt;
        &lt;div class="crayons-story__author-pic"&gt;

          &lt;a href="/unbrokencocoon" class="crayons-avatar  crayons-avatar--l  "&gt;
            &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3195419%2Ff5ddc465-9995-42e8-b405-9233353078f8.jpeg" alt="unbrokencocoon profile" class="crayons-avatar__image"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
        &lt;div&gt;
          &lt;div&gt;
            &lt;a href="/unbrokencocoon" class="crayons-story__secondary fw-medium m:hidden"&gt;
              Thomas Compton
            &lt;/a&gt;
            &lt;div class="profile-preview-card relative mb-4 s:mb-0 fw-medium hidden m:inline-block"&gt;
              
                Thomas Compton
                
              
              &lt;div id="story-author-preview-content-2514669" class="profile-preview-card__content crayons-dropdown branded-7 p-4 pt-0"&gt;
                &lt;div class="gap-4 grid"&gt;
                  &lt;div class="-mt-4"&gt;
                    &lt;a href="/unbrokencocoon" class="flex"&gt;
                      &lt;span class="crayons-avatar crayons-avatar--xl mr-2 shrink-0"&gt;
                        &lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3195419%2Ff5ddc465-9995-42e8-b405-9233353078f8.jpeg" class="crayons-avatar__image" alt=""&gt;
                      &lt;/span&gt;
                      &lt;span class="crayons-link crayons-subtitle-2 mt-5"&gt;Thomas Compton&lt;/span&gt;
                    &lt;/a&gt;
                  &lt;/div&gt;
                  &lt;div class="print-hidden"&gt;
                    
                      Follow
                    
                  &lt;/div&gt;
                  &lt;div class="author-preview-metadata-container"&gt;&lt;/div&gt;
                &lt;/div&gt;
              &lt;/div&gt;
            &lt;/div&gt;

          &lt;/div&gt;
          &lt;a href="https://dev.to/unbrokencocoon/comparing-llms-and-python-ocr-packages-opportunities-and-challenges-in-ocr-accuracy-3ppa" class="crayons-story__tertiary fs-xs"&gt;&lt;time&gt;May 22 '25&lt;/time&gt;&lt;span class="time-ago-indicator-initial-placeholder"&gt;&lt;/span&gt;&lt;/a&gt;
        &lt;/div&gt;
      &lt;/div&gt;

    &lt;/div&gt;

    &lt;div class="crayons-story__indention"&gt;
      &lt;h2 class="crayons-story__title crayons-story__title-full_post"&gt;
        &lt;a href="https://dev.to/unbrokencocoon/comparing-llms-and-python-ocr-packages-opportunities-and-challenges-in-ocr-accuracy-3ppa" id="article-link-2514669"&gt;
          Comparing LLMs and Python OCR Packages: Opportunities and Challenges in OCR Accuracy
        &lt;/a&gt;
      &lt;/h2&gt;
        &lt;div class="crayons-story__tags"&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/ocr"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;ocr&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/openai"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;openai&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/ai"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;ai&lt;/a&gt;
            &lt;a class="crayons-tag  crayons-tag--monochrome " href="/t/gemini"&gt;&lt;span class="crayons-tag__prefix"&gt;#&lt;/span&gt;gemini&lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="crayons-story__bottom"&gt;
        &lt;div class="crayons-story__details"&gt;
          &lt;a href="https://dev.to/unbrokencocoon/comparing-llms-and-python-ocr-packages-opportunities-and-challenges-in-ocr-accuracy-3ppa" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left"&gt;
            &lt;div class="multiple_reactions_aggregate"&gt;
              &lt;span class="multiple_reactions_icons_container"&gt;
                  &lt;span class="crayons_icon_container"&gt;
                    &lt;img src="https://assets.dev.to/assets/sparkle-heart-5f9bee3767e18deb1bb725290cb151c25234768a0e9a2bd39370c382d02920cf.svg" width="18" height="18"&gt;
                  &lt;/span&gt;
              &lt;/span&gt;
              &lt;span class="aggregate_reactions_counter"&gt;2&lt;span class="hidden s:inline"&gt; reactions&lt;/span&gt;&lt;/span&gt;
            &lt;/div&gt;
          &lt;/a&gt;
            &lt;a href="https://dev.to/unbrokencocoon/comparing-llms-and-python-ocr-packages-opportunities-and-challenges-in-ocr-accuracy-3ppa#comments" class="crayons-btn crayons-btn--s crayons-btn--ghost crayons-btn--icon-left flex items-center"&gt;
              Comments


              &lt;span class="hidden s:inline"&gt;Add Comment&lt;/span&gt;
            &lt;/a&gt;
        &lt;/div&gt;
        &lt;div class="crayons-story__save"&gt;
          &lt;small class="crayons-story__tertiary fs-xs mr-2"&gt;
            3 min read
          &lt;/small&gt;
            
              &lt;span class="bm-initial"&gt;
                

              &lt;/span&gt;
              &lt;span class="bm-success"&gt;
                

              &lt;/span&gt;
            
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
  &lt;/div&gt;
&lt;/div&gt;

&lt;/div&gt;


</description>
      <category>ocr</category>
      <category>openai</category>
      <category>ai</category>
      <category>gemini</category>
    </item>
    <item>
      <title>Comparing LLMs and Python OCR Packages: Opportunities and Challenges in OCR Accuracy</title>
      <dc:creator>Thomas Compton</dc:creator>
      <pubDate>Thu, 22 May 2025 10:59:39 +0000</pubDate>
      <link>https://dev.to/unbrokencocoon/comparing-llms-and-python-ocr-packages-opportunities-and-challenges-in-ocr-accuracy-3ppa</link>
      <guid>https://dev.to/unbrokencocoon/comparing-llms-and-python-ocr-packages-opportunities-and-challenges-in-ocr-accuracy-3ppa</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;Multimodal LLMs create new opportunities for extracting text from difficult images. But what are the pros and cons? How do Deepseek, Qwen, Gemini, and ChatGPT compare to traditional OCR packages?&lt;/p&gt;

&lt;p&gt;This post compares different LLMs and traditional Python OCR tools using Jiwer’s WER and CER metrics to assess accuracy. Lessons from running large-scale text extraction using Gemini are also discussed.&lt;/p&gt;

&lt;p&gt;Full evaluation code, tables, and workflows can be found in this &lt;a href="https://github.com/UnbrokenCocoon/OCR-evaluation" rel="noopener noreferrer"&gt;GitHub repository&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Key Findings
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;LLMs offer high-accuracy OCR&lt;/strong&gt;, but are costly, slow, and require powerful hardware.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Traditional OCR tools are lightweight and fast&lt;/strong&gt;, but generally less accurate.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  OCR Packages (EasyOCR, Tesseract, PaddleOCR)
&lt;/h2&gt;

&lt;p&gt;Archival OCR is important for NLP tasks like topic modelling. Python packages like EasyOCR, PyTesseract, and PaddleOCR are commonly used. This comparison focuses on practical performance, not theoretical strengths.&lt;/p&gt;

&lt;p&gt;All evaluations use Jiwer’s Word Error Rate (WER) and Character Error Rate (CER).&lt;/p&gt;

&lt;h3&gt;
  
  
  Traditional OCR Results
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Engine&lt;/th&gt;
&lt;th&gt;WER&lt;/th&gt;
&lt;th&gt;CER&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;EasyOCR&lt;/td&gt;
&lt;td&gt;0.89&lt;/td&gt;
&lt;td&gt;0.67&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tesseract&lt;/td&gt;
&lt;td&gt;0.69&lt;/td&gt;
&lt;td&gt;0.43&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PaddleOCR&lt;/td&gt;
&lt;td&gt;0.79&lt;/td&gt;
&lt;td&gt;0.76&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Tesseract performs best overall. EasyOCR has a lower CER than PaddleOCR, but higher WER, suggesting it identifies characters well but struggles with correct word segmentation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Preprocessing Impact
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Step&lt;/th&gt;
&lt;th&gt;WER&lt;/th&gt;
&lt;th&gt;CER&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Before Preprocessing&lt;/td&gt;
&lt;td&gt;0.77&lt;/td&gt;
&lt;td&gt;0.60&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;After Preprocessing&lt;/td&gt;
&lt;td&gt;0.67&lt;/td&gt;
&lt;td&gt;0.43&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Preprocessing significantly improves accuracy. This step is recommended for all OCR pipelines.&lt;/p&gt;

&lt;h2&gt;
  
  
  Post-Processing with LLMs
&lt;/h2&gt;

&lt;p&gt;Post-correction with LLMs can fix some OCR issues (e.g., word splits), but won't recover text not detected by the OCR engine. Also, token limits and prompt misinterpretations pose risks at scale.&lt;/p&gt;

&lt;h2&gt;
  
  
  LLMs as an OCR Solution
&lt;/h2&gt;

&lt;p&gt;The results speak for themselves:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Engine&lt;/th&gt;
&lt;th&gt;WER&lt;/th&gt;
&lt;th&gt;CER&lt;/th&gt;
&lt;th&gt;LLM&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Gemini&lt;/td&gt;
&lt;td&gt;0.04&lt;/td&gt;
&lt;td&gt;0.02&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen&lt;/td&gt;
&lt;td&gt;0.06&lt;/td&gt;
&lt;td&gt;0.03&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deepseek&lt;/td&gt;
&lt;td&gt;0.10&lt;/td&gt;
&lt;td&gt;0.06&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ChatGPT&lt;/td&gt;
&lt;td&gt;0.58&lt;/td&gt;
&lt;td&gt;0.45&lt;/td&gt;
&lt;td&gt;Yes&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Tesseract&lt;/td&gt;
&lt;td&gt;0.69&lt;/td&gt;
&lt;td&gt;0.43&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;PaddleOCR&lt;/td&gt;
&lt;td&gt;0.79&lt;/td&gt;
&lt;td&gt;0.76&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;EasyOCR&lt;/td&gt;
&lt;td&gt;0.89&lt;/td&gt;
&lt;td&gt;0.67&lt;/td&gt;
&lt;td&gt;No&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Multimodal LLMs outperform traditional OCR significantly. However, they likely include some internal correction pipelines, making the comparison imperfect.&lt;/p&gt;

&lt;h3&gt;
  
  
  Word Mismatch Counts
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Model&lt;/th&gt;
&lt;th&gt;Word Mismatch&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Gemini&lt;/td&gt;
&lt;td&gt;30&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Qwen&lt;/td&gt;
&lt;td&gt;26&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Deepseek&lt;/td&gt;
&lt;td&gt;276&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ChatGPT&lt;/td&gt;
&lt;td&gt;108&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Gemini performs best and has a usable Python wrapper. But it introduces new challenges.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deployment Considerations: Gemini
&lt;/h2&gt;

&lt;p&gt;Gemini’s main issues are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;p&gt;&lt;strong&gt;Rate Limits&lt;/strong&gt;:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Gemini 2.0 Flash-Lite: 30 requests/min, 1,500/day
&lt;/li&gt;
&lt;li&gt;&lt;a href="https://ai.google.dev/gemini-api/docs/rate-limits" rel="noopener noreferrer"&gt;See full limits&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;li&gt;

&lt;p&gt;&lt;strong&gt;Copyright Flags&lt;/strong&gt;:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Gemini may falsely flag archival material.
&lt;/li&gt;
&lt;li&gt;Handling involves rerouting failed items to alternative models.&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;

&lt;/ul&gt;

&lt;h2&gt;
  
  
  Deployment Considerations: Qwen via Ollama
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Qwen via Ollama&lt;/strong&gt; is an option for local runs&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Negatives&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Requires sufficient hardware.&lt;/li&gt;
&lt;li&gt;Accuracy may drop slightly due to quantisation.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Positives&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Free&lt;/li&gt;
&lt;li&gt;No reliance on internet during runs&lt;/li&gt;
&lt;li&gt;Data Privacy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Qwen2.5VL, 7b performed:&lt;br&gt;
&lt;code&gt;WER: 0.22&lt;br&gt;
CER: 0.15&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;This performance is notably worse than Qwen3 through browser interface. To increase results use a larger model, but you will need better specs than I have. Even then, this may not rival online usage. So it will be a decision for the user to make surrounding there priorities. One can also use Hugging Face for this task. This will be a user preference. Although it is worth noting &lt;a href="https://ollama.com/blog/multimodal-models" rel="noopener noreferrer"&gt;Ollama&lt;/a&gt; is discussing their multimodal optimisation strategy which should turn heads &lt;/p&gt;

&lt;h2&gt;
  
  
  Issues with LLMs
&lt;/h2&gt;

&lt;p&gt;LLMs are often criticised for stochasticity. This was tested using Deepseek.&lt;/p&gt;

&lt;h3&gt;
  
  
  Deepseek Consistency Results
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Run&lt;/th&gt;
&lt;th&gt;WER&lt;/th&gt;
&lt;th&gt;CER&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Run 1&lt;/td&gt;
&lt;td&gt;0.10&lt;/td&gt;
&lt;td&gt;0.06&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Run 2&lt;/td&gt;
&lt;td&gt;0.10&lt;/td&gt;
&lt;td&gt;0.06&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Run 3&lt;/td&gt;
&lt;td&gt;0.10&lt;/td&gt;
&lt;td&gt;0.06&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Run 4&lt;/td&gt;
&lt;td&gt;0.10&lt;/td&gt;
&lt;td&gt;0.06&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Deepseek showed consistent results across multiple runs.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Comparison&lt;/th&gt;
&lt;th&gt;WER Difference&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Run 2 vs Run 1&lt;/td&gt;
&lt;td&gt;0.0128&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Run 3 vs Run 1&lt;/td&gt;
&lt;td&gt;0.0000&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Run 4 vs Run 1&lt;/td&gt;
&lt;td&gt;0.0118&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;Still, live monitoring of results (e.g., mean character count) is recommended for production pipelines.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This post compared LLMs and traditional OCR tools for image-to-text pipelines:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Traditional OCR tools&lt;/strong&gt; like Tesseract are stable and lightweight, but less accurate.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LLMs&lt;/strong&gt; like Gemini and Deepseek outperform on accuracy, but introduce complexity, cost, and deployment challenges.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The best choice depends on your goals:  &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For cost-effective large-scale work: Tesseract with preprocessing
&lt;/li&gt;
&lt;li&gt;For highest accuracy and smaller datasets: Gemini or Deepseek&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;For full code and data:&lt;br&gt;&lt;br&gt;
👉 &lt;a href="https://github.com/UnbrokenCocoon/OCR-evaluation" rel="noopener noreferrer"&gt;GitHub Repository&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Amrhein, C. &amp;amp; Clematide, S. (2018). Supervised OCR Error Detection and Correction. &lt;em&gt;JLCL&lt;/em&gt;, 33(1).
&lt;/li&gt;
&lt;li&gt;Compton, T. (2025). &lt;em&gt;OCR Evaluation&lt;/em&gt;. &lt;a href="https://github.com/UnbrokenCocoon/OCR-evaluation" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Hemmer, A. et al. (2024). Confidence-Aware OCR Error Detection. &lt;em&gt;Document Analysis Systems&lt;/em&gt;, Springer.
&lt;/li&gt;
&lt;li&gt;Kim, S. et al. (2025). LLMs and OCR for Historical Records. &lt;em&gt;arXiv&lt;/em&gt;. doi:10.48550/arXiv.2501.11623
&lt;/li&gt;
&lt;li&gt;Warwick Modern Record Centre, archival material
&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>ocr</category>
      <category>openai</category>
      <category>ai</category>
      <category>gemini</category>
    </item>
  </channel>
</rss>
