<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Mili Hunjic</title>
    <description>The latest articles on DEV Community by Mili Hunjic (@mili_hunjic_70cb2c5dd0e49).</description>
    <link>https://dev.to/mili_hunjic_70cb2c5dd0e49</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1817979%2Ff2f697c0-9a90-4041-9025-d39d1faef115.jpeg</url>
      <title>DEV Community: Mili Hunjic</title>
      <link>https://dev.to/mili_hunjic_70cb2c5dd0e49</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/mili_hunjic_70cb2c5dd0e49"/>
    <language>en</language>
    <item>
      <title>Good software thinking doesn’t age. Tools do</title>
      <dc:creator>Mili Hunjic</dc:creator>
      <pubDate>Mon, 06 Apr 2026 13:39:46 +0000</pubDate>
      <link>https://dev.to/mili_hunjic_70cb2c5dd0e49/good-software-thinking-doesnt-age-tools-do-4i6g</link>
      <guid>https://dev.to/mili_hunjic_70cb2c5dd0e49/good-software-thinking-doesnt-age-tools-do-4i6g</guid>
      <description>&lt;p&gt;I built an informatics quiz back in 2009.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fenaqcwwj3flgdbtlpouz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fenaqcwwj3flgdbtlpouz.png" alt="Gameplay menu" width="800" height="531"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At the time, it was just a small student project — nothing official, just something interactive and fun for new students.&lt;/p&gt;

&lt;p&gt;Recently, I revisited it.&lt;/p&gt;

&lt;p&gt;Not the code first.&lt;br&gt;
Not the technology.&lt;/p&gt;

&lt;p&gt;But the &lt;strong&gt;questions&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;And surprisingly… many of them still feel relevant today.&lt;/p&gt;




&lt;h2&gt;
  
  
  🧠 Try it yourself
&lt;/h2&gt;

&lt;p&gt;Here are a few questions from that quiz:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What is steganography?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The Diffie-Hellman algorithm is used for?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Which of the following is NOT a characteristic of Object-Oriented Programming?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The predecessor to the C language was?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Alexey Pajitnov received nothing from $800 million earned from?&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F364vj98yyccvfxa99ar2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F364vj98yyccvfxa99ar2.png" alt="Gameplay" width="800" height="530"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;Some of these test fundamentals.&lt;br&gt;
Some test history.&lt;br&gt;
Some are just there to trick you a little 😄&lt;/p&gt;

&lt;p&gt;But together, they reveal something interesting:&lt;/p&gt;

&lt;p&gt;👉 The &lt;em&gt;core knowledge&lt;/em&gt; hasn’t changed that much.&lt;/p&gt;




&lt;h2&gt;
  
  
  🔍 What changed (and what didn’t)
&lt;/h2&gt;

&lt;p&gt;Back in 2009:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We used &lt;strong&gt;Flash&lt;/strong&gt; for interactivity
&lt;/li&gt;
&lt;li&gt;Questions were loaded from &lt;strong&gt;XML&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;State and logic were handled manually
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Today:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;We use &lt;strong&gt;React / modern frameworks&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Data comes from APIs (JSON)
&lt;/li&gt;
&lt;li&gt;State management is abstracted
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;But if you look closely:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data is still structured
&lt;/li&gt;
&lt;li&gt;Logic still exists — just hidden behind abstractions
&lt;/li&gt;
&lt;li&gt;Fundamentals like memory, algorithms, and architecture are still key
&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🧩 The interesting part
&lt;/h2&gt;

&lt;p&gt;Some of these questions are arguably &lt;em&gt;more important today&lt;/em&gt;:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Security concepts like &lt;strong&gt;Diffie-Hellman&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Understanding of low-level concepts
&lt;/li&gt;
&lt;li&gt;Awareness of computing history
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because modern tools often &lt;strong&gt;hide complexity&lt;/strong&gt; — but don’t remove it.&lt;/p&gt;




&lt;h2&gt;
  
  
  ⚙️ A small technical twist
&lt;/h2&gt;

&lt;p&gt;This quiz was originally built in &lt;strong&gt;Adobe Flash (.swf)&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Which, of course, is no longer supported in modern browsers.&lt;/p&gt;

&lt;p&gt;But…&lt;/p&gt;

&lt;p&gt;👉 It still runs today.&lt;/p&gt;

&lt;p&gt;Thanks to &lt;strong&gt;Ruffle&lt;/strong&gt;, a Flash emulator written in Rust.&lt;/p&gt;

&lt;p&gt;So instead of rewriting the project, I preserved it — and made it runnable again in the browser.&lt;/p&gt;




&lt;h2&gt;
  
  
  🎮 Try the full quiz
&lt;/h2&gt;

&lt;p&gt;Curious how you'd score? No Googling 😄&lt;/p&gt;

&lt;p&gt;You can play it directly in your browser:&lt;/p&gt;

&lt;p&gt;👉 &lt;strong&gt;Live Demo:&lt;/strong&gt; &lt;a href="https://milihwork.github.io/fit-quiz-2009/" rel="noopener noreferrer"&gt;https://milihwork.github.io/fit-quiz-2009/&lt;/a&gt; &lt;br&gt;
👉 &lt;strong&gt;GitHub Repository:&lt;/strong&gt; &lt;a href="https://github.com/milihwork/fit-quiz-2009" rel="noopener noreferrer"&gt;https://github.com/milihwork/fit-quiz-2009&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  🌍 About language support
&lt;/h2&gt;

&lt;p&gt;The quiz includes both Bosnian and English content.&lt;/p&gt;

&lt;p&gt;To make it more accessible, I translated the quiz questions and answers — while keeping the original structure intact.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Note:&lt;/strong&gt; Only quiz questions and answers are translated; menus and other on-screen text may remain in the original language.&lt;/p&gt;




&lt;h2&gt;
  
  
  💭 Final thought
&lt;/h2&gt;

&lt;p&gt;Technologies change.&lt;br&gt;
Frameworks come and go.&lt;/p&gt;

&lt;p&gt;But the underlying way we approach problems — breaking them down, modeling data, and structuring logic — remains largely the same.&lt;/p&gt;

&lt;p&gt;What changes is how much the tools hide.&lt;br&gt;
What stays is the thinking behind them.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Good software thinking doesn’t age. Tools do.&lt;/strong&gt;&lt;/p&gt;

</description>
      <category>flash</category>
      <category>nostalgia</category>
      <category>softwareengineering</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Are We Still Engineers or Just Tool Operators?</title>
      <dc:creator>Mili Hunjic</dc:creator>
      <pubDate>Fri, 03 Apr 2026 07:29:00 +0000</pubDate>
      <link>https://dev.to/mili_hunjic_70cb2c5dd0e49/are-we-still-engineers-or-just-tool-operators-43m9</link>
      <guid>https://dev.to/mili_hunjic_70cb2c5dd0e49/are-we-still-engineers-or-just-tool-operators-43m9</guid>
      <description>&lt;p&gt;26 years ago, I was already building real software at 15 — no AI, no Stack Overflow, no modern web.&lt;/p&gt;




&lt;h2&gt;
  
  
  🧠 A Confession Most Developers Won’t Like
&lt;/h2&gt;

&lt;p&gt;Most developers today would struggle to build software without Google, AI, or Stack Overflow.&lt;/p&gt;

&lt;p&gt;That’s not an insult — &lt;strong&gt;it’s reality&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Because I learned to code before any of them existed.&lt;/p&gt;

&lt;p&gt;❌ No tutorials&lt;br&gt;
❌ No YouTube&lt;br&gt;
❌ No GitHub&lt;br&gt;
❌ No AI&lt;/p&gt;

&lt;p&gt;And somehow… I was still building real software at 15.&lt;/p&gt;

&lt;p&gt;And the strange part?&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;I didn’t feel limited.&lt;/strong&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;So this isn’t a nostalgia post.&lt;/strong&gt; &lt;strong&gt;This is a question:&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Are we becoming better engineers &lt;br&gt;
or just tool operators?&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;This isn’t about what I built.&lt;/p&gt;

&lt;p&gt;It’s about how building without tools forced me to think differently.&lt;/p&gt;

&lt;p&gt;A simplified view of how my development journey evolved over time:&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1m7gmo841tkqy5mqyt6k.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F1m7gmo841tkqy5mqyt6k.png" alt="My dev timeline" width="800" height="436"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;From curiosity to real-world software — built before modern tools existed.&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  🖥️ Before VB6 — There Was QBasic
&lt;/h2&gt;

&lt;p&gt;Even before QBasic, there was GWBASIC.&lt;/p&gt;

&lt;p&gt;And it wasn’t just for experiments.&lt;/p&gt;

&lt;p&gt;Here’s a real example.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl3tlmgwilnh6jp3wctxd.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fl3tlmgwilnh6jp3wctxd.png" alt="GWBasic financial software" width="626" height="393"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;GW-BASIC financial software from ~1989 — built by my uncle, still running today.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;And yes — it still works.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;All screenshots in this post are from my original apps, still running on my machine today — decades after the last line of code was written.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It powered real business:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Accounting systems&lt;/li&gt;
&lt;li&gt;Inventory tools&lt;/li&gt;
&lt;li&gt;Software companies actually depended on&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Before Visual Basic, I was writing in QBasic:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;quizzes&lt;/li&gt;
&lt;li&gt;text-based logic&lt;/li&gt;
&lt;li&gt;small experiments&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;❌ No GUI&lt;br&gt;
❌ No frameworks&lt;/p&gt;

&lt;p&gt;Just:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Code + imagination&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That’s where the real foundation was built.&lt;/p&gt;

&lt;p&gt;And once you build it that way…&lt;br&gt;
you never forget it.&lt;/p&gt;




&lt;h2&gt;
  
  
  🎮 At 15, I Started Building Games
&lt;/h2&gt;

&lt;p&gt;One of my first projects was a simple XO game.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fldi6g8wue1oozh0bl90d.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fldi6g8wue1oozh0bl90d.png" alt="Simple XO game" width="388" height="300"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Nothing special.&lt;br&gt;
But it worked.&lt;/p&gt;

&lt;p&gt;Soon after, I built a full Poker game — distributed on CDs with the local magazine "INFO" (2001).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fryz3pps45ix63ih1h7zl.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fryz3pps45ix63ih1h7zl.png" alt="Quick Poker" width="800" height="720"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Still just a hobby.&lt;/p&gt;

&lt;p&gt;❌ No company&lt;br&gt;
❌ No team&lt;/p&gt;

&lt;p&gt;Just:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Curiosity + persistence&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And somehow… &lt;/p&gt;

&lt;p&gt;That was enough.&lt;/p&gt;




&lt;h2&gt;
  
  
  🎨 No Designer. No Standards. Just Functionality
&lt;/h2&gt;

&lt;p&gt;❌ No UX thinking&lt;br&gt;
❌ No design systems&lt;br&gt;
❌ No Figma&lt;/p&gt;

&lt;p&gt;UI was whatever you could build.&lt;/p&gt;

&lt;p&gt;Basic forms.&lt;br&gt;
Standard controls.&lt;br&gt;
Some creativity.&lt;/p&gt;

&lt;p&gt;Not beautiful by today’s standards.&lt;/p&gt;

&lt;p&gt;But:&lt;/p&gt;

&lt;p&gt;✅ It worked&lt;br&gt;
✅ Users got value&lt;/p&gt;

&lt;p&gt;And in the end… &lt;br&gt;
that’s what actually mattered.&lt;/p&gt;

&lt;p&gt;--&lt;/p&gt;

&lt;h2&gt;
  
  
  🧩 Games That Never Really “Ended”
&lt;/h2&gt;

&lt;p&gt;I built multiple small games.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Puzzle games (like Clix)&lt;/li&gt;
&lt;li&gt;Word and quiz-based games&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;A Hangman game used by university students&lt;/strong&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;My sister was an assistant at the faculty of arts.&lt;br&gt;
I built the game for her students.&lt;br&gt;
And they actually used it.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvqiogej3awrlbeqjowde.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvqiogej3awrlbeqjowde.png" alt="Hangman VB Game" width="718" height="650"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Clix (2007) — inspired by early puzzle games like Bubble Breaker.&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsm97fjj4hdvq2g0tecyv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fsm97fjj4hdvq2g0tecyv.png" alt="Clix screenshot" width="800" height="713"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I’ve open-sourced this game. 👉 &lt;a href="https://github.com/milihwork/clix-vb6/" rel="noopener noreferrer"&gt;GitHub link&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Yes — it still runs today.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;em&gt;Vockice (2005) — inspired by a slot machine game.&lt;/em&gt;&lt;/strong&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fokn3w1w3vbycizk04p1s.PNG" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fokn3w1w3vbycizk04p1s.PNG" alt="Vockice game screenshot" width="388" height="280"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Students would play Vockice whenever lectures got boring 🤣&lt;/p&gt;

&lt;p&gt;Years later, after learning OOP and C#,&lt;br&gt;&lt;br&gt;
I rebuilt Vockice in .NET 2.0 😂&lt;/p&gt;




&lt;p&gt;In 2010, I also built a small Windows Phone 7 game called Magic Symbol — one of my first steps into mobile development.&lt;/p&gt;

&lt;p&gt;(Fun fact: the only screenshot I have today was reconstructed from an XAML file using AI.)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8qbsmobav9v4k450hi9j.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F8qbsmobav9v4k450hi9j.png" alt="MagicSymbol Screenshoot" width="268" height="400"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Real users. Real usage.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;❌ No analytics&lt;br&gt;
❌ No feedback tools&lt;/p&gt;

&lt;p&gt;Just:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;“People are using this.”&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And somehow… &lt;/p&gt;

&lt;p&gt;that was enough.&lt;/p&gt;




&lt;h2&gt;
  
  
  🎓 Software People Actually Used
&lt;/h2&gt;

&lt;p&gt;At some point, it wasn’t just games anymore.&lt;/p&gt;

&lt;p&gt;By the time I was 18, I was building software for learning, a system for practising driving license exams.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Selection Screen&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fawbooclptxbhoescevay.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fawbooclptxbhoescevay.png" alt="Selection Screen" width="489" height="280"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Visual Basic 6 Form Designer View&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdb3s872iluwfqn3k13mt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fdb3s872iluwfqn3k13mt.png" alt="Visual Basic 6 Form Designer View" width="800" height="357"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And this wasn’t just a side project.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;People actually used it.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;A lot of them.&lt;/p&gt;

&lt;p&gt;In fact…&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;An entire generation learned using this.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;❌ No marketing&lt;br&gt;
❌ No distribution platform&lt;/p&gt;

&lt;p&gt;Just software… finding its users.&lt;/p&gt;




&lt;h2&gt;
  
  
  💾 Software Was Not “Deployed” — It Was Delivered
&lt;/h2&gt;

&lt;p&gt;In 2003, I built a PC &amp;amp; PlayStation inventory system.&lt;/p&gt;

&lt;p&gt;As a hobby.&lt;/p&gt;

&lt;p&gt;But it included:&lt;/p&gt;

&lt;p&gt;✅ Custom licensing system&lt;br&gt;
✅ Hardware-based key&lt;br&gt;
✅ Unique installation per machine&lt;br&gt;
✅ Usage tracking&lt;/p&gt;

&lt;p&gt;❌ No libraries&lt;br&gt;
❌ No guides&lt;/p&gt;

&lt;p&gt;Just:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Figuring things out&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And looking back…&lt;/p&gt;

&lt;p&gt;That was probably the real education.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fryj63uat5kbcvt1mcm9e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fryj63uat5kbcvt1mcm9e.png" alt="PlayStation Evidencija" width="790" height="600"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpy0v2mtv2rvge9owkcel.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fpy0v2mtv2rvge9owkcel.png" alt="PlayStation Evidencija Activation" width="400" height="400"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;p&gt;But then I started thinking bigger.&lt;/p&gt;

&lt;p&gt;After analysing the business and the market, I realised that a simple PlayStation tracking system wasn’t enough.&lt;/p&gt;

&lt;p&gt;So I decided to build something more ambitious.&lt;/p&gt;

&lt;p&gt;That same year, I started working on &lt;strong&gt;PC Counter&lt;/strong&gt; — a much more advanced system designed for managing PC gaming clubs.&lt;/p&gt;

&lt;p&gt;It was supposed to be a serious upgrade:&lt;br&gt;
more features, more control, more flexibility.&lt;/p&gt;

&lt;p&gt;But reality had other plans.&lt;/p&gt;

&lt;p&gt;Between limited time, gaps in knowledge, and life getting in the way — &lt;br&gt;
the project never became fully functional.&lt;/p&gt;

&lt;p&gt;And yet…&lt;br&gt;
This is where things get interesting.&lt;/p&gt;

&lt;p&gt;❌ There were no AI tools.&lt;br&gt;
❌ No code generators.&lt;br&gt;
❌ No shortcuts.&lt;/p&gt;

&lt;p&gt;Every step forward came from trial and error.&lt;br&gt;
From thinking.&lt;br&gt;
From understanding.&lt;/p&gt;

&lt;p&gt;Even though the project failed.&lt;br&gt;
The process didn’t.&lt;/p&gt;

&lt;p&gt;And that’s something we’re slowly losing today.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Login and main screen:&lt;/em&gt;&lt;br&gt;
&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3s915v0rdvzhiedktlmi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3s915v0rdvzhiedktlmi.png" alt="PCCounter login" width="649" height="426"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3bli2c78nhlgk82ywpbj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3bli2c78nhlgk82ywpbj.png" alt="PcCounter Main Screen" width="800" height="576"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  💾 Visual Basic 6 Wasn’t “Bad” — It Was Reality
&lt;/h2&gt;

&lt;p&gt;Today, people love to laugh at old tech:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“VB6? That’s ancient.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;But it powered real businesses.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Accounting systems&lt;/li&gt;
&lt;li&gt;Inventory tools&lt;/li&gt;
&lt;li&gt;Internal enterprise apps&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This wasn’t toy code.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;This was production software.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And not just for experiments.&lt;/p&gt;

&lt;p&gt;For years.&lt;/p&gt;




&lt;h2&gt;
  
  
  🎬 Small Tools That Solved Real Problems
&lt;/h2&gt;

&lt;p&gt;Back when movies came on CDs, I built a simple autoplay app.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;(1)&lt;/strong&gt;Insert CD → &lt;strong&gt;(2)&lt;/strong&gt;click to PLAY -&amp;gt; FILM.&lt;/p&gt;

&lt;p&gt;That was it.&lt;/p&gt;

&lt;p&gt;One click.&lt;br&gt;
Problem solved.&lt;/p&gt;

&lt;p&gt;And that’s what most software really is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Solving one small problem… really well.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fooh0dm2rwp22rptpj9vj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fooh0dm2rwp22rptpj9vj.png" alt="DivX AutoPlay" width="433" height="400"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  🧩 No OOP. No Architecture. Still Working.
&lt;/h2&gt;

&lt;p&gt;❌ No design patterns&lt;br&gt;
❌ No clean architecture&lt;br&gt;
❌ No SOLID&lt;/p&gt;

&lt;p&gt;Just:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;forms&lt;/li&gt;
&lt;li&gt;events&lt;/li&gt;
&lt;li&gt;raw logic&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;My code?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;spaghetti logic&lt;/li&gt;
&lt;li&gt;global variables everywhere&lt;/li&gt;
&lt;li&gt;zero documentation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If I saw it today...&lt;/p&gt;

&lt;p&gt;I’d probably panic.&lt;/p&gt;

&lt;p&gt;But…&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;It worked.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And somehow…&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;It kept working.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  🌐 The Web Wasn’t What It Is Today
&lt;/h2&gt;

&lt;p&gt;Today, everything is web.&lt;/p&gt;

&lt;p&gt;Back then?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;static&lt;/li&gt;
&lt;li&gt;slow&lt;/li&gt;
&lt;li&gt;limited&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;“Web apps” barely existed.&lt;/p&gt;

&lt;p&gt;If you wanted real functionality…&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;You built desktop apps.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;That was the only option.&lt;/p&gt;




&lt;h2&gt;
  
  
  🤖 I Built an “AI Bot” Before AI Was a Thing
&lt;/h2&gt;

&lt;p&gt;In 2008, I built a bot for a Facebook game (&lt;em&gt;Word Challenge&lt;/em&gt;).&lt;/p&gt;

&lt;p&gt;It:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;detected letters from the screen (custom OCR)
&lt;/li&gt;
&lt;li&gt;searched a word database
&lt;/li&gt;
&lt;li&gt;generated combinations
&lt;/li&gt;
&lt;li&gt;played automatically
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;❌ No AI&lt;br&gt;
❌ No ML&lt;br&gt;
❌ No libraries&lt;/p&gt;

&lt;p&gt;Just:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Logic + persistence&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;At the time, I was already learning .NET, but I needed something fast and practical.&lt;/p&gt;

&lt;p&gt;So I went with what I knew best: Visual Basic.&lt;/p&gt;

&lt;p&gt;The OCR part was the hardest.&lt;/p&gt;

&lt;p&gt;The first version took nearly 2 minutes to recognise just 5 letters — completely unusable.&lt;/p&gt;

&lt;p&gt;So I optimised it.&lt;/p&gt;

&lt;p&gt;And eventually…&lt;/p&gt;

&lt;p&gt;it became fast enough actually to play the game.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo3au2e7czmva5xhvcl22.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fo3au2e7czmva5xhvcl22.png" alt="Word Challenge Bot" width="741" height="450"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;And no — I didn’t call it AI back then.&lt;br&gt;
It was just… solving the problem.&lt;/p&gt;

&lt;p&gt;Most of the projects I shared here were hobby experiments — mainly built in VB6.&lt;/p&gt;

&lt;p&gt;But after 2008, my journey naturally shifted toward C# and the .NET ecosystem, where I continued growing professionally.&lt;/p&gt;

&lt;p&gt;VB6 was my playground.&lt;br&gt;
C# became my real battlefield.&lt;/p&gt;




&lt;h2&gt;
  
  
  🧱 The World Without Tools
&lt;/h2&gt;

&lt;p&gt;❌ No AI&lt;br&gt;
❌ No ChatGPT&lt;br&gt;
❌ No Copilot&lt;br&gt;
❌ No Stack Overflow (usable)&lt;br&gt;
❌ No YouTube&lt;br&gt;
❌ No GitHub&lt;br&gt;
❌ No package managers&lt;br&gt;
❌ No cloud&lt;/p&gt;

&lt;p&gt;What we had:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Official documentation
&lt;/li&gt;
&lt;li&gt;Trial and error
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And somehow… &lt;strong&gt;that was enough&lt;/strong&gt;.&lt;/p&gt;




&lt;h2&gt;
  
  
  👤 You Were the Entire Team
&lt;/h2&gt;

&lt;p&gt;❌ No QA&lt;br&gt;
❌ No designer&lt;br&gt;
❌ No product manager&lt;br&gt;
❌ No DevOps&lt;/p&gt;

&lt;p&gt;Just you.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;You were everything.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  🚀 Production Didn’t Exist
&lt;/h2&gt;

&lt;p&gt;Production = 💿 CD&lt;/p&gt;

&lt;p&gt;❌ No updates&lt;br&gt;
❌ No patches&lt;br&gt;
❌ No monitoring&lt;br&gt;
❌ No logs&lt;/p&gt;

&lt;p&gt;If there was a bug…&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;It stayed there.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;❌ No hotfixes&lt;br&gt;
❌ No second chances&lt;/p&gt;

&lt;p&gt;And interestingly…&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;There were fewer bugs than today.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  🧠 And Yet… Old Software Was Surprisingly Good
&lt;/h2&gt;

&lt;p&gt;Games were built with:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;limited resources
&lt;/li&gt;
&lt;li&gt;strict memory constraints
&lt;/li&gt;
&lt;li&gt;no engines
&lt;/li&gt;
&lt;li&gt;no massive assets
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And still...&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Incredibly playable. Addictive.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Because of one thing:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Smart design&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;No more features.&lt;br&gt;
Not better tools.&lt;/p&gt;

&lt;p&gt;Just better thinking.&lt;/p&gt;




&lt;h2&gt;
  
  
  ⚡ Old Developer vs AI Developer
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Then:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You solved everything yourself&lt;/li&gt;
&lt;li&gt;You understood every line&lt;/li&gt;
&lt;li&gt;You built from scratch&lt;/li&gt;
&lt;li&gt;There was no fallback&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Now:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You move faster&lt;/li&gt;
&lt;li&gt;You integrate tools&lt;/li&gt;
&lt;li&gt;You rely on AI&lt;/li&gt;
&lt;li&gt;You don’t always understand everything&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Neither is wrong.&lt;/p&gt;

&lt;p&gt;But they are not the same.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;And that difference… is bigger than it looks.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj1173m1anvugxwu9nlxj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fj1173m1anvugxwu9nlxj.png" alt="Infografic" width="800" height="1200"&gt;&lt;/a&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  🔥 The Slightly Uncomfortable Truth
&lt;/h2&gt;

&lt;p&gt;Modern developers are more powerful than ever.&lt;/p&gt;

&lt;p&gt;We build faster.&lt;br&gt;
We ship more.&lt;br&gt;
We have better tools than any generation before us.&lt;/p&gt;

&lt;p&gt;But here’s the uncomfortable part:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F341vq13030s47xr17wyi.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F341vq13030s47xr17wyi.png" alt="Before vs Now infographic" width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Then vs now — not better or worse, just different.&lt;/em&gt;&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Many of us are becoming tool operators more than problem solvers.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Not because we’re worse.&lt;/p&gt;

&lt;p&gt;Because:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;We don’t need to struggle anymore.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;We don’t always understand the system.&lt;/p&gt;

&lt;p&gt;We understand the interface.&lt;/p&gt;

&lt;p&gt;This isn’t about being better or worse — it’s about how our role is changing.&lt;/p&gt;




&lt;p&gt;And now, we’re entering a new phase.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Prompt-driven development&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;Instead of writing logic…&lt;br&gt;
we describe intent.&lt;/p&gt;

&lt;p&gt;Instead of building step by step…&lt;br&gt;
we guide systems to build for us.&lt;/p&gt;




&lt;p&gt;This is incredibly powerful.&lt;/p&gt;

&lt;p&gt;But it changes the role of a developer:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;From someone who builds&lt;br&gt;
to someone who instructs&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;The real question is no longer:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“Can you build it?”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;But:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;“Do you understand what was built?”&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;p&gt;Because prompting without understanding…&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;is just a faster way to build things you can’t explain.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  🧠 What That Time Taught Me
&lt;/h2&gt;

&lt;p&gt;I wasn’t just building software.&lt;/p&gt;

&lt;p&gt;I was learning:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;how systems really work&lt;/li&gt;
&lt;li&gt;how to break problems down&lt;/li&gt;
&lt;li&gt;how to own the entire solution&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;There was no safety net.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If you didn’t understand it — you were stuck.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And that changes how you think.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Not just as a developer… but as a problem solver.&lt;/strong&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  🤖 A Quick Note on AI
&lt;/h2&gt;

&lt;p&gt;This is not an anti-AI post.&lt;/p&gt;

&lt;p&gt;AI is one of the most powerful tools we’ve ever had.&lt;/p&gt;

&lt;p&gt;✅ It accelerates learning&lt;br&gt;
✅ It removes friction&lt;br&gt;
✅ It makes developers incredibly productive&lt;/p&gt;

&lt;p&gt;But:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Tools should amplify understanding — not replace it.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The risk is not using AI.&lt;/p&gt;

&lt;p&gt;The risk is:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Using it without thinking.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;And that’s where the problem begins.&lt;/p&gt;




&lt;h2&gt;
  
  
  🔚 Final Thought
&lt;/h2&gt;

&lt;p&gt;Today, many people laugh at old tech.&lt;/p&gt;

&lt;p&gt;But that tech:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;powered real systems&lt;/li&gt;
&lt;li&gt;solved real problems&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And most importantly:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;It forced you to understand what you were building.&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Not just how to use it.&lt;/p&gt;

&lt;p&gt;But how it actually worked.&lt;/p&gt;

&lt;p&gt;Today, more and more of what we build… feels like a black box.&lt;/p&gt;

&lt;p&gt;It works.&lt;/p&gt;

&lt;p&gt;But we don’t always understand why.&lt;/p&gt;




&lt;h2&gt;
  
  
  💬 Let’s Be Honest
&lt;/h2&gt;

&lt;p&gt;Could you build software today?&lt;/p&gt;

&lt;p&gt;❌ without AI&lt;br&gt;
❌ without Google (as we know it today)&lt;/p&gt;

&lt;p&gt;Or would you get stuck?&lt;/p&gt;

&lt;p&gt;Maybe the real question isn’t whether tools are good or bad.&lt;/p&gt;

&lt;p&gt;Maybe it’s this:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;What happens to our thinking when we no longer need to struggle?&lt;/p&gt;
&lt;/blockquote&gt;




&lt;h2&gt;
  
  
  🤔 So Let Me Ask Again
&lt;/h2&gt;

&lt;p&gt;At the beginning, I asked:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Are we becoming better engineers — or just tool operators?&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Now that you’ve seen both worlds…&lt;/p&gt;

&lt;p&gt;Be honest.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Are we still engineers — or just tool operators?&lt;/strong&gt;&lt;/p&gt;
&lt;/blockquote&gt;

</description>
      <category>ai</category>
      <category>programming</category>
      <category>history</category>
      <category>motivation</category>
    </item>
    <item>
      <title>How I Built a Local-First AI Stack for Document Q&amp;A Without OpenAI</title>
      <dc:creator>Mili Hunjic</dc:creator>
      <pubDate>Mon, 16 Mar 2026 09:37:08 +0000</pubDate>
      <link>https://dev.to/mili_hunjic_70cb2c5dd0e49/how-i-built-a-local-first-ai-stack-for-document-qa-without-openai-364</link>
      <guid>https://dev.to/mili_hunjic_70cb2c5dd0e49/how-i-built-a-local-first-ai-stack-for-document-qa-without-openai-364</guid>
      <description>&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fobbvzdfqt8aawvwuvy3t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fobbvzdfqt8aawvwuvy3t.png" alt="Article Cover Image" width="800" height="876"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  How I Built a Local-First AI Stack for Document Q&amp;amp;A Without OpenAI 📚🤖
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;A multi-service monorepo with &lt;code&gt;llama.cpp&lt;/code&gt;, Qdrant, Python &lt;code&gt;FastAPI&lt;/code&gt; services, React, Node and MCP support for AI agents.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;You’ve probably seen buzzwords like RAG, vector database, embeddings, MCP, and local LLMs everywhere. This article is meant to make those terms feel concrete by showing how they fit together in a real project.&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What You’ll See in This Project 👀
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;Local-first RAG architecture&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;&lt;strong&gt;Document PDF ingestion and chunking pipeline&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Embedding generation&lt;/strong&gt; using &lt;code&gt;sentence-transformers&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vector search&lt;/strong&gt; with &lt;code&gt;Qdrant&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Local LLM inference&lt;/strong&gt; with &lt;code&gt;llama.cpp&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Python backend microservices&lt;/strong&gt; built with &lt;code&gt;FastAPI&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;React frontend&lt;/strong&gt; for document upload and chat&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Optional ML layer&lt;/strong&gt; for security and query analysis&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MCP integration&lt;/strong&gt; so AI agents can use the system as tools&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Table of contents 🧭
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;1. Introduction&lt;/li&gt;
&lt;li&gt;2. What Is a Local AI Stack&lt;/li&gt;
&lt;li&gt;3. Why Build AI Without OpenAI&lt;/li&gt;
&lt;li&gt;4. Use Cases for Local AI&lt;/li&gt;
&lt;li&gt;5. Key Concepts Behind the System&lt;/li&gt;
&lt;li&gt;6. High Level Architecture&lt;/li&gt;
&lt;li&gt;7. Technology Stack&lt;/li&gt;
&lt;li&gt;8. System Components Explained&lt;/li&gt;
&lt;li&gt;9. Document Ingestion Pipeline&lt;/li&gt;
&lt;li&gt;10. Example Document Ingestion Lifecycle&lt;/li&gt;
&lt;li&gt;11. Query Processing Flow&lt;/li&gt;
&lt;li&gt;12. Example Request Lifecycle&lt;/li&gt;
&lt;li&gt;13. Improving Retrieval Quality&lt;/li&gt;
&lt;li&gt;14. Security Considerations&lt;/li&gt;
&lt;li&gt;15. Performance Optimization&lt;/li&gt;
&lt;li&gt;16. Advantages And Pros of a Local AI Stack&lt;/li&gt;
&lt;li&gt;17. Limitations And Cons and Tradeoffs&lt;/li&gt;
&lt;li&gt;18. Future Improvements&lt;/li&gt;
&lt;li&gt;19. Refactoring Path: LangChain, LlamaIndex, or Bedrock&lt;/li&gt;
&lt;li&gt;20. Conclusion&lt;/li&gt;
&lt;li&gt;21. Demoing This Repo&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most AI tutorials still follow the same recipe: call OpenAI, print the response, and label it an AI application.&lt;/p&gt;

&lt;p&gt;That is fine for a quick prototype, but it becomes limiting fast. You inherit API costs, external latency, privacy concerns, and a system design that often relies on a single provider sitting in the middle of everything.&lt;/p&gt;

&lt;p&gt;I wanted to build something closer to a real product: a local-first AI system that can ingest documents, search them semantically, generate grounded answers, and stay flexible enough to support both humans and AI agents.&lt;/p&gt;

&lt;p&gt;That is what &lt;code&gt;document_rag&lt;/code&gt; is. It is a local-first Retrieval-Augmented Generation (RAG) platform for uploading documents, retrieving relevant context, and answering questions with sources. By default, it runs locally without requiring OpenAI, and it is structured as a multi-service monorepo with an MCP server so tools like Cursor or Claude Desktop can also use the same platform.&lt;/p&gt;

&lt;p&gt;You can find the full source code on GitHub at &lt;a href="https://github.com/milihwork/document_rag" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;In this article, I will walk through the architecture, the tech stack, the tradeoffs, and why building AI locally is worth considering in the first place.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Introduction 🌱
&lt;/h2&gt;

&lt;p&gt;AI-powered applications are quickly moving from novelty to default product features. Search, support assistants, internal copilots, documentation chat, and workflow automation are all being rebuilt around language models.&lt;/p&gt;

&lt;p&gt;The easiest way to build these systems is to rely entirely on hosted providers such as OpenAI. That works well for prototypes, but many teams eventually hit the same questions:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;How much will this cost at scale?&lt;/li&gt;
&lt;li&gt;Can we safely send internal documents to an external API?&lt;/li&gt;
&lt;li&gt;What happens if latency spikes or pricing changes?&lt;/li&gt;
&lt;li&gt;What if the application needs to work inside a private network?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Running AI locally is one answer to those concerns.&lt;/p&gt;

&lt;p&gt;In this project, I built a local-first RAG system that ingests PDFs and text, chunks them, turns them into embeddings, stores them in a vector database, retrieves relevant context for a question, and then generates an answer with a local LLM. It lives in a monorepo that contains the frontend, backend services, shared modules, and an MCP server for agent access. The article shows how that stack fits together and why this architecture is useful beyond a demo.&lt;/p&gt;

&lt;h2&gt;
  
  
  2. What Is a Local AI Stack 🧱
&lt;/h2&gt;

&lt;p&gt;A local AI stack is a system where the critical AI components run on infrastructure you control instead of depending entirely on an external API provider.&lt;/p&gt;

&lt;p&gt;In practice, that usually means:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A local or self-hosted LLM runtime&lt;/li&gt;
&lt;li&gt;A local embedding model&lt;/li&gt;
&lt;li&gt;A vector database for retrieval&lt;/li&gt;
&lt;li&gt;Backend services that orchestrate ingestion and question answering&lt;/li&gt;
&lt;li&gt;A UI or API layer for users and other tools&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The biggest difference from cloud AI is where inference and data processing happen.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;In a cloud-first setup, your app sends prompts and often context to a remote provider.&lt;/li&gt;
&lt;li&gt;In a local-first setup, your app keeps the pipeline close to the data and only uses external providers if you intentionally enable them.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In &lt;code&gt;document_rag&lt;/code&gt;, the default local stack is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;llama.cpp&lt;/code&gt; for LLM inference&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;BAAI/bge-small-en-v1.5&lt;/code&gt; for embeddings&lt;/li&gt;
&lt;li&gt;Qdrant for vector storage&lt;/li&gt;
&lt;li&gt;FastAPI services for ingestion, embedding, retrieval, RAG orchestration, and optional ML analysis&lt;/li&gt;
&lt;li&gt;Express Gateway as the entry point for the frontend&lt;/li&gt;
&lt;li&gt;React frontend for upload and chat&lt;/li&gt;
&lt;li&gt;MCP server so AI agents can search, ask questions, and ingest content through the same platform&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;From a repository-design perspective, this is also a monorepo: multiple related applications and services live in one Git repository, share documentation and infrastructure, and work together as one system.&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Why Build AI Without OpenAI 💸
&lt;/h2&gt;

&lt;p&gt;There is nothing wrong with OpenAI or other cloud providers. They are excellent tools. But there are solid engineering reasons to build a system that does not require them by default.&lt;/p&gt;

&lt;h3&gt;
  
  
  Cost considerations
&lt;/h3&gt;

&lt;p&gt;Hosted APIs are easy to start with, but repeated embedding calls, chat completions, and large context windows can become expensive. A local stack makes costs more predictable because the main expense is infrastructure and hardware, not token billing on every request.&lt;/p&gt;

&lt;p&gt;For example, AWS Bedrock pricing depends on the model you choose and can add up quickly at scale. As one reference point, the AWS Bedrock pricing page lists Claude 3.5 Sonnet Extended Access at roughly &lt;code&gt;$6&lt;/code&gt; per million input tokens and &lt;code&gt;$30&lt;/code&gt; per million output tokens, with batch pricing lower than that. That may be perfectly reasonable for production workloads, but the cost becomes usage-driven very quickly once you have frequent queries, longer contexts, or multiple users.&lt;/p&gt;

&lt;p&gt;With a local &lt;code&gt;llama.cpp&lt;/code&gt; setup, the cost model is different. There is no per-token API bill for each request. Instead, you are paying for the machine, electricity, storage, and the operational overhead of running the model. If you are testing on hardware you already own, the marginal cost can feel close to zero. But if you need a stronger dedicated GPU box, the fixed monthly cost can be significant even before traffic grows.&lt;/p&gt;

&lt;h3&gt;
  
  
  Data privacy and security
&lt;/h3&gt;

&lt;p&gt;Many RAG systems work with internal PDFs, team docs, policies, contracts, or private knowledge bases. Keeping the pipeline local reduces exposure and makes the system easier to justify in privacy-sensitive environments.&lt;/p&gt;

&lt;h3&gt;
  
  
  Latency improvements
&lt;/h3&gt;

&lt;p&gt;When embeddings, search, and inference are close to the application, you remove some network overhead. Local inference is not always faster in absolute terms, but it is often more predictable.&lt;/p&gt;

&lt;h3&gt;
  
  
  Vendor lock-in concerns
&lt;/h3&gt;

&lt;p&gt;If the whole product depends on one hosted provider, switching later can be painful. This project avoids that by using config-driven backends. The default path is local, but optional OpenAI and other future backends fit behind stable service contracts.&lt;/p&gt;

&lt;h3&gt;
  
  
  Offline and internal systems
&lt;/h3&gt;

&lt;p&gt;Some tools need to run on internal networks, development laptops, or restricted environments. A local-first design makes those scenarios practical.&lt;/p&gt;

&lt;h2&gt;
  
  
  4. Use Cases for Local AI 🎯
&lt;/h2&gt;

&lt;p&gt;A local AI stack is especially useful for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Internal document search&lt;/li&gt;
&lt;li&gt;Engineering or product knowledge bases&lt;/li&gt;
&lt;li&gt;Company documentation assistants&lt;/li&gt;
&lt;li&gt;Secure enterprise environments&lt;/li&gt;
&lt;li&gt;Teams in regulated industries that want stronger control over data flow&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This repository focuses on document question answering, but the same architecture can support internal wikis, policy assistants, onboarding tools, legal document review support, and research archives.&lt;/p&gt;

&lt;h2&gt;
  
  
  5. Key Concepts Behind the System 🧠
&lt;/h2&gt;

&lt;p&gt;Before looking at the architecture, it helps to define the core RAG building blocks.&lt;/p&gt;

&lt;h3&gt;
  
  
  Large Language Models
&lt;/h3&gt;

&lt;p&gt;LLM stands for &lt;strong&gt;Large Language Model&lt;/strong&gt;. It is the component responsible for generating the final answer from the user question plus the retrieved context. In this project, the default runtime is &lt;code&gt;llama.cpp&lt;/code&gt;, which serves a local model such as Mistral 7B in GGUF format.&lt;/p&gt;

&lt;h3&gt;
  
  
  Embeddings
&lt;/h3&gt;

&lt;p&gt;Embeddings convert text into vectors so semantically similar content can be matched even if the wording is different. This repo uses &lt;code&gt;BAAI/bge-small-en-v1.5&lt;/code&gt; by default.&lt;/p&gt;

&lt;h3&gt;
  
  
  Vector search
&lt;/h3&gt;

&lt;p&gt;Vector search lets the system retrieve the most relevant document chunks for a user query. Qdrant is used as the default vector store.&lt;/p&gt;

&lt;h3&gt;
  
  
  Retrieval-Augmented Generation
&lt;/h3&gt;

&lt;p&gt;RAG stands for &lt;strong&gt;Retrieval-Augmented Generation&lt;/strong&gt;. It combines document retrieval with language generation, so the model answers using relevant source material instead of relying only on its pretrained knowledge.&lt;/p&gt;

&lt;p&gt;In practice, RAG combines retrieval and generation:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Turn the user question into an embedding.&lt;/li&gt;
&lt;li&gt;Search the vector database for relevant chunks.&lt;/li&gt;
&lt;li&gt;Pass those chunks as context to the LLM.&lt;/li&gt;
&lt;li&gt;Generate an answer grounded in the retrieved content.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;That grounding is what makes RAG more useful for document-based assistants than raw prompting alone.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. High Level Architecture 🏗️
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fozq2r6jn92r55en7y72e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fozq2r6jn92r55en7y72e.png" alt="High-Level Architecture Image" width="800" height="361"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;At a high level, the system is split into focused services instead of a single large app:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Frontend for upload and chat&lt;/li&gt;
&lt;li&gt;Gateway for a unified API entry point&lt;/li&gt;
&lt;li&gt;Ingestion service for parsing and chunking documents&lt;/li&gt;
&lt;li&gt;Embedding service for converting text to vectors&lt;/li&gt;
&lt;li&gt;Retrieval service for vector storage and search&lt;/li&gt;
&lt;li&gt;RAG service for orchestration and answer generation&lt;/li&gt;
&lt;li&gt;Optional ML service for injection detection, query classification, and retrieval scoring&lt;/li&gt;
&lt;li&gt;MCP server so AI agents can use the same backend as tools&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This layout is one reason I describe the project as a monorepo. Instead of separating everything into different repositories, the frontend, backend services, shared modules, and MCP integration are versioned together. For a system like this, it makes local development, documentation, and architecture changes easier to manage.&lt;/p&gt;

&lt;p&gt;The main data flow looks like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;A user uploads a document through the frontend.&lt;/li&gt;
&lt;li&gt;The Gateway forwards the request to the Ingestion service.&lt;/li&gt;
&lt;li&gt;Ingestion parses the file, splits it into chunks, and asks the Embedding service for vectors.&lt;/li&gt;
&lt;li&gt;The Retrieval service stores those vectors in Qdrant.&lt;/li&gt;
&lt;li&gt;Later, when the user asks a question, the RAG service embeds the query, retrieves relevant chunks, optionally reranks them, and sends grounded context to the LLM.&lt;/li&gt;
&lt;li&gt;The answer comes back with sources.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here is the high-level architecture:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvbikd1689ffdrw97a49t.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fvbikd1689ffdrw97a49t.png" alt="High Level diagram" width="800" height="574"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;There is also a second access path besides the browser UI: the MCP server exposes the system to AI agents over the Model Context Protocol. That means the same platform can power both a human-facing frontend and agent workflows such as &lt;code&gt;search_documents&lt;/code&gt;, &lt;code&gt;ask_rag&lt;/code&gt;, and &lt;code&gt;ingest_document&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;That separation makes the system easier to reason about, replace, and extend.&lt;/p&gt;

&lt;h2&gt;
  
  
  7. Technology Stack 🧰
&lt;/h2&gt;

&lt;p&gt;Here is the concrete stack used in this project.&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Layer&lt;/th&gt;
&lt;th&gt;Technology&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;LLM runtime&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(C++ runtime) --&amp;gt; llama.cpp&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Default LLM model&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(GGUF model) --&amp;gt; Mistral 7B&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Embeddings&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(Python / sentence-transformers) --&amp;gt; BAAI/bge-small-en-v1.5&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Vector database&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(Vector DB) --&amp;gt; Qdrant&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;API gateway&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(NodeJS) --&amp;gt; Express + TypeScript&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Backend services&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(Python) --&amp;gt; FastAPI&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Frontend&lt;/td&gt;
&lt;td&gt;&lt;code&gt;Frontend (React) --&amp;gt; React + Vite + TypeScript&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Optional agent interface&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(Python) --&amp;gt; MCP server via FastMCP&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Optional alternative backends&lt;/td&gt;
&lt;td&gt;&lt;code&gt;(Config-driven) --&amp;gt; OpenAI, pgvector, Bedrock placeholders/backends&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;One thing I like about this setup is that it stays practical. The default stack is local-first, but the interfaces are designed so that changing a backend does not force a full rewrite of the product.&lt;/p&gt;

&lt;h2&gt;
  
  
  8. System Components Explained 🧩
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Gateway / API layer
&lt;/h3&gt;

&lt;p&gt;The Gateway is the public entry point used by the frontend. It keeps the UI simple and hides the internal service boundaries.&lt;/p&gt;

&lt;h3&gt;
  
  
  Embedding service
&lt;/h3&gt;

&lt;p&gt;This service owns text-to-vector conversion. Other services do not care which embedding provider is behind it as long as the &lt;code&gt;/embed&lt;/code&gt; contract stays stable.&lt;/p&gt;

&lt;h3&gt;
  
  
  Vector database
&lt;/h3&gt;

&lt;p&gt;Qdrant stores chunk vectors and powers similarity search. The Retrieval service sits in front of it so vector database details are isolated.&lt;/p&gt;

&lt;h3&gt;
  
  
  LLM service
&lt;/h3&gt;

&lt;p&gt;The generation layer uses &lt;code&gt;llama.cpp&lt;/code&gt; by default. The RAG service talks to an abstraction, so local inference is the default but not the only possible implementation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Ingestion service
&lt;/h3&gt;

&lt;p&gt;The Ingestion service is responsible for parsing documents, chunking text, requesting embeddings, and inserting results into the retrieval layer. The next sections go deeper into the ingestion pipeline and its request lifecycle.&lt;/p&gt;

&lt;h3&gt;
  
  
  RAG orchestration service
&lt;/h3&gt;

&lt;p&gt;This is the brain of the application. It handles query processing, context assembly, prompt construction, answer generation, safeguards, optional query rewriting, and optional reranking. The dedicated query-flow sections below show that process in more detail.&lt;/p&gt;

&lt;h3&gt;
  
  
  Optional ML service
&lt;/h3&gt;

&lt;p&gt;The ML service adds extra intelligence around prompt injection detection, query intent classification, and retrieval scoring. It is not required for the core app to work, which is a good design choice for graceful degradation.&lt;/p&gt;

&lt;h3&gt;
  
  
  MCP server
&lt;/h3&gt;

&lt;p&gt;The MCP server is a thin integration layer that exposes the platform as tools for AI agents such as Cursor or Claude Desktop. Instead of building a separate agent-specific backend, this repo reuses the same ingestion, retrieval, and RAG services and makes them available over MCP.&lt;/p&gt;

&lt;p&gt;The service architecture looks like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F911a31ibccuopmxg7jqo.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F911a31ibccuopmxg7jqo.png" alt="Service architecture diagram" width="800" height="594"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  9. Document Ingestion Pipeline 📥
&lt;/h2&gt;

&lt;p&gt;Ingestion is where a lot of real RAG quality is decided.&lt;/p&gt;

&lt;p&gt;The pipeline in this repo looks like this:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Accept a PDF upload or raw text.&lt;/li&gt;
&lt;li&gt;Parse the document into plain text.&lt;/li&gt;
&lt;li&gt;Split the text into chunks.&lt;/li&gt;
&lt;li&gt;Generate embeddings for those chunks.&lt;/li&gt;
&lt;li&gt;Store vectors and metadata in the retrieval layer.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here is the ingestion pipeline:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Input&lt;/strong&gt;&lt;br&gt;
A document or text is provided by the user through upload or via the MCP client.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gateway / MCP Server&lt;/strong&gt;&lt;br&gt;
The request is received and validated by the Gateway API or the MCP server, which acts as the system entry point.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Ingestion Service&lt;/strong&gt;&lt;br&gt;
The request is forwarded to the Ingestion Service, responsible for preparing the document for processing and indexing.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Parser&lt;/strong&gt;&lt;br&gt;
The parser extracts raw text from the uploaded content (for example PDF, TXT, or other supported formats).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Chunker&lt;/strong&gt;&lt;br&gt;
The extracted text is split into smaller chunks to optimise embedding generation and retrieval accuracy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Embedding Service&lt;/strong&gt;&lt;br&gt;
Each chunk is converted into a vector representation (embedding) using an embedding model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vector Preparation&lt;/strong&gt;&lt;br&gt;
The generated vectors represent the semantic meaning of each chunk.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Retrieval Service&lt;/strong&gt;&lt;br&gt;
The vectors and their associated metadata are upserted through the retrieval service.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vector Database&lt;/strong&gt;&lt;br&gt;
The vectors are stored in Qdrant, where they become searchable for future semantic queries.&lt;/p&gt;
&lt;h3&gt;
  
  
  Document parsing
&lt;/h3&gt;

&lt;p&gt;The system supports PDF ingestion and also text ingestion, which is useful for testing, automation, and MCP-driven workflows.&lt;/p&gt;
&lt;h3&gt;
  
  
  Chunking strategies
&lt;/h3&gt;

&lt;p&gt;Chunking is critical because poor chunking hurts retrieval quality even if the model is strong. This project exposes chunk-size configuration and keeps chunking as a shared concern rather than scattering it across services.&lt;/p&gt;
&lt;h3&gt;
  
  
  Embedding generation
&lt;/h3&gt;

&lt;p&gt;Each chunk is sent to the Embedding service, which returns vector representations using the configured backend.&lt;/p&gt;
&lt;h3&gt;
  
  
  Storing vectors
&lt;/h3&gt;

&lt;p&gt;The Retrieval service upserts the vectors into Qdrant, making the document searchable for future queries.&lt;/p&gt;
&lt;h2&gt;
  
  
  10. Example Document Ingestion Lifecycle 🔄
&lt;/h2&gt;

&lt;p&gt;To make the ingestion path as concrete as the query path, here is a simplified lifecycle of what happens when a user uploads a PDF document through the frontend.&lt;/p&gt;
&lt;h3&gt;
  
  
  Document Ingestion Service Flow (with Optional ML)
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Step&lt;/th&gt;
&lt;th&gt;Caller&lt;/th&gt;
&lt;th&gt;Endpoint&lt;/th&gt;
&lt;th&gt;Target Service&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Frontend&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/ingest/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Gateway&lt;/td&gt;
&lt;td&gt;Upload PDF or raw text document&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Gateway&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/ingest&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Ingestion Service&lt;/td&gt;
&lt;td&gt;Forward document to ingestion pipeline&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;Ingestion Service&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Parser Module&lt;/td&gt;
&lt;td&gt;Extract text from PDF or text input&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;Ingestion Service&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;/classify&lt;/code&gt; (optional)&lt;/td&gt;
&lt;td&gt;ML Service&lt;/td&gt;
&lt;td&gt;Classify document type when document classification is enabled&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;Ingestion Service&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Chunker Module&lt;/td&gt;
&lt;td&gt;Split extracted text into smaller chunks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Ingestion Service&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/embed&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Embedding Service&lt;/td&gt;
&lt;td&gt;Convert text chunks into vector embeddings&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;Embedding Service&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Embedding Model&lt;/td&gt;
&lt;td&gt;Generate embeddings using &lt;code&gt;sentence-transformers&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;Ingestion Service&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/upsert&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Retrieval Service&lt;/td&gt;
&lt;td&gt;Send vectors and metadata for storage&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;Retrieval Service&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;Qdrant&lt;/td&gt;
&lt;td&gt;Store embeddings in vector database&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;10&lt;/td&gt;
&lt;td&gt;Ingestion Service&lt;/td&gt;
&lt;td&gt;Response&lt;/td&gt;
&lt;td&gt;Gateway&lt;/td&gt;
&lt;td&gt;Return ingestion result&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;11&lt;/td&gt;
&lt;td&gt;Gateway&lt;/td&gt;
&lt;td&gt;JSON response&lt;/td&gt;
&lt;td&gt;Frontend&lt;/td&gt;
&lt;td&gt;Confirm document indexed successfully&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  User action
&lt;/h3&gt;

&lt;p&gt;Upload file:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;employment_contract.pdf&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1 — Frontend uploads the document&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The React frontend sends the PDF to the Gateway as a multipart form upload.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;POST http://localhost:8000/ingest/
Content-Type: multipart/form-data

file = employment_contract.pdf
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 2 — Gateway forwards to the Ingestion service&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Gateway proxies the uploaded file to the Ingestion service.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;POST http://localhost:8001/ingest
Content-Type: multipart/form-data

file = employment_contract.pdf
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 3 — PDF parsing and text extraction&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Ingestion service stores the upload temporarily, extracts text from the PDF, and prepares it for chunking. If document classification is enabled, the service can also classify the document before indexing it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4 — Chunking&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The extracted text is split into smaller chunks so the content can be embedded and retrieved effectively later.&lt;/p&gt;

&lt;p&gt;Example chunk output:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Chunk 1: This employment agreement begins on...
Chunk 2: Either party may terminate this contract by...
Chunk 3: Confidentiality obligations survive termination...
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 5 — Batch embedding&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Ingestion service sends all chunks to the Embedding service in one batch request.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;POST http://localhost:8002/embed
Content-Type: application/json

{
  "texts": [
    "This employment agreement begins on...",
    "Either party may terminate this contract by...",
    "Confidentiality obligations survive termination..."
  ]
}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Embedding service returns one vector per chunk.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 6 — Vector upsert&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Ingestion service packages the chunk text, source filename, and embeddings into points and sends them to the Retrieval service.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;POST http://localhost:8003/upsert
Content-Type: application/json

{
  "points": [
    {
      "id": "uuid-1",
      "vector": [ ... ],
      "payload": {
        "text": "This employment agreement begins on...",
        "source": "employment_contract.pdf"
      }
    }
  ]
}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Retrieval service stores those points in Qdrant, making the document searchable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 7 — Ingestion result returned&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Once indexing finishes, the Ingestion service returns a success payload.&lt;/p&gt;

&lt;p&gt;Example response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"status"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"success"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"chunks_inserted"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"document"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"employment_contract.pdf"&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Gateway passes that response back to the frontend, which can then confirm that the document is ready for search and question answering.&lt;/p&gt;

&lt;p&gt;Here is the same ingestion flow shown as a sequence trace:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4y1ln1w2k921xt43nriv.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4y1ln1w2k921xt43nriv.png" alt="Ingestion flow sequence trace Image" width="800" height="375"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This flow is useful because it shows that ingestion is not just file upload. It is a full indexing pipeline: parsing, chunking, embedding, and vector storage. That is what makes later RAG queries possible.&lt;/p&gt;

&lt;h2&gt;
  
  
  11. Query Processing Flow 🔍
&lt;/h2&gt;

&lt;p&gt;Question answering follows a similar service-oriented path:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;The user asks a question in the frontend.&lt;/li&gt;
&lt;li&gt;The Gateway forwards the question to the RAG service.&lt;/li&gt;
&lt;li&gt;The RAG service validates the input with safeguards.&lt;/li&gt;
&lt;li&gt;If enabled, the ML service analyses the query for prompt injection and intent classification.&lt;/li&gt;
&lt;li&gt;If enabled, the query can be rewritten into a clearer search form.&lt;/li&gt;
&lt;li&gt;The query is embedded.&lt;/li&gt;
&lt;li&gt;The Retrieval service performs vector search in Qdrant.&lt;/li&gt;
&lt;li&gt;If enabled, a reranker improves the ordering of retrieved chunks.&lt;/li&gt;
&lt;li&gt;If enabled, the ML service scores retrieval quality before generation.&lt;/li&gt;
&lt;li&gt;The RAG service builds a prompt with the selected context.&lt;/li&gt;
&lt;li&gt;The LLM generates an answer.&lt;/li&gt;
&lt;li&gt;Output safeguards validate the response before it is returned.&lt;/li&gt;
&lt;li&gt;The frontend shows the answer together with sources.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The RAG query flow looks like this:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;User Input&lt;/strong&gt;&lt;br&gt;
A user submits a question to the system.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Gateway&lt;/strong&gt;&lt;br&gt;
The request is received by the Gateway API, which acts as the entry point for user queries.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;RAG Service&lt;/strong&gt;&lt;br&gt;
The request is forwarded to the RAG orchestration service, which coordinates the entire retrieval and generation pipeline.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Input Safeguards&lt;/strong&gt;&lt;br&gt;
The query is validated to detect unsafe or malformed inputs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Query Analysis (Optional)&lt;/strong&gt;&lt;br&gt;
A machine learning service may analyse the query to determine intent, complexity, or additional metadata.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Query Rewriting (Optional)&lt;/strong&gt;&lt;br&gt;
The query can be rewritten to improve retrieval accuracy.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Embedding Generation&lt;/strong&gt;&lt;br&gt;
The processed query is converted into a vector embedding using the embedding service.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Retrieval Service&lt;/strong&gt;&lt;br&gt;
The retrieval service searches for relevant document chunks using the query vector.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Vector Database Search&lt;/strong&gt;&lt;br&gt;
The similarity search is executed in Qdrant, which stores the document embeddings.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Context Retrieval&lt;/strong&gt;&lt;br&gt;
The most relevant chunks are returned as context for the language model.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reranking (Optional)&lt;/strong&gt;&lt;br&gt;
A reranker model may reorder the retrieved chunks to improve relevance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Retrieval Scoring (Optional)&lt;/strong&gt;&lt;br&gt;
An ML service may evaluate and score the quality of the retrieved results.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Prompt Construction&lt;/strong&gt;&lt;br&gt;
The prompt builder assembles the final prompt using the user query and the retrieved context.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model Execution&lt;/strong&gt;&lt;br&gt;
The prompt is sent to the local model runtime powered by llama.cpp.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Output Safeguards&lt;/strong&gt;&lt;br&gt;
The generated response is validated to ensure safety and compliance.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Final Response&lt;/strong&gt;&lt;br&gt;
The system returns the answer along with source references to the user.&lt;/p&gt;

&lt;p&gt;This is a good example of why modularity matters. The user experiences one simple chat flow, but the system is actually combining retrieval, ranking, safety checks, and generation behind the scenes.&lt;/p&gt;
&lt;h2&gt;
  
  
  12. Example Request Lifecycle 🔁
&lt;/h2&gt;

&lt;p&gt;To make the architecture more concrete, here is a simplified lifecycle of a real request when a user asks a question about an uploaded document. The URLs below use the default local development ports from this repository.&lt;/p&gt;
&lt;h3&gt;
  
  
  Example Request Flow
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Step&lt;/th&gt;
&lt;th&gt;Caller&lt;/th&gt;
&lt;th&gt;Endpoint&lt;/th&gt;
&lt;th&gt;Target Service&lt;/th&gt;
&lt;th&gt;Description&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;1&lt;/td&gt;
&lt;td&gt;Frontend&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/chat/&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Gateway&lt;/td&gt;
&lt;td&gt;Send the user question to the public API&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;2&lt;/td&gt;
&lt;td&gt;Gateway&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/ask&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;RAG Service&lt;/td&gt;
&lt;td&gt;Forward the request to the RAG orchestration layer&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3&lt;/td&gt;
&lt;td&gt;RAG&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;/analyze&lt;/code&gt; (optional)&lt;/td&gt;
&lt;td&gt;ML Service&lt;/td&gt;
&lt;td&gt;Check prompt injection and classify query intent&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;4&lt;/td&gt;
&lt;td&gt;RAG&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/embed&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Embedding Service&lt;/td&gt;
&lt;td&gt;Generate a vector embedding for the question&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;5&lt;/td&gt;
&lt;td&gt;RAG&lt;/td&gt;
&lt;td&gt;&lt;code&gt;/search&lt;/code&gt;&lt;/td&gt;
&lt;td&gt;Retrieval Service&lt;/td&gt;
&lt;td&gt;Retrieve the most relevant chunks&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;6&lt;/td&gt;
&lt;td&gt;Retrieval&lt;/td&gt;
&lt;td&gt;Vector search&lt;/td&gt;
&lt;td&gt;Qdrant&lt;/td&gt;
&lt;td&gt;Run similarity search on stored embeddings&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7&lt;/td&gt;
&lt;td&gt;RAG&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;/score&lt;/code&gt; (optional)&lt;/td&gt;
&lt;td&gt;ML Service&lt;/td&gt;
&lt;td&gt;Score retrieval quality before generation&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;8&lt;/td&gt;
&lt;td&gt;RAG&lt;/td&gt;
&lt;td&gt;Completion call&lt;/td&gt;
&lt;td&gt;LLM Runtime&lt;/td&gt;
&lt;td&gt;Send prompt and context to the configured model backend&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;9&lt;/td&gt;
&lt;td&gt;Gateway&lt;/td&gt;
&lt;td&gt;Response&lt;/td&gt;
&lt;td&gt;Frontend&lt;/td&gt;
&lt;td&gt;Return the final answer and sources&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;
&lt;h3&gt;
  
  
  User question
&lt;/h3&gt;

&lt;p&gt;&lt;code&gt;What does the contract say about termination conditions?&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 1 — Frontend sends the request&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The React frontend sends the user question to the Gateway.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;POST http://localhost:8000/chat/
Content-Type: application/json

{
  "question": "What does the contract say about termination conditions?"
}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 2 — Gateway forwards to the RAG service&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The Gateway validates the incoming request shape and forwards it to the RAG orchestration service.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;POST http://localhost:8004/ask
Content-Type: application/json

{
  "question": "What does the contract say about termination conditions?"
}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Step 3 — Input validation and optional ML analysis&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The RAG service first runs its input safeguards. If the optional ML service is enabled, it can also analyze the query for prompt injection and classify the query intent.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;POST http://localhost:8005/analyze
Content-Type: application/json

{
  "query": "What does the contract say about termination conditions?"
}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This step is optional and the system can continue without it if the ML service is disabled or unavailable.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 4 — Query embedding&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Inside the RAG service, the question is passed to the Embedding service to generate a vector representation.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;POST http://localhost:8002/embed
Content-Type: application/json

{
  "text": "What does the contract say about termination conditions?"
}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Embedding service returns an embedding vector for the query.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 5 — Vector search&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The RAG service sends the query vector to the Retrieval service, which searches Qdrant for the most relevant chunks.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;POST http://localhost:8003/search
Content-Type: application/json

{
  "query_vector": [ ... ],
  "top_k": 5
}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The Retrieval service returns the most similar chunks, including their text and source metadata.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 6 — Optional reranking and retrieval scoring&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;If reranking is enabled, the RAG service reorders the retrieved chunks before generation. If the ML service is enabled, it can also score the retrieval quality.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight http"&gt;&lt;code&gt;&lt;span class="err"&gt;POST http://localhost:8005/score
Content-Type: application/json

{
  "query": "What does the contract say about termination conditions?",
  "chunks": [
    {
      "text": "Either party may terminate the agreement with written notice...",
      "source": "employment_contract.pdf"
    }
  ]
}
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;This lets the system estimate whether the retrieved context is strong enough before asking the model to generate an answer.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 7 — Prompt construction&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The RAG service builds a grounded prompt that combines the original question with the retrieved context.&lt;/p&gt;

&lt;p&gt;Example prompt:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Context:
[Chunk 1: termination clause description...]
[Chunk 2: conditions for ending the agreement...]

Question:
What does the contract say about termination conditions?

Answer using only the provided context.
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If query rewriting, reranking, safeguards, or ML-based scoring are enabled, those steps are applied around retrieval and prompt construction before the model is called.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 8 — LLM inference&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The prompt is then sent from the RAG service to the configured LLM backend. In the default local setup, that means a call to the &lt;code&gt;llama.cpp&lt;/code&gt; server configured by &lt;code&gt;LLM_URL&lt;/code&gt; (typically &lt;code&gt;http://localhost:8080&lt;/code&gt;).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Step 9 — Response returned&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Once the answer is generated, the RAG service returns the result together with the source references.&lt;/p&gt;

&lt;p&gt;Example response:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"question"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"What does the contract say about termination conditions?"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"answer"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"The contract allows termination if either party provides 30 days written notice..."&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"sources"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"contract_page_4_chunk_2"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="s2"&gt;"contract_page_5_chunk_1"&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Finally, the Gateway returns that response to the frontend, and the answer is displayed in the chat interface.&lt;/p&gt;

&lt;p&gt;Here is the same request shown as a sequence trace:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F193hq5su6nq90kb3kryx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F193hq5su6nq90kb3kryx.png" alt="User Question Sequence trace Image" width="800" height="355"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This section is useful because it shows the system as an actual request trace, not just a conceptual diagram. It makes it easier to see how the services collaborate and how RAG works in practice.&lt;/p&gt;

&lt;h2&gt;
  
  
  13. Improving Retrieval Quality 🎯
&lt;/h2&gt;

&lt;p&gt;A basic RAG demo often stops at embedding plus vector search. This project goes further.&lt;/p&gt;

&lt;h3&gt;
  
  
  Chunking strategies
&lt;/h3&gt;

&lt;p&gt;Chunk size is configurable because retrieval quality depends heavily on how much meaning each chunk carries.&lt;/p&gt;

&lt;h3&gt;
  
  
  Top-k retrieval
&lt;/h3&gt;

&lt;p&gt;The system can retrieve a broader candidate set for search and then narrow it down before generation. That is more robust than sending only the first raw matches directly to the LLM.&lt;/p&gt;

&lt;h3&gt;
  
  
  Context filtering
&lt;/h3&gt;

&lt;p&gt;The architecture supports filtering and validating what reaches the model, which matters both for relevance and safety.&lt;/p&gt;

&lt;h3&gt;
  
  
  Query rewriting
&lt;/h3&gt;

&lt;p&gt;One of the nicer features in this repo is optional query rewriting. Short or vague questions can be expanded into clearer search queries for better embedding and retrieval, while the original user wording is preserved for the final answer.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reranking
&lt;/h3&gt;

&lt;p&gt;The project also supports optional BGE reranking. That means vector search can fetch a wider set of candidates, and then a reranker can choose the best chunks to pass into the answer prompt.&lt;/p&gt;

&lt;p&gt;Together, these choices make the retrieval layer more realistic than a minimal tutorial project.&lt;/p&gt;

&lt;h2&gt;
  
  
  14. Security Considerations 🔐
&lt;/h2&gt;

&lt;p&gt;Local AI does not automatically mean secure AI. You still need defensive layers.&lt;/p&gt;

&lt;p&gt;This project includes several useful security-oriented ideas:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Input safeguards to block obvious prompt injection or disallowed patterns&lt;/li&gt;
&lt;li&gt;Output safeguards to block sensitive or restricted responses&lt;/li&gt;
&lt;li&gt;Optional ML-based injection detection&lt;/li&gt;
&lt;li&gt;Context controls so only selected retrieved chunks are passed to generation&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Prompt injection is especially important in RAG because the model is reading untrusted document content and user instructions at the same time. Even in a local system, that risk still exists.&lt;/p&gt;

&lt;p&gt;Input validation and context filtering are therefore just as important as model quality.&lt;/p&gt;

&lt;p&gt;The ML and safety layer in this project looks like this:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Step&lt;/th&gt;
&lt;th&gt;What it does&lt;/th&gt;
&lt;th&gt;Service / Module&lt;/th&gt;
&lt;th&gt;Endpoint / Call Type&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;User Query&lt;/td&gt;
&lt;td&gt;Incoming question&lt;/td&gt;
&lt;td&gt;Gateway → RAG&lt;/td&gt;
&lt;td&gt;
&lt;code&gt;POST /chat&lt;/code&gt; (Gateway) → &lt;code&gt;POST /ask&lt;/code&gt; (RAG)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Input Safeguards&lt;/td&gt;
&lt;td&gt;Validate input&lt;/td&gt;
&lt;td&gt;Safeguard module inside RAG (&lt;code&gt;backend/services/safeguard&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;In-process &lt;code&gt;validate_input(...)&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Query Analysis&lt;/td&gt;
&lt;td&gt;Injection / intent analysis (optional)&lt;/td&gt;
&lt;td&gt;ML Service&lt;/td&gt;
&lt;td&gt;&lt;code&gt;POST /analyze&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Query Rewriter&lt;/td&gt;
&lt;td&gt;Improve query text (optional)&lt;/td&gt;
&lt;td&gt;Query Rewriter (&lt;code&gt;backend/shared/query_rewriter&lt;/code&gt;)&lt;/td&gt;
&lt;td&gt;In-process &lt;code&gt;rewrite(...)&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Embedding Service&lt;/td&gt;
&lt;td&gt;Generate query vector&lt;/td&gt;
&lt;td&gt;Embedding Service&lt;/td&gt;
&lt;td&gt;&lt;code&gt;POST /embed&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Retrieval Service&lt;/td&gt;
&lt;td&gt;Search context chunks&lt;/td&gt;
&lt;td&gt;Retrieval Service&lt;/td&gt;
&lt;td&gt;&lt;code&gt;POST /search&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Reranker&lt;/td&gt;
&lt;td&gt;Reorder results (optional)&lt;/td&gt;
&lt;td&gt;Reranker (&lt;code&gt;backend/shared/reranker&lt;/code&gt;, BGE)&lt;/td&gt;
&lt;td&gt;In-process &lt;code&gt;rerank(...)&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;ML Retrieval Scoring&lt;/td&gt;
&lt;td&gt;Score retrieved context (optional)&lt;/td&gt;
&lt;td&gt;ML Service&lt;/td&gt;
&lt;td&gt;&lt;code&gt;POST /score&lt;/code&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Prompt Builder&lt;/td&gt;
&lt;td&gt;Build final prompt with context&lt;/td&gt;
&lt;td&gt;Prompt builder (shared helpers in RAG)&lt;/td&gt;
&lt;td&gt;In-process prompt assembly&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model Runtime&lt;/td&gt;
&lt;td&gt;LLM inference&lt;/td&gt;
&lt;td&gt;LLM backend (e.g. llama.cpp)&lt;/td&gt;
&lt;td&gt;RAG → LLM HTTP call (&lt;code&gt;LLM_URL&lt;/code&gt;)&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Output Safeguards&lt;/td&gt;
&lt;td&gt;Validate model response&lt;/td&gt;
&lt;td&gt;Safeguard module inside RAG&lt;/td&gt;
&lt;td&gt;In-process &lt;code&gt;validate_output(...)&lt;/code&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Final Response&lt;/td&gt;
&lt;td&gt;Return answer + sources&lt;/td&gt;
&lt;td&gt;RAG → Gateway → Client&lt;/td&gt;
&lt;td&gt;HTTP response&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h2&gt;
  
  
  15. Performance Optimization ⚡
&lt;/h2&gt;

&lt;p&gt;Performance in local AI is about balancing model quality, retrieval quality, and resource usage.&lt;/p&gt;

&lt;p&gt;Some useful optimisation levers visible in this repo are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Caching opportunities for repeated queries or embeddings&lt;/li&gt;
&lt;li&gt;Embedding reuse for already-ingested chunks&lt;/li&gt;
&lt;li&gt;Vector DB tuning such as &lt;code&gt;TOP_K&lt;/code&gt; and retrieval candidate size&lt;/li&gt;
&lt;li&gt;Choosing smaller or larger local models depending on hardware&lt;/li&gt;
&lt;li&gt;Disabling optional steps like query rewriting or ML analysis when latency matters more than quality&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The nice thing about a modular design is that you can tune each part independently instead of treating the whole system as one black box.&lt;/p&gt;

&lt;h2&gt;
  
  
  16. Advantages And Pros of a Local AI Stack ✅
&lt;/h2&gt;

&lt;p&gt;If you want the short version, these are the main pros of this architecture:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Full control over document data&lt;/li&gt;
&lt;li&gt;More predictable operational cost&lt;/li&gt;
&lt;li&gt;Freedom to customise models and providers&lt;/li&gt;
&lt;li&gt;Flexible service boundaries&lt;/li&gt;
&lt;li&gt;Better fit for internal tools and private deployments&lt;/li&gt;
&lt;li&gt;A path to support both web users and AI agents through the same backend&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The last point is worth highlighting. This repo not only exposes a frontend. It also includes an MCP server, which means AI agents can search documents, ask grounded questions, and ingest text using the same backend services.&lt;/p&gt;

&lt;p&gt;That matters because it turns the project from a simple web app into a more reusable AI platform. The same monorepo supports browser users, backend APIs, and agent tooling without duplicating business logic.&lt;/p&gt;

&lt;h2&gt;
  
  
  17. Limitations And Cons and Tradeoffs ⚖️
&lt;/h2&gt;

&lt;p&gt;A local-first approach is powerful, but it is not magic.&lt;/p&gt;

&lt;p&gt;If you want the short version, these are the main cons:&lt;/p&gt;

&lt;h3&gt;
  
  
  Hardware requirements
&lt;/h3&gt;

&lt;p&gt;Running local models well still depends on available CPU, RAM, and ideally GPU support. The better the model, the more demanding the setup tends to be.&lt;/p&gt;

&lt;h3&gt;
  
  
  Model quality vs cloud providers
&lt;/h3&gt;

&lt;p&gt;Strong local models can be impressive, but top hosted models may still outperform them on reasoning, instruction following, or multilingual tasks depending on the setup.&lt;/p&gt;

&lt;h3&gt;
  
  
  Throughput and concurrent requests
&lt;/h3&gt;

&lt;p&gt;Another important limitation is serving capacity. A local model runtime such as &lt;code&gt;llama.cpp&lt;/code&gt; can work very well for development, demos, and low-volume internal tools, but multiple simultaneous requests can quickly become a bottleneck.&lt;/p&gt;

&lt;p&gt;If several users send questions at the same time, you may see:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;queued requests&lt;/li&gt;
&lt;li&gt;slower response times&lt;/li&gt;
&lt;li&gt;higher CPU or GPU contention&lt;/li&gt;
&lt;li&gt;reduced throughput compared to managed cloud inference&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That does not make local inference a bad choice, but it does mean you should think about expected traffic. A local-first stack is often strongest for single-user workflows, small teams, internal tools, or controlled environments rather than high-concurrency public applications.&lt;/p&gt;

&lt;h3&gt;
  
  
  Maintenance overhead
&lt;/h3&gt;

&lt;p&gt;When you own the stack, you also own more operational work:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Managing model files&lt;/li&gt;
&lt;li&gt;Tuning chunking and retrieval&lt;/li&gt;
&lt;li&gt;Running vector infrastructure&lt;/li&gt;
&lt;li&gt;Maintaining service compatibility&lt;/li&gt;
&lt;li&gt;Handling upgrades and troubleshooting&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That is the tradeoff for greater control.&lt;/p&gt;

&lt;h2&gt;
  
  
  18. Future Improvements 🗺️
&lt;/h2&gt;

&lt;p&gt;This project already implements more than a minimal RAG demo, but there is still room to grow.&lt;/p&gt;

&lt;p&gt;Some especially valuable next steps are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Hybrid search that combines vector and keyword retrieval&lt;/li&gt;
&lt;li&gt;More advanced reranking strategies&lt;/li&gt;
&lt;li&gt;ML-based query rewriting improvements&lt;/li&gt;
&lt;li&gt;Multi-model orchestration or query routing&lt;/li&gt;
&lt;li&gt;Observability for latency, retrieval quality, and answer quality&lt;/li&gt;
&lt;li&gt;Multi-tenant document collections&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Because the project already uses stable service boundaries and config-driven backends, those improvements can be added incrementally instead of requiring a full redesign.&lt;/p&gt;

&lt;h2&gt;
  
  
  19. Refactoring Path: LangChain, LlamaIndex, or Bedrock 🛣️
&lt;/h2&gt;

&lt;p&gt;One advantage of this architecture is that it does not lock the project into one implementation style forever. Because the system is already separated into services with stable contracts, it can be refactored gradually to use higher-level frameworks or managed cloud providers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Refactoring toward LangChain
&lt;/h3&gt;

&lt;p&gt;If I wanted to adopt LangChain, I would not rewrite the whole repo at once. The cleaner approach would be to replace internal orchestration inside the RAG service first:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Use LangChain for prompt templates, retrievers, and chain composition&lt;/li&gt;
&lt;li&gt;Keep the existing Gateway, frontend, and ingestion APIs unchanged&lt;/li&gt;
&lt;li&gt;Wrap the current Embedding, Retrieval, and LLM integrations behind LangChain-compatible adapters&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That would let the repo keep its current service boundaries while using LangChain as an orchestration layer instead of making LangChain the whole architecture.&lt;/p&gt;

&lt;h3&gt;
  
  
  Refactoring toward LlamaIndex
&lt;/h3&gt;

&lt;p&gt;LlamaIndex would make the most sense if I wanted a more framework-driven retrieval pipeline with built-in indexing, query engines, and document abstractions.&lt;/p&gt;

&lt;p&gt;A practical path would be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Move retrieval orchestration logic from the custom RAG service into LlamaIndex components&lt;/li&gt;
&lt;li&gt;Keep Qdrant as the vector backend if desired&lt;/li&gt;
&lt;li&gt;Reuse the existing ingestion and document-loading flow where it still fits&lt;/li&gt;
&lt;li&gt;Preserve the external API contracts so the frontend and MCP tools do not need to change&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In other words, LlamaIndex could replace part of the internal RAG implementation without forcing a full product rewrite.&lt;/p&gt;

&lt;h3&gt;
  
  
  Refactoring toward AWS Bedrock
&lt;/h3&gt;

&lt;p&gt;Bedrock is a different kind of change because it is a provider shift rather than only a framework shift.&lt;/p&gt;

&lt;p&gt;This repo is already designed for that direction:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Embedding backends are configurable&lt;/li&gt;
&lt;li&gt;LLM backends are configurable&lt;/li&gt;
&lt;li&gt;The docs and code structure already anticipate Bedrock-style backend implementations&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That means a Bedrock migration could be done by implementing or completing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;a Bedrock embedding backend in the Embedding service&lt;/li&gt;
&lt;li&gt;a Bedrock LLM backend in the RAG service&lt;/li&gt;
&lt;li&gt;optional AWS-specific configuration in service settings&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The important part is that the public APIs would stay the same. The frontend, Gateway, ingestion flow, and MCP integration would not need to know whether the actual model provider is local &lt;code&gt;llama.cpp&lt;/code&gt;, OpenAI, or Bedrock.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why this matters
&lt;/h3&gt;

&lt;p&gt;This is exactly why I prefer a modular monorepo for projects like this. The current stack is local-first, but the architecture leaves room for a future where:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;local inference is used in development&lt;/li&gt;
&lt;li&gt;Bedrock or another managed provider is used in production&lt;/li&gt;
&lt;li&gt;LangChain or LlamaIndex is introduced only where it adds value&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That flexibility makes the project more realistic from a software engineering perspective.&lt;/p&gt;

&lt;h2&gt;
  
  
  20. Conclusion 🧵
&lt;/h2&gt;

&lt;p&gt;This repository demonstrates that you can build a serious AI application without making OpenAI the centre of the architecture.&lt;/p&gt;

&lt;p&gt;The system combines local embeddings, vector search, document ingestion, retrieval orchestration, safeguards, reranking, query rewriting, optional ML analysis, MCP-based agent access, and local LLM inference into one coherent stack. It also shows an important design principle: local-first does not have to mean rigid. By keeping the interfaces stable inside a multi-service monorepo, the system stays flexible enough to support alternative backends later.&lt;/p&gt;

&lt;p&gt;Local AI makes the most sense when you care about privacy, predictable cost, internal deployment, and architectural control. Cloud AI may still be the better fit when you need the strongest hosted models immediately, want minimal infrastructure work, or do not mind sending data to external providers.&lt;/p&gt;

&lt;p&gt;For me, that is the biggest takeaway from building this project: the interesting part is not just calling a model. It is designing the full pipeline around retrieval quality, data ownership, extensibility, and real product constraints.&lt;/p&gt;

&lt;h2&gt;
  
  
  21. Demoing This Repo 🎬
&lt;/h2&gt;

&lt;p&gt;If you want to use this article together with a live demo, the shortest path is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;make up
make llm
make frontend
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;When everything is up and running locally, it looks like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzvvwez4rzyvzz5cjvyxz.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fzvvwez4rzyvzz5cjvyxz.png" alt="Up And Running Image" width="800" height="586"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Then:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Open &lt;code&gt;http://localhost:5173&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Upload a PDF&lt;/li&gt;
&lt;li&gt;Ask a question that requires document retrieval&lt;/li&gt;
&lt;li&gt;Show that the answer includes sources&lt;/li&gt;
&lt;/ol&gt;

</description>
      <category>ai</category>
      <category>rag</category>
      <category>llm</category>
      <category>mcp</category>
    </item>
  </channel>
</rss>
