<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: VIVEK T</title>
    <description>The latest articles on DEV Community by VIVEK T (@vivek_t_05fe5587ebaf850d3).</description>
    <link>https://dev.to/vivek_t_05fe5587ebaf850d3</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1696877%2F7b8283c5-362c-4dfa-9b4a-983d8c1c003a.jpg</url>
      <title>DEV Community: VIVEK T</title>
      <link>https://dev.to/vivek_t_05fe5587ebaf850d3</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/vivek_t_05fe5587ebaf850d3"/>
    <language>en</language>
    <item>
      <title>I Thought Fine-Tuning LLMs Needed Expensive GPUs. I Was Wrong.</title>
      <dc:creator>VIVEK T</dc:creator>
      <pubDate>Wed, 20 May 2026 07:14:56 +0000</pubDate>
      <link>https://dev.to/vivek_t_05fe5587ebaf850d3/i-thought-fine-tuning-llms-needed-expensive-gpus-i-was-wrong-2p06</link>
      <guid>https://dev.to/vivek_t_05fe5587ebaf850d3/i-thought-fine-tuning-llms-needed-expensive-gpus-i-was-wrong-2p06</guid>
      <description>&lt;p&gt;Yesterday I fine-tuned a 1.1B parameter language model using QLoRA on consumer hardware.&lt;/p&gt;

&lt;p&gt;And honestly?&lt;/p&gt;

&lt;p&gt;The hardest part wasn’t training.&lt;br&gt;
It was debugging everything around it.&lt;/p&gt;

&lt;p&gt;I started with a simple goal:&lt;br&gt;
“understand how LLM fine-tuning actually works.”&lt;br&gt;
A few hours later I was deep into:&lt;/p&gt;

&lt;p&gt;NF4 quantization&lt;br&gt;
LoRA internals&lt;br&gt;
tokenization&lt;br&gt;
chat templates&lt;br&gt;
VRAM optimization&lt;br&gt;
adapter injection&lt;br&gt;
FastAPI serving&lt;br&gt;
Redis caching&lt;br&gt;
Qdrant RAG pipelines&lt;br&gt;
and dependency version warfare&lt;/p&gt;

&lt;p&gt;This was the stack:&lt;br&gt;
TinyLlama&lt;br&gt;
QLoRA&lt;br&gt;
PEFT&lt;br&gt;
TRL&lt;br&gt;
BitsAndBytes&lt;br&gt;
Hugging Face&lt;br&gt;
FastAPI&lt;br&gt;
The Crazy Part&lt;/p&gt;

&lt;p&gt;I trained only ~0.2% of the model.&lt;br&gt;
Not 20%.&lt;br&gt;
Not 2%.&lt;br&gt;
0.2%.&lt;/p&gt;

&lt;p&gt;That’s the magic of LoRA.&lt;/p&gt;

&lt;p&gt;Instead of retraining the full model, you train tiny adapter matrices on top of frozen weights.&lt;br&gt;
And with 4-bit NF4 quantization, memory usage drops enough to make this possible on low VRAM hardware.&lt;br&gt;
That moment blew my mind.&lt;/p&gt;

&lt;p&gt;The Funniest Bug&lt;br&gt;
Training loss looked good.&lt;br&gt;
Everything seemed successful.&lt;br&gt;
Then inference output came out completely broken.&lt;/p&gt;

&lt;p&gt;Why?&lt;/p&gt;

&lt;p&gt;Because the inference prompt format didn’t match the training chat template.&lt;br&gt;
One formatting mismatch destroyed the entire output quality.&lt;br&gt;
That single bug taught me more than most tutorials online.&lt;br&gt;
Biggest Takeaway&lt;br&gt;
AI engineering is not:&lt;br&gt;
“call OpenAI API and ship.”&lt;/p&gt;

&lt;p&gt;The real stuff starts when you understand:&lt;/p&gt;

&lt;p&gt;quantization&lt;br&gt;
tokenization&lt;br&gt;
adapters&lt;br&gt;
training loops&lt;br&gt;
inference pipelines&lt;br&gt;
deployment tradeoffs&lt;/p&gt;

&lt;p&gt;That’s when you stop being an API consumer and start understanding the actual systems underneath.&lt;/p&gt;

&lt;p&gt;What I Built 🚀&lt;/p&gt;

&lt;p&gt;✅ Fine-tuned TinyLlama-1.1B using QLoRA&lt;br&gt;
✅ Trained only ~2.25M params out of ~1.1B&lt;br&gt;
✅ Built FastAPI inference pipeline&lt;br&gt;
✅ Saved adapter-only weights&lt;br&gt;
✅ Pushed model adapter to Hugging Face&lt;br&gt;
✅ Built interactive dark-mode revision cheatsheet&lt;br&gt;
✅ Explored Redis + Qdrant RAG concepts&lt;/p&gt;

&lt;p&gt;Open Source AI Is Wild&lt;/p&gt;

&lt;p&gt;Huge respect to:&lt;br&gt;
Hugging Face&lt;br&gt;
TinyLlama&lt;br&gt;
PEFT&lt;br&gt;
TRL&lt;br&gt;
BitsAndBytes&lt;/p&gt;

&lt;p&gt;The tooling available for solo developers right now is insane.&lt;/p&gt;

&lt;p&gt;Links&lt;/p&gt;

&lt;p&gt;🤗 Hugging Face Repo&lt;br&gt;
&lt;/p&gt;
&lt;div class="crayons-card c-embed text-styles text-styles--secondary"&gt;
    &lt;div class="c-embed__content"&gt;
        &lt;div class="c-embed__cover"&gt;
          &lt;a href="https://huggingface.co/whyvickyyy/agent-forge-support-agent" class="c-link align-middle" rel="noopener noreferrer"&gt;
            &lt;img alt="" src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-thumbnails.huggingface.co%2Fsocial-thumbnails%2Fmodels%2Fwhyvickyyy%2Fagent-forge-support-agent.png" height="432" class="m-0" width="800"&gt;
          &lt;/a&gt;
        &lt;/div&gt;
      &lt;div class="c-embed__body"&gt;
        &lt;h2 class="fs-xl lh-tight"&gt;
          &lt;a href="https://huggingface.co/whyvickyyy/agent-forge-support-agent" rel="noopener noreferrer" class="c-link"&gt;
            whyvickyyy/agent-forge-support-agent · Hugging Face
          &lt;/a&gt;
        &lt;/h2&gt;
          &lt;p class="truncate-at-3"&gt;
            We’re on a journey to advance and democratize artificial intelligence through open source and open science.
          &lt;/p&gt;
        &lt;div class="color-secondary fs-s flex items-center"&gt;
          huggingface.co
        &lt;/div&gt;
      &lt;/div&gt;
    &lt;/div&gt;
&lt;/div&gt;


&lt;p&gt;💻 GitHub Repo&lt;br&gt;
&lt;/p&gt;
&lt;div class="ltag-github-readme-tag"&gt;
  &lt;div class="readme-overview"&gt;
    &lt;h2&gt;
      &lt;img src="https://assets.dev.to/assets/github-logo-5a155e1f9a670af7944dd5e12375bc76ed542ea80224905ecaf878b9157cdefc.svg" alt="GitHub logo"&gt;
      &lt;a href="https://github.com/0xvicky" rel="noopener noreferrer"&gt;
        0xvicky
      &lt;/a&gt; / &lt;a href="https://github.com/0xvicky/agent-forge" rel="noopener noreferrer"&gt;
        agent-forge
      &lt;/a&gt;
    &lt;/h2&gt;
    &lt;h3&gt;
      
    &lt;/h3&gt;
  &lt;/div&gt;
  &lt;div class="ltag-github-body"&gt;
    
&lt;div id="readme" class="md"&gt;&lt;div&gt;
&lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/e4a5a906c5ae69a5e4985f7c6842b7204679c7f7a6e67b9f037d858f984e40f1/68747470733a2f2f726561646d652d747970696e672d7376672e64656d6f6c61622e636f6d3f666f6e743d466972612b436f6465267765696768743d3730302673697a653d33322670617573653d3130303026636f6c6f723d4646364233352663656e7465723d74727565267643656e7465723d747275652677696474683d363030266c696e65733d2545322539412539322545462542382538462b4167656e742d466f7267652b2545322539412539322545462542382538463b514c6f52412b46696e652d54756e696e672b2532362b4c4c4d4f70733b50726f64756374696f6e2d47726164652b47656e41492b506970656c696e65"&gt;&lt;img src="https://camo.githubusercontent.com/e4a5a906c5ae69a5e4985f7c6842b7204679c7f7a6e67b9f037d858f984e40f1/68747470733a2f2f726561646d652d747970696e672d7376672e64656d6f6c61622e636f6d3f666f6e743d466972612b436f6465267765696768743d3730302673697a653d33322670617573653d3130303026636f6c6f723d4646364233352663656e7465723d74727565267643656e7465723d747275652677696474683d363030266c696e65733d2545322539412539322545462542382538462b4167656e742d466f7267652b2545322539412539322545462542382538463b514c6f52412b46696e652d54756e696e672b2532362b4c4c4d4f70733b50726f64756374696f6e2d47726164652b47656e41492b506970656c696e65" alt="Typing SVG"&gt;&lt;/a&gt;
&lt;br&gt;
&lt;p&gt;
  &lt;strong&gt;Production-Oriented QLoRA Fine-Tuning &amp;amp; LLMOps Pipeline&lt;/strong&gt;&lt;br&gt;
  Fine-tuned TinyLlama using QLoRA, PEFT, TRL, and the HuggingFace ecosystem&lt;br&gt;
  with a deployment-oriented inference architecture
&lt;/p&gt;



&lt;p&gt;&lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/26076ea24addfe85caabd092eaf87f92dee439db567137cb2b6f98a91eab24fa/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f507974686f6e5f332e31302b2d3337373641423f7374796c653d666f722d7468652d6261646765266c6f676f3d707974686f6e266c6f676f436f6c6f723d7768697465"&gt;&lt;img src="https://camo.githubusercontent.com/26076ea24addfe85caabd092eaf87f92dee439db567137cb2b6f98a91eab24fa/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f507974686f6e5f332e31302b2d3337373641423f7374796c653d666f722d7468652d6261646765266c6f676f3d707974686f6e266c6f676f436f6c6f723d7768697465" alt="Python"&gt;&lt;/a&gt;
&lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/01f021238d5abc3e71e141b696b1fa4c687a820716ff3e7697092159e032bf30/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f5079546f7263682d4545344332433f7374796c653d666f722d7468652d6261646765266c6f676f3d7079746f726368266c6f676f436f6c6f723d7768697465"&gt;&lt;img src="https://camo.githubusercontent.com/01f021238d5abc3e71e141b696b1fa4c687a820716ff3e7697092159e032bf30/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f5079546f7263682d4545344332433f7374796c653d666f722d7468652d6261646765266c6f676f3d7079746f726368266c6f676f436f6c6f723d7768697465" alt="PyTorch"&gt;&lt;/a&gt;
&lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/0dd5d7b14aa6e6e8c1837793e08962bcbee20b939e90b36238ec8f15ee365269/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f48756767696e67466163652d4646443231453f7374796c653d666f722d7468652d6261646765266c6f676f3d68756767696e6766616365266c6f676f436f6c6f723d626c61636b"&gt;&lt;img src="https://camo.githubusercontent.com/0dd5d7b14aa6e6e8c1837793e08962bcbee20b939e90b36238ec8f15ee365269/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f48756767696e67466163652d4646443231453f7374796c653d666f722d7468652d6261646765266c6f676f3d68756767696e6766616365266c6f676f436f6c6f723d626c61636b" alt="HuggingFace"&gt;&lt;/a&gt;
&lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/f1089fe65f9ac27d38b03b1b728ac926659fcd1c88dc75269faa736fa95131d3/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f466173744150492d3030393638383f7374796c653d666f722d7468652d6261646765266c6f676f3d66617374617069266c6f676f436f6c6f723d7768697465"&gt;&lt;img src="https://camo.githubusercontent.com/f1089fe65f9ac27d38b03b1b728ac926659fcd1c88dc75269faa736fa95131d3/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f466173744150492d3030393638383f7374796c653d666f722d7468652d6261646765266c6f676f3d66617374617069266c6f676f436f6c6f723d7768697465" alt="FastAPI"&gt;&lt;/a&gt;
&lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/85e3ff712bb08b8e5595b34ecddfd189a51b20f61988aa467a56c5da9a107dda/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f446f636b65722d3234393645443f7374796c653d666f722d7468652d6261646765266c6f676f3d646f636b6572266c6f676f436f6c6f723d7768697465"&gt;&lt;img src="https://camo.githubusercontent.com/85e3ff712bb08b8e5595b34ecddfd189a51b20f61988aa467a56c5da9a107dda/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f446f636b65722d3234393645443f7374796c653d666f722d7468652d6261646765266c6f676f3d646f636b6572266c6f676f436f6c6f723d7768697465" alt="Docker"&gt;&lt;/a&gt;
&lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/2ea03384d026b00944c957e196bc0d4a33a332bc6775ca16404dbb370bdc8026/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f52656469732d4443333832443f7374796c653d666f722d7468652d6261646765266c6f676f3d7265646973266c6f676f436f6c6f723d7768697465"&gt;&lt;img src="https://camo.githubusercontent.com/2ea03384d026b00944c957e196bc0d4a33a332bc6775ca16404dbb370bdc8026/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f52656469732d4443333832443f7374796c653d666f722d7468652d6261646765266c6f676f3d7265646973266c6f676f436f6c6f723d7768697465" alt="Redis"&gt;&lt;/a&gt;
&lt;a rel="noopener noreferrer nofollow" href="https://camo.githubusercontent.com/b62f286675d2d75f131065d78a4b348fe281ca2cea7c8ebd462e64779d7e264b/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f516472616e742d4442344342323f7374796c653d666f722d7468652d6261646765266c6f676f3d716472616e74266c6f676f436f6c6f723d7768697465"&gt;&lt;img src="https://camo.githubusercontent.com/b62f286675d2d75f131065d78a4b348fe281ca2cea7c8ebd462e64779d7e264b/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f516472616e742d4442344342323f7374796c653d666f722d7468652d6261646765266c6f676f3d716472616e74266c6f676f436f6c6f723d7768697465" alt="Qdrant"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;br&gt;
&lt;p&gt;&lt;a href="https://github.com" rel="noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/6422f17d2bc7b766563edef507a48cf14fdcfb6b59eec4a014b264717b4a54a4/68747470733a2f2f696d672e736869656c64732e696f2f6769746875622f73746172732f566976656b54796167692f6167656e742d666f7267653f7374796c653d736f6369616c" alt="Stars"&gt;&lt;/a&gt;
&lt;a href="https://opensource.org/licenses/MIT" rel="nofollow noopener noreferrer"&gt;&lt;img src="https://camo.githubusercontent.com/fdf2982b9f5d7489dcf44570e714e3a15fce6253e0cc6b5aa61a075aac2ff71b/68747470733a2f2f696d672e736869656c64732e696f2f62616467652f4c6963656e73652d4d49542d79656c6c6f772e737667" alt="License: MIT"&gt;&lt;/a&gt;&lt;/p&gt;
&lt;/div&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;🚀 Overview&lt;/h2&gt;
&lt;/div&gt;
&lt;p&gt;&lt;strong&gt;Agent-Forge&lt;/strong&gt; is a production-oriented GenAI engineering project that implements the &lt;strong&gt;complete lifecycle of modern LLM adaptation and deployment&lt;/strong&gt; — from raw dataset to containerized inference server.&lt;/p&gt;
&lt;blockquote&gt;
&lt;p&gt;🎯 Goal: Deeply understand and implement end-to-end LLM fine-tuning &amp;amp; deployment infrastructure.&lt;/p&gt;
&lt;/blockquote&gt;
&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;
&lt;p&gt;&lt;strong&gt;🧠 ML Engineering&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;QLoRA Fine-Tuning&lt;/li&gt;
&lt;li&gt;4-bit NF4 Quantization&lt;/li&gt;
&lt;li&gt;PEFT / LoRA Adapters&lt;/li&gt;
&lt;li&gt;Supervised Fine-Tuning (SFT)&lt;/li&gt;
&lt;li&gt;Conversational Dataset Engineering&lt;/li&gt;
&lt;/ul&gt;
&lt;/td&gt;
&lt;td&gt;
&lt;p&gt;&lt;strong&gt;⚙️ LLMOps &amp;amp; Infra&lt;/strong&gt;&lt;/p&gt;
&lt;ul&gt;
&lt;li&gt;FastAPI Inference Serving&lt;/li&gt;
&lt;li&gt;Redis Caching Architecture&lt;/li&gt;
&lt;li&gt;Qdrant RAG Integration&lt;/li&gt;
&lt;li&gt;Docker Containerization&lt;/li&gt;
&lt;li&gt;Deployment-Ready Pipelines&lt;/li&gt;
&lt;/ul&gt;
&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;div class="markdown-heading"&gt;
&lt;h2 class="heading-element"&gt;🏗️ Architecture&lt;/h2&gt;
&lt;/div&gt;
&lt;div class="snippet-clipboard-content notranslate position-relative overflow-auto"&gt;
&lt;pre class="notranslate"&gt;&lt;code&gt;  HuggingFace Dataset
          │
          ▼
  Conversational Formatting
          │
          ▼
  Tokenizer (TinyLlama)
          │
          ▼
  4-bit NF4 Quantization  ◄──── BitsAndBytes
          │
          ▼
  QLoRA + PEFT Adapter Injection
          │
          ▼
  SFT Training  (SFTTrainer / TRL)
          │
          ▼
  Inference Evaluation (Before vs After)
          │
          ▼
  LoRA Adapter Saving
          │
          ▼
  FastAPI Inference Server
          │
          ▼
  Redis +&lt;/code&gt;&lt;/pre&gt;…&lt;/div&gt;&lt;/div&gt;
  &lt;/div&gt;
  &lt;div class="gh-btn-container"&gt;&lt;a class="gh-btn" href="https://github.com/0xvicky/agent-forge" rel="noopener noreferrer"&gt;View on GitHub&lt;/a&gt;&lt;/div&gt;
&lt;/div&gt;


&lt;p&gt;If you’re learning AI:&lt;br&gt;
don’t just use models.&lt;/p&gt;

&lt;p&gt;Learn how they’re built, trained, optimized, and deployed.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkx4m0lcdhczt1h5ve5vf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fkx4m0lcdhczt1h5ve5vf.png" alt=" " width="800" height="533"&gt;&lt;/a&gt;&lt;/p&gt;

</description>
      <category>python</category>
      <category>llm</category>
      <category>qlora</category>
      <category>mlops</category>
    </item>
  </channel>
</rss>
