<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Artyom Molchanov</title>
    <description>The latest articles on DEV Community by Artyom Molchanov (@__1bea7786c7).</description>
    <link>https://dev.to/__1bea7786c7</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1900221%2F6b7675e1-b0ff-4a1e-a816-8b367fa895dd.jpg</url>
      <title>DEV Community: Artyom Molchanov</title>
      <link>https://dev.to/__1bea7786c7</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/__1bea7786c7"/>
    <language>en</language>
    <item>
      <title>I Built a Support Ticket Classifier with a Fine-Tuned LLM for $10/month</title>
      <dc:creator>Artyom Molchanov</dc:creator>
      <pubDate>Mon, 26 Jan 2026 05:44:50 +0000</pubDate>
      <link>https://dev.to/__1bea7786c7/i-built-a-support-ticket-classifier-with-a-fine-tuned-llm-for-10month-323l</link>
      <guid>https://dev.to/__1bea7786c7/i-built-a-support-ticket-classifier-with-a-fine-tuned-llm-for-10month-323l</guid>
      <description>&lt;p&gt;I fine-tuned Qwen2.5-0.5B to classify telecom support tickets, quantized it to 350MB, and deployed it on a cheap VPS. Here's how.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://silentworks.tech/test" rel="noopener noreferrer"&gt;Live Demo&lt;/a&gt;&lt;/strong&gt; | &lt;strong&gt;&lt;a href="https://silentworks.tech/docs" rel="noopener noreferrer"&gt;API Docs&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  The Problem
&lt;/h2&gt;

&lt;p&gt;Support teams waste hours manually routing tickets. A customer writes "my wifi is slow" — is it a technical issue? Billing? Should it go to L1 or L2 support?&lt;/p&gt;

&lt;p&gt;I built a classifier that outputs structured JSON with intent, category, urgency, sentiment, routing target, and extracted entities.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Not Just Use a Cloud API?
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Cost&lt;/strong&gt; — 50K requests/month via cloud LLMs (OpenAI, Claude, Gemini) ≈ $100-200. Self-hosted = $10-20&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Privacy&lt;/strong&gt; — Some companies can't send customer data to external APIs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Control&lt;/strong&gt; — Fine-tune for your specific domain&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  The Stack
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Qwen2.5-0.5B (fine-tuned) → GGUF Q4_K_M (350MB)&lt;/li&gt;
&lt;li&gt;llama-cpp-python for inference → FastAPI for API → nginx for reverse proxy&lt;/li&gt;
&lt;li&gt;Docker → VPS ($10/mo)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Fine-Tuning
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Base Model
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;Qwen2.5-0.5B-Instruct&lt;/strong&gt; — small enough for CPU inference, smart enough for classification.&lt;/p&gt;

&lt;h3&gt;
  
  
  Dataset
&lt;/h3&gt;

&lt;p&gt;~1000 synthetic support tickets with labels:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Technical issues (internet, TV, mobile)&lt;/li&gt;
&lt;li&gt;Billing inquiries&lt;/li&gt;
&lt;li&gt;Cancellation requests&lt;/li&gt;
&lt;li&gt;General questions&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Training
&lt;/h3&gt;

&lt;p&gt;Full fine-tuning on Google Colab T4 (free tier):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;3 epochs&lt;/li&gt;
&lt;li&gt;Learning rate: 2e-5&lt;/li&gt;
&lt;li&gt;bf16 training&lt;/li&gt;
&lt;li&gt;~40 minutes total&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Quantization
&lt;/h3&gt;

&lt;p&gt;Converted to GGUF and quantized to 4-bit using llama.cpp tools.&lt;/p&gt;

&lt;p&gt;Result: &lt;strong&gt;350MB&lt;/strong&gt; model that runs on CPU.&lt;/p&gt;

&lt;h2&gt;
  
  
  The API
&lt;/h2&gt;

&lt;p&gt;Simple FastAPI wrapper: load the GGUF model, accept POST requests, construct chat messages with system prompt and user text, parse JSON from model output, log to database.&lt;/p&gt;

&lt;h2&gt;
  
  
  Filtering Garbage Input
&lt;/h2&gt;

&lt;p&gt;Users will send random stuff. Added a heuristic check:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Text too short (&amp;lt; 10 chars) → not relevant&lt;/li&gt;
&lt;li&gt;Contains telecom keywords (wifi, internet, bill, etc.) → relevant&lt;/li&gt;
&lt;li&gt;No keywords + category=unknown → not relevant&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Now irrelevant queries return &lt;code&gt;is_relevant: false&lt;/code&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Deployment
&lt;/h2&gt;

&lt;h3&gt;
  
  
  VPS Setup
&lt;/h3&gt;

&lt;p&gt;Standard approach:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Install Docker&lt;/li&gt;
&lt;li&gt;Deploy with docker compose&lt;/li&gt;
&lt;li&gt;Add SSL with Certbot&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Total cost: &lt;strong&gt;~$10-15/month&lt;/strong&gt; for a 2 vCore, 4GB RAM VPS.&lt;/p&gt;

&lt;h2&gt;
  
  
  Performance
&lt;/h2&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Metric&lt;/th&gt;
&lt;th&gt;Value&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Intent accuracy&lt;/td&gt;
&lt;td&gt;~92%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Category accuracy&lt;/td&gt;
&lt;td&gt;~89%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Inference (VPS CPU)&lt;/td&gt;
&lt;td&gt;3-5 sec&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Inference (M1 Mac)&lt;/td&gt;
&lt;td&gt;150-300ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Model size&lt;/td&gt;
&lt;td&gt;350 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Memory usage&lt;/td&gt;
&lt;td&gt;~700 MB&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;h3&gt;
  
  
  Why 3-5 seconds is fine
&lt;/h3&gt;

&lt;p&gt;This isn't a chatbot. It's ticket classification that happens once when a ticket is created. You can also process async via a queue.&lt;/p&gt;

&lt;p&gt;For faster inference: use a modern CPU (AMD EPYC) or add a GPU.&lt;/p&gt;

&lt;h2&gt;
  
  
  When to Fine-Tune vs Use GPT API
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Fine-tune when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data privacy is required (on-premise)&lt;/li&gt;
&lt;li&gt;High volume of similar requests (&amp;gt;10K/month)&lt;/li&gt;
&lt;li&gt;Specific domain knowledge needed&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Use GPT API when:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Low volume&lt;/li&gt;
&lt;li&gt;Diverse tasks&lt;/li&gt;
&lt;li&gt;Need best quality regardless of cost&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Try It
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Demo:&lt;/strong&gt; &lt;a href="https://silentworks.tech/test" rel="noopener noreferrer"&gt;silentworks.tech&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;API docs:&lt;/strong&gt; &lt;a href="https://silentworks.tech/docs" rel="noopener noreferrer"&gt;silentworks.tech/docs&lt;/a&gt;
&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;strong&gt;Want something similar for your company?&lt;/strong&gt; I build custom LLM solutions that run on your infrastructure. &lt;/p&gt;

&lt;p&gt;Reach out on &lt;a href="https://t.me/var_molchanov" rel="noopener noreferrer"&gt;Telegram&lt;/a&gt; — let's discuss your use case.&lt;/p&gt;

</description>
      <category>llm</category>
      <category>python</category>
      <category>fastapi</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>🧠✂️ Neural Network Lobotomy: Removed 7 Layers from an LLM — It Became 30% Faster</title>
      <dc:creator>Artyom Molchanov</dc:creator>
      <pubDate>Fri, 09 Jan 2026 17:46:24 +0000</pubDate>
      <link>https://dev.to/__1bea7786c7/neural-network-lobotomy-removed-7-layers-from-an-llm-it-became-30-faster-57i8</link>
      <guid>https://dev.to/__1bea7786c7/neural-network-lobotomy-removed-7-layers-from-an-llm-it-became-30-faster-57i8</guid>
      <description>&lt;p&gt;&lt;em&gt;An experiment in surgical layer removal from a language model&lt;/em&gt;&lt;/p&gt;




&lt;h2&gt;
  
  
  TL;DR
&lt;/h2&gt;

&lt;p&gt;I took TinyLlama (1.1B parameters, 22 layers) and started removing layers to test the hypothesis: &lt;strong&gt;modern LLMs are over-parameterized, and many layers do the same thing&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Results:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Removed 1 middle layer → &lt;strong&gt;+10% speed, -4% quality&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Removed 7 layers (safe ones) → &lt;strong&gt;+30% speed, -2.5% quality&lt;/strong&gt;
&lt;/li&gt;
&lt;li&gt;Removed first layer → model broke&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Unexpected:&lt;/strong&gt; Layer 2 is more important than Layer 0! (+6.67 vs +3.92 perplexity)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Tested all 22 layers individually. Here's what I found.&lt;/p&gt;




&lt;h2&gt;
  
  
  Why Does This Matter?
&lt;/h2&gt;

&lt;p&gt;Startups spend millions of dollars on GPUs for LLM inference. OpenAI reportedly spends $700k per day on compute alone. Any optimization that speeds up the model without losing quality is direct cost savings.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Layer pruning&lt;/strong&gt; is one way to speed things up. The idea is simple:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Modern models have dozens of layers (GPT-4 supposedly 120+)&lt;/li&gt;
&lt;li&gt;Not all layers are equally useful&lt;/li&gt;
&lt;li&gt;Some can be removed, and the model barely notices&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Research &lt;a href="https://arxiv.org/abs/2403.03853" rel="noopener noreferrer"&gt;ShortGPT (2024)&lt;/a&gt; showed that you can remove 25% of layers from LLaMA-2 with less than 5% quality loss. I decided to verify this in practice.&lt;/p&gt;




&lt;h2&gt;
  
  
  Experiment Setup
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Hardware:&lt;/strong&gt; MacBook Pro M4 Pro, 24GB RAM&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Model:&lt;/strong&gt; TinyLlama-1.1B-Chat-v1.0&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;1.1 billion parameters&lt;/li&gt;
&lt;li&gt;22 layers (decoder blocks)&lt;/li&gt;
&lt;li&gt;LLaMA architecture&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Metrics:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Perplexity&lt;/strong&gt; — how "surprised" the model is by text (lower = better)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Tokens/second&lt;/strong&gt; — generation speed&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Generation quality&lt;/strong&gt; — subjective assessment of output text&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;Code:&lt;/strong&gt; PyTorch + HuggingFace Transformers. Removing a layer = literally removing it from &lt;code&gt;model.model.layers&lt;/code&gt;:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;remove_layers&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;layers_to_remove&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;original_layers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nf"&gt;list&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;layers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="n"&gt;new_layers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[&lt;/span&gt;
        &lt;span class="n"&gt;layer&lt;/span&gt; &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;layer&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="nf"&gt;enumerate&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;original_layers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;i&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;layers_to_remove&lt;/span&gt;
    &lt;span class="p"&gt;]&lt;/span&gt;
    &lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;layers&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;nn&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nc"&gt;ModuleList&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;new_layers&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="n"&gt;model&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Results
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Summary Table
&lt;/h3&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;What I Removed&lt;/th&gt;
&lt;th&gt;Perplexity&lt;/th&gt;
&lt;th&gt;Δ Quality&lt;/th&gt;
&lt;th&gt;Tokens/s&lt;/th&gt;
&lt;th&gt;Δ Speed&lt;/th&gt;
&lt;th&gt;Works?&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Nothing (baseline)&lt;/td&gt;
&lt;td&gt;1.82&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;59&lt;/td&gt;
&lt;td&gt;—&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Middle layer (#11)&lt;/td&gt;
&lt;td&gt;1.89&lt;/td&gt;
&lt;td&gt;-4%&lt;/td&gt;
&lt;td&gt;64&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;+10%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;3 middle layers (#10-12)&lt;/td&gt;
&lt;td&gt;2.24&lt;/td&gt;
&lt;td&gt;-23%&lt;/td&gt;
&lt;td&gt;66&lt;/td&gt;
&lt;td&gt;+12%&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;First layer (#0)&lt;/td&gt;
&lt;td&gt;5.74&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;-215%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;64&lt;/td&gt;
&lt;td&gt;+10%&lt;/td&gt;
&lt;td&gt;❌&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;7 safe layers&lt;/td&gt;
&lt;td&gt;~1.87&lt;/td&gt;
&lt;td&gt;~-2.5%&lt;/td&gt;
&lt;td&gt;~77&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~30%&lt;/strong&gt;&lt;/td&gt;
&lt;td&gt;✅&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;&lt;em&gt;Note: precise measurements — 10 runs, 5 warmup, MPS backend&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Key Discovery: Middle Layers Are Redundant
&lt;/h3&gt;

&lt;p&gt;Removing one layer from the middle of the model (layer #11 out of 22) gave:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;+10% generation speed&lt;/strong&gt; (59 → 64 tokens/sec)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Only -4% quality&lt;/strong&gt; (perplexity 1.82 → 1.89)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Removing 7 safe layers (3, 4, 5, 9, 10, 11, 12) can achieve &lt;strong&gt;~30% speedup&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Generation remained completely coherent:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;Prompt:&lt;/strong&gt; "Once upon a time"&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Baseline:&lt;/strong&gt; &lt;em&gt;(not measured)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;After removing layer #11:&lt;/strong&gt; "Once upon a time, I was a web developer. Today, I am a freelance web developer. I have worked for some of the most prestigious web..."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The model still generates coherent, grammatically correct text.&lt;/p&gt;

&lt;h3&gt;
  
  
  First Layer Is Sacred
&lt;/h3&gt;

&lt;p&gt;Here's what happened when I removed the first layer:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;strong&gt;After removing layer #0:&lt;/strong&gt; "Once upon a time and a time. Therefore, the therefore, the therefore. Therefore, the therefore, the therefore. Therefore, the..."&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;The model broke. Perplexity shot up from 1.82 to 5.74 (3x worse). Text became meaningless repetition.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Why?&lt;/strong&gt; Early layers are responsible for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Basic attention patterns&lt;/li&gt;
&lt;li&gt;Positional encoding&lt;/li&gt;
&lt;li&gt;Fundamental understanding of language structure&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Without them, the model loses the ability to understand how words relate to each other.&lt;/p&gt;

&lt;h3&gt;
  
  
  Visualization: Importance of Each Layer
&lt;/h3&gt;

&lt;p&gt;I tested removing &lt;strong&gt;each layer individually&lt;/strong&gt; and measured quality degradation:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4wllfinxbw3c2kc9ms9w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F4wllfinxbw3c2kc9ms9w.png" alt=" " width="800" height="328"&gt;&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Layer  0:  ████████████████████████████████████████  +3.92  🔴 CRITICAL
Layer  1:  ██████████                               +0.43
Layer  2:  ████████████████████████████████████████████████████████████████████  +6.67  🔴 MOST IMPORTANT!
Layer  3:                                          +0.01  🟢 CAN REMOVE
Layer  4:  █                                       +0.06  🟢
Layer  5:                                          +0.04  🟢
Layer  6:  ██                                      +0.12
Layer  7:  ███████████████                         +0.74
Layer  8:  ██                                      +0.12
Layer  9:  █                                       +0.07  🟢
Layer 10:  █                                       +0.05  🟢
Layer 11:  █                                       +0.07  🟢
Layer 12:  ██                                      +0.09  🟢
Layer 13:  ███                                     +0.14
Layer 14:  ███████████                             +0.53
Layer 15:  ████████████████████████████████████    +1.81  🟠 IMPORTANT
Layer 16:  █████                                   +0.27
Layer 17:  ██                                      +0.12
Layer 18:  ████                                    +0.18
Layer 19:  ████                                    +0.19
Layer 20:  ██████                                  +0.28
Layer 21:  █████████                               +0.47
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Unexpected discovery:&lt;/strong&gt; Layer 2 is more important than Layer 0! This is the layer that forms key language patterns.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Safe to remove layers:&lt;/strong&gt; 3, 4, 5, 9, 10, 11, 12 — increase perplexity by less than 0.1.&lt;/p&gt;




&lt;h2&gt;
  
  
  Interpretation: Why This Distribution?
&lt;/h2&gt;

&lt;p&gt;Results revealed &lt;strong&gt;three critical zones&lt;/strong&gt;:&lt;/p&gt;

&lt;h3&gt;
  
  
  🔴 Critical Zone 1: Layer 2 (PPL +6.67)
&lt;/h3&gt;

&lt;p&gt;The most important layer in the model! This is unexpected — it's usually assumed that Layer 0 is most important.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Hypothesis:&lt;/strong&gt; Layer 2 is where key attention patterns are formed. The first two layers create a "raw" representation, and Layer 2 "crystallizes" it into a structure that all other layers use.&lt;/p&gt;

&lt;h3&gt;
  
  
  🔴 Critical Zone 2: Layer 0 (PPL +3.92)
&lt;/h3&gt;

&lt;p&gt;The first layer is important for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Processing positional encoding&lt;/li&gt;
&lt;li&gt;Basic token understanding&lt;/li&gt;
&lt;li&gt;Initializing attention patterns&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  🟠 Critical Zone 3: Layer 15 (PPL +1.81)
&lt;/h3&gt;

&lt;p&gt;Unexpected spike in late middle layers. Possibly this is the layer where "switching" from general semantics to task-specific processing happens.&lt;/p&gt;

&lt;h3&gt;
  
  
  🟢 Safe Zone: Layers 3-5, 9-12
&lt;/h3&gt;

&lt;p&gt;These layers show minimal impact (PPL increase &amp;lt; 0.1). They perform &lt;strong&gt;redundant computations&lt;/strong&gt; — repeating what neighboring layers already did.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Practical takeaway:&lt;/strong&gt; you can remove &lt;strong&gt;5-7 layers&lt;/strong&gt; (layers 3, 4, 5, 9, 10, 11, 12) with less than 0.5% quality loss and get &lt;strong&gt;~30% speedup&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Research &lt;a href="https://arxiv.org/abs/2403.03853" rel="noopener noreferrer"&gt;ShortGPT&lt;/a&gt; introduced the &lt;strong&gt;Block Influence (BI)&lt;/strong&gt; metric — my results fully align with their findings: middle layers show low BI and can be safely removed.&lt;/p&gt;




&lt;h2&gt;
  
  
  Practical Takeaways
&lt;/h2&gt;

&lt;h3&gt;
  
  
  For Engineers
&lt;/h3&gt;

&lt;p&gt;Based on per-layer analysis — &lt;strong&gt;optimal combinations for removal:&lt;/strong&gt;&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Aggressiveness&lt;/th&gt;
&lt;th&gt;Remove Layers&lt;/th&gt;
&lt;th&gt;Expected Loss&lt;/th&gt;
&lt;th&gt;Speedup&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;Minimal&lt;/td&gt;
&lt;td&gt;{3}&lt;/td&gt;
&lt;td&gt;~0.4%&lt;/td&gt;
&lt;td&gt;~5%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Moderate&lt;/td&gt;
&lt;td&gt;{3, 5, 10, 11}&lt;/td&gt;
&lt;td&gt;~1%&lt;/td&gt;
&lt;td&gt;~18%&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;Aggressive&lt;/td&gt;
&lt;td&gt;{3, 4, 5, 9, 10, 11, 12}&lt;/td&gt;
&lt;td&gt;~2.5%&lt;/td&gt;
&lt;td&gt;&lt;strong&gt;~32%&lt;/strong&gt;&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="c1"&gt;# Optimal strategy: remove least important layers
&lt;/span&gt;&lt;span class="n"&gt;safe_layers_to_remove&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="mi"&gt;3&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;4&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;5&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;9&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;11&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;12&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;  &lt;span class="c1"&gt;# PPL increase &amp;lt; 0.1 each
&lt;/span&gt;&lt;span class="nf"&gt;remove_layers&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;model&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;safe_layers_to_remove&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# Result: 22 -&amp;gt; 15 layers, ~32% speedup, ~2.5% quality loss
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;&lt;strong&gt;Important:&lt;/strong&gt; never remove layers 0, 2, 15 — these are critical points.&lt;/p&gt;

&lt;h3&gt;
  
  
  For Researchers
&lt;/h3&gt;

&lt;p&gt;This field is actively developing:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;ShortGPT (2024)&lt;/strong&gt; — removing entire layers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;FinerCut (2024)&lt;/strong&gt; — removing components within layers&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;SliceGPT (2024)&lt;/strong&gt; — removing rows/columns from weight matrices&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;LinearPatch (2025)&lt;/strong&gt; — recovering 94% quality after pruning via Hadamard transform (&lt;a href="https://arxiv.org/abs/2505.24680" rel="noopener noreferrer"&gt;arxiv&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;MRP (2025)&lt;/strong&gt; — Maximum Redundancy Pruning, adaptive removal of most redundant layers (&lt;a href="https://arxiv.org/abs/2503.18377" rel="noopener noreferrer"&gt;arxiv&lt;/a&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CLP (2025)&lt;/strong&gt; — automatic search for optimal segments to remove (&lt;a href="https://arxiv.org/abs/2510.23652" rel="noopener noreferrer"&gt;arxiv&lt;/a&gt;)&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Combining with quantization (INT4/INT8) can give even greater speedup.&lt;/p&gt;

&lt;h3&gt;
  
  
  For Business
&lt;/h3&gt;

&lt;p&gt;If you're paying $10k/month for inference GPUs, layer pruning can save $2-3k without noticeable quality loss. At OpenAI's scale, this is millions of dollars.&lt;/p&gt;




&lt;h2&gt;
  
  
  Experiment Limitations
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Small model&lt;/strong&gt; — TinyLlama 1.1B, results may differ for 7B/70B models&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Simple metric&lt;/strong&gt; — perplexity doesn't capture all quality aspects&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;No fine-tuning&lt;/strong&gt; — possibly after removing layers the model can be fine-tuned to recover quality&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Single dataset&lt;/strong&gt; — need to test on different tasks&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Measurement variability&lt;/strong&gt; — speed on MPS backend has ±10% variance, important to do many runs&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Chain-of-thought degradation&lt;/strong&gt; — recent research (&lt;a href="https://arxiv.org/abs/2510.22228" rel="noopener noreferrer"&gt;arxiv 2510.22228&lt;/a&gt;) showed that even removing 1-2 layers can break multi-step reasoning ability, while simple tasks work fine&lt;/li&gt;
&lt;/ol&gt;




&lt;h2&gt;
  
  
  Code
&lt;/h2&gt;

&lt;p&gt;All experiment code is available on GitLab: &lt;a href="https://gitlab.com/molchanov.artem.1994/lobotomyllm" rel="noopener noreferrer"&gt;https://gitlab.com/molchanov.artem.1994/lobotomyllm&lt;/a&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;git clone https://gitlab.com/molchanov.artem.1994/lobotomyllm
&lt;span class="nb"&gt;cd &lt;/span&gt;lobotomyLlm
python &lt;span class="nt"&gt;-m&lt;/span&gt; venv venv &lt;span class="o"&gt;&amp;amp;&amp;amp;&lt;/span&gt; &lt;span class="nb"&gt;source &lt;/span&gt;venv/bin/activate
pip &lt;span class="nb"&gt;install&lt;/span&gt; &lt;span class="nt"&gt;-r&lt;/span&gt; requirements.txt
python experiments/run_ablation.py &lt;span class="nt"&gt;--experiment&lt;/span&gt; quick
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;






&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;&lt;strong&gt;Hypothesis confirmed:&lt;/strong&gt; modern LLMs are over-parameterized, 30% of layers can be removed with &amp;lt;3% quality loss.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Key insights:&lt;/strong&gt;&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;Layer 2 is the most important&lt;/strong&gt; (unexpectedly more important than Layer 0)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Layers 3-5, 9-12 are redundant&lt;/strong&gt; (can be removed almost for free)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Layer 15 is a hidden critical layer&lt;/strong&gt; in the late part of the network&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;strong&gt;Practical result:&lt;/strong&gt; removing 7 layers (22→15) gives ~32% speedup with ~2.5% quality loss.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Next steps:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Run on Llama-3 8B for more convincing results&lt;/li&gt;
&lt;li&gt;Try pruning + quantization combination&lt;/li&gt;
&lt;li&gt;Investigate what critical layers (Layer 2, Layer 15) actually "know"&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;If you liked this — subscribe, star the GitLab repo, share with colleagues.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Questions and suggestions — in the comments or DM.&lt;/em&gt;&lt;/p&gt;




&lt;p&gt;&lt;strong&gt;Tags:&lt;/strong&gt; #MachineLearning #LLM #Optimization #PyTorch #NLP #DeepLearning&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>ai</category>
    </item>
    <item>
      <title>How I built a lag-free 10GB Log Viewer for VS Code using Rust &amp; Memory-Mapping</title>
      <dc:creator>Artyom Molchanov</dc:creator>
      <pubDate>Wed, 10 Dec 2025 19:19:21 +0000</pubDate>
      <link>https://dev.to/__1bea7786c7/how-i-built-a-lag-free-10gb-log-viewer-for-vs-code-using-rust-memory-mapping-ih4</link>
      <guid>https://dev.to/__1bea7786c7/how-i-built-a-lag-free-10gb-log-viewer-for-vs-code-using-rust-memory-mapping-ih4</guid>
      <description>&lt;p&gt;We’ve all been there. You need to check a production log. You download the server.log, double-click it in VS Code, and... &lt;strong&gt;freeze&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;VS Code is built on Electron. Loading a multi-gigabyte text file into the DOM is a death sentence for RAM and UI responsiveness. The standard solution is to close the editor and go back to less or tail in the terminal.&lt;/p&gt;

&lt;p&gt;But I wanted the best of both worlds: the raw speed of CLI tools and the comfort of the VS Code UI (regex search, copy-paste, highlighting).&lt;/p&gt;

&lt;p&gt;So, I built a custom extension with a &lt;strong&gt;Rust sidecar&lt;/strong&gt; to solve this. Here is a deep dive into how it works under the hood.&lt;/p&gt;




&lt;h2&gt;
  
  
  The Architecture: Sidecar Pattern
&lt;/h2&gt;

&lt;p&gt;The extension consists of two parts communicating via Stdin/Stdout (JSON IPC):&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Frontend (TypeScript/VS Code Webview):&lt;/strong&gt; Handles the UI, virtual scrolling, and rendering.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Backend (Rust):&lt;/strong&gt; Handles file I/O, indexing, and searching.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;The goal was simple: &lt;strong&gt;VS Code never holds the full file in memory.&lt;/strong&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  🦀 The Backend: Rust &amp;amp; Memory-Mapping
&lt;/h3&gt;

&lt;p&gt;The core logic resides in a binary called log-core.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. Memory-Mapped I/O&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Instead of reading the file into a buffer, I used memmap2. This maps the file on the disk directly into the process's virtual address space.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Benefit:&lt;/strong&gt; The OS handles paging. Opening a 10GB file takes almost &lt;strong&gt;zero RAM&lt;/strong&gt; allocation for the content itself.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Speed:&lt;/strong&gt; Accessing a byte at offset 1,000,000 is as fast as accessing an array index.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. The Line Index&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
When a file opens, the backend performs a single O(n) pass to build a Vec of line offsets.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;offsets[0] = 0&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;offsets[1] = (position of first \n) + 1&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;...and so on.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This allows the backend to implement read_lines(start_line, count) efficiently. It calculates the byte range from the index, slices the memory-mapped file, and converts it using String::from_utf8_lossy (handling potential encoding issues gracefully).&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;3. Search &amp;amp; Regex&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Search happens entirely in Rust. I use the regex crate for patterns or standard string matching for plain text. To prevent locking up the CPU on massive files, the search creates a stream of results with a hard limit (e.g., stopping after ~10k matches to keep the UI responsive).&lt;/p&gt;




&lt;h2&gt;
  
  
  ⚡ The Frontend: Virtualization &amp;amp; Limits
&lt;/h2&gt;

&lt;p&gt;The frontend is a VS Code Webview. The biggest challenge here isn't just "showing text," it's &lt;strong&gt;browser limits&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;1. The 10 Million Pixel Problem&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Browsers have a hard limit on the height of a DOM element (often around 10-30M pixels). A 10GB log file could easily exceed 100M pixels in height.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Solution:&lt;/strong&gt; I implemented a coordinate scaling factor. The scrollbar you see is "fake" (virtualized). We calculate a virtualHeight that fits within browser limits, and then map the scroll position back to the realLineNumber.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;2. Virtual Scrolling &amp;amp; Buffering&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
The Webview maintains a state map: loadedLines: Map.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;We only render the visible lines + a small buffer (BUFFER_LINES) into the DOM.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Missing ranges are calculated and requested from the Rust backend in chunks (CHUNK_SIZE).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Cache Eviction:&lt;/strong&gt; As you scroll away, lines far from the viewport are removed from the map to keep memory usage flat.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;3. The "Filter Paradox"&lt;/strong&gt;&lt;br&gt;&lt;br&gt;
Implementing filtering (e.g., "Show only ERROR") was tricky.&lt;br&gt;&lt;br&gt;
If I filter a 1M line file and find 500 errors, I need to show a continuous list of 500 lines, BUT I still need to know their &lt;strong&gt;original&lt;/strong&gt; line numbers for debugging.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Logic:&lt;/strong&gt; The frontend builds a mapping: ViewIndex (0..499) → ActualLine (e.g., 504, 1200, 9000).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;UI:&lt;/strong&gt; The main view scrolls based on the ViewIndex, but the Gutter (line numbers) renders the ActualLine.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;This means Ctrl+G (Go to Line) has to solve a reverse lookup: "Where is line 1200 in the filtered view?"&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  🔄 Follow Mode (Tail -f)
&lt;/h2&gt;

&lt;p&gt;I wanted to replace tail -f.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Polling:&lt;/strong&gt; The frontend triggers a refreshFile command every 500ms.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Backend:&lt;/strong&gt; Rust checks the file size. If it grew, it re-maps the file (cheap operation) and scans only the new bytes for new line offsets.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Frontend:&lt;/strong&gt; If the user is at the bottom (or in "Follow Mode"), the view auto-scrolls to the new lines. If the user manually scrolls up, Follow Mode pauses automatically.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;h2&gt;
  
  
  📦 Build &amp;amp; Cross-Compilation
&lt;/h2&gt;

&lt;p&gt;Since this relies on a native binary, I couldn't just publish JS code.&lt;br&gt;&lt;br&gt;
I set up a build pipeline using GitHub Actions to cross-compile the Rust binary for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;darwin-x64 &amp;amp; darwin-arm64 (Apple Silicon)&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;linux-x64&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;win32-x64&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The VS Code Marketplace supports platform-specific builds. When a user installs the extension, VS Code automatically fetches the correct VSIX for their OS. The resulting package is surprisingly small (~0.8MB), with the compressed binary taking up most of that.&lt;/p&gt;




&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;This project started as a way to fix my own frustration, but it turned into a great lesson on how much performance you can squeeze out of VS Code when you offload heavy I/O to a system language like Rust.&lt;/p&gt;

&lt;p&gt;If you deal with massive logs, CSVs, or data dumps, give it a try.&lt;/p&gt;

&lt;p&gt;👉 &lt;strong&gt;VS Code Marketplace:&lt;/strong&gt; [&lt;a href="https://marketplace.visualstudio.com/items?itemName=molchanovartem1994.log-analyzer-pro" rel="noopener noreferrer"&gt;https://marketplace.visualstudio.com/items?itemName=molchanovartem1994.log-analyzer-pro&lt;/a&gt;]&lt;br&gt;&lt;br&gt;
👉 &lt;strong&gt;GitHub Repo:&lt;/strong&gt; [&lt;a href="https://gitlab.com/molchanov.artem.1994/log-analyzer-pro" rel="noopener noreferrer"&gt;https://gitlab.com/molchanov.artem.1994/log-analyzer-pro&lt;/a&gt;]&lt;/p&gt;

&lt;p&gt;Let me know if you have any questions about the memmap implementation or the Webview message passing!&lt;/p&gt;

</description>
      <category>vscode</category>
      <category>rust</category>
      <category>performance</category>
      <category>javascript</category>
    </item>
  </channel>
</rss>
