DEV Community: Artyom Molchanov

I Built a Support Ticket Classifier with a Fine-Tuned LLM for $10/month

Artyom Molchanov — Mon, 26 Jan 2026 05:44:50 +0000

I fine-tuned Qwen2.5-0.5B to classify telecom support tickets, quantized it to 350MB, and deployed it on a cheap VPS. Here's how.

Live Demo | API Docs

The Problem

Support teams waste hours manually routing tickets. A customer writes "my wifi is slow" — is it a technical issue? Billing? Should it go to L1 or L2 support?

I built a classifier that outputs structured JSON with intent, category, urgency, sentiment, routing target, and extracted entities.

Why Not Just Use a Cloud API?

Cost — 50K requests/month via cloud LLMs (OpenAI, Claude, Gemini) ≈ $100-200. Self-hosted = $10-20
Privacy — Some companies can't send customer data to external APIs
Control — Fine-tune for your specific domain

The Stack

Qwen2.5-0.5B (fine-tuned) → GGUF Q4_K_M (350MB)
llama-cpp-python for inference → FastAPI for API → nginx for reverse proxy
Docker → VPS ($10/mo)

Fine-Tuning

Base Model

Qwen2.5-0.5B-Instruct — small enough for CPU inference, smart enough for classification.

Dataset

~1000 synthetic support tickets with labels:

Technical issues (internet, TV, mobile)
Billing inquiries
Cancellation requests
General questions

Training

Full fine-tuning on Google Colab T4 (free tier):

3 epochs
Learning rate: 2e-5
bf16 training
~40 minutes total

Quantization

Converted to GGUF and quantized to 4-bit using llama.cpp tools.

Result: 350MB model that runs on CPU.

The API

Simple FastAPI wrapper: load the GGUF model, accept POST requests, construct chat messages with system prompt and user text, parse JSON from model output, log to database.

Filtering Garbage Input

Users will send random stuff. Added a heuristic check:

Text too short (< 10 chars) → not relevant
Contains telecom keywords (wifi, internet, bill, etc.) → relevant
No keywords + category=unknown → not relevant

Now irrelevant queries return is_relevant: false.

Deployment

VPS Setup

Standard approach:

Install Docker
Deploy with docker compose
Add SSL with Certbot

Total cost: ~$10-15/month for a 2 vCore, 4GB RAM VPS.

Performance

Metric	Value
Intent accuracy	~92%
Category accuracy	~89%
Inference (VPS CPU)	3-5 sec
Inference (M1 Mac)	150-300ms
Model size	350 MB
Memory usage	~700 MB

Why 3-5 seconds is fine

This isn't a chatbot. It's ticket classification that happens once when a ticket is created. You can also process async via a queue.

For faster inference: use a modern CPU (AMD EPYC) or add a GPU.

When to Fine-Tune vs Use GPT API

Fine-tune when:

Data privacy is required (on-premise)
High volume of similar requests (>10K/month)
Specific domain knowledge needed

Use GPT API when:

Low volume
Diverse tasks
Need best quality regardless of cost

Try It

Demo: silentworks.tech
API docs: silentworks.tech/docs

Want something similar for your company? I build custom LLM solutions that run on your infrastructure.

Reach out on Telegram — let's discuss your use case.

🧠✂️ Neural Network Lobotomy: Removed 7 Layers from an LLM — It Became 30% Faster

Artyom Molchanov — Fri, 09 Jan 2026 17:46:24 +0000

An experiment in surgical layer removal from a language model

TL;DR

I took TinyLlama (1.1B parameters, 22 layers) and started removing layers to test the hypothesis: modern LLMs are over-parameterized, and many layers do the same thing.

Results:

Removed 1 middle layer → +10% speed, -4% quality
Removed 7 layers (safe ones) → +30% speed, -2.5% quality
Removed first layer → model broke
Unexpected: Layer 2 is more important than Layer 0! (+6.67 vs +3.92 perplexity)

Tested all 22 layers individually. Here's what I found.

Why Does This Matter?

Startups spend millions of dollars on GPUs for LLM inference. OpenAI reportedly spends $700k per day on compute alone. Any optimization that speeds up the model without losing quality is direct cost savings.

Layer pruning is one way to speed things up. The idea is simple:

Modern models have dozens of layers (GPT-4 supposedly 120+)
Not all layers are equally useful
Some can be removed, and the model barely notices

Research ShortGPT (2024) showed that you can remove 25% of layers from LLaMA-2 with less than 5% quality loss. I decided to verify this in practice.

Experiment Setup

Hardware: MacBook Pro M4 Pro, 24GB RAM

Model: TinyLlama-1.1B-Chat-v1.0

1.1 billion parameters
22 layers (decoder blocks)
LLaMA architecture

Metrics:

Perplexity — how "surprised" the model is by text (lower = better)
Tokens/second — generation speed
Generation quality — subjective assessment of output text

Code: PyTorch + HuggingFace Transformers. Removing a layer = literally removing it from model.model.layers:

def remove_layers(model, layers_to_remove):
    original_layers = list(model.model.layers)
    new_layers = [
        layer for i, layer in enumerate(original_layers)
        if i not in layers_to_remove
    ]
    model.model.layers = nn.ModuleList(new_layers)
    return model

Results

Summary Table

What I Removed	Perplexity	Δ Quality	Tokens/s	Δ Speed	Works?
Nothing (baseline)	1.82	—	59	—	✅
Middle layer (#11)	1.89	-4%	64	+10%	✅
3 middle layers (#10-12)	2.24	-23%	66	+12%	✅
First layer (#0)	5.74	-215%	64	+10%	❌
7 safe layers	~1.87	~-2.5%	~77	~30%	✅

Note: precise measurements — 10 runs, 5 warmup, MPS backend

Key Discovery: Middle Layers Are Redundant

Removing one layer from the middle of the model (layer #11 out of 22) gave:

+10% generation speed (59 → 64 tokens/sec)
Only -4% quality (perplexity 1.82 → 1.89)

Removing 7 safe layers (3, 4, 5, 9, 10, 11, 12) can achieve ~30% speedup.

Generation remained completely coherent:

Prompt: "Once upon a time"

Baseline: (not measured)

After removing layer #11: "Once upon a time, I was a web developer. Today, I am a freelance web developer. I have worked for some of the most prestigious web..."

The model still generates coherent, grammatically correct text.

First Layer Is Sacred

Here's what happened when I removed the first layer:

After removing layer #0: "Once upon a time and a time. Therefore, the therefore, the therefore. Therefore, the therefore, the therefore. Therefore, the..."

The model broke. Perplexity shot up from 1.82 to 5.74 (3x worse). Text became meaningless repetition.

Why? Early layers are responsible for:

Basic attention patterns
Positional encoding
Fundamental understanding of language structure

Without them, the model loses the ability to understand how words relate to each other.

Visualization: Importance of Each Layer

I tested removing each layer individually and measured quality degradation:

Layer  0:  ████████████████████████████████████████  +3.92  🔴 CRITICAL
Layer  1:  ██████████                               +0.43
Layer  2:  ████████████████████████████████████████████████████████████████████  +6.67  🔴 MOST IMPORTANT!
Layer  3:                                          +0.01  🟢 CAN REMOVE
Layer  4:  █                                       +0.06  🟢
Layer  5:                                          +0.04  🟢
Layer  6:  ██                                      +0.12
Layer  7:  ███████████████                         +0.74
Layer  8:  ██                                      +0.12
Layer  9:  █                                       +0.07  🟢
Layer 10:  █                                       +0.05  🟢
Layer 11:  █                                       +0.07  🟢
Layer 12:  ██                                      +0.09  🟢
Layer 13:  ███                                     +0.14
Layer 14:  ███████████                             +0.53
Layer 15:  ████████████████████████████████████    +1.81  🟠 IMPORTANT
Layer 16:  █████                                   +0.27
Layer 17:  ██                                      +0.12
Layer 18:  ████                                    +0.18
Layer 19:  ████                                    +0.19
Layer 20:  ██████                                  +0.28
Layer 21:  █████████                               +0.47

Unexpected discovery: Layer 2 is more important than Layer 0! This is the layer that forms key language patterns.

Safe to remove layers: 3, 4, 5, 9, 10, 11, 12 — increase perplexity by less than 0.1.

Interpretation: Why This Distribution?

Results revealed three critical zones:

🔴 Critical Zone 1: Layer 2 (PPL +6.67)

The most important layer in the model! This is unexpected — it's usually assumed that Layer 0 is most important.

Hypothesis: Layer 2 is where key attention patterns are formed. The first two layers create a "raw" representation, and Layer 2 "crystallizes" it into a structure that all other layers use.

🔴 Critical Zone 2: Layer 0 (PPL +3.92)

The first layer is important for:

Processing positional encoding
Basic token understanding
Initializing attention patterns

🟠 Critical Zone 3: Layer 15 (PPL +1.81)

Unexpected spike in late middle layers. Possibly this is the layer where "switching" from general semantics to task-specific processing happens.

🟢 Safe Zone: Layers 3-5, 9-12

These layers show minimal impact (PPL increase < 0.1). They perform redundant computations — repeating what neighboring layers already did.

Practical takeaway: you can remove 5-7 layers (layers 3, 4, 5, 9, 10, 11, 12) with less than 0.5% quality loss and get ~30% speedup.

Research ShortGPT introduced the Block Influence (BI) metric — my results fully align with their findings: middle layers show low BI and can be safely removed.

Practical Takeaways

For Engineers

Based on per-layer analysis — optimal combinations for removal:

Aggressiveness	Remove Layers	Expected Loss	Speedup
Minimal	{3}	~0.4%	~5%
Moderate	{3, 5, 10, 11}	~1%	~18%
Aggressive	{3, 4, 5, 9, 10, 11, 12}	~2.5%	~32%

# Optimal strategy: remove least important layers
safe_layers_to_remove = {3, 4, 5, 9, 10, 11, 12}  # PPL increase < 0.1 each
remove_layers(model, safe_layers_to_remove)

# Result: 22 -> 15 layers, ~32% speedup, ~2.5% quality loss

Important: never remove layers 0, 2, 15 — these are critical points.

For Researchers

This field is actively developing:

ShortGPT (2024) — removing entire layers
FinerCut (2024) — removing components within layers
SliceGPT (2024) — removing rows/columns from weight matrices
LinearPatch (2025) — recovering 94% quality after pruning via Hadamard transform (arxiv)
MRP (2025) — Maximum Redundancy Pruning, adaptive removal of most redundant layers (arxiv)
CLP (2025) — automatic search for optimal segments to remove (arxiv)

Combining with quantization (INT4/INT8) can give even greater speedup.

For Business

If you're paying $10k/month for inference GPUs, layer pruning can save $2-3k without noticeable quality loss. At OpenAI's scale, this is millions of dollars.

Experiment Limitations

Small model — TinyLlama 1.1B, results may differ for 7B/70B models
Simple metric — perplexity doesn't capture all quality aspects
No fine-tuning — possibly after removing layers the model can be fine-tuned to recover quality
Single dataset — need to test on different tasks
Measurement variability — speed on MPS backend has ±10% variance, important to do many runs
Chain-of-thought degradation — recent research (arxiv 2510.22228) showed that even removing 1-2 layers can break multi-step reasoning ability, while simple tasks work fine

Code

All experiment code is available on GitLab: https://gitlab.com/molchanov.artem.1994/lobotomyllm

git clone https://gitlab.com/molchanov.artem.1994/lobotomyllm
cd lobotomyLlm
python -m venv venv && source venv/bin/activate
pip install -r requirements.txt
python experiments/run_ablation.py --experiment quick

Conclusion

Hypothesis confirmed: modern LLMs are over-parameterized, 30% of layers can be removed with <3% quality loss.

Key insights:

Layer 2 is the most important (unexpectedly more important than Layer 0)
Layers 3-5, 9-12 are redundant (can be removed almost for free)
Layer 15 is a hidden critical layer in the late part of the network

Practical result: removing 7 layers (22→15) gives ~32% speedup with ~2.5% quality loss.

Next steps:

Run on Llama-3 8B for more convincing results
Try pruning + quantization combination
Investigate what critical layers (Layer 2, Layer 15) actually "know"

If you liked this — subscribe, star the GitLab repo, share with colleagues.

Questions and suggestions — in the comments or DM.

Tags: #MachineLearning #LLM #Optimization #PyTorch #NLP #DeepLearning

How I built a lag-free 10GB Log Viewer for VS Code using Rust & Memory-Mapping

Artyom Molchanov — Wed, 10 Dec 2025 19:19:21 +0000

We’ve all been there. You need to check a production log. You download the server.log, double-click it in VS Code, and... freeze.

VS Code is built on Electron. Loading a multi-gigabyte text file into the DOM is a death sentence for RAM and UI responsiveness. The standard solution is to close the editor and go back to less or tail in the terminal.

But I wanted the best of both worlds: the raw speed of CLI tools and the comfort of the VS Code UI (regex search, copy-paste, highlighting).

So, I built a custom extension with a Rust sidecar to solve this. Here is a deep dive into how it works under the hood.

The Architecture: Sidecar Pattern

The extension consists of two parts communicating via Stdin/Stdout (JSON IPC):

Frontend (TypeScript/VS Code Webview): Handles the UI, virtual scrolling, and rendering.
Backend (Rust): Handles file I/O, indexing, and searching.

The goal was simple: VS Code never holds the full file in memory.

🦀 The Backend: Rust & Memory-Mapping

The core logic resides in a binary called log-core.

1. Memory-Mapped I/O

Instead of reading the file into a buffer, I used memmap2. This maps the file on the disk directly into the process's virtual address space.

Benefit: The OS handles paging. Opening a 10GB file takes almost zero RAM allocation for the content itself.
Speed: Accessing a byte at offset 1,000,000 is as fast as accessing an array index.

2. The Line Index

When a file opens, the backend performs a single O(n) pass to build a Vec of line offsets.

offsets[0] = 0
offsets[1] = (position of first \n) + 1
...and so on.

This allows the backend to implement read_lines(start_line, count) efficiently. It calculates the byte range from the index, slices the memory-mapped file, and converts it using String::from_utf8_lossy (handling potential encoding issues gracefully).

3. Search & Regex

Search happens entirely in Rust. I use the regex crate for patterns or standard string matching for plain text. To prevent locking up the CPU on massive files, the search creates a stream of results with a hard limit (e.g., stopping after ~10k matches to keep the UI responsive).

⚡ The Frontend: Virtualization & Limits

The frontend is a VS Code Webview. The biggest challenge here isn't just "showing text," it's browser limits.

1. The 10 Million Pixel Problem

Browsers have a hard limit on the height of a DOM element (often around 10-30M pixels). A 10GB log file could easily exceed 100M pixels in height.

Solution: I implemented a coordinate scaling factor. The scrollbar you see is "fake" (virtualized). We calculate a virtualHeight that fits within browser limits, and then map the scroll position back to the realLineNumber.

2. Virtual Scrolling & Buffering

The Webview maintains a state map: loadedLines: Map.

We only render the visible lines + a small buffer (BUFFER_LINES) into the DOM.
Missing ranges are calculated and requested from the Rust backend in chunks (CHUNK_SIZE).
Cache Eviction: As you scroll away, lines far from the viewport are removed from the map to keep memory usage flat.

3. The "Filter Paradox"

Implementing filtering (e.g., "Show only ERROR") was tricky.

If I filter a 1M line file and find 500 errors, I need to show a continuous list of 500 lines, BUT I still need to know their original line numbers for debugging.

Logic: The frontend builds a mapping: ViewIndex (0..499) → ActualLine (e.g., 504, 1200, 9000).
UI: The main view scrolls based on the ViewIndex, but the Gutter (line numbers) renders the ActualLine.
This means Ctrl+G (Go to Line) has to solve a reverse lookup: "Where is line 1200 in the filtered view?"

🔄 Follow Mode (Tail -f)

I wanted to replace tail -f.

Polling: The frontend triggers a refreshFile command every 500ms.
Backend: Rust checks the file size. If it grew, it re-maps the file (cheap operation) and scans only the new bytes for new line offsets.
Frontend: If the user is at the bottom (or in "Follow Mode"), the view auto-scrolls to the new lines. If the user manually scrolls up, Follow Mode pauses automatically.

📦 Build & Cross-Compilation

Since this relies on a native binary, I couldn't just publish JS code.

I set up a build pipeline using GitHub Actions to cross-compile the Rust binary for:

darwin-x64 & darwin-arm64 (Apple Silicon)
linux-x64
win32-x64

The VS Code Marketplace supports platform-specific builds. When a user installs the extension, VS Code automatically fetches the correct VSIX for their OS. The resulting package is surprisingly small (~0.8MB), with the compressed binary taking up most of that.

Conclusion

This project started as a way to fix my own frustration, but it turned into a great lesson on how much performance you can squeeze out of VS Code when you offload heavy I/O to a system language like Rust.

If you deal with massive logs, CSVs, or data dumps, give it a try.

👉 VS Code Marketplace: [https://marketplace.visualstudio.com/items?itemName=molchanovartem1994.log-analyzer-pro]

👉 GitHub Repo: [https://gitlab.com/molchanov.artem.1994/log-analyzer-pro]

Let me know if you have any questions about the memmap implementation or the Webview message passing!