DEV Community: Nguuma Tyokaha

I Fine-Tuned a Security Reasoning Model That Runs on a 4GB Laptop (No GPU, No Cloud)

Nguuma Tyokaha — Thu, 26 Mar 2026 19:01:57 +0000

The Problem: Security AI Needs to Stay On Your Machine

Every time you paste a suspicious log, a CVE description, or an internal config into a cloud LLM, that data leaves your machine.

For security work red team engagements, incident response, air-gapped environments that's a real problem. You can't send client data to an API. You can't pipe internal logs to OpenAI.

But local security models have been terrible. They either:

Require expensive hardware (A100, 80GB VRAM)
Don't reason

they pattern-match and hallucinate CVE numbers

Have no training signal for the AI-native threats that actually matter in 2025–2026

So I built one that doesn't have those problems.

What I Built

security-slm-unsloth-1.5b a fine-tuned DeepSeek-R1-Distill-Qwen-1.5B model that:

Runs offline on a 4GB RAM laptop, CPU only
Thinks before it answers 100% chain-of-thought (<think>) activation
Covers 2026 AI-native attack classes: MCP tool poisoning, Crescendo jailbreaks, agentic lateral movement, LLM-assisted SSRF
Ships as a 1.2GB GGUF fits on a USB drive
Trained on free Google Colab T4 in ~30 minutes

Results after fine-tuning:

Metric	Before	After
Eval score (/ 10)	2.2	6.0
Improvement	—	+173%
`<think>` block rate	20–60%	100%
Avg response length	50–150 words	200–500 words

Why DeepSeek-R1-Distill-1.5B?

Most domain fine-tuning guides tell you to pick the biggest model you can fit. That's wrong for this use case.

The key insight: DeepSeek-R1-Distill-Qwen-1.5B is the smallest model that reliably produces structured <think> reasoning chains. That matters for security work because you need the model to show its work a black-box answer to "is this a path traversal?" is useless. You need the reasoning chain so you can audit it.

The <think> block is not cosmetic. It's the model's reasoning substrate. Strip it and you get pattern-matching, not analysis.

The Technical Approach

1. Reasoning-Preserving Training Format

Every training sample enforces the reasoning structure explicitly:

{
  "instruction": "An AI agent received this tool-call response: {\"file\": \"../../../../etc/passwd\"}. Is this a path traversal attack?",
  "content": "<think>\nStep 1 Identify the pattern: the string '../../../../etc/passwd' is a classic path traversal sequence...\nStep 2 Assess the trust boundary: tool-call responses should be treated as untrusted input...\nStep 3 Determine severity: /etc/passwd exposes system user accounts...\nStep 4 Evaluate agent response options: block, sanitize, or escalate...\nStep 5 Select mitigation: reject the response, log the event, alert the operator...\n</think>\n\nYes, this is a path traversal attack. The sequence '../../../..' attempts to escape the intended directory scope..."
}

Minimum 5 reasoning steps per sample. Non-negotiable.

2. Full Projection-Layer LoRA

Most fine-tuning tutorials only target attention projections (q_proj, v_proj). That's not enough for security reasoning you need to update the feed-forward reasoning layers too.

target_modules = [
    "q_proj", "k_proj", "v_proj", "o_proj",  # attention
    "gate_proj", "up_proj", "down_proj"        # feed-forward reasoning
]

All 7 layers. LoRA rank r=16. This modifies ~1% of parameters while injecting domain knowledge into both attention and reasoning pathways.

3. Dual-Axis Dataset Design

Every threat scenario is a matched red/blue pair same attack, both perspectives:

#	Threat	Red Team	Blue Team
1	MCP Security	Tool description injection → ENV exfiltration	Validation schema with scope enforcement
2	Prompt Hijacking	Payload splitting across 3 turns (bypasses LlamaGuard)	Semantic drift monitor with cross-turn context
3	Agentic Security	Recursive tool-call loop → resource exhaustion	Token budget circuit breaker + HITL escalation
4	RAG Poisoning	Malicious PDF overwrites system prompt	AWS IAM least-privilege scoped to single S3 prefix
5	Crescendo Attack	6-turn conversational escalation jailbreak	Cross-turn intent accumulation with LlamaGuard
6	Lateral Movement	Search→Email→Storage chain abuse	Inter-tool permission boundary enforcement
7	LLM SSRF	URL-fetching LLM → EC2 metadata credential theft	SSRF-safe HTTP client + IP allowlist

This dual-axis approach means the model doesn't become purely offensive — it can reason from both sides of the same attack.

4. Quantisation Decision

Q4_K_M was selected after analysing the quality/size tradeoff at 1.5B scale:

Format	RAM	Quality	Decision
Q8_0	~1.8GB	99.9%	Too large for 4GB headroom
Q4_K_M	~1.2GB	~99%	Selected
Q4_0	~1.0GB	~97%	Measurable quality loss
Q2_K	~0.7GB	~90%	Not suitable for reasoning

At 1.5B parameters, Q4_K_M retains ~99% of full-precision quality. The quality cliff only appears at Q2_K for this model size.

Training on Free Colab in 30 Minutes

The full pipeline runs on a free Google Colab T4 (15GB VRAM). Unsloth handles the memory efficiency training uses under 3GB VRAM.

from unsloth import FastLanguageModel
import torch

model, tokenizer = FastLanguageModel.from_pretrained(
    model_name="unsloth/deepseek-r1-distill-qwen-1.5b-unsloth-bnb-4bit",
    max_seq_length=2048,
    load_in_4bit=True,
)

model = FastLanguageModel.get_peft_model(
    model,
    r=16,
    target_modules=["q_proj","k_proj","v_proj","o_proj","gate_proj","up_proj","down_proj"],
    lora_alpha=16,
    lora_dropout=0,
    bias="none",
    use_gradient_checkpointing="unsloth",
)

Key hyperparameters:

Learning rate: 2e-4
Batch size: 2 (effective 8 with gradient accumulation × 4)
Epochs: 2
Checkpoint every 25 steps (crash protection on free Colab sessions)
Final training loss: 2.66

Try It Now 3 Ways

Ollama (one command, no Python)

ollama run hf.co/Nguuma/security-slm-unsloth-1.5b

Python (llama-cpp-python)

# pip install llama-cpp-python huggingface_hub
from huggingface_hub import hf_hub_download
from llama_cpp import Llama

model_path = hf_hub_download(
    repo_id="Nguuma/security-slm-unsloth-1.5b",
    filename="security-slm-finetuned-deepseek-r1-distill-qwen-1.5b.Q4_K_M.gguf",
    local_dir="./models",
)

llm = Llama(model_path=model_path, n_ctx=2048, n_threads=4, verbose=False)

response = llm.create_chat_completion(
    messages=[
        {
            "role": "system",
            "content": "You are a Cybersecurity assistant with Blue and Red team security reasoning. Think step by step before answering.",
        },
        {
            "role": "user",
            "content": 'An AI agent received this tool-call response: {"file": "../../../../etc/passwd"}. Is this a path traversal attack? What should the agent do?',
        },
    ],
    max_tokens=512,
    temperature=0.7,
)

print(response["choices"][0]["message"]["content"])

Prompt format (for any inference engine)

<|im_start|>system
You are a Cybersecurity assistant with Blue and Red team security reasoning. Think step by step before answering.
<|im_end|>
<|im_start|>user
Your question here
<|im_end|>
<|im_start|>assistant
<think>

Always open the assistant turn with <think> this triggers the reasoning chain.

What It's Good At

Analysing suspicious logs and tool-call responses for attack patterns
Drafting detection rules (Sigma, YARA, KQL) from attack descriptions
Reasoning through MCP and agentic attack surfaces
Walking through CVE-analogous scenarios step by step
Generating incident response playbook outlines
CTF challenge reasoning with explained steps

What It's Not

Not a general security encyclopedia it's a specialist
Not a substitute for a professional pentest
Not trained on every CVE highly specific CVE details may be wrong

What's Next

Areas I want to expand:

DPO alignment pairs chosen/rejected samples to reduce hallucination on specific CVE numbers
Multi-turn adversarial chains full 5-turn attack simulations with attacker/defender roles
Framework-specific coverage LangChain, AutoGen, CrewAI, MCP server implementations
Higher LoRA rank (r=32) more capacity for complex multi-step reasoning

If you work in security and want to contribute scenarios or feedback on the threat coverage, open an issue on the HuggingFace repo or drop a comment below.

The Future of Private AI: Secure, Cost‑Effective Small Language Models (SLMs) for Domain‑Specific Environments

Nguuma Tyokaha — Sun, 28 Dec 2025 14:36:48 +0000

By an AI & Cybersecurity Specialist

The AI conversation has been dominated by large, cloud‑hosted language models (LLMs). While powerful, they introduce hidden costs, privacy risks, and strategic dependencies that many organisations across regulated and enterprise environments can no longer justify. In this article, I argue that Small Language Models (SLMs) represent the next pragmatic evolution of modern AI adoption.

SLMs enable organisations to deploy offline, private, and domain‑specific AI systems with predictable cost, strong security guarantees, and production‑grade performance. This post provides a practical and opinionated blueprint covering architecture, LoRA distillation, RAG, secure inference, and offline deployment written for engineers, architects, and technical leaders building real systems where privacy, control, and economics matter.

AI in a Regulated World

Financial institutions operate under strict regulatory and risk constraints:

GDPR, PCI‑DSS, SOX, AML, ISO 27001
Highly sensitive transactional and identity data
Zero tolerance for data leakage or hallucinated outputs

Yet many teams are encouraged to adopt cloud LLM APIs that:

Process prompts outside organisational trust boundaries
Have opaque training and retention policies
Introduce unpredictable per‑token cost
Are difficult to audit or explain to regulators

This is not a technical failure it is a strategic mismatch.

Why SLMs Over LLMs (A Hard Truth)

LLMs are optimised for breadth. Enterprises need precision.

SLMs win across healthcare, finance, SOC, and SaaS because they are:

Domain‑bounded (clinical workflows, payments, alerts, product knowledge)
Cheap enough to run continuously
Small enough to deploy offline or in isolated environments
Predictable enough for audits, compliance, and customer trust

In practice, a 1–7B parameter SLM trained correctly outperforms a 70B LLM on narrow financial tasks.

Why Traditional Approaches Failed

Approach	Why It Breaks
Rule engines	Non‑scalable, brittle, expensive to maintain
Classical ML	Poor contextual reasoning
Cloud LLM APIs	Privacy risk, cost explosion, vendor lock‑in

SLMs close this gap by combining contextual reasoning with strict control.

Characteristics of an Enterprise‑Grade, Domain‑Specific SLM

A production‑ready SLM across healthcare, finance, SOC, and SaaS environments must:

Run fully offline or in isolated networks
Be deterministic, explainable, and bounded by domain context
Protect sensitive data (PHI, PII, financial, security telemetry)
Integrate with SIEM, observability, audit, and compliance tooling
Support encryption, RBAC, policy enforcement, and full logging by default
Operate with predictable performance and infrastructure cost

Architecture Overview (Private & Offline‑First)

High‑Level Architecture Diagram

┌───────────────────────┐
│  Internal Data Lake   │  (Transactions, Logs, Policies)
└───────────┬───────────┘
            │
            ▼
┌───────────────────────┐
│ Secure Data Curation  │
│ (PII masking, labeling)
└───────────┬───────────┘
            │
            ▼
┌───────────────────────┐
│ SLM Training Pipeline │◄── Distilled Knowledge (Offline)
│ (LoRA / QLoRA)        │
└───────────┬───────────┘
            │
            ▼
┌───────────────────────┐
│ Domain‑Specific SLM   │
│ (1–7B params)         │
└───────────┬───────────┘
            │
            ▼
┌───────────────────────┐
│ Offline Inference     │
│ (On‑Prem / Private)   │
└───────────────────────┘

Core Design Principles

Offline by default – no internet dependency
Least‑knowledge principle – model only knows its domain
Defense‑in‑depth security – model, runtime, and data
Cost predictability – fixed infrastructure cost

Distilling Frontier LLMs into Domain‑Specific SLMs (LoRA)

from peft import LoraConfig, get_peft_model
from transformers import AutoModelForCausalLM

base_model = AutoModelForCausalLM.from_pretrained(
    "mistral-7b",
    load_in_4bit=True
)

lora_config = LoraConfig(
    r=8,
    lora_alpha=32,
    target_modules=["q_proj", "v_proj"],
    lora_dropout=0.05
)

slm = get_peft_model(base_model, lora_config)

This reduces training cost by >90% while preserving task performance.

Secure Inference (Zero‑Trust Model Runtime)

with secure_enclave():
    output = slm.generate(
        sanitized_prompt,
        max_tokens=256
    )

Security controls:

Encrypted weights at rest
Prompt/output redaction
RBAC‑gated inference
Full audit logging

Sample Domain‑Specific Training Data

Instruction: Assess AML risk
Context: 5 transactions of $9,500 within 48 hours
Output: Medium‑High risk – structuring behaviour detected

Offline & Private Deployment

On‑Prem and Air‑Gapped Hosting

SLMs run efficiently on:

CPU‑only servers
Single low‑end GPUs
Confidential VMs

No internet. No external APIs. No data exfiltration.

SLM + RAG for Domain Intelligence

context = vector_db.retrieve(query)
prompt = f"{context}\nQuestion: {query}"
response = slm.generate(prompt)

Use cases:

AML case investigation
Internal policy Q&A
Risk assessment copilots

Evaluation & Security Testing

Hallucination rate on domain‑critical facts
Prompt injection and data leakage resistance
Model extraction and inversion attempts
Red‑team simulations aligned to healthcare, finance, SOC, and SaaS threats

Performance and Scalability

SLMs scale horizontally:

Stateless inference pods
Deterministic latency
Predictable OPEX

This is enterprise‑friendly AI economics.

6. SLMs vs LLMs (Reality Check)

Dimension	Cloud LLM	SLM
Privacy	X	✅
Offline	X	✅
Cost	Unbounded	Fixed
Auditability	Low	High

Benchmark Comparison (Realistic Enterprise Estimates)

Benchmarks below are representative of real-world enterprise deployments using a 7B SLM vs a frontier cloud LLM API. Exact numbers vary by workload.

Latency (Single Request)

Model	Avg Latency
Cloud LLM (API)	800–2000 ms
Private SLM (GPU)	40–120 ms
Private SLM (CPU)	150–350 ms

Cost (Monthly, ~5M tokens/day)

Model	Estimated Cost
Cloud LLM API	$18,000–$35,000
Private SLM (GPU amortized)	$2,000–$4,000
Private SLM (CPU-only)	$800–$1,500

Security & Compliance Impact

Cloud LLM: High legal and compliance overhead
SLM: Infrastructure-only audit scope

Challenges and What Comes Next

Challenges:

Domain data quality
Skilled MLOps teams

Future direction:

Automated SLM distillation
Hardware‑aware optimisation
Regulatory‑driven AI standards

A Personal Manifesto for Private AI

I believe the future of AI will not be decided by who trains the largest model.

It will be decided by who controls their intelligence stack.

Enterprises do not need models that know everything. They need models that know exactly what they are allowed to know, operate entirely within trust boundaries, and deliver value without hidden risk or runaway cost.

Small Language Models represent a shift from experimental AI to operational AI:

From external dependency to internal capability
From unpredictable billing to fixed economics
From opaque systems to auditable infrastructure

For startups, SLMs unlock AI adoption without destroying margins. For large organisations, they restore sovereignty over data, compliance, and architecture. This is not a temporary workaround it is the long‑term foundation of serious AI systems.

Private, domain‑specific, offline‑capable AI is not the future.

It is the present.

Variants by Domain

Healthcare

Healthcare organisations cannot afford experimental AI. Patient data, clinical accuracy, and regulatory compliance demand systems that operate entirely within hospital and provider trust boundaries. Small Language Models enable clinical and operational AI that runs offline, preserves PHI, and delivers deterministic, auditable results where human lives are at stake.

Finance

Financial institutions operate under constant regulatory scrutiny while facing rising pressure to modernise. SLMs allow banks and fintechs to deploy AI for risk, compliance, and operations without exposing sensitive data, incurring runaway API costs, or sacrificing auditability.

SOC / Cybersecurity

Security teams need speed, precision, and trust. Cloud LLMs introduce latency and risk that SOC environments cannot tolerate. SLMs provide sub‑second, private AI for alert triage, incident response, and threat analysis without leaking adversarial data outside the perimeter.

SaaS

SaaS companies are discovering that LLM APIs silently erode margins. SLMs offer a path to embedded AI with predictable unit economics, customer‑level data isolation, and privacy as a competitive differentiator.

SOC / Cybersecurity (High-Signal, Low-Latency AI)

Key Drivers:

Real-time response requirements
Sensitive security telemetry
Adversarial threat environment

SLM Use Cases:

Alert triage and prioritisation
Incident response copilots
Log and SIEM analysis
Threat intelligence summarisation

Why SLMs Win:

Sub-100ms inference for analysts
No leakage of attack data
Resistant to prompt injection

SaaS (Cost-Controlled, Embedded AI)

Key Drivers:

Margin pressure from LLM APIs
Customer data isolation
Need for predictable unit economics

SLM Use Cases:

In-app copilots
Customer support automation
Knowledge base Q&A
Workflow agents

Why SLMs Win:

Fixed cost per tenant
On-prem or VPC isolation per customer
Competitive differentiation via privacy

SLMs are not a downgrade from LLMs they are a strategic correction.

Organisations that adopt SLMs early will control their AI stack, reduce long-term cost, and stay ahead of regulatory pressure. This is the architecture that will quietly power the next decade of enterprise AI.

israeltn / Fine-Tuned-Qwen2.5-1.5B-Medical-Lab-Test-Analysis

Towards Efficient Clinical Reasoning: Adapting Distilled Reasoning Models for Laboratory Diagnostics in Resource-Constrained Healthcare Environments

Background: Clinical decision support in African healthcare settings is often limited by a lack of specialized personnel and the high computational costs associated with modern AI. While Large Language Models (LLMs) offer reasoning capabilities, their deployment is hindered by hardware constraints and data privacy concerns in remote regions. This study evaluates the performance and efficiency of a distilled reasoning model tailored for automated laboratory result analysis in the Nigerian health infrastructure.

Design/Methods: We developed Med-Lab-FineTuned-Qwen2.5-1.5B by adapting the Qwen2.5-1.5B-Instruct model using Low-Rank Adaptation (LoRA) and 4-bit NormalFloat quantization. The model was trained on a structured dataset of laboratory diagnostics to identify abnormalities and provide clinical recommendations using a Short-Chain-of-Thought (Short-CoT) strategy. To ensure deployment scalability in constrained environments such as lab software and hospital edge devices, the model was converted to GGUF format (q4_k_m). This…

View on GitHub

DEV Community: Nguuma Tyokaha

I Fine-Tuned a Security Reasoning Model That Runs on a 4GB Laptop (No GPU, No Cloud)

The Problem: Security AI Needs to Stay On Your Machine

What I Built

Why DeepSeek-R1-Distill-1.5B?

The Technical Approach

1. Reasoning-Preserving Training Format

2. Full Projection-Layer LoRA

3. Dual-Axis Dataset Design

4. Quantisation Decision

Training on Free Colab in 30 Minutes

Try It Now 3 Ways

Ollama (one command, no Python)

Python (llama-cpp-python)

Prompt format (for any inference engine)

What It's Good At

What It's Not

What's Next

Links

The Future of Private AI: Secure, Cost‑Effective Small Language Models (SLMs) for Domain‑Specific Environments

AI in a Regulated World

Why SLMs Over LLMs (A Hard Truth)

Why Traditional Approaches Failed

Characteristics of an Enterprise‑Grade, Domain‑Specific SLM

Architecture Overview (Private & Offline‑First)

High‑Level Architecture Diagram

Core Design Principles

Distilling Frontier LLMs into Domain‑Specific SLMs (LoRA)

Secure Inference (Zero‑Trust Model Runtime)

Sample Domain‑Specific Training Data

Offline & Private Deployment

On‑Prem and Air‑Gapped Hosting

SLM + RAG for Domain Intelligence

Evaluation & Security Testing

Performance and Scalability

6. SLMs vs LLMs (Reality Check)

Benchmark Comparison (Realistic Enterprise Estimates)

Latency (Single Request)

Cost (Monthly, ~5M tokens/day)

Security & Compliance Impact

Challenges and What Comes Next

A Personal Manifesto for Private AI

Variants by Domain

Healthcare

Finance

SOC / Cybersecurity

SaaS

SOC / Cybersecurity (High-Signal, Low-Latency AI)

SaaS (Cost-Controlled, Embedded AI)

israeltn / Fine-Tuned-Qwen2.5-1.5B-Medical-Lab-Test-Analysis

Towards Efficient Clinical Reasoning: Adapting Distilled Reasoning Models for Laboratory Diagnostics in Resource-Constrained Healthcare Environments