Musa Nayyer

Posted on Jun 9 • Originally published at muzasio.hashnode.dev

Local AI SOC Server on a Laptop - What Actually Broke

#linux #ollama #cybersecurity #infosec

I'm a cybersecurity student. No budget, no enterprise hardware, no client environments.

I set it up as a local AI server for SOC automation: log triage, CVE analysis, security bots. The internet made it look straightforward. It wasn't. This is the honest version of that story.

One thing upfront: everything here ran on self-generated test data and my own lab logs. No client data, no sensitive information, nothing real. I'm documenting a learning build, not a deployment guide. Don't point this at anyone's actual environment until you've properly hardened it beyond what I did here.

The OS Decision - What I Considered and Why It Actually Matters

Most answers online say "just use Ubuntu" with zero justification. Here's the real breakdown.

Ubuntu 24.04 LTS - What I Used

This was the right call for my situation. NVIDIA driver support is the most mature here, nvidia-container-toolkit installs without a fight, and when something breaks at 2am there's a Stack Overflow thread for it.

The downside: blank slate for security work. You're manually installing nmap, gobuster, everything. For a dedicated inference server that's fine, less noise. For a daily driver where you also want to pentest, it gets tedious.

Parrot OS / Kali Linux - Worth Considering

Both are Debian-based and ship with the full offensive toolkit pre-installed. Parrot is lighter and more workable alongside inference workloads. Kali is the standard for dedicated pentest infrastructure.

The appeal: your AI stack and your security toolchain live in the same environment. No dependency wrangling between two systems. The catch: don't run heavy inference and an active nmap scan simultaneously without capping container resources. They'll fight over RAM and both degrade.

Proxmox VE - Good Idea, Painful on Laptops

Turns the machine into a bare-metal hypervisor. Ubuntu for Ollama, Kali for tooling, LXC containers for bots, all isolated, all snapshotted. Great for client work in theory. GPU passthrough on a laptop is genuinely painful though: IOMMU, VFIO config, WiFi driver issues in guests. Skipped it for that reason.

NixOS - Long-Term Interesting, Steep Curve

Entire server state lives in one declarative file. Reproducible, version-controlled, no config drift. Compelling on paper but CUDA/GPU support requires specific Nix module knowledge. Worth it eventually if you're comfortable with infrastructure-as-code. Not where I started.

Model Stack - What the Internet Says vs. What I Hit

You'll see this configuration recommended everywhere:

Workload	Model	Approx. RAM (inference only)
Code audit / vuln review	`deepseek-coder-v2`	~10 GB (16B)
Log analysis / SOC triage	`deepseek-r1`	6-9 GB (7B-14B)
General reasoning / recon	`llama3.1` or `mistral`	~5 GB (7B)
Automation bots / agents	`qwen2.5-coder`	~5 GB (7B)

What the table doesn't tell you: these numbers are for inference at idle. In a real pipeline: ingestion scripts, scan tools, and a bot running concurrently, add 4-6 GB on top. On older or low-spec hardware, even 7B models stall after a few runs if the settings aren't tuned.

My actual experience: the models needed more manual configuration than any tutorial mentioned. Default settings caused the model to get stuck mid-automation after a few cycles. Context window size, temperature, and output format all needed tuning before anything ran reliably.

Q4_K_M quantization is a solid starting point, best quality-to-size ratio for most workloads. But treat the whole table as a starting baseline, not a guarantee.

The Automation Pipeline

The basic setup I wired together:

Calling the Ollama API is simple enough:

curl http://localhost:11434/api/generate \
  -d '{"model": "llama3.1", "prompt": "Analyze this log entry for anomalies: [LOG]", "stream": false}'

The curl call works fine in isolation. Getting it to behave consistently inside an automated pipeline is the different problem, more on that below.

For visual workflow automation, n8n self-hosted in Docker handles the glue between file watchers, Ollama calls, and notifications well. For CLI piping, Fabric is cleaner: cat auth.log | fabric --pattern analyze_logs just works with less overhead.

One Thing I'd Do Differently

Isolate Ollama in Docker from the start. I didn't, and when an automation script misbehaved it had more filesystem access than it should have. On a personal lab machine that's annoying. On anything with real data it would be a serious problem.

For anyone building something similar, even just for learning:

Default-deny at the container network level
Never expose port 11434 outside localhost
Put Nginx with auth in front if you need remote access

The Part Nobody Talks About - The Actual Hard Part

The bottleneck isn't inference speed. It's not even which model you pick.

It's getting structured, parseable output consistently enough to feed into downstream scripts.

Models are inconsistent by default. Valid JSON for 20 runs, then a plain sentence on run 21, then a half-finished response on run 22. Your pipeline breaks on the edge cases, not the happy path.

Prompt engineering for reliability matters more than model size here. A smaller model with a well-structured prompt will outperform a larger model with a vague one inside an automation pipeline every time.

That's the thing I'd tell myself at the start.

DEV Community