Every tutorial about building an AI assistant starts the same way:
pip install openai. Get an API key. Send everything to GPT. Done.
Now your assistant costs money per message, dies without internet, and you have zero control over what happens between the question and the answer.
I needed something different.
PC Workman is an open-source system monitor that runs on people's PCs. Some of those PCs are behind firewalls.
Some users don't want their system data leaving their machine.
So I built hck_GPT... a hybrid AI assistant that works fully offline,
handles Polish and English without a language model, routes messages through a 5-step decision chain, blends keyword scoring with
ML confidence, remembers what it said 3 messages ago, and optionally talks to a local LLM (Ollama) when it's available.
No cloud. No API key. No pip install magic.
Here's how it works. All of it.
1. Intent Scoring - One Algorithm for Two Languages, Zero NLP Libraries
The first problem: how do you figure out what someone is asking without shipping a 500MB NLP model?
Most intent parsers use spaCy, NLTK, or a transformer. I used string matching. But not the dumb kind.
def _score_intent(self, tokens: List[str], full_text: str,
patterns: List[str]) -> float:
score = 0.0
for pattern in patterns:
if " " in pattern:
# Multi-word phrase -> bonus proportional to word count
if pattern in full_text:
score += len(pattern.split()) * 1.5 # "why is ram so high" = 5 x 1.5
else:
if pattern in tokens:
score += 1.0 # exact match
elif any(
t.startswith(pattern) or pattern.startswith(t)
for t in tokens if len(t) >= 3 and len(pattern) >= 3
):
score += 0.4 # prefix match ("perfor" -> "performance")
elif len(pattern) >= 5 and any(
self._edit_distance(t, pattern) <= 1
for t in tokens if abs(len(t) - len(pattern)) <= 2 and len(t) >= 4
):
score += 0.6 # typo tolerance: 1-char edit distance
return score
Three levels of matching in a single loop.
Exact match gives 1.0 - "cpu" in tokens, done.
Prefix match gives 0.4 - "perfor" matches "performance".
This is how the parser handles both languages without a language detector.
Polish and English vocabulary patterns are defined separately, but the scoring algorithm is identical.
Typo tolerance gives 0.6 - Levenshtein distance of 1 catches "temperture" → "temperature" without a spell checker.
Multi-word phrases get bonus scoring proportional to word count. "what changed in performance" (5 words x 1.5 = 7.5 pts) beats a standalone "performance" (1 pt).
More specific queries earn higher confidence.
The edit distance function has an early-exit: if the length difference between two strings is more than 2, skip the computation entirely.
This matters because this runs on every message in a GUI thread.
Blocking for 200ms on a distance calculation would feel like lag.
Zero dependencies. Works in both languages. Runs in under 1ms.
2. Hybrid Confidence - When to Trust ML, When to Trust Keywords
If you have a keyword scorer AND an ML classifier, you need to decide who wins. Most systems pick one. I blend them.
def _blend_with_ml(self, text: str, kw_intent: str, kw_conf: float):
ml_intent, ml_conf = ml_classifier.predict(text)
if ml_conf >= 0.70:
# ML very confident — trust it outright
return ml_intent, ml_conf
if ml_conf >= 0.35:
if ml_intent == kw_intent:
# Agreement -> amplify the signal
blended = 0.65 * ml_conf + 0.35 * kw_conf
return ml_intent, min(1.0, blended)
else:
# Disagreement → pick the stronger weighted signal
ml_score = 0.65 * ml_conf
kw_score = 0.35 * kw_conf
winner = ml_intent if ml_score >= kw_score else kw_intent
conf = min(1.0, max(ml_score, kw_score) + 0.15 * min(ml_score, kw_score))
return winner, conf
# ML not confident enough - fall back to pure keyword scoring
return kw_intent, kw_conf
Three trust zones instead of one threshold.
Above 0.70 - ML is confident. Trust it. Keywords don't get a vote.
Between 0.35 and 0.69, the interesting zone. If both models agree on the intent, the blended confidence goes up (agreement, amplified signal). If they disagree, the loser still gets 15% of the vote.
That 15% matters it prevents the winner from being overconfident on ambiguous inputs.
Below 0.35 - ML doesn't know. Pure keyword scoring takes over.
The 0.65/0.35 weight split is intentional.
ML gets more weight because when it's right, it understands full semantics. Keywords only match surface patterns.
But keywords are more reliable on short, domain-specific inputs like
"cpu" or "mb" where ML tends to underperform.
3. The Routing Decision - 5 Steps from Message to Answer
This is the core of the whole engine. Every message follows the same path:
RULE_THRESHOLD = 0.60 # above -> deterministic rule engine
LOW_THRESHOLD = 0.20 # below -> not worth even trying
def process(self, msg: str, result: ParseResult, lang: str = "pl"):
confidence = result.confidence
intent = result.intent
# Step 1: Open-ended queries -> LLM first
if intent in {"small_talk", "unknown"}:
if self._check_available():
return self._query_llm(msg, lang, result)
return response_builder.build(result, lang) # graceful degradation
# Step 2: High confidence -> deterministic, instant answer
if confidence >= RULE_THRESHOLD:
resp = response_builder.build(result, lang)
if resp:
return resp
# Step 3: Medium confidence -> try LLM for smarter answer
if self._check_available():
llm_resp = self._query_llm(msg, lang, result)
if llm_resp:
return llm_resp
# Step 4: LLM unavailable → rule engine best-effort
if confidence >= LOW_THRESHOLD:
return response_builder.build(result, lang)
# Step 5: Nothing works → caller handles it
return None
No single point of failure. Every step has a fallback.
Type "cpu" → keyword confidence is 0.95 → Step 2 handles it instantly. Rule engine. Deterministic. No LLM involved.
Type "why is my computer acting weird today" → keyword confidence is 0.30 → Step 2 skips → Step 3 sends it to Ollama with full system context → if Ollama is down, Step 4 gives a best-effort rule response.
Type "how are you" → intent is "small_talk" → Step 1 sends it straight to LLM → if unavailable, rule engine returns a canned friendly response.
I didn't want Ollama for everything, it's slow (2-5 seconds).
I didn't want if/else for everything, it can't handle questions
I didn't anticipate.
So I gave every message a confidence score and
let the score decide which engine handles it.
4. Intent-Aware Temperature - The Surgeon vs The Friend
Every AI tutorial I've seen uses one temperature setting. Global. Static.
Same creativity level whether you're asking about CPU specs or saying hello.
I use 20 different temperatures.
_INTENT_TEMPERATURE: Dict[str, float] = {
# Factual / diagnostic - precision over creativity
"hw_cpu": 0.35, # wrong MHz is bad advice
"hw_gpu": 0.35,
"temperature": 0.35,
"throttle_check": 0.35,
"stats": 0.35,
"disk_health": 0.35,
"ram_why_high": 0.40,
"why_slow": 0.45,
# Optimization - some creativity welcome
"performance": 0.55,
"optimization": 0.55,
"speed_up_pc": 0.55,
"health_check": 0.50,
# Conversational - warmth and personality
"small_talk": 0.80, # "how are you" deserves a human answer
"about_program": 0.65,
"help": 0.60,
}
temperature = _INTENT_TEMPERATURE.get(intent, TEMPERATURE) # default 0.72
"What CPU do I have?" -> 0.35. I want a surgeon.
The answer is a fact: Intel Core i7-12700H, 14 cores, 4.7 GHz boost.
There is exactly one correct answer.
Creativity here means wrong numbers.
"How are you?" -> 0.80. I want a friend.
Warmth, personality, maybe a joke.
"Why is my PC slow?" -> 0.45. Somewhere between.
The diagnostic needs to be accurate, but the explanation should be readable, not a data dump.
One line of code.
Massive difference in response quality.
Probably the highest value-per-line-of-code in the entire engine.
5. Fallback Chain with Availability Cache
Ollama runs locally but it's not always there.
Maybe **the user didn't install it.
**Maybe **the model is loading. Maybe it crashed.
**Your assistant can't hang for 30 seconds waiting to find out.
AVAILABILITY_TTL = 300 # re-check Ollama every 5 minutes
def _check_available(self) -> bool:
now = time.time()
# Respect the cool-down from previous failures
if now < self._temp_unavail_until:
return False
# Cache: don't ping /api/tags on every message
if self._available is None or (now - self._available_checked_at) > AVAILABILITY_TTL:
self._available = self._ollama.is_available()
self._available_checked_at = now
if self._available:
self._pick_best_model() # llama3.2 > mistral > phi3 > gemma > first available
return bool(self._available)
def _query_llm(self, msg, lang, result):
try:
raw = self._ollama.generate(model=self.model, ...)
except Exception:
self._temp_unavail_until = time.time() + 60 # crash -> 60s cool-down
return None
if not raw:
self._temp_unavail_until = time.time() + 30 # empty -> model loading, 30s
return None
return self._format_response(raw, lang)
Two levels of cool-down. Empty response (model loading) ->** 30 seconds*. Exception (crash, timeout) ->* 60 seconds*.
Full availability re-check -> **every 5 minutes*.
The user never waits for a ping. If Ollama was unavailable 10 seconds ago, the engine routes to rules immediately.
Answer in milliseconds instead of a spinner.
_pick_best_model() runs once on availability check:
llama3.2 > mistral > phi3 > gemma > whatever is installed.
If you have multiple models, hck_GPT picks the best one automatically.
6. Session Memory - AI That Remembers What It Said 3 Messages Ago
Most chatbot tutorials treat every message as independent.
User asks about RAM. Gets an answer.
User asks about system health. Gets a generic response that doesn't mention the RAM issue from 30 seconds ago.
hck_GPT remembers.
# When answering a RAM question - save the data:
session_memory.record_response_data("hw_ram", {
"total_gb": 16,
"speed": 3200,
"current_pct": 78,
"typical_avg": 51,
})
# Later, in a different handler (health_check) - reference it:
def _resp_health_check(self, r, lang):
ram_sess = session_memory.get_response_data("hw_ram")
if ram_sess.get("total_gb") and ram > 70:
lines.append(
f" (Your {ram_sess['total_gb']} GB RAM we discussed earlier"
f" - now at {ram:.0f}%, that's worth watching)"
)
No SQLite. No JSON file. Just a dict:
{intent: {key: val, recorded_at: timestamp}}.
Lives in RAM for the session. When the app closes, it's gone.
That's intentional, session memory shouldn't outlive the session.
There's also trend tracking fed by snapshots every 30 seconds:
def get_trend(self, metric: str = "cpu") -> str:
buf = self._cpu_trend if metric == "cpu" else self._ram_trend
readings = list(buf) # circular buffer of last 8 readings
if len(readings) < 4:
return "stable"
mid = len(readings) // 2
first_avg = sum(readings[:mid]) / mid
second_avg = sum(readings[mid:]) / (len(readings) - mid)
delta = second_avg - first_avg
if delta > 5: return "rising"
if delta < -5: return "falling"
return "stable"
Split the buffer in half. Compare averages. Rising, falling, or stable. This lets hck_GPT say "your CPU has been climbing for the last few minutes" instead of just showing the current number.
7. System Prompt with Live PC Context — Facts, Not Vibes
When a message reaches Ollama, the LLM doesn't guess about your system.
It gets injected context built from real data:
== Live System State ==
CPU: 67% @ 3840 MHz / max 4800 MHz
RAM: 71% (11.4/16.0 GB used, 4.6 GB free)
Disk C: 48.3 GB free / 476.0 GB total
== Today's Averages ==
CPU avg: 34.2% peak: 89.0%
RAM avg: 68.1%
== Top Processes (by CPU) ==
chrome.exe CPU 18.4% RAM 892 MB
code.exe CPU 12.1% RAM 441 MB
python.exe CPU 3.2% RAM 156 MB
== Hardware Profile ==
CPU: Intel Core i7-12700H (14 cores, boost 4.7 GHz)
GPU: NVIDIA GeForce RTX 3060 VRAM: 6 GB
RAM: 16.0 GB @ 3200 MHz
== Today vs Typical ==
CPU today 34% vs typical 28% /\ (+6%)
RAM today 71% vs typical 51% /\ (+20%)
== Learned Usage Patterns ==
Typical CPU avg (7-day baseline): 28%
Heaviest app this week: chrome.exe
Each section comes from a different module. psutil for live data.
SQLite for historical averages. WMI for hardware profile.
The Stats Engine for 7-day baselines.
The "Today vs Typical" section is the difference between an assistant that says "CPU 67%" and one that says "CPU 67% — but your normal is 28%, something is off." Context, not just data. That's the whole philosophy in two lines of a prompt.
What This Costs vs The OpenAI Approach
| hck_GPT (this approach) | pip install openai | |
|---|---|---|
| Dependencies | zero (beyond psutil) | openai, tiktoken, etc. |
| Model size | zero (rule engine is pure Python) | N/A (cloud) |
| API costs | zero | $0.002-0.06 per message |
| Latency | <1ms rules, 2-5s Ollama | 500-2000ms |
| Internet | not required | required |
| Privacy | data stays local | sent to OpenAI servers |
| RAM | ~5MB | N/A |
What I'd Do Differently
The intent vocabulary is hand-built.
~200 patterns across 25 intents.
Every time a user asks something I didn't anticipate,
I add patterns manually. This doesn't scale past maybe 50 intents.
A lightweight local classifier (even scikit-learn with TF-IDF) would handle the long tail better.
The confidence thresholds (0.60, 0.35, 0.20) were tuned by testing, not by math. They work for my use case but they're not optimal.
A proper evaluation dataset would help.
Session memory is per-session only. The user knowledge base (SQLite in AppData) stores hardware profiles and usage patterns, but conversation context doesn't persist across restarts yet.
Why I'm Writing This
Every "build an AI assistant" tutorial I found in 2026...
starts with pip install langchain or pip install openai.
Cloud API. Token costs. Internet required.
That's fine for some use cases.
But if you're building a desktop app that runs on someone else's machine, behind their firewall, on their data, you need a different approach.
This is that approach: a hybrid engine that's fast when it can be, smart when it needs to be, and never crashes because a server is down.
PC Workman is open source, MIT licensed.
The full AI engine is in hck_gpt/- intent parser, hybrid engine, response builder, session memory, context builder.
Read it, fork it, build something better.
Try It
Download the .exe (no Python needed): GitHub Releases
Read the source: github.com/HuckleR2003/PC_Workman_HCK
All my links: linktr.ee/marcin_firmuga
If you build something with a similar approach,
or if you'd do it completely differently.
I want to hear about it!
I'm Marcin Firmuga, solo developer and founder of HCK_Labs.
Building PC Workman publicly between retail shifts.
**800+ hours, 25 GitHub stars, and one offline AI engine that grew 9 layers when I wasn't looking.




Top comments (0)