DEV Community

Cover image for How I Built an Offline AI Assistant in Python - No OpenAI, No LangChain, No Dependencies
Marcin Firmuga
Marcin Firmuga

Posted on

How I Built an Offline AI Assistant in Python - No OpenAI, No LangChain, No Dependencies

Every tutorial about building an AI assistant starts the same way:
pip install openai. Get an API key. Send everything to GPT. Done.

Now your assistant costs money per message, dies without internet, and you have zero control over what happens between the question and the answer.

I needed something different.

PC Workman is an open-source system monitor that runs on people's PCs. Some of those PCs are behind firewalls.
Some users don't want their system data leaving their machine.

So I built hck_GPT... a hybrid AI assistant that works fully offline,
handles Polish and English without a language model, routes messages through a 5-step decision chain, blends keyword scoring with
ML confidence, remembers what it said 3 messages ago, and optionally talks to a local LLM (Ollama) when it's available.

No cloud. No API key. No pip install magic.
Here's how it works. All of it.


1. Intent Scoring - One Algorithm for Two Languages, Zero NLP Libraries

The first problem: how do you figure out what someone is asking without shipping a 500MB NLP model?

Most intent parsers use spaCy, NLTK, or a transformer. I used string matching. But not the dumb kind.

def _score_intent(self, tokens: List[str], full_text: str,
                  patterns: List[str]) -> float:
    score = 0.0
    for pattern in patterns:
        if " " in pattern:
            # Multi-word phrase -> bonus proportional to word count
            if pattern in full_text:
                score += len(pattern.split()) * 1.5   # "why is ram so high" = 5 x 1.5
        else:
            if pattern in tokens:
                score += 1.0                          # exact match
            elif any(
                t.startswith(pattern) or pattern.startswith(t)
                for t in tokens if len(t) >= 3 and len(pattern) >= 3
            ):
                score += 0.4                          # prefix match ("perfor" -> "performance")
            elif len(pattern) >= 5 and any(
                self._edit_distance(t, pattern) <= 1
                for t in tokens if abs(len(t) - len(pattern)) <= 2 and len(t) >= 4
            ):
                score += 0.6                          # typo tolerance: 1-char edit distance
    return score
Enter fullscreen mode Exit fullscreen mode

Three levels of matching in a single loop.

Exact match gives 1.0 - "cpu" in tokens, done.
Prefix match gives 0.4 - "perfor" matches "performance".
This is how the parser handles both languages without a language detector.
Polish and English vocabulary patterns are defined separately, but the scoring algorithm is identical.
Typo tolerance gives 0.6 - Levenshtein distance of 1 catches "temperture" → "temperature" without a spell checker.

Multi-word phrases get bonus scoring proportional to word count. "what changed in performance" (5 words x 1.5 = 7.5 pts) beats a standalone "performance" (1 pt).
More specific queries earn higher confidence.

The edit distance function has an early-exit: if the length difference between two strings is more than 2, skip the computation entirely.
This matters because this runs on every message in a GUI thread.
Blocking for 200ms on a distance calculation would feel like lag.

Zero dependencies. Works in both languages. Runs in under 1ms.


2. Hybrid Confidence - When to Trust ML, When to Trust Keywords

If you have a keyword scorer AND an ML classifier, you need to decide who wins. Most systems pick one. I blend them.

def _blend_with_ml(self, text: str, kw_intent: str, kw_conf: float):
    ml_intent, ml_conf = ml_classifier.predict(text)

    if ml_conf >= 0.70:
        # ML very confident — trust it outright
        return ml_intent, ml_conf

    if ml_conf >= 0.35:
        if ml_intent == kw_intent:
            # Agreement -> amplify the signal
            blended = 0.65 * ml_conf + 0.35 * kw_conf
            return ml_intent, min(1.0, blended)
        else:
            # Disagreement → pick the stronger weighted signal
            ml_score = 0.65 * ml_conf
            kw_score = 0.35 * kw_conf
            winner   = ml_intent if ml_score >= kw_score else kw_intent
            conf     = min(1.0, max(ml_score, kw_score) + 0.15 * min(ml_score, kw_score))
            return winner, conf

    # ML not confident enough - fall back to pure keyword scoring
    return kw_intent, kw_conf
Enter fullscreen mode Exit fullscreen mode

Three trust zones instead of one threshold.

Above 0.70 - ML is confident. Trust it. Keywords don't get a vote.

Between 0.35 and 0.69, the interesting zone. If both models agree on the intent, the blended confidence goes up (agreement, amplified signal). If they disagree, the loser still gets 15% of the vote.
That 15% matters it prevents the winner from being overconfident on ambiguous inputs.

Below 0.35 - ML doesn't know. Pure keyword scoring takes over.

The 0.65/0.35 weight split is intentional.
ML gets more weight because when it's right, it understands full semantics. Keywords only match surface patterns.
But keywords are more reliable on short, domain-specific inputs like
"cpu" or "mb" where ML tends to underperform.


3. The Routing Decision - 5 Steps from Message to Answer

This is the core of the whole engine. Every message follows the same path:

RULE_THRESHOLD = 0.60   # above -> deterministic rule engine
LOW_THRESHOLD  = 0.20   # below -> not worth even trying

def process(self, msg: str, result: ParseResult, lang: str = "pl"):
    confidence = result.confidence
    intent     = result.intent

    # Step 1: Open-ended queries -> LLM first
    if intent in {"small_talk", "unknown"}:
        if self._check_available():
            return self._query_llm(msg, lang, result)
        return response_builder.build(result, lang)  # graceful degradation

    # Step 2: High confidence -> deterministic, instant answer
    if confidence >= RULE_THRESHOLD:
        resp = response_builder.build(result, lang)
        if resp:
            return resp

    # Step 3: Medium confidence -> try LLM for smarter answer
    if self._check_available():
        llm_resp = self._query_llm(msg, lang, result)
        if llm_resp:
            return llm_resp

    # Step 4: LLM unavailable → rule engine best-effort
    if confidence >= LOW_THRESHOLD:
        return response_builder.build(result, lang)

    # Step 5: Nothing works → caller handles it
    return None
Enter fullscreen mode Exit fullscreen mode

No single point of failure. Every step has a fallback.

Type "cpu" → keyword confidence is 0.95 → Step 2 handles it instantly. Rule engine. Deterministic. No LLM involved.

Type "why is my computer acting weird today" → keyword confidence is 0.30 → Step 2 skips → Step 3 sends it to Ollama with full system context → if Ollama is down, Step 4 gives a best-effort rule response.

Type "how are you" → intent is "small_talk" → Step 1 sends it straight to LLM → if unavailable, rule engine returns a canned friendly response.

I didn't want Ollama for everything, it's slow (2-5 seconds).
I didn't want if/else for everything, it can't handle questions
I didn't anticipate.
So I gave every message a confidence score and
let the score decide which engine handles it.


4. Intent-Aware Temperature - The Surgeon vs The Friend

Every AI tutorial I've seen uses one temperature setting. Global. Static.
Same creativity level whether you're asking about CPU specs or saying hello.

I use 20 different temperatures.

_INTENT_TEMPERATURE: Dict[str, float] = {
    # Factual / diagnostic - precision over creativity
    "hw_cpu":         0.35,   # wrong MHz is bad advice
    "hw_gpu":         0.35,
    "temperature":    0.35,
    "throttle_check": 0.35,
    "stats":          0.35,
    "disk_health":    0.35,
    "ram_why_high":   0.40,
    "why_slow":       0.45,

    # Optimization - some creativity welcome
    "performance":    0.55,
    "optimization":   0.55,
    "speed_up_pc":    0.55,
    "health_check":   0.50,

    # Conversational - warmth and personality
    "small_talk":     0.80,   # "how are you" deserves a human answer
    "about_program":  0.65,
    "help":           0.60,
}

temperature = _INTENT_TEMPERATURE.get(intent, TEMPERATURE)  # default 0.72
Enter fullscreen mode Exit fullscreen mode

"What CPU do I have?" -> 0.35. I want a surgeon.
The answer is a fact: Intel Core i7-12700H, 14 cores, 4.7 GHz boost.
There is exactly one correct answer.
Creativity here means wrong numbers.

"How are you?" -> 0.80. I want a friend.
Warmth, personality, maybe a joke.

"Why is my PC slow?" -> 0.45. Somewhere between.
The diagnostic needs to be accurate, but the explanation should be readable, not a data dump.

One line of code.
Massive difference in response quality.
Probably the highest value-per-line-of-code in the entire engine.


5. Fallback Chain with Availability Cache

Ollama runs locally but it's not always there.
Maybe **the user didn't install it.
**Maybe **the model is loading. Maybe it crashed.
**Your assistant can't hang for 30 seconds waiting to find out.

AVAILABILITY_TTL = 300   # re-check Ollama every 5 minutes

def _check_available(self) -> bool:
    now = time.time()

    # Respect the cool-down from previous failures
    if now < self._temp_unavail_until:
        return False

    # Cache: don't ping /api/tags on every message
    if self._available is None or (now - self._available_checked_at) > AVAILABILITY_TTL:
        self._available            = self._ollama.is_available()
        self._available_checked_at = now
        if self._available:
            self._pick_best_model()   # llama3.2 > mistral > phi3 > gemma > first available

    return bool(self._available)

def _query_llm(self, msg, lang, result):
    try:
        raw = self._ollama.generate(model=self.model, ...)
    except Exception:
        self._temp_unavail_until = time.time() + 60   # crash -> 60s cool-down
        return None

    if not raw:
        self._temp_unavail_until = time.time() + 30   # empty -> model loading, 30s
        return None

    return self._format_response(raw, lang)
Enter fullscreen mode Exit fullscreen mode

Two levels of cool-down. Empty response (model loading) ->** 30 seconds*. Exception (crash, timeout) ->* 60 seconds*.
Full availability re-check -> **every 5 minutes
*.

The user never waits for a ping. If Ollama was unavailable 10 seconds ago, the engine routes to rules immediately.
Answer in milliseconds instead of a spinner.

_pick_best_model() runs once on availability check:
llama3.2 > mistral > phi3 > gemma > whatever is installed.
If you have multiple models, hck_GPT picks the best one automatically.


6. Session Memory - AI That Remembers What It Said 3 Messages Ago

Most chatbot tutorials treat every message as independent.
User asks about RAM. Gets an answer.
User asks about system health. Gets a generic response that doesn't mention the RAM issue from 30 seconds ago.
hck_GPT remembers.

# When answering a RAM question - save the data:
session_memory.record_response_data("hw_ram", {
    "total_gb":    16,
    "speed":       3200,
    "current_pct": 78,
    "typical_avg": 51,
})

# Later, in a different handler (health_check) - reference it:
def _resp_health_check(self, r, lang):
    ram_sess = session_memory.get_response_data("hw_ram")

    if ram_sess.get("total_gb") and ram > 70:
        lines.append(
            f"  (Your {ram_sess['total_gb']} GB RAM we discussed earlier"
            f" - now at {ram:.0f}%, that's worth watching)"
        )
Enter fullscreen mode Exit fullscreen mode

No SQLite. No JSON file. Just a dict:
{intent: {key: val, recorded_at: timestamp}}.
Lives in RAM for the session. When the app closes, it's gone.
That's intentional, session memory shouldn't outlive the session.

There's also trend tracking fed by snapshots every 30 seconds:

def get_trend(self, metric: str = "cpu") -> str:
    buf = self._cpu_trend if metric == "cpu" else self._ram_trend
    readings = list(buf)   # circular buffer of last 8 readings
    if len(readings) < 4:
        return "stable"
    mid        = len(readings) // 2
    first_avg  = sum(readings[:mid]) / mid
    second_avg = sum(readings[mid:]) / (len(readings) - mid)
    delta      = second_avg - first_avg
    if delta > 5:  return "rising"
    if delta < -5: return "falling"
    return "stable"
Enter fullscreen mode Exit fullscreen mode

Split the buffer in half. Compare averages. Rising, falling, or stable. This lets hck_GPT say "your CPU has been climbing for the last few minutes" instead of just showing the current number.


7. System Prompt with Live PC Context — Facts, Not Vibes

When a message reaches Ollama, the LLM doesn't guess about your system.
It gets injected context built from real data:

== Live System State ==
CPU: 67%  @ 3840 MHz / max 4800 MHz
RAM: 71%  (11.4/16.0 GB used,  4.6 GB free)
Disk C: 48.3 GB free / 476.0 GB total

== Today's Averages ==
CPU avg: 34.2%  peak: 89.0%
RAM avg: 68.1%

== Top Processes (by CPU) ==
  chrome.exe       CPU 18.4%  RAM 892 MB
  code.exe         CPU 12.1%  RAM 441 MB
  python.exe       CPU  3.2%  RAM 156 MB

== Hardware Profile ==
CPU: Intel Core i7-12700H  (14 cores, boost 4.7 GHz)
GPU: NVIDIA GeForce RTX 3060  VRAM: 6 GB
RAM: 16.0 GB @ 3200 MHz

== Today vs Typical ==
CPU today 34% vs typical 28%  /\ (+6%)
RAM today 71% vs typical 51%  /\ (+20%)

== Learned Usage Patterns ==
Typical CPU avg (7-day baseline): 28%
Heaviest app this week: chrome.exe
Enter fullscreen mode Exit fullscreen mode

Each section comes from a different module. psutil for live data.
SQLite for historical averages. WMI for hardware profile.
The Stats Engine for 7-day baselines.

The "Today vs Typical" section is the difference between an assistant that says "CPU 67%" and one that says "CPU 67% — but your normal is 28%, something is off." Context, not just data. That's the whole philosophy in two lines of a prompt.


What This Costs vs The OpenAI Approach

hck_GPT (this approach) pip install openai
Dependencies zero (beyond psutil) openai, tiktoken, etc.
Model size zero (rule engine is pure Python) N/A (cloud)
API costs zero $0.002-0.06 per message
Latency <1ms rules, 2-5s Ollama 500-2000ms
Internet not required required
Privacy data stays local sent to OpenAI servers
RAM ~5MB N/A

What I'd Do Differently

The intent vocabulary is hand-built.
~200 patterns across 25 intents.
Every time a user asks something I didn't anticipate,
I add patterns manually. This doesn't scale past maybe 50 intents.
A lightweight local classifier (even scikit-learn with TF-IDF) would handle the long tail better.

The confidence thresholds (0.60, 0.35, 0.20) were tuned by testing, not by math. They work for my use case but they're not optimal.
A proper evaluation dataset would help.

Session memory is per-session only. The user knowledge base (SQLite in AppData) stores hardware profiles and usage patterns, but conversation context doesn't persist across restarts yet.


Why I'm Writing This

Every "build an AI assistant" tutorial I found in 2026...
starts with pip install langchain or pip install openai.
Cloud API. Token costs. Internet required.
That's fine for some use cases.

But if you're building a desktop app that runs on someone else's machine, behind their firewall, on their data, you need a different approach.
This is that approach: a hybrid engine that's fast when it can be, smart when it needs to be, and never crashes because a server is down.

PC Workman is open source, MIT licensed.
The full AI engine is in hck_gpt/- intent parser, hybrid engine, response builder, session memory, context builder.
Read it, fork it, build something better.


Try It

Download the .exe (no Python needed): GitHub Releases

Read the source: github.com/HuckleR2003/PC_Workman_HCK

All my links: linktr.ee/marcin_firmuga

If you build something with a similar approach,
or if you'd do it completely differently.
I want to hear about it!


I'm Marcin Firmuga, solo developer and founder of HCK_Labs.
Building PC Workman publicly between retail shifts.
**800+ hours, 25 GitHub stars
, and one offline AI engine that grew 9 layers when I wasn't looking.

Follow the build: GitHub · LinkedIn · X · Medium

Top comments (0)