Sruthik I

Posted on Apr 25 • Originally published at dev.to

Plexus: A WiFi Graph RAG for Network Troubleshooting

#wifi #rag #networking #ai

WiFi troubleshooting has a confidence problem.

Ask a chatbot what's causing client disconnections and it'll give you an answer that sounds right. But infrastructure troubleshooting isn't a trivia game — the cost of a confident wrong answer is an engineer wasting hours chasing the wrong fix.

I built Plexus, a private WiFi troubleshooting assistant specifically to solve this. Every answer it produces is grounded in retrieved evidence from a curated domain knowledge corpus. If the evidence is weak, the answer says so. The first cut — available now for trials — is focused on knowledge querying: ask a WiFi or networking question and get back a source-safe, evidence-grounded answer. Public users do not see private source names, page references, chunk IDs, or citations; those stay in internal traces for debugging and evaluation.

It's a private project — this post covers the design, not the data.

The Problem

WiFi troubleshooting is not just a search problem. A good answer usually depends on several kinds of evidence:

The user's question and operational context.
Protocol behavior and failure modes that are easy to confuse.
Incident artifacts — packet captures, logs, timeline signals.
Confidence boundaries: what the system knows, what it inferred, and what still needs validation.

A normal chatbot blends real evidence with plausible guesses and presents them at the same confidence level. That's dangerous in infrastructure troubleshooting. So Plexus was built around one strict rule: important technical claims should be grounded in retrieved evidence where possible, and uncertainty must be surfaced — not hidden.

System Map

At a high level, Plexus has three big areas:

An online app core for API/UI requests, routing, retrieval, answer generation, and RCA workflows.
Stores and services for lexical search, vector retrieval, graph relationships, workflow execution, and inference.
An offline indexing and release pipeline that prepares the private knowledge corpus into serving indexes.

The online path starts with a FastAPI application. Requests from the web UI, chat interface, or CLI/admin path go through a query service that decides what kind of work is needed.

The critical design choice: retrieval is not a single vector search call. Plexus combines multiple retrieval shapes and builds an evidence pack before generation ever begins.

The Knowledge RAG Core

This is the heart of Plexus and what's live in the trial.

You ask a WiFi or networking question in the chat interface. Before anything gets retrieved, the query goes through a question classifier that uses embedding similarity against class prototypes — reference, compare, troubleshooting, advanced troubleshooting — combined with structural pattern signals (regex markers for "what is/explain" vs "why/fail/diagnose" vs "compare/differ/tradeoff"). The question class isn't cosmetic. It drives both answer policy and retrieval behavior: simple knowledge questions get concise explanations, while troubleshooting questions can use cause-and-next-check workflows.

Alongside that, a domain intent parser extracts WiFi-domain signals from the query: security protocols (WPA2, WPA3, SAE, OWE, PMF), frame types (EAPOL, Probe, Auth, Association), WiFi generations (802.11r, 802.11k, ax, be), vendor hints, AP roles. These feed directly into retrieval.

Three Retrieval Modes

Plexus operates in two primary retrieval modes, switchable at runtime without restart:

Traditional mode runs dense vector search (Qdrant) and lexical search (SQLite FTS) in parallel. Their ranked lists are merged, duplicate chunks across document editions are collapsed, and the top candidates can be expanded with page- or section-adjacent neighbors from the same source.
The ranked lists from Qdrant and SQLite FTS are merged using Reciprocal Rank Fusion (RRF) to normalize the scores. To ensure exact string matches (like specific error codes or MAC vendor prefixes) aren't diluted by the dense retriever's semantic confidence, we pass the merged top-K candidates through a cross-encoder model for final reranking. Quality penalties are then applied to demote junk chunks (glossaries, boilerplate, answer keys) before they hit the evidence pack.

Graph mode adds Neo4j to the picture. This is where it gets interesting.

Graph RAG: Entity-Aware Retrieval

During offline indexing, entities are extracted from the knowledge corpus — protocol concepts, configuration states, failure modes, vendor behaviors — and imported into Neo4j as nodes with RELATES_TO weighted edges and community memberships.

At query time, Plexus resolves anchor terms from the parsed intent (protocol names, security methods, frame identifiers) to entity nodes via full-text index. It then traverses outward in one of three submodes, selected based on question class and query signals:

Local: entity → directly mentioned chunks → neighbor entities via RELATES_TO → their chunks. Best for specific, concrete questions.
Drift: local traversal + community expansion. Plexus follows entities into their community cluster and pulls chunks from co-clustered entities. Useful for broader symptom-to-cause problems where the answer lives in a nearby concept, not the exact entity.
Global: community-first traversal. Matches communities by full-text search against the query, then pulls chunks from member entities. For corpus-wide thematic questions.

The immediate danger with 'Drift' and 'Global' traversals is graph decay—as firmware updates and new standards emerge, old entity relationships become stale. To counter this, Plexus enforces a temporal decay penalty on edges during traversal, ensuring that newer corpus ingestion overwrites or heavily down-weights deprecated protocol behaviors, keeping the graph grounded in current reality

Graph results don't replace traditional retrieval — they're hybridized. Both lists are merged via RRF and jointly reranked. A chunk that surfaces from both graph and traditional retrieval gets a relevance boost. A graph-only chunk with zero lexical overlap against the question gets penalized — the graph can hallucinate relevance when entity connections are indirect.

The Compatibility Lane

WiFi has a class of question that's particularly hard: compatibility. "Does WPA3-SAE interoperate with WPA2 clients on 802.11ax?" requires understanding security method × generation × vendor interactions simultaneously. A single query against a single retrieval surface rarely reaches the right evidence.

The intent parser detects compatibility signals — security protocols, WiFi generations, vendor hints — and when they're present, a parallel retrieval lane fires. It generates a set of targeted sub-queries, one per compatibility axis combination, and runs dense + lexical retrieval for each concurrently. Results are pooled, deduped, and reranked into a compatibility evidence segment that merges with the main evidence pack.

This lane runs alongside the primary retrieval path, not instead of it.

Evidence Packs and Two-Pass Generation

The flow is intentionally boring and auditable — and that's a feature, not a limitation.

Retrieved chunks don't go directly to the prompt. They're assembled into a typed evidence pack — each entry carries internal identity, retrieval path, provenance, and relevance signals. Diversity enforcement helps the pack span distinct sources before it's trimmed to the final window. The public response does not expose those private details, but operators can inspect them later by request ID.

Generation happens in two passes:

Answer generation: the model produces a response grounded in the evidence pack.
Verification and cleanup: a separate grounding pass checks whether technical claims are supported. Unsupported claims are flagged, and public responses are cleaned so private source details and citations are not returned to users.

If verification finds weak evidence coverage, Plexus surfaces that explicitly — "here's what the evidence suggests, but confidence is limited." For common in-scope WiFi concepts, it can also use expert synthesis when retrieved evidence is partial; that state is tracked internally instead of being hidden.

Offline Indexing and Release Gate

Plexus is only as good as the indexes behind it. Poor indexing is a silent production bug — the model keeps producing fluent text, but grounded in weaker evidence, and nothing in the output tells you retrieval degraded.

The pipeline handles extraction, normalization, chunking, metadata enrichment, embedding generation, and index publishing for the lexical, vector, and graph backends. Then validation checks run before any index is promoted to the online path.

That gate was added after a hard lesson early in the build. Embedding model drift caused retrieval quality to degrade silently. Plexus kept producing fluent answers, but they were grounded in stale, misaligned chunks. We caught it during a manual review — nothing in the output had signaled the problem. Adding offline evaluation before promotion was the fix. Now degradation shows up as a failed gate before it reaches users.

RCA: The Enterprise Extension

The knowledge chat is the first-cut release. The RCA engine is what comes next.

RCA is a separate problem from Q&A. Incident analysis needs to ingest packet and log artifacts, normalize them into structured observations, build an event timeline, generate candidate hypotheses, and ground those hypotheses against the knowledge corpus. Stuffing raw artifacts into a prompt is not a workflow — it's a guess.

Plexus has an RCA path designed around durable execution, per-tenant incident state, audit trails, and async workers. In the full enterprise shape, that means Temporal-style workflow orchestration, a persistent RCA store, structured reports, trace access, and explicit runtime health gates. That path has been implemented and evaluated separately from the public knowledge-chat trial, but broader RCA availability is intentionally gated behind its own quality and operations checks.

The enterprise stack is intentionally gated behind the knowledge RAG foundation. Plexus's knowledge corpus is what makes the RCA evidence credible. You can't have a trustworthy incident report without a trustworthy retrieval layer underneath it.

Tech Stack

Plexus's backend is Python with FastAPI for the API layer and Typer for CLI/admin workflows. Retrieval uses SQLite FTS, Qdrant, and Neo4j each in their respective roles. Inference runs locally via Ollama or through AWS Bedrock depending on deployment configuration. The current public trial uses Google sign-in through Cognito, a small lifetime question quota, DynamoDB-backed quota/feedback/history metadata, CloudFront/S3 for the static UI, and a lightweight backend runtime for the query path. The RCA architecture is designed for durable execution and structured analysis rather than mixing raw artifacts into prompt text. Instead of dumping a 500-line spanning tree log or a raw PCAP dump into the context window, the execution pipeline parses the artifact into a strict, deterministic schema first. The LLM only sees the distilled state.
{ "event_type": "802.11_auth_failure", "client_mac": "a1:b2:c3:...", "ap_bssid": "d4:e5:f6:...", "reason_code": 15, "timing_delta_ms": 120, "inferred_state": "4-way handshake timeout" }

This prevents the model from getting lost in the noise and allows the workflow to execute deterministic logic before leaning on the LLM for reasoning.

The specific tools matter less than the structural separations:

API and routing are separate from retrieval.
Retrieval is separate from answer generation.
RCA parsing is separate from RCA reasoning.
Offline indexing is separate from online serving.
Evaluation gates sit before release, not after user-facing failures.

Each boundary makes one layer independently testable and replaceable without touching the others.

Lessons From The Build

The biggest lesson: a useful troubleshooting RAG system needs more product discipline than model integration. The model is one component. The harder parts are the evidence pipeline, retrieval quality, answer grounding, and knowing when to say "the evidence isn't strong enough."

Evidence packs over prompt stuffing. The first version concatenated retrieved chunks directly into the prompt. It worked until context length grew — then the model started blending chunks in ways that were hard to audit and impossible to trace. Switching to a typed evidence pack with explicit internal slots made generation more reliable and made verification possible.

Hybrid retrieval pays off fast. Version one used only vector search. It missed exact string matches: protocol codes, specific error strings, and standards names. Adding FTS alongside vector search improved quality more than another round of prompt tuning would have.

Graph retrieval needs a penalty for speculation. Early graph mode returned chunks from indirectly connected entities that were topically related but not actually relevant to the specific question. A graph-only chunk with weak topical overlap is a speculation, not strong evidence. Penalizing that case made the hybrid retriever more precise.

Public answers should be source-safe. The system still tracks evidence internally, but the public UI should not reveal private corpus details. That forced a useful product boundary: users get concise answers, confidence, and feedback controls; operators get traces, evidence maps, and evaluation data.

Uncertainty signals matter more than you think. Early on, the LLM produced confident-sounding answers even when retrieved evidence was thin. Adding verification and confidence handling made Plexus feel trustworthy rather than just fluent.

Closing

Plexus is live as a private trial: knowledge chat, hybrid GraphRAG retrieval, source-safe answers, Google sign-in, quota protection, and feedback capture. If you work in WiFi infrastructure and want to put it through its paces, the trial is open at app.plexus.pw/chat. The RCA engine is the next broader product surface.

The architecture pattern here is broadly reusable: build a retrieval layer that can explain itself internally, keep generation grounded in evidence, and design incident workflows around structured analysis.

For infrastructure troubleshooting, that difference matters. The goal is not a fluent answer. The goal is an answer an engineer can trust, inspect, and challenge.

DEV Community