I Built a Guardrailed, RAG-Powered AI Workspace for My Autistic Teenager. Here's What Actually Broke.

#ai #beginners #devops #machinelearning

My daughter is 13 and autistic. She needs homework help at 9pm sometimes. Every AI tool I looked at was either totally unmonitored, pointed at the open internet, or locked behind a school district policy I had zero visibility into.

I'm an IT admin. I run a homelab. I have an RTX 3060 sitting there. So I built something myself.

This isn't a tutorial. It's a postmortem of everything that failed and what I did to fix it — because the gap between "I have a working Ollama instance" and "this is actually safe for a vulnerable kid" is a lot wider than I expected.

The Stack

Nothing exotic here:

Ubuntu Server 24.04 on local hardware
Ollama for model serving
LiteLLM 1.68.2 as the LLM proxy
Open WebUI 0.8.12 as the front end
TEI reranker container for RAG reranking
PostgreSQL for persistent storage
RTX 3060 12GB doing the inference

Getting this running took an afternoon. Getting it to actually behave correctly for a neurodivergent 13-year-old took weeks.

Failure 1: The System Prompt Isn't Just Instructions

I came in thinking I understood system prompts. I didn't — not for this use case.

The hardest part was safety escalation. I needed the assistant to respond differently to three distinct situations:

Tier 1: Stress and frustration. "I hate this homework" → calm acknowledgment, one small next step, no alarm.
Tier 2: Ambiguous language that might suggest self-harm. "I just want to disappear" → specific required phrases, 988 crisis line, tell a trusted adult.
Tier 3: Explicit crisis disclosure. "I want to hurt myself" → stop everything, full escalation, all four required phrases, crisis resources.

My first implementation shared a single phrase list across Tier 2 and Tier 3. The model couldn't reliably tell them apart. It kept firing Tier 3 language at Tier 2 inputs.

That matters. Over-escalating to a kid who said "I wish I wasn't here" while frustrated about a math test is its own kind of harm. It can cause panic. It can erode trust. It can make them less likely to say anything at all next time.

The fix: Split everything into tier-specific blocks. Added concrete anchoring examples inside each tier definition. Added an explicit pre-response decision rule:

Before responding to any distress signal, identify the correct tier first.
Do not use Tier 3 language for Tier 2 signals.

Simple in retrospect. Not obvious until you've watched it fail a few times.

Failure 2: RAG Retrieval Wasn't Doing What I Thought

I built a knowledge base of 34 support documents: study habits, math steps, overwhelm strategies, emotional support, writing frameworks, autism-specific anchors for executive function and stress shutdown.

Uploaded them. Ran a test. The model pulled a research standards document instead of the overwhelm support guide.

The problem: my short support docs (200–400 words each) were being semantically outcompeted by longer, denser reference documents during retrieval scoring. Embeddings don't care which document is "right." They care about similarity, and a thin support guide loses to a 1,500-word standards reference almost every time.

I tried chunking adjustments first:

chunk_size: 300 → 800
overlap: 50 → 100

Helped marginally. Didn't fix it. The root cause was semantic density mismatch, not chunking.

The real fix: Rewrote all 29 general support documents to be longer and richer. Added a Common Core connection section to each one — deliberately mirroring the vocabulary of the dense anchor documents so the support docs could compete in retrieval scoring. Three hours of rewriting. After that, retrieval started hitting the right documents consistently.

Failure 3: "One Step at a Time" Wasn't Enforced Enough

The system prompt said: give distressed users one step at a time. What the model actually did was give one step, then append a "Remember:" block, or a "Tips:" section, or an "If you get stuck:" coda.

Technically compliant with the spirit of the rule. Not actually compliant. For a kid who's already overwhelmed, that extra content is precisely the problem we were trying to solve.

The fix: Added an explicit forbidden behaviors section:

Do not add extra sections such as "if you get stuck," "tips," 
"remember," or "extra help" unless the knowledge base pattern 
includes them or the user explicitly asks.

After that, instruction compliance tests passed consistently.

The Test Protocol

Before my daughter ever touched it, I ran a structured red team evaluation across four categories:

Category	Tests	Pass	Partial	Fail
Safety	10	8	1	1
RAG accuracy	10	6	3	1
Boundary enforcement	10	9	1	0
Instruction compliance	10	6	2	2
Total	40	29	7	4

Five escalation levels per category, from benign baseline probes up to combination attacks that blended two failure types in a single message. All four failures were remediated before deployment.

The one open partial: a fallback scenario where a secondary example surfaces after the primary one. Low priority. On the list.

What I'd Do Differently

Start with requirements, not model selection. I wasted a week comparing models before I understood that prompt quality and knowledge base structure matter more than model choice for a constrained use case like this. A well-prompted smaller model will outperform a poorly-prompted frontier model here.

Write the system prompt like a contract. Every ambiguity gets exploited — not maliciously, but by the model doing its best to be helpful in unexpected ways. Specify exact required phrases. Specify forbidden response structures. Be concrete.

Test adversarially before anyone uses it. Especially for anything touching a minor's mental health. "Seems fine" isn't a standard.

Semantic density matters in RAG. If your support documents are short and your reference documents are long, your support documents will lose retrieval. Either bulk them up or architect separate collections.

Where It Stands

The workspace is live. Safety escalation is solid. RAG retrieval is accurate across all tested scenarios. Boundary enforcement holds.

My daughter hasn't broken it yet. That's the benchmark that actually matters.

The full writeup with more detail on the RAG architecture, the rewriter script, and the complete test record is on my Hashnode — link in the comments.

Has anyone else built guardrailed AI tooling for a specific vulnerable user population? I'd genuinely like to compare notes — especially on the RAG side. I suspect the density mismatch problem is more common than people realize and I haven't seen much written about it.