DEV Community: Gilson Leite Siqueira Junior

Building an internal AI assistant for support engineers with MCP

Gilson Leite Siqueira Junior — Tue, 26 May 2026 16:31:18 +0000

Support engineers lose time in a very specific way. A question starts in one place, the answer lives somewhere else, and before anyone notices you’ve got six tabs open and the actual problem is still half-hidden.

I wanted to fix that because I’ve lived it myself. Not with a bigger dashboard, or with another chat bubble pretending to know everything: I wanted something smaller, gentler, and more honest.

So I built an internal assistant built on Model Context Protocol that could reach into the systems we already trusted and pull back the right context when it mattered.

The problem

On paper, the workflow looked clean enough.

A customer ticket lands with partial clues.
The engineer checks the knowledge base for known fixes.
They search previous tickets to see whether the issue was already solved.
They jump to internal docs for product behavior or edge cases.
They repeat the same search patterns when the first pass misses.

The problem wasn’t missing information; we had plenty of that.

The real tax was friction.

Every switch in context costs attention, and every repeated search chips away at momentum. You feel it in the middle of a live case, when the thread is still open and someone is waiting on an answer. It starts as something annoying and ends up draining more than you expect. If you’ve worked support long enough, you know the feeling immediately.

The solution

For that, I developed an MCP server that exposes the internal search system as a set of tools a client like Claude Desktop or Claude Code can call.

The important choice was what not to build. I didn’t want a chatbot that improvises its way through everything. I wanted a thin layer that keeps the source of truth where it already lives and gives the model structured access to it. That felt calmer to me, and more respectful of the systems people already rely on every day. There’s a kind of comfort in that simplicity.

Support engineer
      |
      v
Claude / AI client
      |
      v
MCP server
      |
      +--> KB search
      +--> Ticket search
      +--> Issue search
      +--> Product docs search
      |
      v
Internal search infrastructure

That separation mattered more than I expected. The assistant is not the database. It’s the interface, and once I started treating it that way, the shape of the whole project became clearer. I could feel the difference right away.

Why MCP fit this use case

MCP gave me a clean boundary between the AI client and the internal services, which is exactly what I wanted: it made tool definitions explicit and search behavior easier to reason about. I could let the same backend serve different clients without hard-wiring the workflow into a single UI.

The architecture splits cleanly into three layers:

Client layer: Claude Desktop or Claude Code makes tool calls over the MCP protocol. The model sees each tool’s description and parameters but doesn’t know about the backend.
MCP server layer: A Python process that translates tool calls into HTTP requests. It handles timeouts, error messages, and response formatting. This is where the business logic lives.
Service layer: The remote search infrastructure that actually runs the queries and answer generation. The server calls it over standard HTTPS but shields the client from network details.

That separation meant I could evolve the server independently from the services. If an upstream API changed, only the server needed updating. If we wanted to swap one service for another, the client didn’t care. I also liked how MCP pushed me to think in capabilities instead of prompts. A support engineer doesn’t need a giant blob of instructions that tries to anticipate every branch. They need reliable tools for lookup, comparison, and follow-up, and they need those tools to behave predictably when things are already tense. That kind of steadiness matters more than cleverness.

Tool design

I split the server into separate tools instead of one generic search endpoint.

That sounds like a minor detail; but i assure you it isn’t, at least not when you’re trying to keep the experience understandable for the person using it. Small boundaries can make a tool feel much kinder:

Keyword search and semantic search solve different problems.
Knowledge base lookup is not the same as ticket history lookup.
A generated answer is not the same as a regenerated search result.
Narrow tools are easier to test and easier to trust.

The model can ask for the specific corpus it needs. No guessing, and no hoping one universal search call will somehow infer the right intent every single time. That restraint made the whole thing easier to trust.

The tools

The server has several tool categories, each serving a distinct purpose.

The first handles search: semantic/neural queries across knowledge bases, tickets, issues, and documentation, paired with keyword search when you need exact phrases.

For answers, I built two tools: one to generate grounded LLM responses, another to regenerate or reformat them in different tones (formal, casual, technical, etc.).

Discovery tools let the model list available sources, tags, and sections without guessing. A fourth class handles structured metadata lookups to build more informed queries.

Each tool is stateless except for answer generation, which stores a chat_id so you can regenerate without re-searching. That minimal state felt right — the model can use these tools in any order, and results are cacheable.

What the backend looked like

Under the hood, the backend wrapped a Kubernetes-deployed semantic search service that already indexed multiple corpora. My job was to make that stack usable from an AI client without leaking the messy bits or making the experience harder than it needed to be. I wanted the person on the other side to feel help, not complexity.

The practical work wasn’t glamorous:

Handling network access cleanly.
Keeping the tool surface small.
Returning results in a form the model could use.
Preserving enough metadata for a human to verify the answer.

Implementation details

The server is a Python FastMCP application that communicates with the AI client over stdio (the MCP protocol). If you’re setting this up locally, you’ll add a small configuration block to your Claude Desktop config that points to your Python environment:

{
  "mcpServers": {
    "kcs-search": {
      "command": "/path/to/python",
      "args": ["-m", "server_module"],
      "env": {
        "HTTP_TIMEOUT_SECONDS": "60"
      }
    }
  }
}

The HTTP_TIMEOUT_SECONDS parameter is the only tuning knob most people need to touch.

Response times vary by operation. Semantic search usually completes in 2–5 seconds. Keyword search is faster. Answer generation is slower (10–30 seconds) because it involves model inference on top of retrieval. I also had to respect something obvious but easy to ignore: internal systems are not built for free-form AI access. Some endpoints only work in specific contexts. Some assumptions that hold locally fall apart the moment you hit the real network. You learn that quickly, usually after a few false starts. That part can be humbling.

Security and guardrails

I found an authentication gap while working through the integration, and I documented it instead of trying to sneak around it. That part mattered to me more than the workaround would have. I’ve always preferred the slower honest path there.

That changed how I approached the rest of the implementation. Internal tools should default to the least surprising behavior:

Only expose what the client actually needs.
Avoid broad write capabilities.
Prefer read-only operations unless a workflow truly requires more.
Make the source of each result visible so the human can verify it.

For AI tooling, security isn’t a later phase. It has to be part of the interface design from the start, or the whole thing feels off. People can sense when something is missing, even if they can’t name it.

Limitations and caveats

A few hard constraints shaped this tool:

Network access: The upstream services are only reachable from inside the internal network. The server returns a connection error if network access is not configured correctly.
Authentication: At the time of building this, the upstream services had no authentication layer. This means the server should never run on shared or public machines. It’s designed for individual developer workstations only.
Environment scope: Both services run on staging infrastructure. There is no production endpoint at this point.
Streaming: The underlying answer generation API supports streaming, but v1 of this server does not. Answer generation returns the full response when complete instead of streaming tokens as they arrive. This adds latency but simplifies the client implementation.

These constraints shaped the tool’s scope on purpose. It works well for solo developers on internal networks with staging data. I didn’t try to make it something it isn’t — that simplicity is the whole point. There’s less to go wrong when a tool admits what it can’t do.

Adoption

The strongest sign that the project was useful wasn’t the code. It was that I shared it with other engineers on the team.

That changes the bar immediately. A private prototype can be clever and still be useless. A shared internal tool has to survive real usage, inconsistent questions, and the pressure of helping someone in the middle of a live support case. That’s a different kind of test entirely, and it asks for a different kind of care.

Once other engineers could use it, the project had to be understandable, dependable, and fast enough to stay out of the way. If it gets in the way, people stop trusting it, and trust is the whole game here. I felt that responsibility pretty strongly.

What I learned

This project taught me a few things that go beyond one internal assistant.

AI tooling becomes much more useful when it is attached to real operational systems.
Retrieval quality matters more than flashy prompting.
Narrow, explicit tools are easier to trust than one large abstraction.
Production constraints shape the product as much as model choice does.
The best AI systems for support work are the ones that reduce context switching.

It also reinforced something I’ve seen over and over in support and infrastructure work: reliability is a feature. If the tool only looks good in demos, people will leave it behind, no matter how clever it seemed at first. Real usefulness is quieter than that.

Closing thought

I like building AI systems that make expert work feel lighter instead of replacing the expert. This project did that in a very concrete way. It turned scattered knowledge into something engineers could query directly, right in the flow of work, which is exactly the kind of thing I want to keep building. That kind of work feels good to me.

That’s the kind of AI product I want to keep building.

Related Project

dubweave — Fully local AI dubbing pipeline. Like this KCS Search MCP project, dubweave is built on the principle of keeping everything local, measurable, and under your control. Both are systems that respect the data they handle and the people who use them.

Dubweave, for Aline

Gilson Leite Siqueira Junior — Tue, 26 May 2026 16:26:25 +0000

I built dubweave for my wife, Aline. She speaks Portuguese, and I kept finding these videos (essays, documentaries, interviews) that I wanted to show her. Subtitles help, yeah, but it is not the same. When you dub something, you can both just… listen. You sit together and watch without the reading part happening in your head.

So I built this for that. To not send our videos to some company’s server. To keep it mine. And as I built it, it got complicated in the ways that feel honest to me, because I like (and obviously prefer) systems I can understand and trust.

If I had to explain what dubweave actually does, I would say it is like a workshop. Not a fancy one. Just a series of stations where each person passes the work to the next person: just clear handoffs.

The workshop tour, step by step

Station 1: Get the video

If you give it a local file, it converts to mp4 and pulls out the audio. If you give it a video link, yt-dlp tries a few different ways to download it. Sometimes it is fast, sometimes it falls back. The pipeline has a bunch of ways to keep trying. It rotates client profiles. It uses your cookies if you have them. If all else fails, it just extracts the audio from the video file itself. The point is not to be clever about it. The point is to keep working.

Station 2: Listen and mark the time

Whisper does two things: First it listens and figures out what language the audio is. Then it transcribes what it hears and marks down the exact time each word starts and stops. That timing is everything; it is like the skeleton. If I do not respect it, the dubbed voice will start moving around and the speaker’s mouth will not match. Thus I keep that timing sacred.

Station 3: Glue the pieces together

Whisper chops up speech into tiny bits. A lot of tiny bits.

But translation works better on whole thoughts, not fragments. So I have a few simple rules: do not make utterances too short, do not make them too long, if there is a big gap in the audio maybe split there, if there is punctuation maybe stop there. I keep track of which original fragments I combined, so later I can spread the translation back across the original timing. This is not just tidying up. This is the difference between a translation that sounds natural and one that sounds broken.

Station 4: Translate, then fix it

If you have an API key, translation goes through Gemini. I break it into chunks, number them, and give it a bit of context from previous translations so pronouns and tone stay consistent. If the API fails or you do not have a key, it falls back to a local model instead. Either way, the translation then gets run through a PT-BR fixer.

Most systems either guess at these rules or bury them. I keep 36 explicit regex rules in code and in a JSON file so I can edit them without redeploying:

_PTPT_TO_PTBR = [
    (r"\btu\b", "você"), 
    (r"\bteu\b", "seu"), 
    (r"\bestás\b", "está"),
    (r"\bautocarro\b", "ônibus"),
    (r"\btelemóvel\b", "celular"),
    # ... and 31 more rules for pronouns, verbs, vocabulary
]

The rules get loaded from a JSON file first. You can edit them. You can test them. You can see what is actually happening.

Station 5: Does it fit?

This is the center of everything. Each translated sentence is checked against how long it has to fit. Here is the unique part: I measured the actual speech rate for every voice, not guessed it.

VOICE_CALIBRATION: dict[str, float] = {
    "pf_dora": 13.3,           # Kokoro female
    "pm_alex": 13.1,           # Kokoro male  
    "pt-BR-FranciscaNeural": 11.1,  # Google (fast)
    "M1": 16.0, "F1": 16.0,    # Supertonic
    "default": 15.1,
}

def _estimate_synth_duration(text: str, cps: float = 15.1):
    return len(text.strip()) / cps

These numbers come from autoresearch loops where I ran actual samples and measured them. Not guesses. The data.

Then if something is too long, I try an LLM rephrase. If that fails, I trim to the nearest word boundary. Because the worst failure is when the voice keeps going and the mouth is already closed.

Station 6: Speak

I support a bunch of different text-to-speech engines — Kokoro, XTTS v2, Edge, Google, Gemini, ElevenLabs, Supertonic. They all work differently but I make them all follow the same rules: generate audio, measure how long it is, then speed it up or slow it down to fit the time slot. If one breaks, it becomes a short silence instead of killing the whole run. That is not fancy. That is just reliability.

Station 7: Mix it all together

I build the final audio directly in a numpy array. Each clip sits at its time offset. Then I make sure nothing is clipping loud, and I put it back into the original video with ffmpeg. Subtitles are generated separately using basic reading-speed math, and tiny gaps get merged so the subtitles feel like they were written by a person, not an algorithm.

How it persists and resumes

Another unique part. If you run dubweave for hours and it stops at station 5, you restart from stage 5 without redoing 1–4. This is baked in:

def save_project_stage(name: str, stage: str, data):
    d = project_dir(name)  # projects/my_project/
    if stage == "download":
        shutil.copy2(str(v_src), str(d / "video.mp4"))
    elif stage == "translate":
        (d / "translated.json").write_text(json.dumps(data))
    elif stage == "synthesize":
        (d / "timed_clips.json").write_text(json.dumps(data))
    # each stage: one file on disk

def load_project_stage(name: str, stage: str):
    # Load back from disk at any point
    return json.loads((d / "translated.json").read_text())

Every stage is a file. Pause. Come back. Tweak manually if needed.

What makes it different

Local and resumable: Everything runs local by default. Every stage saves to disk. Stop and restart from where you left off.

Timing is measured, not guessed: 13.3 chars/sec for Kokoro pf_dora. 11.1 for Google Francisca. These come from actual measured output, kept per-voice in code.

Normalization is explicit and editable: 36 regex rules for PT-PT → PT-BR. Not buried in code. In a JSON file. You can edit pronouns, verb forms, gerunds, vocabulary. Change them and run again.

Translation has a safety net: Gemini with context windows for consistency. If the API fails or you do not have a key, it falls back to a local NLLB model. Normalization runs either way.

Seven TTS engines, one contract: Kokoro, XTTS v2, Edge, Google, Gemini, ElevenLabs, Supertonic. All generate audio, get time-stretched to fit, fail silently. Same code path.

Measured, not guessed: Autoresearch loops with KEEP and DISCARD logged. I do not assume something is better. I measure it. The numbers are in the README.

How I work

I am autistic. I do not build by guessing. I measure, I calibrate, I make small changes and then check if they actually worked. I keep logs. I write rules down in JSON. I tune things like speech rate over and over until the numbers match what I hear, because my ears alone are not reliable enough. Ambiguity exhausts me, so I build systems with explicit rules instead. This is not a personality thing. This is how I keep systems honest, and this is how I keep myself functional.

That is it

dubweave is a technical system but also a personal one. It was built for my wife. It was built the way I know how to build. If you are thinking about hiring me, this is what you are hiring: someone who builds things by measuring them, not guessing. Someone who keeps explicit rules and keeps them in files. Someone who cares about timing and context windows and graceful failure. Someone who would rather spend an hour understanding a thing than thirty minutes assuming it works.