Marc O Cleirigh

Posted on Feb 25

I Stopped Fighting My Logging Tools and Built an AI Co-Investigator

#ai #devops #documentation #splunk

TL;DR: I restructured my team's scattered documentation into an AI-queryable format, modelled every service's Splunk log events as TypeScript types, and built an investigation workflow around it. Complex incident investigations went from ~2 hours to ~30 minutes, and the system gets smarter with every investigation archived.

Who this is for: Backend and platform engineers dealing with on-call rotations, incident investigation across multiple services, and documentation that never stays current.

It's 2 AM. PagerDuty wakes you up. The alert says something is wrong with a service you haven't touched in months.

You open Splunk. You open New Relic. You open your IDE. You open Slack. And then you open your team's documentation — Confluence, a wiki, whatever your team uses — and the real challenge starts. Not the tool's fault. This is a people problem.

The documentation is scattered across dozens of pages written by different engineers in different eras. Half of it is stale. The service got renamed six months ago and nobody updated the docs. You're scanning through walls of text looking for one critical detail while a production incident ticks upward.

This was our reality. And I decided to fix it — not by writing better docs, but by rethinking what documentation is for.

The Mental Shift

Our backend team ran a suite of Java Spring Boot services and Python Lambda functions on AWS. Multiple services, inconsistent logging, complex downstream dependencies spread across a large organisation. When incidents happened, engineers were expected to acknowledge within 5 minutes and be investigating within 20.

But "investigating" usually meant spending the first 20-30 minutes just gathering context. Which service is this? Where are the logs? What does this log event mean? Has this happened before?

The information existed. It was just locked in Confluence pages, wiki entries, old Slack threads, and people's heads. Not accessible — not at 2 AM, not under pressure.

Traditional documentation asks: "How do I explain this system to a person?"

AI-ready documentation asks: "How do I make this system queryable?"

The answer was having both formats. A docs/ folder with comprehensive, narrative documentation for engineers to read and onboard with. And a .context/ folder with dense, structured, AI-queryable reference material. Documentation to read and documentation to query.

What I Built

Consolidate and Restructure

I started by pulling all existing documentation into a single repository — Confluence pages (converted from HTML to Markdown), GitHub Pages docs, and README files from key project repos.

Then I asked something I've found consistently useful when working with AI: "What do industry best practices say about documenting projects that span multiple services, codebases, and teams? Let's restructure what we have to align with those practices."

Our existing docs weren't absent — someone had put genuine effort in — but they'd been written without documentation training or industry frameworks to guide the structure. I let established patterns fill that gap. The AI helped identify what was missing, what was stale, and what was improperly categorised. We rebuilt from there.

Map the Territory: Service Manifests

With the structure in place, I went service by service through the codebases. Using Claude, I scanned each one and created service manifests — what each service does, what it talks to, what talks to it. Five manifests covering the core services, each one a structured picture of that service's role in the backend: incoming requests, outgoing requests, dependencies inside and outside our team's scope.

This became Mermaid diagrams of the infrastructure topology. Sequence diagrams of key request flows. Inventories of cloud resources — DynamoDB tables, Lambda functions, S3 buckets — with their associated repositories.

I did the initial pass in about a week, then built up the picture incrementally — adding services as investigations happened and free time allowed, so the team's sprint velocity wasn't impacted. This isn't a one-and-done effort. It's meant to be alive.

The Highest-Leverage Thing: TypeScript Models of Your Log Events

Here's the part I'm most proud of, because it solved a problem I hadn't seen anyone else tackle.

Each of our backend services logged to Splunk, but each had its own format. Not wildly different, but different enough to cause real pain. A customerId might be a top-level field in one service's events but buried three levels deep in a nested JSON object in another. Experienced engineers would learn this over time — to some degree — but never document it. It became mystical knowledge: information engineers didn't even know they had, accumulated through hours of painful investigation.

So I downloaded sample log events from every service using Splunk queries, fed them to Claude, and had it generate TypeScript type definitions for each service's log format.

// Example: one service's log event structure
interface ServiceLogEntry {
  timestamp: string;
  level: string;
  service: string;
  event: {
    type: string;
    customerId: string;  // <-- top level here
    // ...
  };
}

// Another service: same data, different structure
interface AnotherServiceLogEntry {
  timestamp: string;
  message: string;
  context: {
    request: {
      headers: {
        'x-customer-id': string;  // <-- buried here
      };
    };
  };
}

Once these models existed in the project, the AI had a map of every service's log structure. It could craft Splunk queries that traced a request across multiple services — joining on identifiers it knew lived in different places in each service's logs. No single engineer could do this, because no one person held all the structural knowledge simultaneously.

The Investigation Flywheel

The final piece was a wizard-based investigation tool: a script and lightweight web app that guided engineers through the process.

Enter the alert details — service, alert type, description
Search past investigations — the system checks a repository of previous investigations for similar incidents
Generate an AI prompt — pre-loaded with the relevant service documentation, log event models, Splunk query examples, and any context from similar past investigations
Investigate with AI — work iteratively with an AI assistant to generate queries, interpret results, and trace issues through the system
Archive the investigation — commit your notes, queries, and findings back to the repository

That last step is critical. It creates a flywheel: each investigation makes the system smarter for the next one. Better documentation → better AI responses → faster investigations → saved investigations → even better future responses.

We logged 14 investigations in the first month. By the end of that month, the system was already surfacing relevant past incidents and proven query patterns when new alerts came in.

Why AI + Splunk Is a Better Partnership Than You'd Expect

This deserves its own section, because it's where the human-AI collaboration really shines.

Splunk has over 140 commands. Most engineers know 5-10. We learn enough to get by, then stop. And honestly, why wouldn't we? When you're investigating an incident, your cognitive load is already maxed out — figuring out the time window, the services involved, where the relevant identifiers live. You write a query with whatever Splunk knowledge you have, tweak it until it runs, and move on because you're not even sure you're looking at the right thing.

Multiple tabs open. Queries over hours of data, millions of events, each taking a minute (or ten) to run. You're not sure about a command, so the query fails and you debug the query itself. You're on a call with other engineers theorizing about the issue and you lose your train of thought mid-query.

The AI doesn't have these problems. It knows every command. It uses stats, timechart, transaction, eventstats — commands many of us never use to their fullest. It renames ugly field names to something readable. It navigates nested JSON effortlessly. You run a query, download 10 sample results, hand them to the AI — and it scans them in seconds, noticing patterns and anomalies you'd have missed while scrolling. Every bit of this reduces your cognitive load and allows you to focus on being Sherlock Holmes to AI's Watson.

You bring the context and domain knowledge. The AI brings meticulousness, encyclopedic command knowledge, and the ability to scan vast amounts of structured data. It's a genuinely great partnership.

I've rarely met an engineer who was both a Splunk expert and a deep domain expert on the systems they were investigating. Now, this setup means you don't have to be.

The Results

After rolling this out to the team:

Metric	Before	After
Investigation time (complex issue)	~2 hours	~30 minutes (75% faster)
Investigation time (familiar issue)	~45 minutes	~10 minutes (78% faster)
Query sophistication	Basic, inconsistent	Advanced, consistent patterns
Investigation documentation	Rarely created	14 investigations archived in month one
Knowledge retention	Lost when engineers leave	Searchable investigation archive

We presented the system first to ~20 engineers on our team, then to roughly 100 engineers, managers, and product owners across the wider organisation. Since then, multiple other teams have reached out, asking how to replicate it for their own services, implementing it.

"But What About..."

"Isn't this just RAG?"
Sort of, but the key insight isn't the retrieval mechanism — it's what you retrieve. Most RAG implementations stuff documents into a vector store and hope for the best. The value here came from curating the documentation: structured service manifests, TypeScript log models, Splunk reference material, and archived investigations. Garbage in, garbage out applies to RAG just as much as anything else.

"Won't AI hallucinate bad Splunk queries?"
It can, and it does occasionally. But here's the thing — engineers write bad Splunk queries too. The difference is the AI writes them instantly, so you can iterate faster. And because the log models give the AI structural knowledge of each service's events, the queries are usually more accurate than what a human would produce from memory. You always validate by running the query and checking results. Then your validation of the queries feeds into the knowledge of the context docs, improving the accuracy even more.

"My org won't approve AI tools for production data."
Fair. Two things: first, you don't feed production data to the AI — you feed it documentation about production data, which is a very different security posture. Second, the documentation restructuring and TypeScript log models are valuable even without AI. They're useful for onboarding, knowledge sharing, and just having accurate docs for once.

What I'd Tell You If You're Considering This

Start with the documentation restructuring. Even without AI, version-controlled structured docs are better than scattered wiki pages. The AI layer is powerful, but the foundation is having your documentation in a format that's actually useful. Bonus: maintenance gets easier too — when a service gets renamed, AI can scan the entire doc set and find every reference, including ones a find-and-replace would miss because they're misspelled, abbreviated, or contextual.

The TypeScript log models are the highest-leverage piece. If you do nothing else, model your log event structures. The amount of tribal knowledge this captures is staggering.

Investigation archival creates compound returns. The first investigation is the hardest. By the fourteenth, the system is already suggesting relevant past incidents and proven query patterns.

It's not about replacing engineers. It's about making sure that at 2 AM, you have access to the collective knowledge of every investigation your team has ever done — and an assistant that can help you use it.

If your team deals with complex backend systems, on-call rotations, and the kind of tribal knowledge that walks out the door when engineers leave, I'd love to hear how you're approaching it. What's worked? What hasn't?

DEV Community