DEV Community: Jay Bamroliya

I built an AI agent that fixes incidents and proves it — using SigNoz

Jay Bamroliya — Sun, 26 Jul 2026 05:52:48 +0000

How my team built a self-healing SRE agent for the Agents of SigNoz hackathon: it detects a problem in SigNoz, fixes the app, and then uses SigNoz again to prove recovery.

Every observability tool — including SigNoz's own new MCP server — is great at one thing: telling you what broke. "This endpoint is slow, here are the traces." But then a human still has to read the data, guess the cause, apply a fix, and check whether it actually worked.

For the Agents of SigNoz hackathon (Team TraceBandits — Jay Bamroliya & Kaushal Karkar), we asked a different question: what if the agent did that whole loop? Not just explain the incident, but fix it — and then prove the fix worked, using observability.

That last part is the whole idea. The result: an endpoint that was taking 4.2 seconds drops to tens of milliseconds, detected → fixed → verified, with one click and no human in the loop.

The agent's dashboard right after a heal: it detected the slow /about endpoint, disabled the fault, and re-tested it — 4.21 s → 36 ms, with its own incident report.

What it does

The agent runs a closed loop:

Detect — reads service metrics from SigNoz and spots the degraded service.
Investigate — pulls per-operation latency (including internal DB spans) to find where the time actually goes.
Fix — applies a scoped remediation to the running app.
Verify — re-tests the endpoint and confirms recovery, then writes its own incident report.

Most "AI + observability" demos stop at step 2. Steps 3 and 4 — acting and then verifying via observability — are what make this different, and they're exactly what an on-call engineer actually wants.

One click: I inject a latency fault, the app goes DEGRADED, and the agent starts investigating SigNoz live.

The architecture

The whole system: an OTel-instrumented app streams to SigNoz; the agent reads SigNoz through tools and acts back on the app.

Agent: Spring Boot (Java). Most agents are Python; doing it in Java kept it close to the app we were healing and easy to reason about.
Brain: Llama-3.3-70B on Groq (free tier, OpenAI-compatible function calling). The LLM decides which tools to call and in what order.
Observability: SigNoz (Cloud for the demo; the repo ships casting.yaml so you can self-host it with Foundry).
Demo target: an OpenTelemetry-instrumented Spring Boot app with a runtime fault-injection control plane, so we can stage a real, controllable incident.

How the agent uses SigNoz

This is the heart of the project, so here are the actual tools the LLM can call — each is a call to the SigNoz Query API (authenticated with a service-account key):

get_services → POST /api/v1/services — p99 latency, error rate, throughput per service. Finds the degraded service.
get_top_operations → POST /api/v1/service/top_operations — p50/p95/p99 per operation, including internal spans (DB calls, repository methods). This is how the agent pinpoints where the latency lives, not just that a request is slow.
apply_fix → calls the app's control plane to disable the fault.
verify_recovery → re-tests the endpoint directly and reports the new latency.

Notice SigNoz is used twice in the loop: once to detect and diagnose, and — after the fix — again to confirm recovery. That verification-via-observability step is the part I'm proudest of.

SigNoz showing the /about endpoint's latency — this is what the agent reads to find the bottleneck.

Here's a real, verbatim conclusion the agent wrote during a run:

"Incident summary: the 'smartcontactmanager' service had an endpoint '/about' with abnormally high latency (p99 ≈ 4.2 s). The fix applied was to disable the injected latency fault. After the fix, the endpoint was re-tested and its average response time was 36 ms, confirming recovery."

What broke while building it (the useful part)

1. Spring bean-name clashes. I registered RestClient beans named groqClient, demoAppClient — which collided with the @Component classes of the same name. Spring refused to start: "a bean with that name has already been defined." Fix: rename the @Bean methods (groqRestClient, demoRestClient) and match the constructor parameter names, since with multiple RestClient beans Spring disambiguates by parameter name.

2. Llama stringifies numbers. My first tool schema had an integer minutes parameter. Groq's validation kept rejecting the model's output: "expected integer, but got string" — Llama emitted {"minutes": "60"}. Fix: drop flaky numeric params; the tools take only strings and the time window is a server-side default. Simpler schema, zero validation failures.

3. Ingestion delay vs. real-time proof. SigNoz Cloud has a short ingestion lag, so right after a fix the aggregated p99 hasn't updated yet. For deterministic proof, verify_recovery re-tests the endpoint directly and reports the measured latency — instant and honest — while SigNoz's dashboards show the same recovery a little later.

What I learned

Observability isn't just for humans anymore. The moment you expose telemetry through a clean API (or an MCP server), an agent can reason over it — and act on it.
The verify step is underrated. Any agent can propose a fix. An agent that checks its own work against real signals is one you might actually trust.
Free tools go far. Groq's free Llama + SigNoz + OpenTelemetry — the whole stack cost nothing and runs on a 6 GB laptop.

Try it

The code, the casting.yaml for a reproducible SigNoz deployment, a pitch deck, and a full demo video are all in the repo. Clone it, drop your SigNoz + Groq keys into .env, run the app and the agent, open the dashboard, and click Self-Heal.

SigNoz's MCP server lets an AI see your system. This agent lets it heal your system — and prove it did.

— Team TraceBandits · Agents of SigNoz (WeMakeDevs × SigNoz)

My registration endpoint took 5.3 seconds. The INSERT took 38 ms.

Jay Bamroliya — Sat, 18 Jul 2026 22:29:26 +0000

Instrumenting a five-year-old Spring Boot app with zero code changes, and everything that broke on the way to self-hosting SigNoz on a 6 GB Windows laptop.

My Spring Boot registration endpoint took 5.3 seconds. The SQL INSERT inside it took 38 milliseconds. I only learned that after pointing SigNoz at a five-year-old side project — without changing a single line of its code. This post is the story of that evening: what broke on the way (a lot), and what the traces taught me about my own app.

Around 2020, when I was learning Java, I built Smart Contact Manager — a Spring Boot 2.3 app with Spring Security, JPA, and Thymeleaf. I made it purely for my personal learning, and it shows: it has System.out.println debugging all over it, because that's what I knew back then.

Full disclosure before we start: I did this whole setup pair-working with an AI coding agent — it drove the terminal while I made the decisions, clicked the UAC prompts, and created the accounts. For a hackathon literally called Agents of SigNoz, that felt fitting. Every number and error in this post is from my machine, and I watched all of it happen.

For the Agents of SigNoz pre-event challenge, I asked a different question than "can I instrument a fresh demo app": can modern observability tooling light up an app I wrote five years ago, without me changing a single line of its code?

The answer turned out to be yes — and the traces immediately showed me two things about my own app that I never knew. Along the way, almost every step of the setup broke in an instructive way, so this post is both a "how to" and an honest list of everything that went wrong.

Setup: Windows 11 laptop, 6 GB RAM (yes, really), Java 19, no Docker installed, and — as I discovered — a broken WSL.

Step 0: The setup fought back

SigNoz self-hosting needs Docker. My machine had neither Docker nor a working WSL — wsl --version died with:

Class not registered
Error code: Wsl/CallMsi/Install/REGDB_E_CLASSNOTREG

Re-registering the Store package didn't help. What fixed it was installing the official WSL MSI directly from the microsoft/WSL GitHub releases (wsl.2.7.10.0.x64.msi), enabling the Virtual Machine Platform feature, and rebooting.

Then came the first genuinely useful discovery, straight from the SigNoz Docker install docs: on Windows, don't run SigNoz under Docker Desktop. ClickHouse Keeper is known to crash in a restart loop with segmentation faults under Docker Desktop's virtualization. The docs recommend native Docker Engine inside WSL 2 instead:

# inside Ubuntu on WSL2
curl -fsSL https://get.docker.com | sh

With 6 GB of total RAM, I also capped the WSL VM so ClickHouse, the JVM, and my browser could coexist — C:\Users\<me>\.wslconfig:

[wsl2]
memory=4GB
swap=6GB
processors=4

This worked. ClickHouse Keeper reported healthy on the first boot and stayed that way.

Step 1: Casting SigNoz with Foundry

If you last looked at SigNoz a while ago, the install has changed: the docker-compose manifests and install.sh are deprecated. SigNoz now deploys through Foundry, a CLI that treats your observability stack as declarative config:

curl -fsSL https://signoz.io/foundry.sh | bash

cat > casting.yaml <<'EOF'
apiVersion: v1alpha1
kind: Installation
metadata:
  name: signoz
spec:
  deployment:
    flavor: compose
    mode: docker
EOF

foundryctl cast -f casting.yaml

cast validates Docker, generates Compose files into pours/deployment/, and starts five containers: ClickHouse, ClickHouse Keeper, Postgres, the SigNoz server, and the OTel collector ("ingester"). UI on :8080, OTLP ingestion on :4317 (gRPC) and :4318 (HTTP).

Two real-world notes from my run:

On a slow connection, the image pull step can get killed mid-way. Pre-pulling with docker compose pull inside pours/deployment/ and re-running foundryctl cast (it's idempotent) got me through.
A killed first run left a stale migration lock. The SigNoz container crash-looped with migrations table is already locked ... duplicate key value violates unique constraint "migration_lock_table_name_key". One DELETE FROM migration_lock; against the bundled Postgres and it healed itself. If your signoz container restarts endlessly after an interrupted install, check this first.

And a genuinely non-obvious gotcha: telemetry ingestion doesn't start until you create the admin account. The collector registers with the SigNoz server via OpAMP, and until the first user/org exists, the server rejects it (cannot create agent without orgId in the logs) and the collector never opens ports 4317/4318. I spent a while staring at Failed to export errors from my app before realizing the fix was simply… finishing the signup form at localhost:8080.

Step 2: Zero-code instrumentation

The app didn't need MySQL for this experiment, so I pointed it at an in-memory H2 database (a dependency swap in pom.xml plus datasource properties — config, not code). Then the entire instrumentation was one JVM flag and a few environment variables:

$env:OTEL_SERVICE_NAME = "smartcontactmanager"
$env:OTEL_EXPORTER_OTLP_ENDPOINT = "http://localhost:4318"
$env:OTEL_EXPORTER_OTLP_PROTOCOL = "http/protobuf"

java -javaagent:opentelemetry-javaagent.jar `
     -jar smartcontactmanager-0.0.1-SNAPSHOT.jar

That's it. The OpenTelemetry Java agent (v2.29.0) attached to my Spring Boot 2.3.4 app running on Java 19 — a framework version from 2020 — and instrumented Tomcat, Spring MVC, Spring Data, Hibernate, JDBC, and Spring Security without complaint.

I generated traffic with a shell script: user registration, failed logins, successful logins, adding contacts, paginating through them, plus some deliberate errors (a bad path variable, 404s, unauthenticated access). Within seconds, 277 spans had landed in ClickHouse.

My five-year-old app, alive in the APM view.

Latency percentiles, request rate, Apdex — all from one JVM flag.

What the traces showed me

1. The 5.3-second registration

The Services view showed smartcontactmanager with a P99 latency of 3.75 seconds. Filtering the Traces Explorer by durationNano >= 2000000000 surfaced the culprit: POST /do_register, 5.29 s.

Every request slower than 2 seconds — including an 11.5 s cold-start GET /.

The flame graph broke it down mercilessly:

Span	Duration
POST /do_register	5,292 ms
└ UserRepository.save	2,761 ms
└ Session.persist	834 ms
└ INSERT smartcontact.user	38 ms

The flame graph: 5.29 s total, and the actual INSERT is the tiny bar at the end.

My first reaction was: wait, 5 seconds just for a registration? Do my other APIs also take this long and I never knew? That question — which I could never have answered before tonight — is the entire reason observability exists.

The database was innocent. The actual INSERT took 38 milliseconds — 0.7% of the request. The rest was BCrypt password hashing and Hibernate initializing its machinery on the first write of the session. Without the waterfall I'd have "optimized" the database, which was never the problem.

(The traces also caught my app restart red-handed: the first GET / after a restart took 11.5 seconds — a JVM cold start on a memory-starved machine, visible as two lonely 11,587 ms rows in the explorer. Every restart of a Java app has a story like this; traces just make it impossible to ignore.)

2. The exception my app was swallowing

This one I didn't stage. My traffic script re-registers the same test user every cycle, and the Exceptions tab lit up with JdbcSQLIntegrityConstraintViolationException → DataIntegrityViolationException → RollbackException, three per cycle.

Here's the thing: the endpoint returns HTTP 200 for those requests. My 2020-era controller catches the exception, prints a stack trace to stdout, and renders the signup page with a friendly error banner. Status-code monitoring would tell you this endpoint never fails. The exceptions tab, populated automatically from span error events, tells you the truth: the database is rolling back a transaction on every duplicate signup, and the "success" metric is lying to you.

Constraint violations my app has been silently swallowing — while returning HTTP 200.

That contrast — success metrics green, exceptions tab red — was the single most useful thing SigNoz showed me.

3. My logs pillar was empty, and the reason was my own code

Traces and metrics flowed instantly, but the Logs Explorer sat at "No logs yet". The OTel agent captures logs by hooking the logging framework (Logback here) — and my student self logged everything with System.out.println, which no logging framework ever sees.

The zero-code fix: turn on framework logging via startup args, no rebuild required:

--logging.level.org.hibernate.SQL=DEBUG

Suddenly every SQL statement Hibernate executes arrives in SigNoz as a log record, automatically stamped with the trace_id of the request that caused it — so you can click from a slow span straight to the exact SQL it ran. My own println "logs" remain invisible forever. Consider this a message from your future self: use a logger.

Real Hibernate SQL flowing into the Logs Explorer.

Every log record carries the trace_id of the request that produced it — one click from log to trace.

Alerts, and an honest note about 6 GB of RAM

To round things out I created an alert rule in the UI: request rate above 0.4 ops/s over the last 5 minutes. The query-builder-with-live-chart flow made this pleasantly concrete — you see the exact line your threshold will cut across before you save.

The alert rule, genuinely firing on my traffic generator's waves.

One honest limitation of my potato-spec setup: when I tried to build the alert on P90 latency from the http.server.request.duration histogram, the percentile query kept failing with an internal error — the ClickHouse window-function query behind percentile aggregation didn't survive my 4 GB WSL memory cap. Switching the alert to the request-count metric worked instantly. On a machine with the recommended RAM I'd expect the percentile alert to work; on 6 GB, know your limits.

Quick reference: every error I hit, and the fix

Error / symptom	Fix
`Wsl/CallMsi/Install/REGDB_E_CLASSNOTREG` on any `wsl` command	Install the official WSL MSI from microsoft/WSL releases, reboot
ClickHouse Keeper segfault restart-loop on Windows	Don't use Docker Desktop — install Docker Engine natively inside WSL2 (per SigNoz docs)
`foundryctl cast` dies with `signal: killed` mid-pull	Pre-pull with `docker compose pull` inside `pours/deployment/`, then re-run `cast` (it's idempotent)
`signoz` container restart-loops: `migrations table is already locked ... migration_lock_table_name_key`	Stale lock from an interrupted install: `docker exec <postgres-container> psql -U signoz -d signoz -c "DELETE FROM migration_lock;"`
App logs `Failed to export` forever; ports 4317/4318 closed	Create the first admin account at `localhost:8080` — the collector can't register via OpAMP until an org exists (`cannot create agent without orgId` in server logs)
SigNoz dies when you close your WSL terminal	The WSL VM shuts down with its last client — keep a WSL shell open, or configure `vmIdleTimeout`
Percentile (P90/P99) alert queries fail with internal error	ClickHouse ran out of memory under my 4 GB WSL cap — use count/rate metrics, or give the VM more RAM

Takeaways

Zero-code instrumentation is real, even for old apps. A 2020 Spring Boot 2.3 app on Java 19 got full traces — HTTP, JPA, JDBC, security filters — from one -javaagent flag.
Traces answer "where did the time go" in a way metrics can't. P99 said "slow"; only the waterfall said "BCrypt + Hibernate warm-up, not the database".
Exceptions ≠ error responses. My app returned 200 while rolling back transactions. Watch both.
Logs are only as good as your logging discipline. The agent can ship your logs, but not your printlns.
Read the platform-specific docs. Native Docker-in-WSL instead of Docker Desktop (ClickHouse Keeper segfaults), the account-creation-before-ingestion dependency, and the Foundry migration were all things I'd never have guessed.
6 GB of RAM is enough for SigNoz + ClickHouse + a JVM + a browser, if you cap the WSL VM. Tight, but enough.

I won't pretend this was a smooth evening. It took many more hours than I expected, most of them spent on things that had nothing to do with SigNoz itself — a broken WSL, a forced reboot at midnight, a laptop with barely enough RAM. But that's exactly why I'm writing it all down.

I don't have a grand plan for the hackathon on the 20th yet. What I do have is something most teams won't: a self-hosted SigNoz that already survived everything my machine threw at it, and an app already streaming traces into it. The warm-up did its job.

If you're sitting on an old side project, don't build a fresh demo to try observability — instrument the old thing. Its skeletons make much better screenshots.

SigNoz docs: signoz.io/docs · OpenTelemetry Java agent: opentelemetry.io/docs/zero-code/java/agent

I Built an AI That Never Forgets

Jay Bamroliya — Sun, 05 Jul 2026 12:16:35 +0000

I Built an AI That Never Forgets — for $0 (Cognee Hackathon)

By Team MindVault — Jay Bamroliya & Kaushal Karkar

Every AI assistant has the same embarrassing problem.

You spend 20 minutes explaining your project. You close the tab. You come back tomorrow — and it has no idea who you are.

Your AI has amnesia. Every. Single. Time.

For the WeMakeDevs × Cognee Hackathon, we built MindVault to fix that — a personal "living memory" that builds a knowledge graph of your life as you talk to it. And we made it run on a completely free stack. Here's exactly how, including everything that broke along the way.

The Problem with Stateless AI

When you call an LLM, every request starts from zero. No memory of your last session, your preferences, your decisions, or your name.

The usual workarounds all fall short:

System prompts — token-limited, manually managed
Vector databases — semantic similarity only, no relational context
RAG pipelines — complex to build, no graph awareness

None of these give you real persistent memory.

Enter Cognee

Cognee is an open-source memory layer for AI agents. It turns text into a hybrid graph-vector knowledge store — two retrieval systems working together:

Vector search — "find things semantically similar to this query"
Graph traversal — "follow relationships between concepts"

That's the difference between a filing cabinet and an actual brain.

Cognee 1.2's memory API is beautifully simple — four verbs that cover the whole memory lifecycle:

import cognee

await cognee.remember("Jay is a developer from India building MindVault.")
results = await cognee.recall("Who is building MindVault?")
await cognee.improve()                 # enrich graph connections
await cognee.forget(everything=True)   # GDPR-ready erasure

What We Built: MindVault

A chat interface where every message becomes structured memory:

Operation	What happens
💾 Remember	Text → embedded + mined into knowledge-graph entities
🔍 Recall	Question → hybrid graph+vector search → AI answer from YOUR memories
✨ Improve	Re-runs enrichment, strengthening graph connections
🗑️ Forget	Full erasure — complete data lifecycle

Plus the parts we're proud of:

A live force-directed knowledge graph rendered on Canvas — zero libraries, custom physics (repulsion, springs, gravity). You literally watch your memory grow as you type.
Voice input via the Web Speech API — speak your memories.
A live LOCAL ↔ CLOUD toggle — one click switches between open-source Cognee running on your machine and Cognee Cloud. No restart. Same codebase, memory_engine.py abstracts both backends behind identical async functions.

Browser (chat · voice · live graph · toggle)
        │
        ▼
FastAPI backend ── /remember /recall /improve /forget
        │
        ▼
memory_engine.py ── one interface, two backends
   ├── LOCAL:  open-source Cognee + Groq + fastembed
   └── CLOUD:  Cognee Cloud REST API

The Real Story: Making It Run for $0

This was the hardest and most educational part. We had no budget for APIs. Here's the free stack and every wall we hit:

Wall 1: LLM costs. Groq's free tier gives you llama-3.3-70b-versatile at 6,000 tokens/minute. Sounds fine — until you learn Cognee's cognify pipeline makes multiple concurrent LLM calls. Instant 429 rate-limit errors.

Fix: Cognee ships a built-in rate limiter (backed by aiolimiter). Three env vars:

LLM_RATE_LIMIT_ENABLED=true
LLM_RATE_LIMIT_REQUESTS=1
LLM_RATE_LIMIT_INTERVAL=15

Calls queue and space out automatically. remember() takes ~90 seconds on the free tier — a fair trade for $0.

Wall 2: Embedding costs. Cognee defaults to OpenAI embeddings — which means an OpenAI key and a bill.

Fix: fastembed runs BAAI/bge-small-en-v1.5 locally. No API key, no network calls:

EMBEDDING_PROVIDER=fastembed
EMBEDDING_MODEL=BAAI/bge-small-en-v1.5

Wall 3: Vector dimension mismatch. Our LanceDB store had been created with OpenAI's 3072-dim vectors; fastembed produces 384-dim. Schema conflict, cryptic errors.

Fix: wipe .cognee_system/databases and let it rebuild with the right schema. Lesson: embedding dimensions are part of your storage schema — changing providers means migrating.

Wall 4 (Cloud mode): the silent no-op. Cognee Cloud's /api/v1/add accepts multipart file uploads, not JSON. Our JSON POSTs returned plausible status codes while storing nothing. Recall answers were pure LLM hallucination — confidently wrong, cached per-question.

Fix: read the OpenAPI spec (/openapi.json), switch to multipart:

files={"data": ("memory.txt", text.encode(), "text/plain")},
data={"datasetName": dataset},

Debugging lesson: when search says "no data found" but add says "success," trust the negative signal — verify what's actually stored (GET /api/v1/datasets/{id}/data) instead of trusting status codes.

What Surprised Us

Graph traversal is genuinely different from vector search. We stored "Jay is building MindVault" and "MindVault is powered by Cognee AI" as separate memories, then asked "What is Jay building?" — Cognee connected the dots through the graph, not by keyword overlap.

improve() is underrated. Most people stop at add-and-search. Re-running enrichment after accumulating memories visibly strengthens the graph — new edges appear between old nodes.

Try It

git clone https://github.com/jaybamroliya/mindvault
cd mindvault
pip install -r requirements.txt
cp .env.example .env   # add a free Groq key from console.groq.com
python -m uvicorn main:app --port 8000

Total cost: $0. No credit card anywhere in the stack.

Final Thoughts

"Stateless AI" is one of the most annoying unsolved UX problems in AI. Cognee solves it properly — not with a prompt hack, but with a real hybrid memory architecture that you can self-host for free or scale on their cloud.

If you're building agents, give your AI a memory. It changes everything.

Built for the WeMakeDevs × Cognee Hackathon by Team MindVault — Jay Bamroliya & Kaushal Karkar. Source: github.com/jaybamroliya/mindvault