Avinash Zala

Posted on Jun 22 • Originally published at github.com

From Stack Trace to Suggested Fix in 4 Seconds: Building a Self-Healing .NET API Gateway.

#dotnet #csharp #ai #softwareengineering

Last Tuesday my API gateway caught a NullReferenceException, streamed it to a dashboard in real-time, and pushed a draft code fix to the browser tab of the on-call engineer — before I finished reading the error myself. That sentence used to be vendor marketing. Now it's just my Program.cs.

This is the architecture post-mortem. I built it on weekends. It runs in Docker. It cost me exactly $0 in LLM credits during development because Groq's free tier is generous and Ollama works as a swap-in. The repo is here — issues and PRs welcome.

The problem most .NET teams have

Production errors are caught, logged to a file, and forgotten. Engineers find out from a Slack ping twenty minutes later, if at all. By the time someone looks, the original request context is gone, the user's session has expired, and the stack trace is buried four layers deep in System.* calls.

"Self-healing" is a word vendors use to mean "auto-restart the pod." I wanted something better. The actual ask:

When an exception is thrown in service A, give the engineer (a) a clear root cause, (b) a suggested fix, and (c) a draft code patch — in under 30 seconds.

Not a magic black box. Not an auto-applied patch. Just: catch the error, give the model the right context, push the analysis to a human in real-time, and let the human close the loop.

The architecture

One .NET solution, four projects, four NuGet packages, no new infrastructure beyond what you probably already have.

[ HTTP request ]
       |
       v
+-------------------+        enqueue         +---------------------+
| SmartLogAnalyzer. | ---------------------> |  Hangfire (Redis)   |
|      Api          |                        +----------+----------+
|  (ErrorHandling   |                                   |
|   Middleware)     |                                   v
+-------------------+                        +---------------------+
                                                | SmartLogAnalyzer.   |
                                                |     Worker          |
                                                | (ErrorProcessingWorker)
                                                +-----+-------+-------+
                                                      |       |
                                          AI call     |       |  persist
                                                      v       v
                                            +-----------+   +-----------+
                                            | Semantic  |   | MSSQL     |
                                            | Kernel +  |   | (ErrorLog |
                                            | Groq LLM  |   |  table)   |
                                            +-----+-----+   +-----------+
                                                  |
                                                  v
                                        +---------------------+
                                        |  SignalR Hub        |
                                        |  (ErrorHub)         |
                                        +----------+----------+
                                                   |
                                              broadcast
                                                   v
                                        +---------------------+
                                        |  React Dashboard    |
                                        |  (live update)      |
                                        +---------------------+

The crucial detail is where the AI call happens. It does not happen in the request thread. The middleware returns the 500 in milliseconds; the AI work happens inside a Hangfire background job, in a different process, possibly on a different machine. Two different response times, one user.

Part 1 — the capture

The middleware is fifty lines including the using statements. Here is the whole thing.

using Hangfire;
using SmartLogAnalyzer.Core.Models;
using SmartLogAnalyzer.Core.Workers;
using System.Text.Json;

namespace SmartLogAnalyzer.Api.Middleware
{
    public class ErrorHandlingMiddleware
    {
        private readonly RequestDelegate _next;
        private readonly IBackgroundJobClient _backgroundJobClient;

        public ErrorHandlingMiddleware(
            RequestDelegate next,
            IBackgroundJobClient backgroundJobClient)
        {
            _next = next;
            _backgroundJobClient = backgroundJobClient;
        }

        public async Task InvokeAsync(HttpContext context)
        {
            try
            {
                await _next(context);
            }
            catch (Exception ex)
            {
                await HandleExceptionAsync(context, ex);
            }
        }

        private async Task HandleExceptionAsync(HttpContext context, Exception ex)
        {
            context.Response.ContentType = "application/json";
            context.Response.StatusCode = 500;

            var errorLog = new ErrorLog
            {
                ErrorMessage = ex.Message,
                StackTrace   = ex.StackTrace ?? string.Empty,
                RoutePath    = context.Request.Path
            };

            // The line that does the work. Enqueue is non-blocking;
            // the response is sent before the AI is ever called.
            _backgroundJobClient.Enqueue<ErrorProcessingWorker>(
                worker => worker.ProcessErrorAsync(errorLog));

            var result = JsonSerializer.Serialize(
                new { error = "An internal error has been logged and is being analyzed." });
            await context.Response.WriteAsync(result);
        }
    }
}

Two things to notice.

First, the Enqueue call returns immediately. Hangfire's IBackgroundJobClient is a thin proxy over the Hangfire storage (Redis in this case) and a worker pickup. We don't await an AI call here. The user gets their 500 in single-digit milliseconds.

Second, the response body — "An internal error has been logged and is being analyzed." — is itself a feature. The user (or the calling frontend) now knows the error is being handled. It is not a lie, it is a contract.

Part 2 — the worker

The ErrorProcessingWorker is a plain C# class. Hangfire instantiates it from the DI container, calls ProcessErrorAsync, and (if it throws) retries up to three times with exponential backoff.

[AutomaticRetry(Attempts = 3)]
public async Task ProcessErrorAsync(ErrorLog errorLog)
{
    // 1. Hash the stack trace to dedupe identical errors
    var stackTraceHash = ComputeHash(errorLog.StackTrace);

    // 2. If we've seen this stack trace in the last 24h, just bump the count
    if (await _redisCacheService.KeyExistsAsync(stackTraceHash))
    {
        var existingLog = await _errorLogRepository.AddOrUpdateErrorLogAsync(errorLog);
        await _hubContext.Clients.All.SendAsync(
            "ReceiveErrorUpdate",
            JsonSerializer.Serialize(existingLog));
        return;
    }

    // 3. New error — claim the hash so duplicates skip the AI call
    await _redisCacheService.SetKeyAsync(stackTraceHash, "1", TimeSpan.FromHours(24));

    // 4. Ask the LLM
    var analyzedLog = await _aiAnalysisService.AnalyzeErrorAsync(errorLog);

    // 5. Persist + push to the dashboard
    var savedLog = await _errorLogRepository.AddOrUpdateErrorLogAsync(analyzedLog);
    await _hubContext.Clients.All.SendAsync(
        "ReceiveErrorUpdate",
        JsonSerializer.Serialize(savedLog));
}

private string ComputeHash(string input)
{
    using var md5 = MD5.Create();
    var hashBytes = md5.ComputeHash(Encoding.UTF8.GetBytes(input));
    return BitConverter.ToString(hashBytes).Replace("-", "").ToLowerInvariant();
}

The Redis dedupe step is the difference between a $0 demo and a $200 Groq bill. The first occurrence of a NullReferenceException at /api/users/{id} costs one LLM call. The next 10,000 occurrences cost nothing. The 24-hour TTL is a knob you will want to tune.

Part 3 — the AI call

I started this project with a fancy design: a Semantic Kernel KernelPlugin that would have the AI fetch the offending source file from GitHub, look at the test that covers it, and then propose a diff grounded in real code. It was clever. It was also over-engineered for v1.

The version that shipped is fifteen lines.

var prompt = $@"
You are a Senior .NET Engineer. Analyze the following error
and provide a JSON response with exactly three keys:
RootCause, FixSuggestion, and CodePatch.

Error Message: {errorLog.ErrorMessage}
Stack Trace: {errorLog.StackTrace}

JSON Response:
";

var result = await _kernel.InvokePromptAsync(prompt);
var responseText = result.ToString();

Then I parse the response into the three fields with a JsonDocument, and on failure, fall back to a hand-rolled string parser. We will get back to that parser in the next section — it is the most important code in the whole project and also the part I am least proud of.

Why Semantic Kernel if the call is this simple? Two reasons.

Provider swap. The Groq wire-up is one line: AddOpenAIChatCompletion(modelId: "llama-3.3-70b-versatile", apiKey: [your-groq-key], endpoint: new Uri("https://api.groq.com/openai/v1")) — where the key is loaded from .env at startup. Swapping to OpenAI, Azure OpenAI, or local Ollama is one constructor call. If I had called Groq directly via HttpClient, I would be rewriting the call site for every provider I tried.
Built-in retries and timeouts. Kernel.InvokePromptAsync handles 429s and 5xxs with a default policy. That is one less thing to get wrong.

You can absolutely build this with raw HttpClient and chat.completions.create(). You will write the retry logic yourself. I have done that. I do not recommend it.

Part 4 — what I got wrong

This is the part you came for. Five things that bit me, in order of how much they cost.

4.1 The JSON parser that almost shipped

First version of the response parser used JsonDocument.Parse and threw on any malformed output. About 15% of Groq responses came back wrapped in

json ...

markdown fences, despite the prompt saying "JSON Response:" right at the end. I added a stripper:

var cleaned = responseText.Trim();
if (cleaned.StartsWith("```
{% endraw %}
json")) cleaned = cleaned.Substring(7);
if (cleaned.StartsWith("
{% raw %}
```"))    cleaned = cleaned.Substring(3);
if (cleaned.EndsWith("```
{% endraw %}
"))      cleaned = cleaned.Substring(0, cleaned.Length - 3);
cleaned = cleaned.Trim();
{% raw %}

That fixed 90% of it. The other 10% needed a hand-rolled regex parser that walks the string looking for "RootCause": "..." and respects backslash escapes. Do not be too proud to write a regex parser. When the upstream is an LLM and the contract is "please return JSON," the LLM is sometimes wrong and you need a fallback.

The pattern in AiAnalysisService.cs is the right one: try the strict parser, catch the exception, try the lenient one, and only then give up and store the raw text with a "Failed to parse" flag. The dashboard renders the raw text anyway, so the engineer still gets value.

4.2 Sensitive data leakage

A stack trace can contain connection strings, JWTs, or PII. The first version sent the raw exception text to Groq. After a code review from a friend who is more paranoid than I am, I added a redaction step before the AI call.


csharp
private static readonly Regex BearerToken  = new(@"Bearer\s+[A-Za-z0-9._\-]+", RegexOptions.Compiled);
private static readonly Regex PasswordKV   = new(@"(password|pwd|secret)\s*=\s*\S+",  RegexOptions.Compiled | RegexOptions.IgnoreCase);
private static readonly Regex CreditCard   = new(@"\b\d{16}\b",                       RegexOptions.Compiled);
private static readonly Regex EmailAddr    = new(@"\b[\w._%+-]+@[\w.-]+\.[A-Za-z]{2,}\b", RegexOptions.Compiled);

public static string Redact(string input)
{
    input = BearerToken.Replace(input, "Bearer [REDACTED]");
    input = PasswordKV .Replace(input, "$1=[REDACTED]");
    input = CreditCard .Replace(input, "[REDACTED-CC]");
    input = EmailAddr  .Replace(input, "[REDACTED-EMAIL]");
    return input;
}

Run this before the prompt is built, every time. Always assume the AI provider sees your data. Always. The day you forget is the day a customer's JWT ends up in someone else's training set, or at minimum in someone else's logs.

4.3 The "self-healing" promise is misleading

This system suggests fixes. It does not apply them. I almost shipped an "auto-apply patch on green confidence" toggle. Then I imagined a 3 AM page where a hallucinated regex wipes a production database because the model misread a column name. The toggle is gone. Auto-merging AI patches into prod is a 2027 problem, not a 2026 one. Be honest about this in your README, your marketing, and your internal pitches. Engineers will trust you more.

4.4 Hangfire retries are silent (and cost money)

If the AI call times out, Hangfire retries it. If the AI call consistently times out — bad prompt, big payload, network blip — Hangfire retries it three times. Each retry costs a Groq credit. The [AutomaticRetry(Attempts = 3)] attribute is the default, and the default is wrong for any external dependency that costs money.

Fix: lower the count, add delay, and add a circuit breaker. This is what I have on the worker method now:


csharp
[AutomaticRetry(Attempts = 2, DelaysInSeconds = new[] { 30, 120 })]
public async Task ProcessErrorAsync(ErrorLog errorLog) { ... }

Two attempts, with 30s and 2m delays. That bounds the cost spiral when something is wrong. A truly broken state would still cost 2x per error, but it would not retry 5 more times in a tight loop and drain a month's budget in an hour.

4.5 No correlation between dashboard event and the original request

The user got a 500 with no error ID. The dashboard showed a fix suggestion with no way to find the request that caused it. So when an engineer wanted to reproduce the error, they had to guess the URL, the headers, the auth state. Useless.

Fix: generate a CorrelationId once in the middleware, return it in the response header, and store it on the ErrorLog model. One UUID, two places. The dashboard now shows #1234 next to each error and the engineer can grep their logs for that ID.


csharp
// in the middleware
var correlationId = Guid.NewGuid().ToString("N");
context.Response.Headers["X-Correlation-Id"] = correlationId;
errorLog.CorrelationId = correlationId;

Part 5 — the live dashboard

The dashboard is a 280-line React app in SmartLogAnalyzer.Dashboard/smart-log-analyzer-dashboard/. The whole real-time piece is fifty lines of hooks.


typescript
useEffect(() => {
  const newConnection = new signalR.HubConnectionBuilder()
    .withUrl(`${API_URL}/errorHub`)
    .withAutomaticReconnect()
    .build();
  setConnection(newConnection);
}, []);

useEffect(() => {
  if (!connection) return;
  connection.start().then(() => {
    setConnected(true);
    connection.on('ReceiveErrorUpdate', (errorJson: string) => {
      const error: ErrorLog = JSON.parse(errorJson);
      setErrors(prev => {
        const index = prev.findIndex(e => e.id === error.id);
        if (index !== -1) {
          const updated = [...prev];
          updated[index] = error;
          return updated;
        }
        return [error, ...prev];
      });
    });
  });
  return () => { connection.stop(); };
}, [connection]);

The wire is JSON-over-SignalR. The server-side hub does Clients.All.SendAsync("ReceiveErrorUpdate", jsonString) and every open browser tab updates. No polling. No refresh button. You literally watch errors arrive, get analyzed, and become fixable, in real-time.

A few small UX details I am proud of:

Severity badges. Errors with occurrenceCount >= 10 get a red 🔴 Critical chip. Under 2 is green. Engineers learn to scan for red.
"Analyzing with AI..." spinner. When a new error arrives, its card shows a spinner for the few seconds until the AI response comes back. The state machine is pending → analyzing → analyzed, driven by whether aiRootCause is set.
Expand on click, stack trace in a <details>. Most engineers want the AI's take first. The stack trace is one click away.

When you should — and shouldn't — build this

Build it if:

You have more than three services throwing exceptions, and your on-call rotation is a human who hates pages at 3 AM.
You are already paying for an LLM API, or you have a GPU sitting around running Ollama.
Your mean time to acknowledge (MTTA) on alerts is more than five minutes.

Don't build it if:

You have one service and a steady stream of bugs. Fix the bugs.
Your "errors" are mostly business-logic edge cases — a missing null check that is actually a missing requirement. The AI cannot help with those.
You don't have CI/CD yet. Self-healing on top of an unsafe deploy pipeline is just a faster way to break production.

The general rule: a 200-line NuGet package won't fix a 2000-line architecture problem. This system is a force multiplier on a healthy codebase. On an unhealthy one, it is a faster way to find out how unhealthy you are.

The repo and how to run it

The full source is at github.com/ZalaAvinash/Smart-Log-Analyzer-Self-Healing-API-Gateway. To run it locally:


bash
git clone https://github.com/ZalaAvinash/Smart-Log-Analyzer-Self-Healing-API-Gateway.git
cd Smart-Log-Analyzer-Self-Healing-API-Gateway
cp .env.example .env
# Edit .env — set GROQ_API_KEY (free at groq.com), DB_SERVER, REDIS_HOST
start-all.bat    # Windows; the repo has a Makefile-equivalent for *nix

Three windows open: the API, the Worker, the Dashboard. Open http://localhost:3000, click the "Trigger Test Error" link, and watch a NullReferenceException arrive, get analyzed, and become a clickable fix suggestion, all in under 4 seconds.

Closing

The future of "self-healing" is not magic. It is a small, honest pipeline: catch the error, give the model the right context, push the analysis to a human in real-time, let the human close the loop. The model writes the boilerplate diff. The engineer writes the actual fix. That is a real workflow, and it works today, on the same .NET you are already running, with one extra NuGet package and one extra process.

If you build something similar and run into the same five problems, I'd love to hear about it. The repo is open for issues, PRs, and rants about how your retry policy bankrupted your LLM budget. We've all been there.

Build with: .NET 10 · ASP.NET Core · Hangfire · Semantic Kernel · Groq (llama-3.3-70b-versatile) · SignalR · MSSQL · Redis · React

Repo: ZalaAvinash/Smart-Log-Analyzer-Self-Healing-API-Gateway

About the author: Avinash Zala is a senior .NET engineer in Surat, India, with 7+ years building enterprise web apps, APIs, and ERP systems. He is currently adding AI/LLM capabilities to his stack and writing about what he learns. GitHub · LinkedIn

Top comments (2)

STrRedWolf • Jun 23

You're missing something here: The request itself. The HTTP payload. What was fed in. If it was a POST request, you just tossed a clue out. You're missing that context.

Still, a good way to make a self-diagnosing framework. Now we just need a local LLM to do the diagnostics.

Avinash Zala • Jul 9

yes, you are right thanks.