DEV Community

Ali Suleyman TOPUZ
Ali Suleyman TOPUZ

Posted on • Originally published at topuzas.Medium on

Semantic Kernel for Enterprise AI: Architecting Production-Grade LLM Integration in .NET

Semantic Kernel for Enterprise AI: Architecting Production-Grade LLM Integration in .NET — Implementation & Observability — Part 2

This is Part 2 of the series. Part 1 covered the foundational architecture of Semantic Kernel — plugins, planners, memory, and filters — along with the FinOps cost model and SRE failure taxonomy. In this part, we move from architecture to implementation: building the async-first parallel orchestration engine, the Redis-backed semantic cache, and the complete production filter pipeline with token metering.

I. Recap and What This Part Covers

Part 1 established that the gap between LLM demo and production system is architectural. Semantic Kernel closes that gap through four composable primitives — plugins, planners, memory, and filters — wrapped in a resilience and observability model that matches enterprise operational standards.

Part 2 builds on that foundation with concrete, production-ready .NET 9.0 implementations of the three highest-leverage components an architect must get right:

Async-First Parallel Orchestration  — collapsing multi-step LLM workflows from sequential to concurrent execution, with proper cancellation, error isolation, and result aggregation patterns.

Redis Semantic Cache  — a vector similarity-backed caching layer that achieves 30–60% token cost reduction in enterprise workloads, with TTL management, cache invalidation, and hit-rate instrumentation.

Complete Filter Pipeline  — the full middleware chain covering rate limiting, semantic caching, audit logging, output validation, and token metering, wired together as a coherent operational stack.

II. Async-First Parallel Orchestration

2.1 The Sequential Trap

The most common performance anti-pattern in Semantic Kernel implementations is inadvertent sequential LLM invocation. Engineers familiar with async/await in .NET often write code that looks asynchronous but executes serially:

// ❌ Anti-pattern: Awaiting each invocation sequentially
// Total latency = sum of all individual latencies
public async Task<DocumentAnalysisResult> AnalyzeDocumentAsync(string documentId)
{
    var summary = await _kernel.InvokeAsync("DocumentPlugin", "Summarize", args); // 2.1s
    var entities = await _kernel.InvokeAsync("DocumentPlugin", "ExtractEntities", args); // 1.8s
    var sentiment = await _kernel.InvokeAsync("DocumentPlugin", "AnalyzeSentiment", args); // 1.4s
    var keywords = await _kernel.InvokeAsync("DocumentPlugin", "ExtractKeywords", args); // 1.2s

    // Total: ~6.5s — user is waiting for sequential completion
    return BuildResult(summary, entities, sentiment, keywords);
}
Enter fullscreen mode Exit fullscreen mode

Four independent LLM calls executed sequentially when they have no data dependencies on each other. This is a 4× latency penalty and a FinOps problem — the user session holds open a server thread for the full duration, limiting throughput.

2.2 The Parallel Orchestration Pattern

The correct pattern treats independent LLM invocations as parallel tasks, combining Task.WhenAll with proper cancellation token propagation and isolated error handling per invocation:

// ✅ Parallel orchestration with error isolation
public class ParallelDocumentAnalysisOrchestrator
{
    private readonly Kernel _kernel;
    private readonly ILogger<ParallelDocumentAnalysisOrchestrator> _logger;
    private readonly ParallelOrchestrationOptions _options;
    public async Task<DocumentAnalysisResult> AnalyzeDocumentAsync(
        string documentId,
        CancellationToken ct = default)
    {
        var document = await LoadDocumentAsync(documentId, ct);
        var sharedArguments = new KernelArguments
        {
            ["document_content"] = document.Content,
            ["document_id"] = documentId
        };
        // Define parallel execution units with individual timeout budgets
        using var cts = CancellationTokenSource.CreateLinkedTokenSource(ct);
        cts.CancelAfter(_options.MaxOrchestrationDuration); // Hard timeout for entire operation
        var summaryTask = InvokeWithTimeoutAsync(
            "DocumentPlugin", "Summarize", sharedArguments,
            timeout: TimeSpan.FromSeconds(8), ct: cts.Token);
        var entityTask = InvokeWithTimeoutAsync(
            "DocumentPlugin", "ExtractEntities", sharedArguments,
            timeout: TimeSpan.FromSeconds(6), ct: cts.Token);
        var sentimentTask = InvokeWithTimeoutAsync(
            "DocumentPlugin", "AnalyzeSentiment", sharedArguments,
            timeout: TimeSpan.FromSeconds(5), ct: cts.Token);
        var keywordTask = InvokeWithTimeoutAsync(
            "DocumentPlugin", "ExtractKeywords", sharedArguments,
            timeout: TimeSpan.FromSeconds(5), ct: cts.Token);
        // WhenAll preserves individual task exceptions-don't use WaitAll
        var results = await Task.WhenAll(
            summaryTask, entityTask, sentimentTask, keywordTask);
        // Total latency = slowest individual invocation (~2.1s vs 6.5s sequential)
        return BuildResult(results[0], results[1], results[2], results[3]);
    }
    private async Task<OrchestrationResult> InvokeWithTimeoutAsync(
        string pluginName,
        string functionName,
        KernelArguments arguments,
        TimeSpan timeout,
        CancellationToken ct)
    {
        using var timeoutCts = CancellationTokenSource.CreateLinkedTokenSource(ct);
        timeoutCts.CancelAfter(timeout);
        try
        {
            var result = await _kernel.InvokeAsync(
                pluginName, functionName, arguments, timeoutCts.Token);
            return OrchestrationResult.Success(
                pluginName, functionName, result.GetValue<string>()!);
        }
        catch (OperationCanceledException) when (!ct.IsCancellationRequested)
        {
            // Individual invocation timed out-don't cancel sibling tasks
            _logger.LogWarning(
                "Function {Plugin}.{Function} exceeded timeout of {Timeout}ms",
                pluginName, functionName, timeout.TotalMilliseconds);
            return OrchestrationResult.TimedOut(pluginName, functionName);
        }
        catch (Exception ex)
        {
            // Isolate failure-sibling tasks continue executing
            _logger.LogError(ex,
                "Function {Plugin}.{Function} failed with exception",
                pluginName, functionName);
            return OrchestrationResult.Failed(pluginName, functionName, ex);
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

2.3 Dependency-Aware DAG Orchestration

Real-world workflows are rarely fully parallel. Some steps depend on the output of prior steps, creating a directed acyclic graph (DAG) of dependencies. The architectural pattern for this is staged parallel execution — group independent operations into waves, execute each wave in parallel, and feed outputs forward to dependent stages:

public class DagOrchestrator
{
    private readonly Kernel _kernel;
    /// <summary>
    /// Contract Review Workflow DAG:
    ///
    /// Stage 1 (parallel): ExtractParties, ExtractDates, ExtractObligations
    /// ↓
    /// Stage 2 (parallel, depends on Stage 1): 
    /// ValidateParties(parties), CheckDeadlines(dates), PrioritizeObligations(obligations)
    /// ↓
    /// Stage 3 (sequential, depends on Stage 2): GenerateExecutiveSummary(all Stage 2 outputs)
    /// </summary>
    public async Task<ContractReviewResult> ReviewContractAsync(
        string contractText,
        CancellationToken ct = default)
    {
        // ── Stage 1: Independent extraction (parallel) ──────────────────────────
        var stage1Args = new KernelArguments { ["contract_text"] = contractText };
        var (parties, dates, obligations) = await ExecuteStageAsync(
            ct,
            ("LegalPlugin", "ExtractParties", stage1Args),
            ("LegalPlugin", "ExtractDates", stage1Args),
            ("LegalPlugin", "ExtractObligations", stage1Args));
        // ── Stage 2: Validation (parallel, consumes Stage 1) ────────────────────
        var (validatedParties, deadlineAnalysis, prioritizedObligations) = 
            await ExecuteStageAsync(
                ct,
                ("LegalPlugin", "ValidateParties", 
                    BuildArgs(stage1Args, ("parties_json", parties))),
                ("LegalPlugin", "CheckDeadlines", 
                    BuildArgs(stage1Args, ("dates_json", dates))),
                ("LegalPlugin", "PrioritizeObligations", 
                    BuildArgs(stage1Args, ("obligations_json", obligations))));
        // ── Stage 3: Synthesis (sequential, consumes all prior stages) ───────────
        var summaryArgs = new KernelArguments
        {
            ["contract_text"] = contractText,
            ["validated_parties"] = validatedParties,
            ["deadline_analysis"] = deadlineAnalysis,
            ["prioritized_obligations"] = prioritizedObligations
        };
        var executiveSummary = await _kernel.InvokeAsync(
            "LegalPlugin", "GenerateExecutiveSummary", summaryArgs, ct);
        return new ContractReviewResult
        {
            Parties = DeserializeParties(validatedParties),
            DeadlineAnalysis = DeserializeDeadlines(deadlineAnalysis),
            Obligations = DeserializeObligations(prioritizedObligations),
            ExecutiveSummary = executiveSummary.GetValue<string>()!
        };
    }
    private async Task<(string, string, string)> ExecuteStageAsync(
        CancellationToken ct,
        params (string Plugin, string Function, KernelArguments Args)[] invocations)
    {
        var tasks = invocations.Select(inv =>
            _kernel.InvokeAsync(inv.Plugin, inv.Function, inv.Args, ct)
                .ContinueWith(t => t.Result.GetValue<string>()!, 
                    TaskContinuationOptions.OnlyOnRanToCompletion));
        var results = await Task.WhenAll(tasks);
        return (results[0], results[1], results[2]);
    }
}
Enter fullscreen mode Exit fullscreen mode

2.4 Streaming with Backpressure

For user-facing operations where progressive disclosure is preferable to waiting for full completion, GetStreamingChatMessageContentsAsync paired with server-sent events or SignalR delivers tokens to the UI as they arrive. The critical implementation detail is proper backpressure — don't buffer unboundedly if downstream consumers are slower than the LLM's generation rate:

public class StreamingOrchestrator
{
    private readonly Kernel _kernel;
    public async IAsyncEnumerable<string> StreamAnalysisAsync(
        string documentContent,
        [EnumeratorCancellation] CancellationToken ct = default)
    {
        var arguments = new KernelArguments
        {
            ["content"] = documentContent
        };
        var executionSettings = new OpenAIPromptExecutionSettings
        {
            MaxTokens = 1024,
            Temperature = 0.3,
            // Stream-specific: stop generation when we have enough for the UI
            StopSequences = ["[END_SUMMARY]"]
        };
        arguments.ExecutionSettings = new Dictionary<string, PromptExecutionSettings>
        {
            [PromptExecutionSettings.DefaultServiceId] = executionSettings
        };
        var tokenBuffer = new StringBuilder();
        var tokenCount = 0;
        await foreach (var chunk in _kernel
            .InvokeStreamingAsync<StreamingChatMessageContent>(
                "AnalysisPlugin", "StreamSummary", arguments, ct))
        {
            if (chunk.Content is null) continue;
            tokenBuffer.Append(chunk.Content);
            tokenCount++;
            // Yield complete words/sentences rather than individual tokens
            // to reduce UI flicker and downstream rendering load
            if (chunk.Content.Contains(' ') || chunk.Content.Contains('\n'))
            {
                yield return tokenBuffer.ToString();
                tokenBuffer.Clear();
            }
            // Early termination: if we've generated enough for the use case,
            // cancel further generation to avoid paying for unused completion tokens
            if (tokenCount >= 800)
            {
                yield return tokenBuffer.ToString();
                yield break;
            }
        }
        // Flush any remaining buffered content
        if (tokenBuffer.Length > 0)
            yield return tokenBuffer.ToString();
    }
}
Enter fullscreen mode Exit fullscreen mode

III. Redis Semantic Cache Implementation

3.1 Why String Equality Is the Wrong Cache Key

Naive caching uses exact string matching: two requests are equivalent if their prompt strings are byte-for-byte identical. This produces near-zero cache hit rates in practice because natural language users ask the same question in slightly different ways:

  • “Summarize this contract”
  • “Give me a summary of this contract”
  • “What does this contract say, briefly?”

These are semantically equivalent — they should return the same cached response. String equality misses all three matches.

Semantic caching uses vector embeddings to measure intent similarity. Two prompts are considered equivalent if their embedding vectors are within a configurable cosine similarity threshold. This is what drives the 30–60% cache hit rates referenced in Part 1.

3.2 Redis Stack Setup and Index Configuration

RedisStack (available in Azure Cache for Redis Enterprise) provides the FT.SEARCH capability with vector similarity search. The index configuration determines both search performance and accuracy:

public class RedisSemanticCacheService : ISemanticCacheService, IAsyncDisposable
{
    private readonly IConnectionMultiplexer _redis;
    private readonly IDatabase _db;
    private readonly ITextEmbeddingGenerationService _embeddingService;
    private readonly SemanticCacheConfiguration _config;
    private readonly ILogger<RedisSemanticCacheService> _logger;
private const string IndexName = "semantic-cache-idx";
    private const string KeyPrefix = "sk-cache:";
    public RedisSemanticCacheService(
        IConnectionMultiplexer redis,
        ITextEmbeddingGenerationService embeddingService,
        SemanticCacheConfiguration config,
        ILogger<RedisSemanticCacheService> logger)
    {
        _redis = redis;
        _db = redis.GetDatabase();
        _embeddingService = embeddingService;
        _config = config;
        _logger = logger;
    }
    public async Task InitializeIndexAsync()
    {
        var server = _redis.GetServer(_redis.GetEndPoints().First());
        try
        {
            // Check if index already exists
            await server.ExecuteAsync("FT.INFO", IndexName);
            _logger.LogInformation("Semantic cache index {IndexName} already exists", IndexName);
        }
        catch (RedisServerException ex) when (ex.Message.Contains("Unknown index name"))
        {
            // Create the vector index
            // HNSW (Hierarchical Navigable Small World) is the recommended algorithm
            // for production-better query performance than FLAT at scale
            await server.ExecuteAsync(
                "FT.CREATE", IndexName,
                "ON", "HASH",
                "PREFIX", "1", KeyPrefix,
                "SCHEMA",
                    "plugin_name", "TAG",
                    "function_name", "TAG",
                    "tenant_id", "TAG",
                    "embedding", "VECTOR", "HNSW", "10",
                        "TYPE", "FLOAT32",
                        "DIM", "1536", // text-embedding-3-small dimension
                        "DISTANCE_METRIC", "COSINE",
                        "INITIAL_CAP", "10000",
                        "EF_CONSTRUCTION", "200", // Higher = better quality, slower build
                    "response_text", "TEXT",
                    "created_at", "NUMERIC",
                    "hit_count", "NUMERIC");
            _logger.LogInformation(
                "Created semantic cache index {IndexName} with HNSW vector configuration",
                IndexName);
        }
    }
    public async Task<SemanticCacheEntry?> GetAsync(
        string pluginName,
        string functionName,
        string tenantId,
        string promptText)
    {
        var embedding = await _embeddingService.GenerateEmbeddingAsync(promptText);
        var embeddingBytes = EmbeddingToBytes(embedding);
        // KNN vector search with metadata filter
        // The TAG filters narrow the search space before vector comparison
        var query = $"(@plugin_name:{{{EscapeTag(pluginName)}}} " +
                    $"@function_name:{{{EscapeTag(functionName)}}} " +
                    $"@tenant_id:{{{EscapeTag(tenantId)}}})=>" +
                    $"[KNN {_config.MaxCandidates} @embedding $vec AS score]";
        var searchResults = await _db.ExecuteAsync(
            "FT.SEARCH", IndexName, query,
            "PARAMS", "2", "vec", embeddingBytes,
            "RETURN", "3", "response_text", "score", "hit_count",
            "SORTBY", "score",
            "DIALECT", "2");
        var results = ParseSearchResults(searchResults);
        if (!results.Any())
        {
            _logger.LogDebug(
                "Cache miss for {Plugin}.{Function} (no candidates found)",
                pluginName, functionName);
            return null;
        }
        var best = results.First();
        // Cosine similarity: score 0 = identical, score 1 = orthogonal
        // We want high similarity, so threshold is checked as (1 - score) >= threshold
        var similarity = 1.0 - best.Score;
        if (similarity < _config.SimilarityThreshold)
        {
            _logger.LogDebug(
                "Cache miss for {Plugin}.{Function}: best similarity {Similarity:F3} " +
                "below threshold {Threshold:F3}",
                pluginName, functionName, similarity, _config.SimilarityThreshold);
            return null;
        }
        // Increment hit counter asynchronously-fire and forget acceptable here
        _ = IncrementHitCountAsync(best.Key);
        _logger.LogInformation(
            "Cache hit for {Plugin}.{Function}: similarity={Similarity:F3}, " +
            "hit_count={HitCount}",
            pluginName, functionName, similarity, best.HitCount + 1);
        return new SemanticCacheEntry
        {
            ResponseText = best.ResponseText,
            Similarity = similarity,
            CacheKey = best.Key
        };
    }
    public async Task SetAsync(
        string pluginName,
        string functionName,
        string tenantId,
        string promptText,
        string responseText,
        TimeSpan? ttl = null)
    {
        var embedding = await _embeddingService.GenerateEmbeddingAsync(promptText);
        var embeddingBytes = EmbeddingToBytes(embedding);
        var cacheKey = $"{KeyPrefix}{Guid.NewGuid():N}";
        var effectiveTtl = ttl ?? _config.DefaultTtl;
        var hashFields = new HashEntry[]
        {
            new("plugin_name", pluginName),
            new("function_name", functionName),
            new("tenant_id", tenantId),
            new("prompt_text", promptText), // Store for debugging/audit
            new("response_text", responseText),
            new("embedding", embeddingBytes),
            new("created_at", DateTimeOffset.UtcNow.ToUnixTimeSeconds()),
            new("hit_count", 0)
        };
        var tx = _db.CreateTransaction();
        _ = tx.HashSetAsync(cacheKey, hashFields);
        _ = tx.KeyExpireAsync(cacheKey, effectiveTtl);
        await tx.ExecuteAsync();
        _logger.LogDebug(
            "Cached response for {Plugin}.{Function} with TTL={TTL}",
            pluginName, functionName, effectiveTtl);
    }
    private static byte[] EmbeddingToBytes(ReadOnlyMemory<float> embedding)
    {
        var span = embedding.Span;
        var bytes = new byte[span.Length * sizeof(float)];
        MemoryMarshal.AsBytes(span).CopyTo(bytes);
        return bytes;
    }
    private static string EscapeTag(string value) =>
        value.Replace("-", "\\-").Replace(".", "\\.");
    public async ValueTask DisposeAsync() =>
        await _redis.CloseAsync();
}
Enter fullscreen mode Exit fullscreen mode

3.3 Cache Configuration Tuning

The SimilarityThreshold parameter is the most consequential configuration value in the semantic cache. Too low and you return cached responses for queries that are semantically distinct—producing incorrect or irrelevant answers. Too high and your hit rate collapses toward zero.

The recommended tuning process is empirical: deploy with threshold 0.92 as a starting point, instrument cache hit rates and user feedback signals, and adjust based on observed quality vs. hit rate trade-off. Different plugin functions warrant different thresholds—summarization is more tolerant of semantic variation than precise data extraction:

public class SemanticCacheConfiguration
{
    public TimeSpan DefaultTtl { get; set; } = TimeSpan.FromHours(4);
    public int MaxCandidates { get; set; } = 3; // KNN k value
    // Per-function threshold overrides
    public Dictionary<string, double> FunctionThresholds { get; set; } = new()
    {
        // High tolerance: question phrasing variations map to same answer
        ["Summarize"] = 0.88,
        ["AnalyzeSentiment"] = 0.90,

        // Low tolerance: subtle wording changes carry semantic weight
        ["ExtractObligations"] = 0.95,
        ["ExtractEntities"] = 0.94,
        ["ClassifyIntent"] = 0.93
    };
    public double DefaultThreshold { get; set; } = 0.92;
    public double GetThreshold(string functionName) =>
        FunctionThresholds.TryGetValue(functionName, out var threshold)
            ? threshold
            : DefaultThreshold;
}
Enter fullscreen mode Exit fullscreen mode

3.4 Cache Hit Rate Instrumentation

Without measuring the cache, you cannot optimize it. Wire the cache service into your OpenTelemetry metrics pipeline:

// Metrics registered at startup
private static readonly Counter<long> CacheHits = 
    Metrics.CreateCounter<long>("sk_cache_hits_total",
        description: "Total semantic cache hits");
private static readonly Counter<long> CacheMisses = 
    Metrics.CreateCounter<long>("sk_cache_misses_total",
        description: "Total semantic cache misses");
private static readonly Histogram<double> CacheSimilarityScore = 
    Metrics.CreateHistogram<double>("sk_cache_similarity_score",
        description: "Cosine similarity score of cache hits");
private static readonly Counter<double> CacheTokensSaved = 
    Metrics.CreateCounter<double>("sk_cache_tokens_saved_total",
        description: "Estimated tokens saved by cache hits");
Enter fullscreen mode Exit fullscreen mode

The sk_cache_hits_total / (sk_cache_hits_total + sk_cache_misses_total) ratio is your cache hit rate metric—the primary FinOps KPI for the caching layer. Target 35%+ in steady state for workloads with repetitive query patterns.

IV. The Complete Production Filter Pipeline

4.1 Filter Registration and Ordering

Semantic Kernel filters execute in registration order for pre-invocation logic and in reverse order for post-invocation logic — identical to ASP.NET Core middleware ordering semantics. The order matters for correctness:

// Kernel configuration with ordered filter pipeline
builder.Services.AddSingleton<Kernel>(sp =>
{
    var kernelBuilder = Kernel.CreateBuilder();
    kernelBuilder.AddAzureOpenAIChatCompletion(
        deploymentName: config["AzureOpenAI:DeploymentName"]!,
        endpoint: config["AzureOpenAI:Endpoint"]!,
        apiKey: config["AzureOpenAI:ApiKey"]!,
        serviceId: "primary");
    kernelBuilder.AddAzureOpenAIChatCompletion(
        deploymentName: config["AzureOpenAI:FallbackDeployment"]!,
        endpoint: config["AzureOpenAI:Endpoint"]!,
        apiKey: config["AzureOpenAI:ApiKey"]!,
        serviceId: "fallback");
    // Filter order (pre-invocation): 1 → 2 → 3 → 4 → 5
    // Filter order (post-invocation): 5 → 4 → 3 → 2 → 1
    kernelBuilder.Services
        .AddSingleton<IFunctionInvocationFilter, TenantRateLimitFilter>() // 1: Gate first
        .AddSingleton<IFunctionInvocationFilter, SemanticCacheFilter>() // 2: Check cache
        .AddSingleton<IFunctionInvocationFilter, PromptSanitizationFilter>() // 3: Sanitize input
        .AddSingleton<IFunctionInvocationFilter, ObservabilityFilter>() // 4: Measure
        .AddSingleton<IFunctionInvocationFilter, OutputValidationFilter>() // 5: Validate output
        .AddSingleton<IPromptRenderFilter, AuditPromptFilter>() // Audit resolved prompts
        .AddSingleton<IAutoFunctionInvocationFilter, PlannerBoundaryFilter>(); // Bound planner
    return kernelBuilder.Build();
});
Enter fullscreen mode Exit fullscreen mode

4.2 Tenant Rate Limit Filter

The rate limiting filter is the outermost gate — it rejects requests before any LLM work begins, protecting both cost budgets and downstream service capacity. Implementation uses a sliding window algorithm backed by Redis atomic operations for correctness under concurrent load:

public class TenantRateLimitFilter : IFunctionInvocationFilter
{
    private readonly ITokenBudgetService _budgetService;
    private readonly IRateLimiterService _rateLimiter;
    private readonly IHttpContextAccessor _httpContext;
    public async Task OnFunctionInvocationAsync(
        FunctionInvocationContext context,
        Func<FunctionInvocationContext, Task> next)
    {
        var tenantId = ResolveTenantId();
        // Check token budget before any invocation
        var budgetCheck = await _budgetService.CheckAsync(tenantId);
        if (!budgetCheck.HasBudget)
        {
            throw new TokenBudgetExceededException(
                $"Tenant {tenantId} has exhausted daily token budget of " +
                $"${budgetCheck.DailyLimitUsd:F2}. " +
                $"Budget resets at {budgetCheck.ResetTime:HH:mm} UTC.");
        }
        // Check request rate limit (RPM - requests per minute)
        var rateCheck = await _rateLimiter.CheckRateLimitAsync(
            key: $"rpm:{tenantId}",
            limit: _config.GetRpmLimit(tenantId),
            window: TimeSpan.FromMinutes(1));
        if (!rateCheck.IsAllowed)
        {
            throw new RateLimitExceededException(
                $"Rate limit exceeded for tenant {tenantId}. " +
                $"Limit: {rateCheck.Limit} RPM. " +
                $"Retry after: {rateCheck.RetryAfter.TotalSeconds:F0}s")
            {
                RetryAfter = rateCheck.RetryAfter
            };
        }
        await next(context);
    }
    private string ResolveTenantId() =>
        _httpContext.HttpContext?.User
            .FindFirst("tenant_id")?.Value
        ?? throw new InvalidOperationException(
            "Tenant context not available - ensure authentication middleware runs before kernel invocation.");
}
Enter fullscreen mode Exit fullscreen mode

4.3 Prompt Sanitization Filter

The prompt sanitization filter intercepts function arguments before prompt rendering, applying heuristic and model-based checks for prompt injection attempts. This operates on the IPromptRenderFilter interface, which provides access to the rendered prompt text before it is submitted to the LLM:

public class AuditPromptFilter : IPromptRenderFilter
{
    private readonly IPromptAuditStore _auditStore;
    private readonly IPromptInjectionDetector _injectionDetector;
    private readonly ILogger<AuditPromptFilter> _logger;
    public async Task OnPromptRenderAsync(
        PromptRenderContext context,
        Func<PromptRenderContext, Task> next)
    {
        await next(context); // Let prompt render complete
        var renderedPrompt = context.RenderedPrompt;
        if (renderedPrompt is null) return;
        // Injection detection: heuristic patterns first (cheap), LLM-based second (expensive)
        var injectionResult = await _injectionDetector.DetectAsync(renderedPrompt);
        if (injectionResult.IsInjectionDetected)
        {
            _logger.LogWarning(
                "Prompt injection detected for function {PluginName}.{FunctionName}. " +
                "Confidence: {Confidence:F2}. Pattern: {Pattern}",
                context.Function.PluginName,
                context.Function.Name,
                injectionResult.Confidence,
                injectionResult.DetectedPattern);
            // Record security event
            await _auditStore.RecordSecurityEventAsync(new PromptSecurityEvent
            {
                Timestamp = DateTimeOffset.UtcNow,
                TenantId = ResolveTenantId(),
                PluginName = context.Function.PluginName,
                FunctionName = context.Function.Name,
                DetectedPattern = injectionResult.DetectedPattern,
                Confidence = injectionResult.Confidence,
                // Never log full prompt in audit store-may contain PII
                PromptHash = ComputeHash(renderedPrompt)
            });
            if (injectionResult.Confidence >= _config.BlockThreshold)
            {
                throw new PromptInjectionException(
                    $"Prompt injection blocked with confidence {injectionResult.Confidence:F2}.");
            }
        }
        // Audit all prompts in regulated deployments
        if (_config.AuditAllPrompts)
        {
            await _auditStore.RecordPromptAsync(new PromptAuditRecord
            {
                Timestamp = DateTimeOffset.UtcNow,
                PluginName = context.Function.PluginName,
                FunctionName = context.Function.Name,
                PromptHash = ComputeHash(renderedPrompt),
                TokenEstimate = EstimateTokenCount(renderedPrompt)
            });
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

4.4 Output Validation Filter

The output validation filter is the post-invocation gate for semantic correctness. Where infrastructure filters handle binary pass/fail scenarios, output validation handles the probabilistic quality spectrum of LLM responses:

public class OutputValidationFilter : IFunctionInvocationFilter
{
    private readonly IOutputValidatorRegistry _validatorRegistry;
    private readonly ILogger<OutputValidationFilter> _logger;
    private readonly OutputValidationConfiguration _config;
    public async Task OnFunctionInvocationAsync(
        FunctionInvocationContext context,
        Func<FunctionInvocationContext, Task> next)
    {
        await next(context);
        var output = context.Result.GetValue<string>();
        if (output is null) return;
        // Look up validators registered for this specific function
        var validators = _validatorRegistry.GetValidators(
            context.Function.PluginName,
            context.Function.Name);
        if (!validators.Any()) return;
        var validationContext = new OutputValidationContext
        {
            PluginName = context.Function.PluginName,
            FunctionName = context.Function.Name,
            Output = output,
            InputArguments = context.Arguments
        };
        foreach (var validator in validators)
        {
            var result = await validator.ValidateAsync(validationContext);
            if (!result.IsValid)
            {
                _logger.LogWarning(
                    "Output validation failed for {Plugin}.{Function}: {Reason}. " +
                    "Validator: {ValidatorType}",
                    context.Function.PluginName,
                    context.Function.Name,
                    result.FailureReason,
                    validator.GetType().Name);
                // Determine escalation strategy based on validator severity
                switch (result.Severity)
                {
                    case ValidationSeverity.Critical:
                        // Block response entirely-cannot return this output
                        throw new OutputValidationException(
                            $"Critical output validation failure: {result.FailureReason}");
                    case ValidationSeverity.High:
                        // Attempt retry with modified execution settings
                        if (context.RequestSequenceIndex < _config.MaxRetries)
                        {
                            _logger.LogInformation(
                                "Retrying {Plugin}.{Function} due to high-severity " +
                                "validation failure (attempt {Attempt}/{Max})",
                                context.Function.PluginName,
                                context.Function.Name,
                                context.RequestSequenceIndex + 1,
                                _config.MaxRetries);
                            // Signal retry-caller will re-invoke
                            context.Result = FunctionResult.Empty;
                            return;
                        }
                        goto case ValidationSeverity.Medium;
                    case ValidationSeverity.Medium:
                        // Return with quality degradation signal in metadata
                        var metadata = context.Result.Metadata ?? 
                            new Dictionary<string, object?>();
                        metadata["validation_warning"] = result.FailureReason;
                        metadata["validation_severity"] = result.Severity.ToString();
                        break;
                    case ValidationSeverity.Low:
                        // Log only-pass through
                        break;
                }
            }
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

4.5 Planner Boundary Filter

The planner boundary filter is the safety mechanism for auto-invoke scenarios. It enforces hard limits on plan execution — maximum steps, allowed plugin set, and cumulative token budget — preventing unbounded planning loops:

public class PlannerBoundaryFilter : IAutoFunctionInvocationFilter
{
    private readonly PlannerBoundaryConfiguration _config;
    private readonly ILogger<PlannerBoundaryFilter> _logger;
    public async Task OnAutoFunctionInvocationAsync(
        AutoFunctionInvocationContext context,
        Func<AutoFunctionInvocationContext, Task> next)
    {
        // Enforce maximum execution steps
        if (context.RequestSequenceIndex >= _config.MaxPlannerSteps)
        {
            _logger.LogWarning(
                "Planner exceeded maximum step count of {MaxSteps} for goal: {Goal}. " +
                "Terminating plan execution.",
                _config.MaxPlannerSteps,
                context.ChatHistory.LastOrDefault()?.Content?.Truncate(100));
            context.Terminate = true;
            return;
        }
        // Enforce plugin allowlist
        var requestedPlugin = context.Function.PluginName;
        if (!_config.AllowedPlugins.Contains(requestedPlugin))
        {
            _logger.LogWarning(
                "Planner attempted to invoke unauthorized plugin {PluginName}. " +
                "Blocked by boundary filter.",
                requestedPlugin);
            // Don't throw-instead provide a synthetic result explaining the restriction
            context.Result = new FunctionResult(
                context.Function,
                $"Plugin '{requestedPlugin}' is not authorized for autonomous invocation.");
            return;
        }
        await next(context);
        // Post-step: check cumulative token consumption
        var stepUsage = context.Result.Metadata?
            .GetValueOrDefault("Usage") as CompletionsUsage;
        if (stepUsage is not null)
        {
            var cumulativeTokens = TrackCumulativeTokens(context, stepUsage);
            if (cumulativeTokens > _config.MaxPlannerTokenBudget)
            {
                _logger.LogWarning(
                    "Planner exceeded token budget of {Budget} tokens " +
                    "(consumed: {Consumed}). Terminating plan.",
                    _config.MaxPlannerTokenBudget,
                    cumulativeTokens);
                context.Terminate = true;
            }
        }
    }
}
Enter fullscreen mode Exit fullscreen mode

V. End-to-End Wiring: The Complete Kernel Bootstrap

With all components defined, the production bootstrap assembles them into a coherent, observable, resilient system:

// Program.cs — Production Semantic Kernel Bootstrap
var builder = Host.CreateApplicationBuilder(args);
// ── Infrastructure ──────────────────────────────────────────────────────────
builder.Services.AddStackExchangeRedisCache(options =>
{
    options.Configuration = builder.Configuration["Redis:ConnectionString"];
    options.InstanceName = "sk-prod:";
});
builder.Services.AddSingleton<IConnectionMultiplexer>(sp =>
    ConnectionMultiplexer.Connect(
        builder.Configuration["Redis:ConnectionString"]!));
// ── Semantic Cache ──────────────────────────────────────────────────────────
builder.Services.AddSingleton<SemanticCacheConfiguration>(sp =>
    builder.Configuration.GetSection("SemanticCache").Get<SemanticCacheConfiguration>()!);
builder.Services.AddSingleton<ISemanticCacheService, RedisSemanticCacheService>();
builder.Services.AddHostedService<SemanticCacheIndexInitializer>(); // Ensures index on startup
// ── Resilience ──────────────────────────────────────────────────────────────
builder.Services
    .AddHttpClient("AzureOpenAI-Primary")
    .AddResilienceHandler("llm-primary", ConfigureLlmResiliencePipeline(isPrimary: true));
builder.Services
    .AddHttpClient("AzureOpenAI-Fallback")
    .AddResilienceHandler("llm-fallback", ConfigureLlmResiliencePipeline(isPrimary: false));
// ── Kernel ──────────────────────────────────────────────────────────────────
builder.Services.AddSingleton<Kernel>(sp =>
{
    var kernelBuilder = Kernel.CreateBuilder();
    // Models
    kernelBuilder.AddAzureOpenAIChatCompletion(
        deploymentName: config["AzureOpenAI:GPT4oDeployment"]!,
        endpoint: config["AzureOpenAI:Endpoint"]!,
        apiKey: config["AzureOpenAI:ApiKey"]!,
        serviceId: "primary",
        httpClient: sp.GetRequiredService<IHttpClientFactory>()
            .CreateClient("AzureOpenAI-Primary"));
    kernelBuilder.AddAzureOpenAIChatCompletion(
        deploymentName: config["AzureOpenAI:GPT4oMiniDeployment"]!,
        endpoint: config["AzureOpenAI:Endpoint"]!,
        apiKey: config["AzureOpenAI:ApiKey"]!,
        serviceId: "fallback",
        httpClient: sp.GetRequiredService<IHttpClientFactory>()
            .CreateClient("AzureOpenAI-Fallback"));
    kernelBuilder.AddAzureOpenAITextEmbeddingGeneration(
        deploymentName: config["AzureOpenAI:EmbeddingDeployment"]!,
        endpoint: config["AzureOpenAI:Endpoint"]!,
        apiKey: config["AzureOpenAI:ApiKey"]!);
    // Plugins
    kernelBuilder.Plugins.AddFromType<DocumentAnalysisPlugin>("DocumentPlugin");
    kernelBuilder.Plugins.AddFromType<LegalAnalysisPlugin>("LegalPlugin");
    kernelBuilder.Plugins.AddFromType<NotificationPlugin>("NotificationPlugin");
    // Filter Pipeline (order is significant)
    kernelBuilder.Services
        .AddSingleton<IFunctionInvocationFilter>(
            sp.GetRequiredService<TenantRateLimitFilter>())
        .AddSingleton<IFunctionInvocationFilter>(
            sp.GetRequiredService<SemanticCacheFilter>())
        .AddSingleton<IFunctionInvocationFilter>(
            sp.GetRequiredService<ObservabilityFilter>())
        .AddSingleton<IFunctionInvocationFilter>(
            sp.GetRequiredService<OutputValidationFilter>())
        .AddSingleton<IPromptRenderFilter>(
            sp.GetRequiredService<AuditPromptFilter>())
        .AddSingleton<IAutoFunctionInvocationFilter>(
            sp.GetRequiredService<PlannerBoundaryFilter>());
    return kernelBuilder.Build();
});
// ── OpenTelemetry ───────────────────────────────────────────────────────────
builder.Services.AddOpenTelemetry()
    .ConfigureResource(r => r.AddService("SemanticKernelService", serviceVersion: "2.0.0"))
    .WithTracing(t => t
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddSource("Microsoft.SemanticKernel*")
        .AddSource("SemanticKernel.Custom*")
        .AddOtlpExporter(o => o.Endpoint = 
            new Uri(config["OpenTelemetry:Endpoint"]!)))
    .WithMetrics(m => m
        .AddAspNetCoreInstrumentation()
        .AddMeter("Microsoft.SemanticKernel*")
        .AddMeter("SemanticKernel.Custom")
        .AddOtlpExporter(o => o.Endpoint = 
            new Uri(config["OpenTelemetry:Endpoint"]!)));
await builder.Build().RunAsync();
Enter fullscreen mode Exit fullscreen mode

VI. Operational Runbook: What to Watch in Production

6.1 The Five Metrics That Matter Most

Once the system is running, these five metrics define your production health posture:

| Metric | Alert Threshold | Action on Breach |
|-------------------------------------|------------------------|----------------------------------------|
| sk_cache_hit_rate (7d avg) | < 25% | Review query diversity, adjust TTL |
| sk_function_duration_p99 | > 12s | Check model latency, circuit breakers |
| sk_estimated_cost_usd_total (daily) | > 110% of daily budget | Activate model tiering, alert FinOps |
| sk_circuit_breaker_opened_total | Any increment | Page on-call, activate fallback |
| sk_output_validation_failures_total | > 2% of invocations | Review prompt quality, check for drift |
Enter fullscreen mode Exit fullscreen mode

6.2 Incident Response Patterns

Symptom: Sudden latency spike (P99 > 15s)

  1. Check circuit breaker state — if open, primary model is degraded
  2. Verify fallback model is receiving traffic and responding within SLA
  3. Check Redis cache hit rate — a spike in misses increases LLM load
  4. Review FT.INFO semantic-cache-idx for index health

Symptom: Cost burn rate 2× normal

  1. Check cache hit rate — a cache index rebuild or Redis failover may have cleared cached entries
  2. Review planner step counts — unbounded planning may have slipped through boundary filter
  3. Audit token consumption by plugin — identify highest-cost functions
  4. Check for prompt injection events — attackers deliberately inflating token usage

Symptom: Elevated output validation failures

  1. Pull sample failed outputs from audit store
  2. Check if a prompt template was recently updated — prompt regressions are the primary cause
  3. Verify LLM model version hasn’t changed in Azure OpenAI deployment
  4. Review input distribution — a change in user query patterns may be exposing prompt brittleness

VII. Key Takeaways and What’s Next

Part 2 has walked through the three implementation layers that bridge architectural intent and production reality.

The async parallel orchestration pattern transforms multi-step LLM workflows from sequential 6–10 second operations into concurrent 2–3 second operations — not through faster LLMs, but through correct use of Task composition and dependency-aware DAG execution.

The Redis semantic cache eliminates token costs for repeated semantic intent — the most impactful FinOps lever available at this layer of the stack. The implementation details that matter most are KNN vector indexing with HNSW, per-function similarity thresholds, and TTL policies that balance freshness against hit rate.

The complete filter pipeline is what separates a Semantic Kernel proof-of-concept from a system an enterprise can operate. Rate limiting, prompt sanitization, token metering, output validation, and planner boundary enforcement are not optional hardening steps — they are the production system.

In Part 3, we will move into advanced territory: implementing a multi-agent orchestration architecture where specialized kernel instances collaborate on complex tasks, building a domain-specific memory system for regulated industries with PII redaction and audit-trail requirements, and exploring the emerging Semantic Kernel Process Framework for stateful, long-running AI workflows that survive service restarts and scale across distributed nodes.

This article is Part 2 of a series on Semantic Kernel for Enterprise AI in .NET. Part 1 covered foundational architecture, FinOps cost modeling, and SRE reliability patterns.

Top comments (0)