Ali Suleyman TOPUZ

Posted on Mar 21 • Originally published at topuzas.Medium on Mar 21

Semantic Kernel for Enterprise AI: Architecting Production-Grade LLM Integration in .NET

#finops #enterpriseai #agents #sre

Semantic Kernel for Enterprise AI: Architecting Production-Grade LLM Integration in .NET — Foundations — Part 1

This article is Part 1 of a series on Semantic Kernel for Enterprise AI in .NET. I work at the intersection of distributed systems, AI infrastructure, and .NET engineering.

I. Executive Summary

The Gap Between Demo and Production

Every engineering team that has delivered an LLM proof-of-concept eventually faces the same humbling reality: a polished ChatGPT-style demo is light years away from a production system that handles real business transactions under real load, with real money on the line. The raw OpenAI or Azure OpenAI API gets you to the demo in days. Getting from there to a system that a Fortune 500 organization can stake its operations on — that takes an architectural framework purpose-built for the challenge.

Microsoft’s Semantic Kernel is that framework for .NET engineers.

At its core, Semantic Kernel is an open-source SDK that functions as an orchestration layer — a sophisticated middleware bridging the deterministic world of traditional enterprise software with the probabilistic, latency-sensitive, and expensive world of Large Language Models. It is not a thin API wrapper. It is not a prompt-templating library. It is a full orchestration runtime with first-class support for skills composition, autonomous planning, semantic memory, token optimization, circuit breakers, and OpenTelemetry-grade observability. Every one of those capabilities was forged in the furnace of Microsoft’s own internal AI deployments — Copilot across Office, GitHub Copilot, Azure AI services — serving hundreds of millions of users where failures are measured in dollars, not just SLA percentages.

For senior engineers operating at the intersection of AI, distributed systems, and .NET, this article is a deep architectural walkthrough of what Semantic Kernel is, why it exists in its current form, and how to wield it at the level of a production systems architect rather than a tutorial consumer.

II. Why Raw LLM APIs Are Necessary but Insufficient

Before examining Semantic Kernel’s architecture, we need to name the production failure modes that motivate its existence. These are problems that emerge reliably once an LLM-powered system moves beyond controlled demos.

2.1 Prompt Management at Scale

In a demo, your prompt is a string in your source code. In production, prompts are configuration artifacts that must be versioned, deployed independently of code, A/B tested, localized, role-conditioned, and audited. A single enterprise AI agent may require dozens of prompt templates, each evolving independently. Managing this through string interpolation in C# is an operational disaster waiting to happen.

Semantic Kernel introduces a structured prompt template system — YAML-based prompt configuration with variable injection, function calling declarations, and execution settings — that treats prompts as first-class, versionable assets rather than embedded strings.

2.2 Context Window Economics

GPT-4’s context window is large but not infinite, and every token you send costs money. In a naive implementation, developers stuff entire conversation histories into every API call, leading to two correlated failure modes: context window overflow when histories grow long, and exponentially escalating costs as conversation depth increases. The economic model of LLMs means that architectural inefficiency translates directly and immediately to the FinOps P&L.

Semantic Kernel’s memory abstractions — backed by vector stores like Azure AI Search, Qdrant, or Chroma — enable selective context retrieval through semantic similarity rather than brute-force history injection. This is not a convenience feature; it is a cost-control mechanism that can reduce per-conversation token spend by an order of magnitude at enterprise scale.

2.3 Non-Deterministic Failure Modes

Traditional enterprise systems fail in ways that ops teams have learned to handle: service unavailable (503), timeout, null reference, constraint violation. These are binary, deterministic, and observable. LLM failures are none of these things. An LLM can:

Return a response that is syntactically valid JSON but semantically incorrect
Confidently hallucinate facts, API parameters, or code behavior
Produce outputs that drift in quality as prompt context fills up
Fail silently when a function call argument is subtly malformed
Exhibit prompt injection vulnerabilities when user input is insufficiently sandboxed

Your circuit breaker doesn’t catch hallucinations. Your retry policy doesn’t fix semantic drift. Addressing this class of failure demands architectural patterns — output validation pipelines, structured output enforcement, semantic guardrails, and human-in-the-loop escalation hooks — that raw API calls cannot provide.

2.4 Multi-Model, Multi-Provider Orchestration

No enterprise runs a single LLM for all use cases. GPT-4o handles complex reasoning; GPT-3.5-turbo handles high-volume simple classification at a fraction of the cost; a fine-tuned domain model handles specialized extraction. Routing intelligently across these models — based on task complexity, cost budget, latency SLA, and availability — requires an abstraction layer that decouples your business logic from any specific model’s API surface.

Semantic Kernel’s connector architecture provides exactly this. You register multiple model services with service IDs, define routing strategies, and let the orchestration layer handle model selection. Your skills and planners operate against the abstraction, not the concrete API.

III. Architectural Anatomy of Semantic Kernel

Understanding Semantic Kernel at an architectural level requires examining its four foundational concepts: Plugins (Skills), Planners , Memory , and Filters (Middleware). These are not independent features — they compose into an orchestration runtime.

3.1 Plugins: The Unit of AI Capability

A plugin in Semantic Kernel is a class that exposes one or more functions — either semantic functions (backed by LLM invocations) or native functions (backed by deterministic C# logic) — that the kernel can discover, invoke, and compose. The [KernelFunction] attribute and [Description] annotations are not decorative; they are the mechanism by which the planner understands what a function does and when to invoke it.

public class DocumentAnalysisPlugin
{
    private readonly Kernel _kernel;
    private readonly IDocumentStore _documentStore;

    public DocumentAnalysisPlugin(Kernel kernel, IDocumentStore documentStore)
    {
        _kernel = kernel;
        _documentStore = documentStore;
    }

    [KernelFunction("summarize_document")]
    [Description("Produces a structured executive summary of a business document given its ID")]
    public async Task<DocumentSummary> SummarizeDocumentAsync(
        [Description("The unique identifier of the document to summarize")] string documentId,
        [Description("Target audience: executive, technical, or legal")] string audience = "executive")
    {
        var document = await _documentStore.GetAsync(documentId);

        var arguments = new KernelArguments
        {
            ["document_content"] = document.Content,
            ["audience"] = audience,
            ["max_length"] = audience == "executive" ? "200" : "500"
        };
        var result = await _kernel.InvokeAsync(
            "DocumentPrompts", 
            "SummarizeForAudience", 
            arguments);
        return JsonSerializer.Deserialize<DocumentSummary>(result.GetValue<string>()!)!;
    }
    [KernelFunction("extract_obligations")]
    [Description("Extracts legal obligations, deadlines, and parties from contract documents")]
    public async Task<ContractObligations> ExtractObligationsAsync(
        [Description("Raw contract text to analyze")] string contractText)
    {
        // Native validation before LLM invocation-keep deterministic logic out of prompts
        if (contractText.Length > 100_000)
            throw new ArgumentException("Contract text exceeds maximum analyzable length");
        var arguments = new KernelArguments { ["contract_text"] = contractText };

        var result = await _kernel.InvokeAsync(
            "LegalPrompts", 
            "ExtractObligations", 
            arguments);
        return ParseObligations(result.GetValue<string>()!);
    }
}

The architectural principle here is the separation of semantic and native logic. Input validation, data access, and output parsing belong in native functions. Reasoning, synthesis, and language understanding belong in semantic functions. Mixing these concerns produces systems that are harder to test, harder to debug, and more expensive to run.

3.2 Planners: Autonomous Orchestration

The planner is Semantic Kernel’s most architecturally significant component — and the one most likely to surprise engineers coming from traditional orchestration systems like Temporal or Azure Durable Functions. A planner takes a high-level goal expressed in natural language and dynamically generates a plan — a sequence of plugin invocations — to achieve it, using the LLM itself as the planning engine.

Semantic Kernel currently supports two primary planner strategies:

Handlebars Planner generates a structured Handlebars template representing a fixed execution plan. This is deterministic once generated — suitable for workflows where you want human review of the plan before execution, or where the execution graph needs to be auditable.

Function Calling Planner (Auto-invoke) delegates step-by-step planning to the LLM’s native function-calling capability, allowing iterative, dynamic execution where the model decides at each step which function to invoke based on prior results. This is appropriate for exploratory or open-ended tasks.

For production enterprise systems, the architectural guidance is this: never use open-ended auto-invoke without bounding constraints. Unbounded planners can generate execution graphs of arbitrary length, incurring unbounded token costs and latency. Always configure FunctionChoiceBehavior with an explicit maximum step count and a function allowlist:

var executionSettings = new OpenAIPromptExecutionSettings
{
    FunctionChoiceBehavior = FunctionChoiceBehavior.Auto(
        functions: allowedFunctions, // Constrain available functions
        options: new FunctionChoiceBehaviorOptions
        {
            AllowConcurrentInvocation = true, // Parallel execution where safe
            AllowParallelCalls = true
        }
    ),
    MaxTokens = 2048,
    Temperature = 0.1 // Low temperature for planning tasks—higher determinism
};

3.3 Memory: Semantic Context Management

Semantic Kernel’s memory system abstracts vector storage and retrieval behind a clean interface that integrates naturally with the kernel’s execution pipeline. The architectural pattern here is Retrieval-Augmented Generation (RAG), but implemented at the framework level rather than hand-rolled per application.

The key design decision is between volatile memory (in-process, session-scoped) and persistent memory (vector database-backed, cross-session). In enterprise applications, persistent memory backed by Azure AI Search or Qdrant is the standard pattern — it enables knowledge bases that grow over time, cross-session context, and shared memory across multiple kernel instances in a horizontally scaled deployment.

// Memory configuration with Azure AI Search
var memoryBuilder = new MemoryBuilder();
memoryBuilder
    .WithAzureOpenAITextEmbeddingGeneration(
        deploymentName: config["AzureOpenAI:EmbeddingDeployment"]!,
        endpoint: config["AzureOpenAI:Endpoint"]!,
        apiKey: config["AzureOpenAI:ApiKey"]!)
    .WithMemoryStore(new AzureAISearchMemoryStore(
        endpoint: config["AzureSearch:Endpoint"]!,
        apiKey: config["AzureSearch:ApiKey"]!));

var memory = memoryBuilder.Build();
// Storing knowledge
await memory.SaveInformationAsync(
    collection: "product-knowledge-base",
    text: productDocument.Content,
    id: productDocument.Id,
    description: productDocument.Title);
// Semantic retrieval-returns chunks ranked by cosine similarity
var results = await memory.SearchAsync(
    collection: "product-knowledge-base",
    query: userQuery,
    limit: 5,
    minRelevanceScore: 0.75)
    .ToListAsync();

The minRelevanceScore parameter is operationally critical. In production systems, you need a floor on retrieval quality to prevent low-relevance noise from polluting the LLM's context window—which degrades response quality while increasing token costs simultaneously.

3.4 Filters: The Middleware Pipeline

Semantic Kernel’s filter pipeline is its most underutilized and most architecturally powerful feature for enterprise hardening. Filters are middleware components that intercept function invocations — both before and after execution — enabling cross-cutting concerns to be implemented once and applied uniformly across all kernel operations.

The three filter interfaces are:

IFunctionInvocationFilter — intercepts individual plugin function calls
IPromptRenderFilter — intercepts prompt construction before LLM submission
IAutoFunctionInvocationFilter — intercepts auto-planner function calls specifically

For enterprise systems, the canonical set of filters to implement includes:

Rate Limiting Filter — enforces per-user, per-tenant, or global token budget constraints, rejecting invocations that would exceed configured limits before they hit the LLM API.

Semantic Cache Filter — checks a Redis-backed semantic similarity cache before forwarding requests to the LLM, returning cached responses for semantically equivalent queries. This is the single highest-ROI FinOps optimization available in the stack.

Audit Filter — writes a structured audit log of every LLM invocation including the resolved prompt, model parameters, token consumption, and response — essential for regulated industries where AI decision auditing is a compliance requirement.

Output Validation Filter — applies post-invocation validation rules (schema validation, content policy checks, confidence scoring) and triggers retry or escalation logic when validation fails.

public class SemanticCacheFilter : IFunctionInvocationFilter
{
    private readonly ISemanticCache _cache;
    private readonly ILogger<SemanticCacheFilter> _logger;
    private readonly CacheConfiguration _config;
    public async Task OnFunctionInvocationAsync(
        FunctionInvocationContext context,
        Func<FunctionInvocationContext, Task> next)
    {
        // Only cache semantic function invocations, not native functions
        if (!context.Function.IsSemanticFunction())
        {
            await next(context);
            return;
        }
        var cacheKey = BuildCacheKey(context);
        var cached = await _cache.GetAsync(cacheKey);
        if (cached is not null)
        {
            _logger.LogInformation(
                "Cache hit for function {FunctionName}. Tokens saved: ~{EstimatedTokens}",
                context.Function.Name,
                EstimateTokens(cached));

            context.Result = new FunctionResult(context.Function, cached);
            return; // Skip LLM invocation entirely
        }
        // Cache miss-execute and store
        await next(context);
        var result = context.Result.GetValue<string>();
        if (result is not null)
        {
            await _cache.SetAsync(
                cacheKey, 
                result, 
                TimeSpan.FromMinutes(_config.TtlMinutes));
        }
    }
    private string BuildCacheKey(FunctionInvocationContext context)
    {
        // Cache key must capture the semantic intent, not just surface syntax
        // Use embedding similarity rather than string equality for cache lookup
        var prompt = context.Arguments.ToString();
        return $"{context.Function.PluginName}:{context.Function.Name}:{ComputeSemanticHash(prompt)}";
    }
}

IV. FinOps Architecture: Treating Token Costs as Infrastructure

Token costs are not an afterthought. At enterprise scale, they are a primary cost driver that must be engineered with the same rigor as compute and storage costs. This section outlines the architectural patterns that constitute a mature FinOps posture for Semantic Kernel deployments.

4.1 The Token Cost Model

Understanding the cost model is prerequisite to optimizing it. Azure OpenAI pricing operates on a per-million-token basis, differentiated between input (prompt) tokens and output (completion) tokens, with output tokens typically priced at a 3× multiplier. For GPT-4o as of this writing:

Input: ~$2.50 per 1M tokens
Output: ~$10.00 per 1M tokens

A production AI agent handling 10,000 conversations per day, each averaging 8,000 input tokens and 500 output tokens, incurs:

Daily input cost: 10,000 × 8,000 / 1,000,000 × $2.50 = $200/day
Daily output cost: 10,000 × 500 / 1,000,000 × $10.00 = $50/day
Monthly total: ~$7,500/month from token costs alone

The leverage points for FinOps optimization, in approximate order of impact:

1. Semantic caching — Returning cached responses for semantically equivalent queries eliminates token costs entirely for cache hits. In enterprise applications with repetitive query patterns (FAQ-style, report generation, classification), cache hit rates of 30–60% are achievable, representing proportional cost reductions.

2. Model tiering — Routing simple tasks (intent classification, entity extraction from structured text, short summarization) to GPT-3.5-turbo or GPT-4o-mini rather than GPT-4o can reduce per-invocation costs by 10–50× for those tasks. The routing logic belongs in a Semantic Kernel filter or a custom ITextGenerationService implementation.

3. Prompt compression — Systematically auditing and reducing prompt verbosity. Every word in your system prompt costs money on every invocation. Compressing a system prompt from 800 tokens to 400 tokens halves input costs for that plugin.

4. Context window management — Implementing sliding window context truncation rather than naive full-history injection. Keeping the last N turns plus retrieved semantic memory rather than the complete conversation history.

5. Streaming with early termination — Using GetStreamingChatMessageContentsAsync() and terminating the stream once enough content has been generated to satisfy the use case, avoiding payment for completion tokens you don't need.

4.2 Token Budget Enforcement Architecture

public class TokenBudgetService : ITokenBudgetService
{
    private readonly IDistributedCache _cache;
    private readonly TokenBudgetConfiguration _config;
    public async Task<BudgetCheckResult> CheckAndDeductAsync(
        string tenantId, 
        int estimatedInputTokens, 
        int estimatedOutputTokens)
    {
        var budgetKey = $"token-budget:{tenantId}:{DateTime.UtcNow:yyyy-MM-dd}";

        var currentUsage = await GetUsageAsync(budgetKey);
        var projectedCost = CalculateCost(
            currentUsage.InputTokens + estimatedInputTokens,
            currentUsage.OutputTokens + estimatedOutputTokens);

        if (projectedCost > _config.DailyBudgetUsd[tenantId])
        {
            return BudgetCheckResult.Exceeded(
                currentUsage: currentUsage.TotalCostUsd,
                limit: _config.DailyBudgetUsd[tenantId],
                resetTime: DateTime.UtcNow.Date.AddDays(1));
        }
        // Optimistic deduction-reconcile against actual usage post-invocation
        await IncrementUsageAsync(budgetKey, estimatedInputTokens, estimatedOutputTokens);

        return BudgetCheckResult.Approved();
    }
}

V. SRE Reliability Patterns for Semantic Kernel

Production AI systems must meet the same availability and reliability standards as any other mission-critical service. The SRE challenge with LLM-backed systems is that the failure domain is richer and less familiar than traditional backend services.

5.1 The AI System Failure Taxonomy

Before designing for reliability, enumerate what you’re designing against:

| Failure Class | Description | Detection Signal | Mitigation Pattern |
|-------------------------|------------------------------------------|-----------------------------------|---------------------------------------------|
| Model Unavailability | LLM API returns 503/429 | HTTP status codes | Circuit breaker + fallback model |
| Rate Limit Exhaustion | TPM/RPM limits exceeded | 429 with Retry-After header | Exponential backoff + quota management |
| Context Overflow | Input exceeds context window | 400 with context length error | Dynamic context truncation |
| Semantic Degradation | Output quality falls below threshold | Custom quality scorer | Output validation + retry |
| Prompt Injection | Malicious user input hijacks prompt | Heuristic + LLM-as-judge | Input sanitization filter |
| Hallucination | Factually incorrect output | Grounding verification | RAG + citation enforcement |
| Token Budget Exhaustion | Daily spend limit reached | Budget service check | Graceful degradation to cached/static |

5.2 Circuit Breaker Implementation

The Polly library integrates naturally with Semantic Kernel’s HttpClient pipeline, providing circuit breaker, retry, and bulkhead patterns that protect against cascading failures from LLM API instability.

// Configure resilience pipeline for LLM HTTP clients
builder.Services.AddHttpClient("AzureOpenAI")
    .AddResilienceHandler("llm-pipeline", pipeline =>
    {
        // Retry with exponential backoff—respects Retry-After headers from 429 responses
        pipeline.AddRetry(new HttpRetryStrategyOptions
        {
            MaxRetryAttempts = 3,
            Delay = TimeSpan.FromSeconds(2),
            BackoffType = DelayBackoffType.Exponential,
            UseJitter = true,
            ShouldHandle = static args => args.Outcome switch
            {
                { Result.StatusCode: HttpStatusCode.TooManyRequests } => PredicateResult.True(),
                { Result.StatusCode: HttpStatusCode.ServiceUnavailable } => PredicateResult.True(),
                { Exception: HttpRequestException } => PredicateResult.True(),
                _ => PredicateResult.False()
            },
            OnRetry = args =>
            {
                // Extract Retry-After header and honor it
                if (args.Outcome.Result?.Headers.RetryAfter?.Delta is TimeSpan retryAfter)
                {
                    args.Arguments.Context.Properties.Set(
                        new ResiliencePropertyKey<TimeSpan>("retry-after"), 
                        retryAfter);
                }
                return ValueTask.CompletedTask;
            }
        });

// Circuit breaker-trips after 5 consecutive failures, resets after 30 seconds
        pipeline.AddCircuitBreaker(new HttpCircuitBreakerStrategyOptions
        {
            SamplingDuration = TimeSpan.FromSeconds(30),
            MinimumThroughput = 10,
            FailureRatio = 0.5,
            BreakDuration = TimeSpan.FromSeconds(30),
            OnOpened = args =>
            {
                // Alert SRE team and activate fallback routing
                MetricsRegistry.CircuitBreakerOpened
                    .Add(1, new TagList { ["model"] = "primary-gpt4" });
                return ValueTask.CompletedTask;
            }
        });
        // Timeout-LLM calls should complete within SLA
        pipeline.AddTimeout(TimeSpan.FromSeconds(30));
    });

5.3 Multi-Model Failover Strategy

When the primary model’s circuit breaker is open, the system must degrade gracefully rather than fail completely. The architectural pattern is a fallback chain: primary GPT-4o → fallback GPT-3.5-turbo → cached response → static degraded response.

public class ResilientKernelService : IResilientKernelService
{
    private readonly Kernel _primaryKernel; // GPT-4o
    private readonly Kernel _fallbackKernel; // GPT-3.5-turbo
    private readonly ISemanticCache _cache;
    private readonly ICircuitBreakerRegistry _circuitBreakers;
    public async Task<string> InvokeWithFallbackAsync(
        string pluginName, 
        string functionName, 
        KernelArguments arguments,
        CancellationToken ct = default)
    {
        // Check circuit breaker state before attempting primary
        if (!_circuitBreakers.IsOpen("primary-gpt4"))
        {
            try
            {
                var result = await _primaryKernel.InvokeAsync(
                    pluginName, functionName, arguments, ct);
                return result.GetValue<string>()!;
            }
            catch (Exception ex) when (IsTransientFailure(ex))
            {
                _circuitBreakers.RecordFailure("primary-gpt4");
            }
        }
        // Fallback to cheaper/more available model
        try
        {
            var result = await _fallbackKernel.InvokeAsync(
                pluginName, functionName, arguments, ct);

            MetricsRegistry.FallbackInvocations.Add(1, 
                new TagList { ["reason"] = "primary-unavailable" });

            return result.GetValue<string>()!;
        }
        catch (Exception ex)
        {
            // Last resort: return cached response if available
            var cacheKey = BuildCacheKey(pluginName, functionName, arguments);
            var cached = await _cache.GetAsync(cacheKey);

            if (cached is not null)
            {
                MetricsRegistry.FallbackInvocations.Add(1, 
                    new TagList { ["reason"] = "all-models-unavailable-cache-hit" });
                return cached;
            }
            // Propagate only if no fallback is available
            throw new AiServiceUnavailableException(
                "All AI service tiers are unavailable and no cached response exists.", ex);
        }
    }
}

VI. Observability: The Foundation of Production AI Operations

You cannot operate what you cannot observe. This principle, axiomatic in traditional SRE, becomes even more critical in AI systems where failures are probabilistic and often invisible at the infrastructure layer. An LLM returning a hallucinated response looks exactly like a successful 200 OK from the network’s perspective.

6.1 The Three-Layer Observability Model

Enterprise Semantic Kernel observability operates at three distinct layers, each requiring different instrumentation:

Infrastructure Layer — Traditional metrics: request rates, error rates, latency percentiles, HTTP status codes. Semantic Kernel emits OpenTelemetry traces and metrics natively from Microsoft.SemanticKernel.* activity sources. Wire these into your existing observability stack (Azure Monitor, Datadog, Prometheus/Grafana) with zero custom code.

Resource Consumption Layer — Token-level metrics: input token count, output token count, model invoked, cost estimate. These require custom instrumentation via a IFunctionInvocationFilter that reads token usage from the FunctionInvocationContext.Result.Metadata.

Semantic Quality Layer — The layer most organizations skip and later regret. Metrics around output quality: confidence scores, grounding percentages, retrieval relevance scores, user satisfaction signals. These require purpose-built evaluation infrastructure, typically implemented as a background evaluation pipeline that samples production outputs and scores them against quality criteria.

public class ObservabilityFilter : IFunctionInvocationFilter
{
    private static readonly Histogram<double> InvocationDuration = 
        Metrics.CreateHistogram<double>(
            "sk_function_duration_seconds",
            "Duration of Semantic Kernel function invocations",
            new InstrumentDescription { Unit = "s" });
    private static readonly Counter<long> TokensConsumed = 
        Metrics.CreateCounter<long>(
            "sk_tokens_consumed_total",
            "Total tokens consumed by Semantic Kernel functions");
    private static readonly Counter<double> EstimatedCostUsd = 
        Metrics.CreateCounter<double>(
            "sk_estimated_cost_usd_total",
            "Estimated cost in USD of Semantic Kernel function invocations");
    public async Task OnFunctionInvocationAsync(
        FunctionInvocationContext context,
        Func<FunctionInvocationContext, Task> next)
    {
        var stopwatch = Stopwatch.StartNew();
        Exception? exception = null;
        using var activity = ActivitySource.StartActivity(
            $"sk.function.{context.Function.PluginName}.{context.Function.Name}",
            ActivityKind.Internal);
        activity?.SetTag("sk.plugin.name", context.Function.PluginName);
        activity?.SetTag("sk.function.name", context.Function.Name);
        try
        {
            await next(context);
            // Capture token usage from response metadata
            var usage = context.Result.Metadata?
                .GetValueOrDefault("Usage") as CompletionsUsage;
            if (usage is not null)
            {
                var tags = new TagList
                {
                    ["plugin"] = context.Function.PluginName,
                    ["function"] = context.Function.Name,
                    ["model"] = context.Result.Metadata?
                        .GetValueOrDefault("ModelId")?.ToString() ?? "unknown",
                    ["token_type"] = "input"
                };
                TokensConsumed.Add(usage.PromptTokens, tags);

                tags["token_type"] = "output";
                TokensConsumed.Add(usage.CompletionTokens, tags);
                var cost = CalculateCost(
                    usage.PromptTokens, 
                    usage.CompletionTokens,
                    tags["model"].ToString()!);
                EstimatedCostUsd.Add(cost, new TagList 
                { 
                    ["plugin"] = context.Function.PluginName,
                    ["function"] = context.Function.Name 
                });
                activity?.SetTag("sk.tokens.prompt", usage.PromptTokens);
                activity?.SetTag("sk.tokens.completion", usage.CompletionTokens);
                activity?.SetTag("sk.cost.usd", cost);
            }
        }
        catch (Exception ex)
        {
            exception = ex;
            activity?.SetStatus(ActivityStatusCode.Error, ex.Message);
            throw;
        }
        finally
        {
            stopwatch.Stop();
            var resultTags = new TagList
            {
                ["plugin"] = context.Function.PluginName,
                ["function"] = context.Function.Name,
                ["success"] = exception is null
            };
            InvocationDuration.Record(stopwatch.Elapsed.TotalSeconds, resultTags);
        }
    }
}

6.2 The SRE Dashboard for AI Systems

Beyond the instrumentation code, the operational posture requires purpose-built dashboards that surface AI-specific signals:

Cost Burn Rate Dashboard — Real-time token consumption and estimated cost, broken down by tenant, plugin, and model. Alert when daily burn rate projects to exceed monthly budget. This is the FinOps control plane.

Quality Degradation Dashboard — Trending quality scores from the semantic evaluation pipeline. A sudden drop in average confidence or grounding scores is often the first signal of a prompt regression before users start complaining.

Latency Percentile Dashboard — P50, P95, P99 invocation latency by plugin and model. LLM latency is highly variable; mean latency is a misleading metric. The P99 determines actual user experience under load.

Fallback Activation Rate — Track the ratio of primary model invocations to fallback invocations. A rising fallback rate signals primary model degradation before the circuit breaker fully opens.

VII. Bringing It Together: The Production Architecture Blueprint

A production Semantic Kernel deployment for an enterprise AI agent system has the following high-level architecture:

Ingress & Auth Layer — API Gateway (Azure API Management) handling authentication, per-tenant rate limiting, and request routing. Token budget checks happen at this layer before requests reach the Semantic Kernel service.

Semantic Kernel Service — Horizontally scaled .NET 9.0 service hosting the kernel, plugins, and filter pipeline. Stateless — all state lives in the memory layer or distributed cache.

Semantic Cache — Redis Cluster with vector similarity index (RedisStack) for semantic deduplication. Shared across all kernel service instances.

Vector Memory Store — Azure AI Search or Qdrant for persistent semantic memory. Supports the RAG pipeline for knowledge-grounded responses.

LLM Providers — Primary Azure OpenAI deployment (GPT-4o) and fallback deployment (GPT-4o-mini), each behind Polly resilience pipelines with independent circuit breakers.

Observability Stack — OpenTelemetry Collector receiving traces and metrics from all services, forwarding to Azure Monitor + Application Insights. Grafana dashboards surfacing the AI-specific signals described in Section VI.

Evaluation Pipeline — Asynchronous worker service that samples production invocations, runs them through quality evaluation (LLM-as-judge pattern), and writes scores to the metrics store for dashboard visibility.

VIII. Key Takeaways and What’s Next

Semantic Kernel earns its complexity. For teams moving LLM applications from proof-of-concept to production scale, the framework provides:

The plugin architecture that makes AI capabilities composable, testable, and maintainable across large teams. The planner that enables autonomous AI agents without requiring every orchestration path to be pre-coded. The memory system that makes RAG a first-class architectural pattern rather than a hand-rolled feature. The filter pipeline that gives you the cross-cutting concerns — caching, rate limiting, observability, output validation — as a structured composition point rather than scattered middleware. And the multi-model connector architecture that decouples business logic from any specific provider’s API surface.

None of these benefits come for free. Semantic Kernel introduces genuine complexity: asynchronous orchestration, distributed state management, probabilistic failure modes, and token cost economics all demand architectural attention that simpler integrations can defer. The payoff is a system architecture that can survive the transition from demo to production, from hundreds of users to millions, and from the first AI feature to the tenth.

In Part 2, we will move from foundations to implementation: building the async-first parallel orchestration patterns that collapse multi-step LLM workflows from sequential to concurrent execution, implementing the semantic cache with Redis vector similarity search, and walking through the full filter pipeline implementation with production-grade token metering.

DEV Community