Semantic Kernel on .NET: Deploy, Observe, Optimize

Are you sure your LLM app will survive the Friday deploy and the Monday traffic spike? Most don’t – not because of AI, but because of missing production basics: observability, budgets, and guardrails.

In this post I’ll show you how I ship Semantic Kernel (SK) apps to production on .NET, wire OpenTelemetry end‑to‑end, shave latency & cost, and stay compliant. Everything here is battle‑tested from my own projects – copy/paste friendly.

Why it matters

Shipping an LLM feature is easy; running it reliably and profitably is where teams get burned. SK gives you clean composition over prompts, tools, and models — but production demands more than good abstractions.

Here’s why this playbook matters to you and your users:

Latency is a UX killer. A single network hop to a busy model can turn into seconds. If you don’t stream early and trace end‑to‑end, you won’t know whether the slowness is your API, the kernel, or the provider.
Tokens are money. Unlike CPU time, token usage scales with prompts, context windows, and model choice. Without per‑request cost attribution, a handful of prompts can silently consume most of your budget.
Vendors wobble. Rate limits, timeouts, and occasional outages are normal. You need retries with jitter, circuit breakers, and sane timeouts to avoid cascading failures.
Compliance isn’t optional. Raw prompts and outputs may include PII. You must redact logs, version prompts, and keep an audit trail that’s reconstructable without storing sensitive data.
Prompts evolve. Product teams will iterate weekly. You’ll want canary models, A/B prompt versions, and dashboards that show quality, latency, and cost movements after each change.

Deploy: containerize, configure, and keep it alive

Dockerfile (tiny & ready for K8s)

# build
FROM mcr.microsoft.com/dotnet/sdk:8.0 AS build
WORKDIR /src
COPY . .
RUN dotnet publish src/SkService/SkService.csproj -c Release -o /app /p:PublishTrimmed=true /p:PublishReadyToRun=true

# run
FROM mcr.microsoft.com/dotnet/aspnet:8.0-alpine
WORKDIR /app
COPY --from=build /app .
ENV ASPNETCORE_URLS=http://+:8080 DOTNET_GCHeapHardLimit=0
EXPOSE 8080
ENTRYPOINT ["dotnet","SkService.dll"]

Minimal hosting with health & config

var builder = WebApplication.CreateBuilder(args);
var cfg = builder.Configuration;

// Strongly-typed options for prices/limits
builder.Services.Configure<ModelBudgetOptions>(cfg.GetSection("Budget"));

builder.Services.AddHttpClient();

// OpenTelemetry: traces + metrics + logs
builder.Services.AddOpenTelemetry()
    .ConfigureResource(r => r.AddService(serviceName: "sk-service", serviceVersion: "1.0"))
    .WithTracing(t => t
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddSource(SkTelemetry.ActivitySourceName)
        .AddOtlpExporter())
    .WithMetrics(m => m
        .AddAspNetCoreInstrumentation()
        .AddHttpClientInstrumentation()
        .AddMeter(SkTelemetry.MeterName)
        .AddPrometheusExporter());

builder.Logging.AddOpenTelemetry(o => o.AddOtlpExporter());

// Semantic Kernel
var skBuilder = Microsoft.SemanticKernel.Kernel.CreateBuilder();
skBuilder.AddOpenAIChatCompletion(
    modelId: cfg["OpenAI:Model"]!,
    apiKey: cfg["OpenAI:ApiKey"]);

// Decorate chat completion with tracing/retry/cache
skBuilder.Services.Decorate<IChatCompletionService, TracingChatCompletion>();
skBuilder.Services.Decorate<IChatCompletionService, CachingChatCompletion>();
skBuilder.Services.Decorate<IChatCompletionService, ResilientChatCompletion>();

builder.Services.AddSingleton(skBuilder.Build());

// Caching & rate windows
builder.Services.AddMemoryCache();

var app = builder.Build();

app.MapGet("/health/live", () => Results.Ok());
app.MapGet("/health/ready", () => Results.Ok());
app.MapPrometheusScrapingEndpoint();

// Streaming endpoint
app.MapPost("/chat", Endpoints.ChatStream);

app.Run();

Tips
Keep the base image alpine (c‑lib compatible dependencies? verify provider SDKs).
Expose a Prometheus scrape endpoint and /ready & /live.
Add PodDisruptionBudget in K8s to avoid all replicas draining at once.

K8s probes & autoscaling (essentials)

apiVersion: apps/v1
kind: Deployment
metadata: { name: sk-service }
spec:
  replicas: 3
  selector: { matchLabels: { app: sk } }
  template:
    metadata: { labels: { app: sk } }
    spec:
      containers:
        - name: sk
          image: ghcr.io/you/sk-service:1.0
          ports: [{ containerPort: 8080 }]
          env:
            - name: OpenAI__ApiKey
              valueFrom: { secretKeyRef: { name: openai, key: apikey } }
          readinessProbe:
            httpGet: { path: /health/ready, port: 8080 }
          livenessProbe:
            httpGet: { path: /health/live, port: 8080 }
          resources:
            requests: { cpu: "200m", memory: "256Mi" }
            limits:   { cpu: "1",    memory: "1Gi" }
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: sk-service }
spec:
  scaleTargetRef: { apiVersion: apps/v1, kind: Deployment, name: sk-service }
  minReplicas: 3
  maxReplicas: 20
  metrics:
    - type: Resource
      resource: { name: cpu, target: { type: Utilization, averageUtilization: 60 } }

Observe: end‑to‑end OpenTelemetry for SK

SK calls are just HTTP under the hood. We’ll instrument traces (correlate across API → SK → provider), metrics (tokens, cost, latency), and logs (prompt snapshots, redacted).

Telemetry primitives

public static class SkTelemetry
{
    public const string ActivitySourceName = "SemanticKernel.Prod";
    public const string MeterName = "SemanticKernel.Metrics";

    public static readonly ActivitySource ActivitySource = new(ActivitySourceName);
    public static readonly Meter Meter = new(MeterName);

    public static readonly Counter<long> PromptTokens = Meter.CreateCounter<long>("sk_prompt_tokens");
    public static readonly Counter<long> CompletionTokens = Meter.CreateCounter<long>("sk_completion_tokens");
    public static readonly Histogram<double> LatencyMs = Meter.CreateHistogram<double>("sk_llm_latency_ms");
    public static readonly Counter<double> CostUsd = Meter.CreateCounter<double>("sk_cost_usd");
}

Tracing decorator for IChatCompletionService

public sealed class TracingChatCompletion(IChatCompletionService inner, IOptions<ModelBudgetOptions> budget)
    : IChatCompletionService
{
    public IReadOnlyDictionary<string, object?> Attributes => inner.Attributes;

    public async Task<ChatMessageContent> GetChatMessageContentAsync(
        ChatHistory history,
        PromptExecutionSettings? settings = null,
        Kernel? kernel = null,
        CancellationToken ct = default)
    {
        using var act = SkTelemetry.ActivitySource.StartActivity("llm.chat", ActivityKind.Client);
        act?.SetTag("provider", inner.Attributes[AIServiceExtensions.Provider] ?? "unknown");
        act?.SetTag("model", settings?.ModelId);
        act?.SetTag("temperature", settings?.ExtensionData?["temperature"]);

        var sw = Stopwatch.StartNew();
        var result = await inner.GetChatMessageContentAsync(history, settings, kernel, ct);
        sw.Stop();

        act?.SetTag("status", "ok");
        act?.SetTag("latency_ms", sw.Elapsed.TotalMilliseconds);
        SkTelemetry.LatencyMs.Record(sw.Elapsed.TotalMilliseconds);

        // Estimate tokens if provider didn’t return usage
        var prompt = string.Join("\n", history.Select(m => m.Content));
        var completion = result.Content ?? string.Empty;
        var promptTokens = Tokenizer.EstimateTokens(prompt);
        var completionTokens = Tokenizer.EstimateTokens(completion);
        var cost = TokenCost.EstimateUsd(settings?.ModelId, promptTokens, completionTokens, budget.Value);

        SkTelemetry.PromptTokens.Add(promptTokens);
        SkTelemetry.CompletionTokens.Add(completionTokens);
        SkTelemetry.CostUsd.Add(cost);

        act?.SetTag("prompt_tokens", promptTokens);
        act?.SetTag("completion_tokens", completionTokens);
        act?.SetTag("cost_usd", cost);
        return result;
    }

    // streaming passthrough with timing
    public async IAsyncEnumerable<StreamingChatMessageContent> GetStreamingChatMessageContentsAsync(
        ChatHistory history, PromptExecutionSettings? settings = null, Kernel? kernel = null, [EnumeratorCancellation] CancellationToken ct = default)
    {
        using var act = SkTelemetry.ActivitySource.StartActivity("llm.chat.stream", ActivityKind.Client);
        var sw = Stopwatch.StartNew();
        await foreach (var chunk in inner.GetStreamingChatMessageContentsAsync(history, settings, kernel, ct))
        {
            yield return chunk;
        }
        sw.Stop();
        SkTelemetry.LatencyMs.Record(sw.Elapsed.TotalMilliseconds);
        act?.SetTag("latency_ms", sw.Elapsed.TotalMilliseconds);
    }
}

Cost model and token math (simple & explicit)

public sealed class ModelBudgetOptions
{
    public Dictionary<string, ModelPrice> Prices { get; init; } = new();
}

public sealed class ModelPrice
{
    public double PromptPer1K { get; init; }
    public double CompletionPer1K { get; init; }
}

public static class TokenCost
{
    public static double EstimateUsd(string? model, int promptTokens, int completionTokens, ModelBudgetOptions cfg)
    {
        if (model is null || !cfg.Prices.TryGetValue(model, out var p)) return 0;
        return (promptTokens / 1000d) * p.PromptPer1K + (completionTokens / 1000d) * p.CompletionPer1K;
    }
}

public static class Tokenizer
{
    // quick heuristic; swap with tiktoken-sharp if needed
    public static int EstimateTokens(string text) => (int)Math.Ceiling(text.Length / 4.0);
}

Prometheus panels you actually need

rate(sk_cost_usd[5m]) – burn rate
histogram_quantile(0.95, sum(rate(sk_llm_latency_ms_bucket[5m])) by (le)) – p95 latency
sum(rate(sk_prompt_tokens[1m])) by (model) – token ingress per model
Errors by provider/model using trace status code

Optimize: latency & cost without hurting quality

Stream early, render progressively

Reduce TTFB by streaming tokens to the client while the model thinks.

public static async Task ChatStream(HttpContext ctx, Kernel kernel)
{
    var request = await ctx.Request.ReadFromJsonAsync<ChatRequest>();
    var history = new ChatHistory(request!.SystemPrompt);
    history.AddUserMessage(request.User);

    ctx.Response.Headers["Content-Type"] = "text/event-stream";

    await foreach (var delta in kernel.GetRequiredService<IChatCompletionService>()
        .GetStreamingChatMessageContentsAsync(history, new() { ModelId = request.Model }, kernel, ctx.RequestAborted))
    {
        await ctx.Response.WriteAsync($"data: {delta.Content}\n\n");
        await ctx.Response.Body.FlushAsync();
    }
}

Cache by prompt fingerprint (with safety)

public sealed class CachingChatCompletion(IChatCompletionService inner, IMemoryCache cache) : IChatCompletionService
{
    public IReadOnlyDictionary<string, object?> Attributes => inner.Attributes;

    public async Task<ChatMessageContent> GetChatMessageContentAsync(ChatHistory history, PromptExecutionSettings? settings = null, Kernel? kernel = null, CancellationToken ct = default)
    {
        var key = CacheKey(history, settings);
        if (cache.TryGetValue<ChatMessageContent>(key, out var hit)) return hit!;

        var res = await inner.GetChatMessageContentAsync(history, settings, kernel, ct);
        cache.Set(key, res, TimeSpan.FromMinutes(5));
        return res;
    }

    public async IAsyncEnumerable<StreamingChatMessageContent> GetStreamingChatMessageContentsAsync(ChatHistory history, PromptExecutionSettings? settings = null, Kernel? kernel = null, [EnumeratorCancellation] CancellationToken ct = default)
    {
        // opt-out from caching for streams in this simple example
        await foreach (var chunk in inner.GetStreamingChatMessageContentsAsync(history, settings, kernel, ct))
            yield return chunk;
    }

    static string CacheKey(ChatHistory history, PromptExecutionSettings? settings)
    {
        var sb = new StringBuilder();
        foreach (var m in history) sb.Append(m.Role).Append('|').Append(m.Content);
        sb.Append("|model=").Append(settings?.ModelId);
        using var sha = SHA256.Create();
        return Convert.ToHexString(sha.ComputeHash(Encoding.UTF8.GetBytes(sb.ToString())));
    }
}

Cache rules of thumb
Only cache deterministic prompts (set temperature low, no time‑based content).
Include model id + system prompt in the key.
Short TTLs (1-10 min) prevent staleness while still absorbing bursts.

Retries that don’t amplify downtime (Polly)

public sealed class ResilientChatCompletion(IChatCompletionService inner) : IChatCompletionService
{
    static readonly IAsyncPolicy PolicyWrap =
        Polly.Policy.WrapAsync(
            Polly.Policy.Handle<Exception>()
                .CircuitBreakerAsync(handledEventsAllowedBeforeBreaking: 4, durationOfBreak: TimeSpan.FromSeconds(30)),
            Polly.Policy.Handle<Exception>()
                .WaitAndRetryAsync(3, attempt => TimeSpan.FromMilliseconds(100 * Math.Pow(2, attempt)) + TimeSpan.FromMilliseconds(Random.Shared.Next(0, 50)))
        );

    public IReadOnlyDictionary<string, object?> Attributes => inner.Attributes;

    public Task<ChatMessageContent> GetChatMessageContentAsync(ChatHistory history, PromptExecutionSettings? settings = null, Kernel? kernel = null, CancellationToken ct = default) =>
        PolicyWrap.ExecuteAsync(_ => inner.GetChatMessageContentAsync(history, settings, kernel, ct), ct);

    public async IAsyncEnumerable<StreamingChatMessageContent> GetStreamingChatMessageContentsAsync(ChatHistory history, PromptExecutionSettings? settings = null, Kernel? kernel = null, [EnumeratorCancellation] CancellationToken ct = default)
    {
        // For streams we keep it simple; you can preflight with a HEAD call to provider
        await foreach (var chunk in inner.GetStreamingChatMessageContentsAsync(history, settings, kernel, ct))
            yield return chunk;
    }
}

Trim tokens with guardrails, not guesses

System prompts: move them to files with placeholders; version them; keep them short.
Tools/functions: prefer structured tool calls over long natural‑language rules.
Context windows: add a semantic memory (vector store) but cap hits to N and summarize top‑K.
Stop sequences and max tokens: be explicit.

var settings = new PromptExecutionSettings
{
    ModelId = "gpt-4o-mini",
    ExtensionData = new() { ["temperature"] = 0.2, ["max_tokens"] = 400, ["stop"] = new[] {"\n\nUser:"} }
};

Choose the right model for the job

Small model for classification/routing, bigger model for synthesis.
Use canary header (e.g., X-Model-Variant) to test a cheaper model for 5% of traffic and compare answer quality + cost/latency.

Comply: privacy, audit, and safe outputs

Redact sensitive data before logging

public static class Redactor
{
    static readonly Regex Email = new("[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}", RegexOptions.Compiled);
    static readonly Regex Phone = new("\\b(?:\u002B\d{1,3}[ -]?)?\d{3}[ -]?\d{3}[ -]?\d{4}\\b", RegexOptions.Compiled);

    public static string Sanitize(string text)
        => Phone.Replace(Email.Replace(text, "[email]"), "[phone]");
}

app.Logger.LogInformation("prompt={Prompt}", Redactor.Sanitize(request.User));

Prompt versioning & audit trail

Store prompts in object storage or Git with a semantic version.
On every request, log prompt_version, model, user_id (hashed), trace_id, and a hash of the final prompt to reconstruct without storing raw PII.

Output moderation (defense in depth)

Use provider filters (if available) and a lightweight post‑classifier (keyword/regex or small model) with a blocklist and allowlist.
For chat, add content policy system prompt with explicit rules (no PII echo, no images of faces, etc.).

End‑to‑end sample: minimal chat with everything wired

app.MapPost("/v1/ask", async (Ask req, Kernel kernel, ILogger<Program> log, IOptions<ModelBudgetOptions> budget, CancellationToken ct) =>
{
    using var act = SkTelemetry.ActivitySource.StartActivity("http.ask");

    var history = new ChatHistory(req.System ?? "You are a concise assistant.");
    history.AddUserMessage(req.User);

    var settings = new PromptExecutionSettings { ModelId = req.Model ?? "gpt-4o-mini", ExtensionData = new() { ["temperature"] = req.Temperature ?? 0.2 } };

    var svc = kernel.GetRequiredService<IChatCompletionService>();
    var sw = Stopwatch.StartNew();
    var answer = await svc.GetChatMessageContentAsync(history, settings, kernel, ct);
    sw.Stop();

    log.LogInformation("latency_ms={Latency} cost_usd~{Cost}", sw.Elapsed.TotalMilliseconds,
        TokenCost.EstimateUsd(settings.ModelId,
            Tokenizer.EstimateTokens(string.Join("\n", history.Select(m => m.Content))),
            Tokenizer.EstimateTokens(answer.Content ?? string.Empty),
            budget.Value));

    return Results.Ok(new { answer = answer.Content, traceId = Activity.Current?.TraceId.ToString() });
});

public sealed record Ask(string User, string? System, string? Model, double? Temperature);

Production checklist (10‑minute pass before launch)

Health: liveness/readiness probe green under load tests
Observability: traces link API → SK → provider; p95/p99 dashboards; error budgets
Budgets: max tokens per request; daily spend alert; circuit breaker configured
Resilience: retries with jitter; backoff ≤ 3 attempts; idempotent endpoints
Cache: hit rate ≥ 30% for deterministic prompts; TTLs tuned
Security: secrets from vault; redaction in logs; audit fields captured
Quality: canary model live; A/B prompt versions; manual evaluations recorded
Docs: runbook for incidents; model deprecation plan; data retention policy

FAQ: running Semantic Kernel in production

Is OpenTelemetry worth the effort for small apps?

Yes. Even for a solo service, tracing instantly pays for itself the first time latency spikes. Start with the minimal setup above and an OTLP backend.

How do I attribute cost per tenant/user?

Add tenant_id as a trace attribute and emit a sk_cost_usd counter with that label. Your metrics backend can sum by tenant and send alerts.

Where should I store prompts?

Keep them in Git (versioning + code reviews). Load at startup or via a small cache with change notifications.

Do I need a vector DB from day one?

No. Start with small, in‑process embeddings and a file/SQLite store. Add a managed vector DB once your context window routinely overflows.

How do I validate outputs?

For critical flows (e.g., emails, pricing), run a rule‑based validator and a small “judge” model. If validation fails, fall back to a safer template.

What about cold starts?

Warm up the model endpoint during deployment and pre‑load embeddings if applicable. Keep at least one replica always on.

Conclusion: ship fast, watch everything, pay less

Semantic Kernel isn’t a magic wand – it’s a solid composition layer for LLM apps. In production, wins come from boring engineering: traces with IDs you can follow, budgets on tokens you can afford, and policies that fail softly. Implement the decorators above, wire OpenTelemetry, and you’ll get faster responses, lower bills, and fewer surprises.

Which part will you add first – streaming, caching, or full OTel? Tell me in the comments and I’ll share a deeper dive for the most requested one.

Post Views: 432