Are you sure your LLM app will survive the Friday deploy and the Monday traffic spike? Most don’t – not because of AI, but because of missing production basics: observability, budgets, and guardrails.
In this post I’ll show you how I ship Semantic Kernel (SK) apps to production on .NET, wire OpenTelemetry end‑to‑end, shave latency & cost, and stay compliant. Everything here is battle‑tested from my own projects – copy/paste friendly.
Why it matters
Shipping an LLM feature is easy; running it reliably and profitably is where teams get burned. SK gives you clean composition over prompts, tools, and models — but production demands more than good abstractions.
Here’s why this playbook matters to you and your users:
- Latency is a UX killer. A single network hop to a busy model can turn into seconds. If you don’t stream early and trace end‑to‑end, you won’t know whether the slowness is your API, the kernel, or the provider.
- Tokens are money. Unlike CPU time, token usage scales with prompts, context windows, and model choice. Without per‑request cost attribution, a handful of prompts can silently consume most of your budget.
- Vendors wobble. Rate limits, timeouts, and occasional outages are normal. You need retries with jitter, circuit breakers, and sane timeouts to avoid cascading failures.
- Compliance isn’t optional. Raw prompts and outputs may include PII. You must redact logs, version prompts, and keep an audit trail that’s reconstructable without storing sensitive data.
- Prompts evolve. Product teams will iterate weekly. You’ll want canary models, A/B prompt versions, and dashboards that show quality, latency, and cost movements after each change.
Deploy: containerize, configure, and keep it alive
Dockerfile (tiny & ready for K8s)
# build
FROM mcr.microsoft.com/dotnet/sdk:8.0 AS build
WORKDIR /src
COPY . .
RUN dotnet publish src/SkService/SkService.csproj -c Release -o /app /p:PublishTrimmed=true /p:PublishReadyToRun=true
# run
FROM mcr.microsoft.com/dotnet/aspnet:8.0-alpine
WORKDIR /app
COPY --from=build /app .
ENV ASPNETCORE_URLS=http://+:8080 DOTNET_GCHeapHardLimit=0
EXPOSE 8080
ENTRYPOINT ["dotnet","SkService.dll"]
Minimal hosting with health & config
var builder = WebApplication.CreateBuilder(args);
var cfg = builder.Configuration;
// Strongly-typed options for prices/limits
builder.Services.Configure<ModelBudgetOptions>(cfg.GetSection("Budget"));
builder.Services.AddHttpClient();
// OpenTelemetry: traces + metrics + logs
builder.Services.AddOpenTelemetry()
.ConfigureResource(r => r.AddService(serviceName: "sk-service", serviceVersion: "1.0"))
.WithTracing(t => t
.AddAspNetCoreInstrumentation()
.AddHttpClientInstrumentation()
.AddSource(SkTelemetry.ActivitySourceName)
.AddOtlpExporter())
.WithMetrics(m => m
.AddAspNetCoreInstrumentation()
.AddHttpClientInstrumentation()
.AddMeter(SkTelemetry.MeterName)
.AddPrometheusExporter());
builder.Logging.AddOpenTelemetry(o => o.AddOtlpExporter());
// Semantic Kernel
var skBuilder = Microsoft.SemanticKernel.Kernel.CreateBuilder();
skBuilder.AddOpenAIChatCompletion(
modelId: cfg["OpenAI:Model"]!,
apiKey: cfg["OpenAI:ApiKey"]);
// Decorate chat completion with tracing/retry/cache
skBuilder.Services.Decorate<IChatCompletionService, TracingChatCompletion>();
skBuilder.Services.Decorate<IChatCompletionService, CachingChatCompletion>();
skBuilder.Services.Decorate<IChatCompletionService, ResilientChatCompletion>();
builder.Services.AddSingleton(skBuilder.Build());
// Caching & rate windows
builder.Services.AddMemoryCache();
var app = builder.Build();
app.MapGet("/health/live", () => Results.Ok());
app.MapGet("/health/ready", () => Results.Ok());
app.MapPrometheusScrapingEndpoint();
// Streaming endpoint
app.MapPost("/chat", Endpoints.ChatStream);
app.Run();
Tips
- Keep the base image alpine (c‑lib compatible dependencies? verify provider SDKs).
- Expose a Prometheus scrape endpoint and /ready & /live.
- Add PodDisruptionBudget in K8s to avoid all replicas draining at once.
K8s probes & autoscaling (essentials)
apiVersion: apps/v1
kind: Deployment
metadata: { name: sk-service }
spec:
replicas: 3
selector: { matchLabels: { app: sk } }
template:
metadata: { labels: { app: sk } }
spec:
containers:
- name: sk
image: ghcr.io/you/sk-service:1.0
ports: [{ containerPort: 8080 }]
env:
- name: OpenAI__ApiKey
valueFrom: { secretKeyRef: { name: openai, key: apikey } }
readinessProbe:
httpGet: { path: /health/ready, port: 8080 }
livenessProbe:
httpGet: { path: /health/live, port: 8080 }
resources:
requests: { cpu: "200m", memory: "256Mi" }
limits: { cpu: "1", memory: "1Gi" }
---
apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata: { name: sk-service }
spec:
scaleTargetRef: { apiVersion: apps/v1, kind: Deployment, name: sk-service }
minReplicas: 3
maxReplicas: 20
metrics:
- type: Resource
resource: { name: cpu, target: { type: Utilization, averageUtilization: 60 } }
Observe: end‑to‑end OpenTelemetry for SK
SK calls are just HTTP under the hood. We’ll instrument traces (correlate across API → SK → provider), metrics (tokens, cost, latency), and logs (prompt snapshots, redacted).
Telemetry primitives
public static class SkTelemetry
{
public const string ActivitySourceName = "SemanticKernel.Prod";
public const string MeterName = "SemanticKernel.Metrics";
public static readonly ActivitySource ActivitySource = new(ActivitySourceName);
public static readonly Meter Meter = new(MeterName);
public static readonly Counter<long> PromptTokens = Meter.CreateCounter<long>("sk_prompt_tokens");
public static readonly Counter<long> CompletionTokens = Meter.CreateCounter<long>("sk_completion_tokens");
public static readonly Histogram<double> LatencyMs = Meter.CreateHistogram<double>("sk_llm_latency_ms");
public static readonly Counter<double> CostUsd = Meter.CreateCounter<double>("sk_cost_usd");
}
Tracing decorator for IChatCompletionService
public sealed class TracingChatCompletion(IChatCompletionService inner, IOptions<ModelBudgetOptions> budget)
: IChatCompletionService
{
public IReadOnlyDictionary<string, object?> Attributes => inner.Attributes;
public async Task<ChatMessageContent> GetChatMessageContentAsync(
ChatHistory history,
PromptExecutionSettings? settings = null,
Kernel? kernel = null,
CancellationToken ct = default)
{
using var act = SkTelemetry.ActivitySource.StartActivity("llm.chat", ActivityKind.Client);
act?.SetTag("provider", inner.Attributes[AIServiceExtensions.Provider] ?? "unknown");
act?.SetTag("model", settings?.ModelId);
act?.SetTag("temperature", settings?.ExtensionData?["temperature"]);
var sw = Stopwatch.StartNew();
var result = await inner.GetChatMessageContentAsync(history, settings, kernel, ct);
sw.Stop();
act?.SetTag("status", "ok");
act?.SetTag("latency_ms", sw.Elapsed.TotalMilliseconds);
SkTelemetry.LatencyMs.Record(sw.Elapsed.TotalMilliseconds);
// Estimate tokens if provider didn’t return usage
var prompt = string.Join("\n", history.Select(m => m.Content));
var completion = result.Content ?? string.Empty;
var promptTokens = Tokenizer.EstimateTokens(prompt);
var completionTokens = Tokenizer.EstimateTokens(completion);
var cost = TokenCost.EstimateUsd(settings?.ModelId, promptTokens, completionTokens, budget.Value);
SkTelemetry.PromptTokens.Add(promptTokens);
SkTelemetry.CompletionTokens.Add(completionTokens);
SkTelemetry.CostUsd.Add(cost);
act?.SetTag("prompt_tokens", promptTokens);
act?.SetTag("completion_tokens", completionTokens);
act?.SetTag("cost_usd", cost);
return result;
}
// streaming passthrough with timing
public async IAsyncEnumerable<StreamingChatMessageContent> GetStreamingChatMessageContentsAsync(
ChatHistory history, PromptExecutionSettings? settings = null, Kernel? kernel = null, [EnumeratorCancellation] CancellationToken ct = default)
{
using var act = SkTelemetry.ActivitySource.StartActivity("llm.chat.stream", ActivityKind.Client);
var sw = Stopwatch.StartNew();
await foreach (var chunk in inner.GetStreamingChatMessageContentsAsync(history, settings, kernel, ct))
{
yield return chunk;
}
sw.Stop();
SkTelemetry.LatencyMs.Record(sw.Elapsed.TotalMilliseconds);
act?.SetTag("latency_ms", sw.Elapsed.TotalMilliseconds);
}
}
Cost model and token math (simple & explicit)
public sealed class ModelBudgetOptions
{
public Dictionary<string, ModelPrice> Prices { get; init; } = new();
}
public sealed class ModelPrice
{
public double PromptPer1K { get; init; }
public double CompletionPer1K { get; init; }
}
public static class TokenCost
{
public static double EstimateUsd(string? model, int promptTokens, int completionTokens, ModelBudgetOptions cfg)
{
if (model is null || !cfg.Prices.TryGetValue(model, out var p)) return 0;
return (promptTokens / 1000d) * p.PromptPer1K + (completionTokens / 1000d) * p.CompletionPer1K;
}
}
public static class Tokenizer
{
// quick heuristic; swap with tiktoken-sharp if needed
public static int EstimateTokens(string text) => (int)Math.Ceiling(text.Length / 4.0);
}
Prometheus panels you actually need
rate(sk_cost_usd[5m])
– burn ratehistogram_quantile(0.95, sum(rate(sk_llm_latency_ms_bucket[5m])) by (le))
– p95 latencysum(rate(sk_prompt_tokens[1m])) by (model)
– token ingress per model- Errors by provider/model using trace status code
Optimize: latency & cost without hurting quality
Stream early, render progressively
Reduce TTFB by streaming tokens to the client while the model thinks.
public static async Task ChatStream(HttpContext ctx, Kernel kernel)
{
var request = await ctx.Request.ReadFromJsonAsync<ChatRequest>();
var history = new ChatHistory(request!.SystemPrompt);
history.AddUserMessage(request.User);
ctx.Response.Headers["Content-Type"] = "text/event-stream";
await foreach (var delta in kernel.GetRequiredService<IChatCompletionService>()
.GetStreamingChatMessageContentsAsync(history, new() { ModelId = request.Model }, kernel, ctx.RequestAborted))
{
await ctx.Response.WriteAsync($"data: {delta.Content}\n\n");
await ctx.Response.Body.FlushAsync();
}
}
Cache by prompt fingerprint (with safety)
public sealed class CachingChatCompletion(IChatCompletionService inner, IMemoryCache cache) : IChatCompletionService
{
public IReadOnlyDictionary<string, object?> Attributes => inner.Attributes;
public async Task<ChatMessageContent> GetChatMessageContentAsync(ChatHistory history, PromptExecutionSettings? settings = null, Kernel? kernel = null, CancellationToken ct = default)
{
var key = CacheKey(history, settings);
if (cache.TryGetValue<ChatMessageContent>(key, out var hit)) return hit!;
var res = await inner.GetChatMessageContentAsync(history, settings, kernel, ct);
cache.Set(key, res, TimeSpan.FromMinutes(5));
return res;
}
public async IAsyncEnumerable<StreamingChatMessageContent> GetStreamingChatMessageContentsAsync(ChatHistory history, PromptExecutionSettings? settings = null, Kernel? kernel = null, [EnumeratorCancellation] CancellationToken ct = default)
{
// opt-out from caching for streams in this simple example
await foreach (var chunk in inner.GetStreamingChatMessageContentsAsync(history, settings, kernel, ct))
yield return chunk;
}
static string CacheKey(ChatHistory history, PromptExecutionSettings? settings)
{
var sb = new StringBuilder();
foreach (var m in history) sb.Append(m.Role).Append('|').Append(m.Content);
sb.Append("|model=").Append(settings?.ModelId);
using var sha = SHA256.Create();
return Convert.ToHexString(sha.ComputeHash(Encoding.UTF8.GetBytes(sb.ToString())));
}
}
Cache rules of thumb
- Only cache deterministic prompts (set temperature low, no time‑based content).
- Include model id + system prompt in the key.
- Short TTLs (1-10 min) prevent staleness while still absorbing bursts.
Retries that don’t amplify downtime (Polly)
public sealed class ResilientChatCompletion(IChatCompletionService inner) : IChatCompletionService
{
static readonly IAsyncPolicy PolicyWrap =
Polly.Policy.WrapAsync(
Polly.Policy.Handle<Exception>()
.CircuitBreakerAsync(handledEventsAllowedBeforeBreaking: 4, durationOfBreak: TimeSpan.FromSeconds(30)),
Polly.Policy.Handle<Exception>()
.WaitAndRetryAsync(3, attempt => TimeSpan.FromMilliseconds(100 * Math.Pow(2, attempt)) + TimeSpan.FromMilliseconds(Random.Shared.Next(0, 50)))
);
public IReadOnlyDictionary<string, object?> Attributes => inner.Attributes;
public Task<ChatMessageContent> GetChatMessageContentAsync(ChatHistory history, PromptExecutionSettings? settings = null, Kernel? kernel = null, CancellationToken ct = default) =>
PolicyWrap.ExecuteAsync(_ => inner.GetChatMessageContentAsync(history, settings, kernel, ct), ct);
public async IAsyncEnumerable<StreamingChatMessageContent> GetStreamingChatMessageContentsAsync(ChatHistory history, PromptExecutionSettings? settings = null, Kernel? kernel = null, [EnumeratorCancellation] CancellationToken ct = default)
{
// For streams we keep it simple; you can preflight with a HEAD call to provider
await foreach (var chunk in inner.GetStreamingChatMessageContentsAsync(history, settings, kernel, ct))
yield return chunk;
}
}
Trim tokens with guardrails, not guesses
- System prompts: move them to files with placeholders; version them; keep them short.
- Tools/functions: prefer structured tool calls over long natural‑language rules.
- Context windows: add a semantic memory (vector store) but cap hits to N and summarize top‑K.
- Stop sequences and max tokens: be explicit.
var settings = new PromptExecutionSettings
{
ModelId = "gpt-4o-mini",
ExtensionData = new() { ["temperature"] = 0.2, ["max_tokens"] = 400, ["stop"] = new[] {"\n\nUser:"} }
};
Choose the right model for the job
- Small model for classification/routing, bigger model for synthesis.
- Use canary header (e.g.,
X-Model-Variant
) to test a cheaper model for 5% of traffic and compare answer quality + cost/latency.
Comply: privacy, audit, and safe outputs
Redact sensitive data before logging
public static class Redactor
{
static readonly Regex Email = new("[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\\.[A-Za-z]{2,}", RegexOptions.Compiled);
static readonly Regex Phone = new("\\b(?:\u002B\d{1,3}[ -]?)?\d{3}[ -]?\d{3}[ -]?\d{4}\\b", RegexOptions.Compiled);
public static string Sanitize(string text)
=> Phone.Replace(Email.Replace(text, "[email]"), "[phone]");
}
app.Logger.LogInformation("prompt={Prompt}", Redactor.Sanitize(request.User));
Prompt versioning & audit trail
- Store prompts in object storage or Git with a semantic version.
- On every request, log
prompt_version
,model
,user_id (hashed)
,trace_id
, and a hash of the final prompt to reconstruct without storing raw PII.
Output moderation (defense in depth)
- Use provider filters (if available) and a lightweight post‑classifier (keyword/regex or small model) with a blocklist and allowlist.
- For chat, add content policy system prompt with explicit rules (no PII echo, no images of faces, etc.).
End‑to‑end sample: minimal chat with everything wired
app.MapPost("/v1/ask", async (Ask req, Kernel kernel, ILogger<Program> log, IOptions<ModelBudgetOptions> budget, CancellationToken ct) =>
{
using var act = SkTelemetry.ActivitySource.StartActivity("http.ask");
var history = new ChatHistory(req.System ?? "You are a concise assistant.");
history.AddUserMessage(req.User);
var settings = new PromptExecutionSettings { ModelId = req.Model ?? "gpt-4o-mini", ExtensionData = new() { ["temperature"] = req.Temperature ?? 0.2 } };
var svc = kernel.GetRequiredService<IChatCompletionService>();
var sw = Stopwatch.StartNew();
var answer = await svc.GetChatMessageContentAsync(history, settings, kernel, ct);
sw.Stop();
log.LogInformation("latency_ms={Latency} cost_usd~{Cost}", sw.Elapsed.TotalMilliseconds,
TokenCost.EstimateUsd(settings.ModelId,
Tokenizer.EstimateTokens(string.Join("\n", history.Select(m => m.Content))),
Tokenizer.EstimateTokens(answer.Content ?? string.Empty),
budget.Value));
return Results.Ok(new { answer = answer.Content, traceId = Activity.Current?.TraceId.ToString() });
});
public sealed record Ask(string User, string? System, string? Model, double? Temperature);
Production checklist (10‑minute pass before launch)
- Health: liveness/readiness probe green under load tests
- Observability: traces link API → SK → provider; p95/p99 dashboards; error budgets
- Budgets: max tokens per request; daily spend alert; circuit breaker configured
- Resilience: retries with jitter; backoff ≤ 3 attempts; idempotent endpoints
- Cache: hit rate ≥ 30% for deterministic prompts; TTLs tuned
- Security: secrets from vault; redaction in logs; audit fields captured
- Quality: canary model live; A/B prompt versions; manual evaluations recorded
- Docs: runbook for incidents; model deprecation plan; data retention policy
FAQ: running Semantic Kernel in production
Yes. Even for a solo service, tracing instantly pays for itself the first time latency spikes. Start with the minimal setup above and an OTLP backend.
Add tenant_id
as a trace attribute and emit a sk_cost_usd
counter with that label. Your metrics backend can sum by tenant and send alerts.
Keep them in Git (versioning + code reviews). Load at startup or via a small cache with change notifications.
No. Start with small, in‑process embeddings and a file/SQLite store. Add a managed vector DB once your context window routinely overflows.
For critical flows (e.g., emails, pricing), run a rule‑based validator and a small “judge” model. If validation fails, fall back to a safer template.
Warm up the model endpoint during deployment and pre‑load embeddings if applicable. Keep at least one replica always on.
Conclusion: ship fast, watch everything, pay less
Semantic Kernel isn’t a magic wand – it’s a solid composition layer for LLM apps. In production, wins come from boring engineering: traces with IDs you can follow, budgets on tokens you can afford, and policies that fail softly. Implement the decorators above, wire OpenTelemetry, and you’ll get faster responses, lower bills, and fewer surprises.
Which part will you add first – streaming, caching, or full OTel? Tell me in the comments and I’ll share a deeper dive for the most requested one.