Vector Stores & Memory in Semantic Kernel for .NET

Vector Stores & Long-Term Memory: Persisting Knowledge in Semantic Kernel

Your AI agent is brilliant… for 30 seconds. Then it forgets everything. If that sounds familiar, you’re one vector store away from transforming a clever demo into a reliable system that remembers customers, documents, and past decisions. In this post I’ll show you how I persist long‑term memory in .NET with Semantic Kernel (SK) using Azure AI Search, Pinecone, and Postgres/pgvector – plus the indexing tricks (chunking, metadata, caching) that actually make RAG work in production.

Why vector memory matters (and when it doesn’t)

LLMs are stateless. Your chat history doesn’t survive process restarts, deployments, or agent handoffs unless you persist it. Vector memory solves two problems:

  1. Recall: Retrieve semantically similar facts/snippets, even if the query words don’t match the source text.
  2. Grounding: Feed the model only the most relevant pieces of your knowledge base to reduce hallucinations and cost.

When not to use a vector store:

  • Your dataset is tiny and keyword search is enough.
  • Answers must be exact and structured (use SQL/Elasticsearch with filters first, vector as a fallback).

Rule of thumb from my projects: if you can’t name the top 5 documents you’d hand to a human expert for a question, you probably need embeddings.

Providers at a glance

ProviderBest forProsCons
Azure AI SearchHybrid keyword + vector over enterprise docsManaged, fast filters, semantic ranker, built‑in scaling, private networkingCloud‑only, index schema management, price per index/GB
PineconeHigh‑QPS vector workloadsPurpose‑built vector DB, strong performance, metadata filters, namespacesSeparate billing, must integrate keyword search elsewhere for hybrid
Postgres + pgvectorTeams already on Postgres, self‑hostedOne database to rule them all, ACID, joins, extensions (full‑text)You manage ops, tuning for ANN/recall, version/extension drift

Tip: start where your data already lives. If your ops team loves Postgres, pgvector is fantastic. If you’re deeply in Azure, AI Search gives hybrid search and ops conveniences. Pinecone shines when vector latency/scale is the core bottleneck.

Semantic Kernel building blocks (quick refresher)

  • Embeddings generator: turns text => float vector. (OpenAI/Azure OpenAI/OSS models.)
  • Vector store: persists vectors + metadata and supports similarity search.
  • RAG pipeline: query => retrieve top‑K chunks => prompt LLM + citations.

With SK you can swap embedding models and vector stores behind interfaces. That’s the lever that makes the same RAG code run on Azure AI Search one day and Postgres the next.

Wiring up providers in .NET (SK 1.x+)

Namespaces/classes vary slightly across SK packages. The patterns below are stable; adjust types to your SK version.

Common setup: Kernel + embeddings

using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.Embeddings; // ITextEmbeddingGenerationService

static string Env(string name) => Environment.GetEnvironmentVariable(name)!;

var builder = Kernel.CreateBuilder();

// Azure OpenAI embeddings (works similarly with OpenAI)
builder.AddAzureOpenAITextEmbeddingGeneration(
    deploymentName: Env("AZURE_EMBEDDINGS_DEPLOYMENT"),
    endpoint: new Uri(Env("AZURE_OPENAI_ENDPOINT")),
    apiKey: Env("AZURE_OPENAI_API_KEY"));

var kernel = builder.Build();
var embeddings = kernel.GetRequiredService<ITextEmbeddingGenerationService>();

Azure AI Search

using Microsoft.SemanticKernel.Memory;
// using Microsoft.SemanticKernel.Connectors.AzureAISearch; // or AzureCognitiveSearch in older builds

var searchEndpoint = Env("AZURE_SEARCH_ENDPOINT");
var searchApiKey  = Env("AZURE_SEARCH_API_KEY");

// Create a memory store backed by an Azure AI Search vector index
var memoryStore = new AzureAISearchMemoryStore(searchEndpoint, searchApiKey);

var memory = new SemanticTextMemory(memoryStore, embeddings);
await memory.CreateCollectionIfNotExistsAsync("kb_docs");

Pinecone

// using Microsoft.SemanticKernel.Connectors.Pinecone;
var pineconeApiKey = Env("PINECONE_API_KEY");
var pineconeEnvOrProject = Env("PINECONE_PROJECT");

var memoryStore = new PineconeMemoryStore(pineconeApiKey, pineconeEnvOrProject);
var memory = new SemanticTextMemory(memoryStore, embeddings);
await memory.CreateCollectionIfNotExistsAsync("kb_docs");

Postgres + pgvector

// using Microsoft.SemanticKernel.Connectors.Postgres; // ensure CREATE EXTENSION IF NOT EXISTS vector;
var pg = Env("PG_CONNECTION_STRING");
var memoryStore = await PostgresMemoryStore.ConnectAsync(pg);
var memory = new SemanticTextMemory(memoryStore, embeddings);
await memory.CreateCollectionIfNotExistsAsync("kb_docs");

If you prefer SK’s newer IVectorStore abstractions, map a typed record (Id, Vector, Metadata) to your provider and expose GetCollection<T>(). The chunking + metadata patterns below remain the same.

Indexing strategies that actually work

I’ve shipped RAG to production in apps that index thousands to millions of pages. The index makes or breaks quality & cost.

Chunk sizing (token‑aware)

Too small = no context. Too big = low recall + expensive prompts.

  • General docs: 300–800 tokens per chunk with 50–100‑token overlap.
  • Code: smaller chunks (120–300 tokens) with stronger overlap. Preserve function/class boundaries.
  • Tables/FAQs: chunk by row/QA pair to keep atomic facts together.

A quick C# chunker I use:

public static IEnumerable<string> ChunkByTokens(
    string text,
    int maxTokens = 600,
    int overlapTokens = 80,
    Func<string, int> tokenCount = null)
{
    tokenCount ??= s => Math.Max(1, s.Split(' ', StringSplitOptions.RemoveEmptyEntries).Length / 0.75f > int.MaxValue
        ? int.MaxValue
        : (int)(s.Split(' ', StringSplitOptions.RemoveEmptyEntries).Length / 0.75));

    var sentences = Regex.Split(text, "(?<=[.!?])\s+");
    var window = new List<string>();
    int tokens = 0;

    foreach (var s in sentences)
    {
        var t = tokenCount(s);
        if (tokens + t > maxTokens && window.Count > 0)
        {
            yield return string.Join(" ", window);
            // start next window with tail overlap
            var tail = string.Join(" ", window).Split(' ').TakeLast(overlapTokens);
            window = new List<string> { string.Join(" ", tail) };
            tokens = tokenCount(string.Join(" ", tail));
        }
        window.Add(s);
        tokens += t;
    }
    if (window.Count > 0) yield return string.Join(" ", window);
}

Rich metadata (filters & citations)

Always store at least:

  • doc_id (stable document identity)
  • chunk_id (e.g., doc_id#0005)
  • source (url or path)
  • title
  • author (if useful)
  • created_at, updated_at
  • tags (domain, product, locale, confidentiality)
  • char_start, char_end (for precise citation highlight)
  • sha256 of the chunk text (idempotent updates)

Embeddings cache (don’t pay twice)

You’ll re‑crawl. Don’t re‑embed unchanged text.

public sealed record Chunk(string DocId, int Index, string Text, string Sha256);

static string Hash(string text)
{
    using var sha = SHA256.Create();
    var bytes = sha.ComputeHash(Encoding.UTF8.GetBytes(text));
    return Convert.ToHexString(bytes);
}

async Task IndexAsync(IEnumerable<Chunk> chunks, ISemanticTextMemory memory, string collection)
{
    foreach (var c in chunks)
    {
        var id = $"{c.DocId}#{c.Index:D4}";
        var existing = await memory.GetAsync(collection, id);
        if (existing != null && existing.Metadata.AdditionalMetadata?.Contains(c.Sha256) == true)
            continue; // cached

        var metadata = new Dictionary<string, string> {
            ["doc_id"] = c.DocId,
            ["sha256"] = c.Sha256
        };

        await memory.SaveInformationAsync(
            collection: collection,
            text: c.Text,
            id: id,
            description: c.DocId,
            additionalMetadata: JsonSerializer.Serialize(metadata));
    }
}

Tip: keep a local table (SQLite/Postgres) mapping sha256 => embedding(vector) so you can de‑duplicate across all docs, not just per document.

Indexing pipeline (diagram)

[Ingest] ──► [Parse] ──► [Chunk] ──► [Compute hash] ──► [Embed (cache)] ──► [Upsert in vector store]
                                 │                                                    ▲
                                 └───────────────[Metadata + offsets]─────────────────┘

Schema design notes per provider

  • Azure AI Search: define fields upfront (content, vector, doc_id, tags, etc.). Enable vector & semantic config. Use search.in(tags, 'a,b', ',') or filters like locale eq 'en'.
  • Pinecone: metadata JSON per record; keep it flat for fast filtering. Namespaces per tenant or per collection.
  • pgvector: store in a table: (id TEXT PRIMARY KEY, doc_id TEXT, chunk_id INT, vector VECTOR(1536), jsonb_metadata JSONB); use an ANN index (ivfflat) for speed plus tsvector column if you need hybrid keyword + vector.

Querying & RAG

Structured vs semantic queries

  • Structured: exact filters (“policy by id”, “FAQ for product X”). Use SQL/Azure Search filters first.
  • Semantic: similarity search for fuzzy intent. Use when wording varies or users don’t know the right terms.
  • Hybrid: keyword pre‑filter => vector search => re‑rank. This is my go‑to pipeline for enterprise docs.

Retrieval in SK (provider‑agnostic)

var topK = 6;
var minScore = 0.75;
var results = memory.SearchAsync(
    collection: "kb_docs",
    query: "How do I rotate API keys?",
    limit: topK,
    minRelevanceScore: minScore);

await foreach (var item in results)
{
    Console.WriteLine($"{item.Metadata.Description}  (score={item.Relevance})");
    Console.WriteLine(item.Metadata.AdditionalMetadata); // JSON metadata for citations
}

Adding filters (examples)

Azure AI Search (hybrid): pre‑filter by tags/locale, then semantic/vector search.

var filter = "locale eq 'en' and confidentiality ne 'internal'";
var results = await azureSearchClient.SimilaritySearchAsync(
    indexName: "kb_docs",
    query: "rotate api keys",
    top: 8,
    filter: filter,
    useSemanticRanker: true);

Postgres: hybrid keyword + vector in one query (sketch):

SELECT id, doc_id, chunk_id,  
       0.4 * (1 - (vector <=> :qvec)) + 0.6 * ts_rank_cd(ts, plainto_tsquery(:qtext)) AS score
FROM kb_docs
WHERE jsonb_metadata->>'locale' = 'en'
ORDER BY score DESC
LIMIT 8;

RAG prompt with citations

Keep prompts deterministic and compact. I like a small, strict template:

string BuildPrompt(string question, IEnumerable<SearchResult> chunks)
{
    var sb = new StringBuilder();
    sb.AppendLine("You are a careful assistant. Use ONLY the sources below to answer.");
    sb.AppendLine("If the answer isn't in the sources, say you don't know.");
    sb.AppendLine();
    int i = 1;
    foreach (var c in chunks)
    {
        sb.AppendLine($"[{i}] {c.Text.Trim()}\nSOURCE: {c.Source}#chars({c.CharStart}-{c.CharEnd})");
        sb.AppendLine();
        i++;
    }
    sb.AppendLine($"Question: {question}");
    sb.AppendLine("\nAnswer with short paragraphs and include citation numbers like [1][3].");
    return sb.ToString();
}

And stitching it together in SK:

var retrieved = await memory.SearchAsync("kb_docs", userQuestion, 8, 0.7).ToListAsync();
var chunks = retrieved.Select(r => new SearchResult
{
    Text = r.Metadata.Text,
    Source = JsonDocument.Parse(r.Metadata.AdditionalMetadata!).RootElement.GetProperty("source").GetString(),
    CharStart = int.Parse(JsonDocument.Parse(r.Metadata.AdditionalMetadata!).RootElement.GetProperty("char_start").GetString()!),
    CharEnd = int.Parse(JsonDocument.Parse(r.Metadata.AdditionalMetadata!).RootElement.GetProperty("char_end").GetString()!)
});

var prompt = BuildPrompt(userQuestion, chunks);
var answer = await kernel.InvokePromptAsync(prompt);
Console.WriteLine(answer);

Citations UX: Return both the URL and the char offsets from metadata so the frontend can highlight the exact span. This dramatically increases user trust.

Cost, quality, and operations

Embeddings cost

Cost scales with tokens embedded, not the number of chunks. Keep chunks minimal and dedup aggressively.

Rough math you can adapt:

  • total_tokens ≈ sum(tokens(chunk))
  • price = total_tokens / 1000 * price_per_1k_tokens

Batch embeddings to reduce overhead, and cache by sha256 => vector so re‑crawls are cheap.

Evaluating retrieval quality

Track recall@k / precision@k via a tiny golden‑set (10-50 questions with known sources). Automate nightly.

Reindex safely

  • New fields? Use side‑by‑side indexes, then swap alias.
  • Changed chunking? Version your collection (e.g., kb_docs_v2) and dual‑write during migration.

Security & tenancy

  • Store tenant_id in metadata and filter at retrieval (don’t trust the client).
  • Tag and exclude confidentiality='internal' for external users.
  • For PII, encrypt metadata columns (Postgres) or use private endpoints (Azure AI Search).

End‑to‑end example: minimal RAG web API

// Program.cs (ASP.NET Core minimal API)
var builder = WebApplication.CreateBuilder(args);

builder.Services.AddSingleton<ITextEmbeddingGenerationService>(sp =>
{
    return new AzureOpenAITextEmbeddingGenerationService(
        deploymentName: Env("AZURE_EMBEDDINGS_DEPLOYMENT"),
        endpoint: new Uri(Env("AZURE_OPENAI_ENDPOINT")),
        apiKey: Env("AZURE_OPENAI_API_KEY"));
});

builder.Services.AddSingleton<ISemanticTextMemory>(sp =>
{
    var embeddings = sp.GetRequiredService<ITextEmbeddingGenerationService>();
    var store = new AzureAISearchMemoryStore(Env("AZURE_SEARCH_ENDPOINT"), Env("AZURE_SEARCH_API_KEY"));
    return new SemanticTextMemory(store, embeddings);
});

var app = builder.Build();

app.MapPost("/ask", async (AskRequest req, ISemanticTextMemory memory) =>
{
    var retrieved = await memory.SearchAsync("kb_docs", req.Question, 6, 0.7).ToListAsync();
    if (retrieved.Count == 0)
        return Results.Json(new { answer = "I don't know based on the provided sources.", citations = Array.Empty<object>() });

    var citations = retrieved.Select((r, i) => new {
        index = i + 1,
        text = r.Metadata.Text,
        meta = r.Metadata.AdditionalMetadata
    }).ToArray();

    var prompt = BuildPrompt(req.Question, citations.Select(c => new SearchResult
    {
        Text = c.text,
        Source = JsonDocument.Parse(c.meta!).RootElement.GetProperty("source").GetString(),
        CharStart = int.Parse(JsonDocument.Parse(c.meta!).RootElement.GetProperty("char_start").GetString()!),
        CharEnd = int.Parse(JsonDocument.Parse(c.meta!).RootElement.GetProperty("char_end").GetString()!)
    }));

    // Call your preferred chat model here (Azure OpenAI, etc.)
    var kernel = Kernel.CreateBuilder().Build();
    var answer = await kernel.InvokePromptAsync(prompt);

    return Results.Json(new { answer, citations });
});

app.Run();

record AskRequest(string Question);

FAQ: Costs, deletes, and maintenance

What’s the cost of embeddings?

Multiply your total embedded tokens by the model’s price per 1K tokens. Reduce cost by chunk dedup via hashing, lowering overlap, embedding only text that could be retrieved (skip boilerplate/nav), and using smaller embedding models if quality remains acceptable.

How do I delete memories?

Upserts are easy; deletes require policy. Options:
1. Per document: find all chunks where doc_id = X and delete by id.
2. Soft delete: set is_deleted=true in metadata and filter out at query time (safer for audit; reclaim storage later).
3. Retention: keep only latest N versions per doc_id; a scheduled job purges old chunks.

Can I update a chunk without changing ids?

Yes if you keep chunk_id stable. When content changes, recompute sha256, re‑embed, and upsert the same id so queries see the latest, and caches stay coherent.

Should I store raw text in the vector store?

Yes, store the chunk text (for prompts) plus a compact metadata JSON for citations and filters. Keep originals (PDFs, Markdown) in blob storage and reference via source.

Do I need hybrid search?

In enterprise content: almost always. Keyword pre‑filters (product/version/locale) cut noise and halve your prompt tokens.

How many results (K) should I retrieve?

Start with 5-8. If your chunks are small or questions span multiple sections, increase K; otherwise you bloat prompts without gains.

How do I evaluate if RAG is “good enough”?

Build a 20-50 question golden set tied to known sources. Track recall@k, answer exact‑match rate, and citation coverage. Re‑run after any index/schema change.

Conclusion: From one brain to a team of agents

Vector memory is your agent’s long‑term hippocampus. With solid chunking, metadata, and caching, SK lets you swap stores (Azure AI Search, Pinecone, Postgres) without rewriting your app. The next step is orchestration: multiple agents (retrieval, planner, writer, critic) sharing the same memory through a common vector store and topic‑scoped namespaces. That’s where systems stop “chatting” and start working.

I’d love to hear what tripped you up in production – chunking, costs, or citations? Drop a comment with your stack and I’ll suggest a sizing/metadata template you can copy.

Leave a Reply

Your email address will not be published. Required fields are marked *