Your AI agent is brilliant… for 30 seconds. Then it forgets everything. If that sounds familiar, you’re one vector store away from transforming a clever demo into a reliable system that remembers customers, documents, and past decisions. In this post I’ll show you how I persist long‑term memory in .NET with Semantic Kernel (SK) using Azure AI Search, Pinecone, and Postgres/pgvector – plus the indexing tricks (chunking, metadata, caching) that actually make RAG work in production.
Why vector memory matters (and when it doesn’t)
LLMs are stateless. Your chat history doesn’t survive process restarts, deployments, or agent handoffs unless you persist it. Vector memory solves two problems:
- Recall: Retrieve semantically similar facts/snippets, even if the query words don’t match the source text.
- Grounding: Feed the model only the most relevant pieces of your knowledge base to reduce hallucinations and cost.
When not to use a vector store:
- Your dataset is tiny and keyword search is enough.
- Answers must be exact and structured (use SQL/Elasticsearch with filters first, vector as a fallback).
Rule of thumb from my projects: if you can’t name the top 5 documents you’d hand to a human expert for a question, you probably need embeddings.
Providers at a glance
Provider | Best for | Pros | Cons |
---|---|---|---|
Azure AI Search | Hybrid keyword + vector over enterprise docs | Managed, fast filters, semantic ranker, built‑in scaling, private networking | Cloud‑only, index schema management, price per index/GB |
Pinecone | High‑QPS vector workloads | Purpose‑built vector DB, strong performance, metadata filters, namespaces | Separate billing, must integrate keyword search elsewhere for hybrid |
Postgres + pgvector | Teams already on Postgres, self‑hosted | One database to rule them all, ACID, joins, extensions (full‑text) | You manage ops, tuning for ANN/recall, version/extension drift |
Tip: start where your data already lives. If your ops team loves Postgres, pgvector is fantastic. If you’re deeply in Azure, AI Search gives hybrid search and ops conveniences. Pinecone shines when vector latency/scale is the core bottleneck.
Semantic Kernel building blocks (quick refresher)
- Embeddings generator: turns text => float vector. (OpenAI/Azure OpenAI/OSS models.)
- Vector store: persists vectors + metadata and supports similarity search.
- RAG pipeline: query => retrieve top‑K chunks => prompt LLM + citations.
With SK you can swap embedding models and vector stores behind interfaces. That’s the lever that makes the same RAG code run on Azure AI Search one day and Postgres the next.
Wiring up providers in .NET (SK 1.x+)
Namespaces/classes vary slightly across SK packages. The patterns below are stable; adjust types to your SK version.
Common setup: Kernel + embeddings
using Microsoft.SemanticKernel;
using Microsoft.SemanticKernel.Embeddings; // ITextEmbeddingGenerationService
static string Env(string name) => Environment.GetEnvironmentVariable(name)!;
var builder = Kernel.CreateBuilder();
// Azure OpenAI embeddings (works similarly with OpenAI)
builder.AddAzureOpenAITextEmbeddingGeneration(
deploymentName: Env("AZURE_EMBEDDINGS_DEPLOYMENT"),
endpoint: new Uri(Env("AZURE_OPENAI_ENDPOINT")),
apiKey: Env("AZURE_OPENAI_API_KEY"));
var kernel = builder.Build();
var embeddings = kernel.GetRequiredService<ITextEmbeddingGenerationService>();
Azure AI Search
using Microsoft.SemanticKernel.Memory;
// using Microsoft.SemanticKernel.Connectors.AzureAISearch; // or AzureCognitiveSearch in older builds
var searchEndpoint = Env("AZURE_SEARCH_ENDPOINT");
var searchApiKey = Env("AZURE_SEARCH_API_KEY");
// Create a memory store backed by an Azure AI Search vector index
var memoryStore = new AzureAISearchMemoryStore(searchEndpoint, searchApiKey);
var memory = new SemanticTextMemory(memoryStore, embeddings);
await memory.CreateCollectionIfNotExistsAsync("kb_docs");
Pinecone
// using Microsoft.SemanticKernel.Connectors.Pinecone;
var pineconeApiKey = Env("PINECONE_API_KEY");
var pineconeEnvOrProject = Env("PINECONE_PROJECT");
var memoryStore = new PineconeMemoryStore(pineconeApiKey, pineconeEnvOrProject);
var memory = new SemanticTextMemory(memoryStore, embeddings);
await memory.CreateCollectionIfNotExistsAsync("kb_docs");
Postgres + pgvector
// using Microsoft.SemanticKernel.Connectors.Postgres; // ensure CREATE EXTENSION IF NOT EXISTS vector;
var pg = Env("PG_CONNECTION_STRING");
var memoryStore = await PostgresMemoryStore.ConnectAsync(pg);
var memory = new SemanticTextMemory(memoryStore, embeddings);
await memory.CreateCollectionIfNotExistsAsync("kb_docs");
If you prefer SK’s newer
IVectorStore
abstractions, map a typed record (Id, Vector, Metadata) to your provider and exposeGetCollection<T>()
. The chunking + metadata patterns below remain the same.
Indexing strategies that actually work
I’ve shipped RAG to production in apps that index thousands to millions of pages. The index makes or breaks quality & cost.
Chunk sizing (token‑aware)
Too small = no context. Too big = low recall + expensive prompts.
- General docs: 300–800 tokens per chunk with 50–100‑token overlap.
- Code: smaller chunks (120–300 tokens) with stronger overlap. Preserve function/class boundaries.
- Tables/FAQs: chunk by row/QA pair to keep atomic facts together.
A quick C# chunker I use:
public static IEnumerable<string> ChunkByTokens(
string text,
int maxTokens = 600,
int overlapTokens = 80,
Func<string, int> tokenCount = null)
{
tokenCount ??= s => Math.Max(1, s.Split(' ', StringSplitOptions.RemoveEmptyEntries).Length / 0.75f > int.MaxValue
? int.MaxValue
: (int)(s.Split(' ', StringSplitOptions.RemoveEmptyEntries).Length / 0.75));
var sentences = Regex.Split(text, "(?<=[.!?])\s+");
var window = new List<string>();
int tokens = 0;
foreach (var s in sentences)
{
var t = tokenCount(s);
if (tokens + t > maxTokens && window.Count > 0)
{
yield return string.Join(" ", window);
// start next window with tail overlap
var tail = string.Join(" ", window).Split(' ').TakeLast(overlapTokens);
window = new List<string> { string.Join(" ", tail) };
tokens = tokenCount(string.Join(" ", tail));
}
window.Add(s);
tokens += t;
}
if (window.Count > 0) yield return string.Join(" ", window);
}
Rich metadata (filters & citations)
Always store at least:
doc_id
(stable document identity)chunk_id
(e.g.,doc_id#0005
)source
(url or path)title
author
(if useful)created_at
,updated_at
tags
(domain, product, locale, confidentiality)char_start
,char_end
(for precise citation highlight)sha256
of the chunk text (idempotent updates)
Embeddings cache (don’t pay twice)
You’ll re‑crawl. Don’t re‑embed unchanged text.
public sealed record Chunk(string DocId, int Index, string Text, string Sha256);
static string Hash(string text)
{
using var sha = SHA256.Create();
var bytes = sha.ComputeHash(Encoding.UTF8.GetBytes(text));
return Convert.ToHexString(bytes);
}
async Task IndexAsync(IEnumerable<Chunk> chunks, ISemanticTextMemory memory, string collection)
{
foreach (var c in chunks)
{
var id = $"{c.DocId}#{c.Index:D4}";
var existing = await memory.GetAsync(collection, id);
if (existing != null && existing.Metadata.AdditionalMetadata?.Contains(c.Sha256) == true)
continue; // cached
var metadata = new Dictionary<string, string> {
["doc_id"] = c.DocId,
["sha256"] = c.Sha256
};
await memory.SaveInformationAsync(
collection: collection,
text: c.Text,
id: id,
description: c.DocId,
additionalMetadata: JsonSerializer.Serialize(metadata));
}
}
Tip: keep a local table (SQLite/Postgres) mapping
sha256 => embedding(vector)
so you can de‑duplicate across all docs, not just per document.
Indexing pipeline (diagram)
[Ingest] ──► [Parse] ──► [Chunk] ──► [Compute hash] ──► [Embed (cache)] ──► [Upsert in vector store]
│ ▲
└───────────────[Metadata + offsets]─────────────────┘
Schema design notes per provider
- Azure AI Search: define fields upfront (
content
,vector
,doc_id
,tags
, etc.). Enable vector & semantic config. Usesearch.in(tags, 'a,b', ',')
or filters likelocale eq 'en'
. - Pinecone: metadata JSON per record; keep it flat for fast filtering. Namespaces per tenant or per collection.
- pgvector: store in a table:
(id TEXT PRIMARY KEY, doc_id TEXT, chunk_id INT, vector VECTOR(1536), jsonb_metadata JSONB)
; use an ANN index (ivfflat
) for speed plustsvector
column if you need hybrid keyword + vector.
Querying & RAG
Structured vs semantic queries
- Structured: exact filters (“policy by id”, “FAQ for product X”). Use SQL/Azure Search filters first.
- Semantic: similarity search for fuzzy intent. Use when wording varies or users don’t know the right terms.
- Hybrid: keyword pre‑filter => vector search => re‑rank. This is my go‑to pipeline for enterprise docs.
Retrieval in SK (provider‑agnostic)
var topK = 6;
var minScore = 0.75;
var results = memory.SearchAsync(
collection: "kb_docs",
query: "How do I rotate API keys?",
limit: topK,
minRelevanceScore: minScore);
await foreach (var item in results)
{
Console.WriteLine($"{item.Metadata.Description} (score={item.Relevance})");
Console.WriteLine(item.Metadata.AdditionalMetadata); // JSON metadata for citations
}
Adding filters (examples)
Azure AI Search (hybrid): pre‑filter by tags/locale, then semantic/vector search.
var filter = "locale eq 'en' and confidentiality ne 'internal'";
var results = await azureSearchClient.SimilaritySearchAsync(
indexName: "kb_docs",
query: "rotate api keys",
top: 8,
filter: filter,
useSemanticRanker: true);
Postgres: hybrid keyword + vector in one query (sketch):
SELECT id, doc_id, chunk_id,
0.4 * (1 - (vector <=> :qvec)) + 0.6 * ts_rank_cd(ts, plainto_tsquery(:qtext)) AS score
FROM kb_docs
WHERE jsonb_metadata->>'locale' = 'en'
ORDER BY score DESC
LIMIT 8;
RAG prompt with citations
Keep prompts deterministic and compact. I like a small, strict template:
string BuildPrompt(string question, IEnumerable<SearchResult> chunks)
{
var sb = new StringBuilder();
sb.AppendLine("You are a careful assistant. Use ONLY the sources below to answer.");
sb.AppendLine("If the answer isn't in the sources, say you don't know.");
sb.AppendLine();
int i = 1;
foreach (var c in chunks)
{
sb.AppendLine($"[{i}] {c.Text.Trim()}\nSOURCE: {c.Source}#chars({c.CharStart}-{c.CharEnd})");
sb.AppendLine();
i++;
}
sb.AppendLine($"Question: {question}");
sb.AppendLine("\nAnswer with short paragraphs and include citation numbers like [1][3].");
return sb.ToString();
}
And stitching it together in SK:
var retrieved = await memory.SearchAsync("kb_docs", userQuestion, 8, 0.7).ToListAsync();
var chunks = retrieved.Select(r => new SearchResult
{
Text = r.Metadata.Text,
Source = JsonDocument.Parse(r.Metadata.AdditionalMetadata!).RootElement.GetProperty("source").GetString(),
CharStart = int.Parse(JsonDocument.Parse(r.Metadata.AdditionalMetadata!).RootElement.GetProperty("char_start").GetString()!),
CharEnd = int.Parse(JsonDocument.Parse(r.Metadata.AdditionalMetadata!).RootElement.GetProperty("char_end").GetString()!)
});
var prompt = BuildPrompt(userQuestion, chunks);
var answer = await kernel.InvokePromptAsync(prompt);
Console.WriteLine(answer);
Citations UX: Return both the URL and the char offsets from metadata so the frontend can highlight the exact span. This dramatically increases user trust.
Cost, quality, and operations
Embeddings cost
Cost scales with tokens embedded, not the number of chunks. Keep chunks minimal and dedup aggressively.
Rough math you can adapt:
total_tokens ≈ sum(tokens(chunk))
price = total_tokens / 1000 * price_per_1k_tokens
Batch embeddings to reduce overhead, and cache by sha256 => vector
so re‑crawls are cheap.
Evaluating retrieval quality
Track recall@k / precision@k via a tiny golden‑set (10-50 questions with known sources). Automate nightly.
Reindex safely
- New fields? Use side‑by‑side indexes, then swap alias.
- Changed chunking? Version your
collection
(e.g.,kb_docs_v2
) and dual‑write during migration.
Security & tenancy
- Store
tenant_id
in metadata and filter at retrieval (don’t trust the client). - Tag and exclude
confidentiality='internal'
for external users. - For PII, encrypt metadata columns (Postgres) or use private endpoints (Azure AI Search).
End‑to‑end example: minimal RAG web API
// Program.cs (ASP.NET Core minimal API)
var builder = WebApplication.CreateBuilder(args);
builder.Services.AddSingleton<ITextEmbeddingGenerationService>(sp =>
{
return new AzureOpenAITextEmbeddingGenerationService(
deploymentName: Env("AZURE_EMBEDDINGS_DEPLOYMENT"),
endpoint: new Uri(Env("AZURE_OPENAI_ENDPOINT")),
apiKey: Env("AZURE_OPENAI_API_KEY"));
});
builder.Services.AddSingleton<ISemanticTextMemory>(sp =>
{
var embeddings = sp.GetRequiredService<ITextEmbeddingGenerationService>();
var store = new AzureAISearchMemoryStore(Env("AZURE_SEARCH_ENDPOINT"), Env("AZURE_SEARCH_API_KEY"));
return new SemanticTextMemory(store, embeddings);
});
var app = builder.Build();
app.MapPost("/ask", async (AskRequest req, ISemanticTextMemory memory) =>
{
var retrieved = await memory.SearchAsync("kb_docs", req.Question, 6, 0.7).ToListAsync();
if (retrieved.Count == 0)
return Results.Json(new { answer = "I don't know based on the provided sources.", citations = Array.Empty<object>() });
var citations = retrieved.Select((r, i) => new {
index = i + 1,
text = r.Metadata.Text,
meta = r.Metadata.AdditionalMetadata
}).ToArray();
var prompt = BuildPrompt(req.Question, citations.Select(c => new SearchResult
{
Text = c.text,
Source = JsonDocument.Parse(c.meta!).RootElement.GetProperty("source").GetString(),
CharStart = int.Parse(JsonDocument.Parse(c.meta!).RootElement.GetProperty("char_start").GetString()!),
CharEnd = int.Parse(JsonDocument.Parse(c.meta!).RootElement.GetProperty("char_end").GetString()!)
}));
// Call your preferred chat model here (Azure OpenAI, etc.)
var kernel = Kernel.CreateBuilder().Build();
var answer = await kernel.InvokePromptAsync(prompt);
return Results.Json(new { answer, citations });
});
app.Run();
record AskRequest(string Question);
FAQ: Costs, deletes, and maintenance
Multiply your total embedded tokens by the model’s price per 1K tokens. Reduce cost by chunk dedup via hashing, lowering overlap, embedding only text that could be retrieved (skip boilerplate/nav), and using smaller embedding models if quality remains acceptable.
Upserts are easy; deletes require policy. Options:
1. Per document: find all chunks where doc_id = X
and delete by id.
2. Soft delete: set is_deleted=true
in metadata and filter out at query time (safer for audit; reclaim storage later).
3. Retention: keep only latest N versions per doc_id
; a scheduled job purges old chunks.
Yes if you keep chunk_id
stable. When content changes, recompute sha256
, re‑embed, and upsert the same id so queries see the latest, and caches stay coherent.
Yes, store the chunk text (for prompts) plus a compact metadata JSON for citations and filters. Keep originals (PDFs, Markdown) in blob storage and reference via source
.
In enterprise content: almost always. Keyword pre‑filters (product/version/locale) cut noise and halve your prompt tokens.
Start with 5-8. If your chunks are small or questions span multiple sections, increase K; otherwise you bloat prompts without gains.
Build a 20-50 question golden set tied to known sources. Track recall@k, answer exact‑match rate, and citation coverage. Re‑run after any index/schema change.
Conclusion: From one brain to a team of agents
Vector memory is your agent’s long‑term hippocampus. With solid chunking, metadata, and caching, SK lets you swap stores (Azure AI Search, Pinecone, Postgres) without rewriting your app. The next step is orchestration: multiple agents (retrieval, planner, writer, critic) sharing the same memory through a common vector store and topic‑scoped namespaces. That’s where systems stop “chatting” and start working.
I’d love to hear what tripped you up in production – chunking, costs, or citations? Drop a comment with your stack and I’ll suggest a sizing/metadata template you can copy.