$ cd ../blog

Building Production RAG Systems: What the Tutorials Don't Tell You

·5 min read

Every RAG tutorial follows the same script: chunk your documents, embed them, stuff the top-k results into a prompt, done. It works beautifully in the demo. Then you put it in front of real users with real documents and real questions, and the cracks appear immediately.

I've spent the past couple of years building LLM-powered products, including a shared AI conversation platform where retrieval quality directly shapes what users see. Here's what I wish someone had told me before the first production deployment.

Chunking is a product decision, not a preprocessing step

The default advice — split every 500 tokens with some overlap — treats all documents the same. Real corpora aren't uniform. A legal contract, an API reference, and a support ticket thread have completely different structures, and a chunking strategy that works for one mangles the others.

What actually worked for us:

  • Respect document structure. Split on headings and sections first, token counts second. A chunk that starts mid-sentence in one clause and ends mid-sentence in another is poison for retrieval.
  • Attach context to every chunk. Prepend the document title and section path to the chunk text before embedding. "Refund windows are 30 days" retrieves very differently when it carries "Payments Policy > Consumer Refunds" with it.
  • Keep chunks bigger than you think. Tiny chunks score high on similarity and deliver nothing the model can use. We landed around 800–1200 tokens for prose-heavy content.

The uncomfortable truth: we re-chunked our entire corpus three times in the first six months. Budget for that. Make your ingestion pipeline re-runnable from day one.

Retrieval quality degrades silently

A broken API endpoint throws a 500. A broken retrieval pipeline returns plausible-looking results that are subtly wrong, and nobody notices until a user pastes an embarrassing answer into a screenshot.

The fix is boring and unavoidable: build an eval set before you build features. Ours started as a spreadsheet — fifty real questions, the document that should answer each one, and a pass/fail column. Every change to chunking, embeddings, or prompts ran against it. It caught regressions that no amount of "vibe testing" in the playground ever would.

Two retrieval upgrades paid for themselves almost immediately:

  • Hybrid search. Pure vector similarity misses exact identifiers — invoice numbers, product codes, function names. Combining BM25-style keyword search with vector search and merging results fixed a whole class of failures.
  • Reranking. Retrieve 30 candidates cheaply, then let a reranker pick the best 5. It added ~100ms of latency and removed most of our "right document, wrong chunk" complaints.

Cost control is an architecture problem

LLM costs don't grow linearly with users — they grow with context size, and RAG inflates context aggressively. A few things kept our bill sane:

  • Cache at the retrieval layer. Popular questions cluster hard. Caching normalized-query → retrieved-chunks pairs in Redis cut embedding API calls dramatically.
  • Don't send chunks the model won't use. If your reranker scores a chunk near zero, drop it. Padding the prompt "just in case" costs real money at scale.
  • Stream everything. Streaming doesn't reduce cost, but it transforms perceived latency. A response that starts in 400ms feels fast even if it takes 8 seconds to finish.

Observability: log the retrieval, not just the answer

When a user reports a bad answer, the first question is always the same: what did the model actually see? If you can't answer that, you're debugging blind.

We log, for every request: the raw user query, the normalized query used for retrieval, the IDs and scores of every retrieved chunk, the final assembled prompt, and the model's response. Storage is cheap; reconstructing a failure from memory is not. This single decision converted "the AI is being weird" tickets from unreproducible mysteries into fifteen-minute investigations.

A few metrics turned out to be leading indicators of user-visible quality problems:

  • Retrieval score distribution. When the top result's similarity score drops below a threshold, the corpus probably doesn't cover the question. Alert on it — it tells you what content to write next.
  • Context utilization. If the model's answers consistently cite only one of five retrieved chunks, you're paying for four chunks of noise. Tighten retrieval.
  • "I don't know" rate. Both directions matter. Too low means hallucination; a sudden spike means ingestion broke and new documents aren't landing in the index.

Latency budgets force honest architecture

A RAG request is a pipeline: embed the query, search the index, rerank, assemble the prompt, call the LLM, stream the response. Each stage is individually fast and collectively slow. We set a hard budget — first token in under a second — and made every stage justify its share.

That budget killed several tempting features. Query expansion with an extra LLM call? Two hundred milliseconds we didn't have. Retrieving from three indexes and merging? Only after we made the searches concurrent. The budget wasn't a constraint on quality — it was a forcing function for keeping the pipeline honest. Every stage that survived earned its latency.

The model is the least interesting part

Here's the thing that surprised me most: swapping the underlying LLM was almost never the fix for quality problems. When answers were bad, it was retrieval. When retrieval was fine, it was chunking. When chunking was fine, it was that the answer genuinely wasn't in the corpus — and the honest fix was teaching the system to say "I don't know" instead of hallucinating confidence.

Treat your RAG system like the distributed data pipeline it actually is: instrument it, evaluate it, version its components independently. The teams that struggle are the ones treating it like a prompt with a database attached.

If you're starting today: build the eval set first, chunk by structure, go hybrid on retrieval, and cache aggressively. The rest is iteration.