RAG Is Harder Than It Looks

Retrieval-augmented generation sounds simple. Fetch relevant docs, stuff them in the prompt. In practice, most RAG systems return garbage.

The pitch for RAG is straightforward. Your LLM doesn't know about your internal docs? Just retrieve the relevant ones and include them in the prompt. Problem solved.

The pitch undersells the problem by about 10x.

Where RAG breaks

Retrieval quality. "Relevant" is doing a lot of work in that sentence. Embedding similarity is a proxy for relevance, and it's a leaky one. Documents that are semantically close aren't always useful for answering the question.

Chunk boundaries. You split your docs into pieces for embedding. Those boundaries are arbitrary. The answer to a question might span two chunks that don't get retrieved together.

Context limits. Even with 100k+ context windows, you can't just dump everything in. You have to pick. Picking wrong means wrong answers.

Conflicting information. Real enterprise data contradicts itself. The 2023 policy doc says one thing. The 2024 update says another. Which one should the model trust? It doesn't know.

What I've learned building these systems

Retrieval is where you win or lose. Fancy prompting on top of bad retrieval doesn't help. Spend your time on better chunking, better embeddings, better ranking.

Hybrid search beats pure vector search. Keywords still matter. A customer asking about "SKU 12345" needs exact match, not semantic similarity.

Metadata filtering is underrated. Don't retrieve from the entire corpus. Filter by date, source, document type. Most questions have implicit constraints.

Reranking helps more than I expected. Retrieve 50 candidates, rerank to pick 5. The reranker catches things the initial retrieval misses.

The evaluation problem

How do you know if your RAG system is working?

You need test cases. Real questions with known good answers. Not synthetic ones—actual questions users have asked. Run your system against these regularly. Catch regressions before users do.

Most teams don't have this. They ship, hope for the best, and find out about problems through support tickets.

My current take

RAG is necessary. Most useful LLM applications need external knowledge. But it's not a magic fix. It's a system that needs tuning, evaluation, and maintenance.

If someone tells you RAG "just works," they haven't built one at scale.

Written by Rajkiran Panuganti