ahad.

Moving Beyond Naive RAG: How We Built a 90% Hit-Rate Pipeline for Production

AK
Ahad KhanAgentic AI Engineer
May 22, 2024
11 min read
LLMRAGVectorDBPythonArchitecture

Most RAG (Retrieval-Augmented Generation) tutorials you find online are toy examples. They work great on a curated dataset of five Wikipedia pages, but they fall apart the moment you hit them with messy, real-world enterprise data. When we first shipped our internal knowledge base bot, we used what I call "Naive RAG"—just a basic top-k vector search. The results were mediocre: a Hit Rate @ 5 of roughly 65%. For a production tool, that’s a failing grade.

To get to the 90%+ accuracy our users demanded, we had to move from a simple script to a multi-stage engineering pipeline. If you’re still just "retrieving and stuffing" context into a prompt, you're leaving performance on the table. We spent three months iterating on this architecture, moving from a single similarity search to a complex, resilient system that handles the nuances of technical documentation and user intent.

The 4-Stage Production Architecture

We quickly learned that retrieval isn't a single step; it’s a sequence. We moved to a four-stage pipeline that handles the messy reality of user queries and unorganized data. This modular approach allows us to optimize each stage independently without breaking the entire flow.

Loading diagram...

1. Data Engineering: More Than Just 'Split'

In our early iterations, we split text every 500 characters. It was a disaster. Sentences were cut in half, and the model lost the subject of the paragraph. We switched to Recursive Character Splitting. This ensures that paragraphs are kept together before sentences, and sentences before words. By maintaining these semantic boundaries, we ensure that each chunk contains a coherent thought.

Loading diagram...
chunking_strategy.py
1from langchain_text_splitters import RecursiveCharacterTextSplitter
2
3def get_text_splitter():
4 """
5 Returns a splitter that prioritizes semantic boundaries
6 to maintain context integrity.
7 """
8 return RecursiveCharacterTextSplitter(
9 chunk_size=1000,
10 chunk_overlap=200,
11 separators=["\n\n", "\n", ".", " ", ""],
12 add_start_index=True
13 )

But the real win was Metadata Filtering. We stopped searching the entire index for every query. By tagging documents with user_id, doc_type, and department, we narrowed the search space before the vector engine even calculated a single cosine distance. According to our Pinecone logs, pre-filtering reduced our P99 query latency by 22% because the engine could ignore irrelevant branches of the HNSW (Hierarchical Navigable Small World) graph. This is critical when your index grows beyond 1 million vectors.

2. Retrieval: The Hybrid "Golden Path"

Vector search (Dense) is great for semantic meaning, but it’s surprisingly bad at finding specific product IDs or technical jargon like "CVE-2024-1234". For that, you need BM25 (Sparse) keyword matching. We found that dense embeddings often "smooth over" these critical identifiers, leading to irrelevant results.

We implemented Hybrid Search using Reciprocal Rank Fusion (RRF). This combines the best of both worlds by merging the ranked lists from both search methods.

RRF Score Formula: score(d) = Σ 1/(k + rank(d)) where k=60.

Using a constant k=60, we fused vector scores and keyword scores. This change alone pushed our retrieval accuracy from 65% to 82%. But we didn't stop there. We added a Re-ranker. Standard vector search uses bi-encoders, which are fast but can't capture the deep interaction between query and document.

Loading diagram...

We introduced Cohere’s Rerank-v3 as a post-retrieval step. We'd retrieve the top 25 chunks via hybrid search (~40ms) and then pass them to the re-ranker (~150ms). In our benchmarks, adding a re-ranker improved our NDCG@10 by 20%. It turns out, letting a Cross-Encoder look at the query and document together is worth the 150ms latency hit.

3. Query Transformation: Fixing Bad User Input

Users are notoriously bad at writing search queries. They ask "How do I do that thing from yesterday?" instead of "How do I rotate AWS IAM keys?" To solve this, we implemented HyDE (Hypothetical Document Embeddings). Instead of searching with the user's messy query, we asked GPT-4o to "write a fake answer to this question." We then used that fake answer as the embedding source.

Loading diagram...

Research (Gao et al., 2022) shows HyDE can outperform standard retrieval by 10-15% on complex datasets. It bridges the gap between the "question space" and the "answer space." By searching for documents that look like the answer rather than the question, we significantly improved our hit rate on ambiguous queries.

4. Solving "Lost in the Middle"

A critical paper from Stanford (Liu et al., 2023) highlighted a massive flaw in how LLMs handle context: they are great at reading the start and end of a prompt but terrible at the middle. Accuracy drops by 20-30% when relevant info is buried in the center. This is a physical limitation of the attention mechanism in many current models.

To combat this, we modified our re-ranker logic to use a "Long Context Reorder" strategy.

Loading diagram...

Most relevant chunks are placed at the extremes (start and end) of the context window to maximize LLM attention.

By "sandwiching" the most important data, we maximized the LLM's attention. We place the most relevant chunk at the very beginning and the second most relevant at the very end.

5. Benchmarking the Infrastructure

Choosing a vector database isn't just about features; it's about the scale-to-latency ratio. When we hit 5 million vectors, we saw a massive performance divergence. We tested several leading solutions to find the right fit for our throughput requirements.

DatabaseP99 Latency (1M vectors)Managed?Throughput (QPS)
Pinecone~45msYesHigh (Serverless)
Milvus~25msOptionalUltra-High
PGVector~95msSelf-hostedLow-Moderate
Weaviate~35msOptionalModerate

For our scale, Milvus provided the best raw throughput, but we stayed with Pinecone for the serverless ease of use until we hit the 10M vector mark, where specialized engines outperform general-purpose SQL plugins (like PGVector) by nearly 4x.

6. Evaluation: Stop Using "Vibes"

You cannot build a production system on "vibes-based" testing. We integrated the RAG Triad (via TruLens) and Ragas metrics into our CI/CD pipeline. This allows us to quantify the impact of every change we make to the pipeline.

Loading diagram...

We aim for three specific KPIs:

  1. Faithfulness (>0.85): Can the answer be 100% inferred from the context?
  2. Relevance (>0.80): Does the answer actually address the user's prompt?
  3. Hit Rate @ 5 (>90%): Is the correct document in the top 5 results?

To keep costs down, we utilized Prompt Caching. Anthropic's Claude 3.5 Sonnet offers up to a 90% discount on cached input tokens. For a system prompt that includes a 20k token static knowledge base, this reduced our per-query cost from $0.08 to $0.01.

Agentic RAG: The Final Evolution

The most recent shift we've made is toward Agentic RAG. Instead of a linear pipeline, we use a loop where the LLM can "critique" its own retrieval. If the retrieved docs aren't good enough, the agent rewrites the query and tries again. This adds a layer of self-correction that is vital for handling the "long tail" of difficult user questions.

Loading diagram...
agentic_loop.py
1from typing import TypedDict, List
2from langgraph.graph import StateGraph, END
3
4class RAGState(TypedDict):
5 query: str
6 documents: List[str]
7 generation: str
8
9def grade_documents(state: RAGState):
10 """
11 Logic to check if retrieved docs are relevant.
12 In production, this calls an LLM to 'grade' the context.
13 """
14 return state
15
16workflow = StateGraph(RAGState)
17workflow.add_node("retrieve", retrieve_docs)
18workflow.add_node("grade", grade_documents)
19workflow.add_node("generate", generate_answer)
20
21workflow.set_entry_point("retrieve")
22workflow.add_edge("retrieve", "grade")
23workflow.add_conditional_edges("grade", decide_next_step)
24workflow.add_edge("generate", END)
25app = workflow.compile()

Lessons Learned

  1. Retrieval is 80% of the battle. If the right data isn't in the prompt, the smartest model in the world can't help you. Focus your engineering efforts here first.
  2. Metadata is your best friend. Every bit of structured data you can attach to a chunk is a shortcut for the vector engine. Use it to bypass similarity search entirely when possible.
  3. Don't ignore latency. A 10-second RAG response is a dead RAG response. Parallelize your retrieval and use streaming for the generation to keep the user engaged.
  4. RAG > Long Context. Even as context windows grow to 1M+ tokens, RAG remains 10x-50x cheaper and avoids the "Lost in the Middle" performance degradation.

Building production RAG is an exercise in iterative refinement. We started with a 65% success rate and fought for every percentage point through hybrid search, re-ranking, and rigorous evaluation. If you're still in the "Naive" stage, it's time to start engineering.

What's Next?

We are currently experimenting with ColBERT (Contextualized Late Interaction) for even higher retrieval precision without the latency of a full Cross-Encoder. Initial tests show another 5% jump in Hit Rate. We are also looking into Knowledge Graph integration to handle multi-hop reasoning queries that standard vector search struggles with. Stay tuned.