AI Engineering

Building Production-Ready RAG Pipelines: Beyond Simple Vector Search

Moving past naive semantic search into advanced retrieval-augmented generation. Exploring hybrid indexing, query rewriting, reranking, and verification gates for enterprise deployment.

Thanuka EllepolaMay 18, 20266 min read

The Fallacy of Naive RAG

Many engineers begin their retrieval-augmented generation (RAG) journey by loading a few PDFs into a vector database, embedding them with a standard API, and querying them directly. While this works beautifully for simple demos, it fails spectacularly in production enterprise environments.

In production, naive semantic search suffers from multiple flaws: poor retrieval recall due to bad chunking, out-of-context retrieval, lost-in-the-middle phenomena in LLM context windows, and hallucination loops. To build enterprise-grade systems, we must construct a highly engineered retrieval and verification lifecycle.

1. Multi-Stage Retrieval and Reranking

First-stage retrieval needs to be broad and fast. We combine traditional keyword search (BM25) with dense vector search (semantic similarity) using a hybrid index. By blending keyword precision with semantic understanding, we capture both specific jargon and conceptual matches.

Once the top 50 documents are retrieved, we run them through a cross-encoder reranking model (like Cohere Rerank or BGE-Reranker). Reranking computes a direct attention-based similarity score between the user query and each document chunk, compressing our input to the top 5-10 highly relevant chunks and eliminating noise.

2. Query Rewriting and Expansion

Users rarely write optimal queries. A search query like "sales q3" is far too sparse. To solve this, we introduce an agentic query rewriter step before retrieval.

The query rewriter analyzes conversation history, expands abbreviations, and generates 3 alternative phrasings of the query. We perform vector searches for all variants and merge the results using Reciprocal Rank Fusion (RRF).

typescriptNeural Code Block

// Example of Query Expansion Node in LangChain
async function queryExpanderNode(state: AgentState) {
  const llm = new ChatOpenAI({ modelName: "gpt-4o", temperature: 0 });
  const response = await llm.invoke([
    new SystemMessage("Expand the user query into 3 distinct search queries optimized for vector database retrieval. Return as a JSON array of strings."),
    new HumanMessage(state.latestQuery)
  ]);
  
  const expandedQueries = JSON.parse(response.content) as string[];
  return { ...state, searchQueries: [state.latestQuery, ...expandedQueries] };
}

Query rewrite module inside our agent pipeline

3. Grounding and Verification Gates

Even with perfect retrieval, language models can hallucinate. To guarantee trust, we implement a post-generation verification gate.

The verification gate parses the LLM output, extracts key claims, and traces each claim back to the source chunks. If a claim lacks supporting evidence (low grounding score), the response is rejected, and the agent initiates a secondary retrieval loop to seek better context.

Grounding verification gates are the difference between a prototype that occasionally lies and an enterprise system with a guaranteed SLA for factual accuracy.

Conclusion

Scaling RAG is not about using a bigger LLM; it is about engineering the data flow. By incorporating hybrid search, rerankers, query rewriting, and grounding checks, you turn a fragile chatbot into a robust decision-support system.

Thanuka Ellepola.