Gloss Key Takeaways

Most RAG accuracy failures are retrieval failures (not generation), so improving retrieval yields the biggest quality gains.
A naive dense-vector-only pipeline often misses exact-match needs (product codes, function names) and rare facts, leading to wrong context and downstream hallucinations.
Adding BM25-style keyword search (e.g., Postgres full-text search) complements embeddings by capturing exact terms and lexical signals.
Fusing dense and sparse results with Reciprocal Rank Fusion is a robust default because it avoids score normalization and benefits from retriever disagreement.
In practice, hybrid search plus reranking can deliver large recall gains (often 25–45%) with relatively little implementation time if your corpus is already indexed.

Two streams merging into a single river

Hybrid Search in 90 Minutes, the Single Biggest RAG Quality Win in 2026

Industry analysis from the major RAG observability platforms shows that 73% of RAG failures come from retrieval, not generation. The model is not hallucinating because it is dumb. The model is hallucinating because the retrieval step did not give it the right chunk. If you have a RAG pipeline in production and you are unhappy with its accuracy, you are almost certainly fighting the wrong problem. The fix is not a smarter model. The fix is hybrid search plus reranking.

This is a focused tutorial. We will take a naive dense-vector RAG pipeline, add BM25 keyword search, fuse the two with reciprocal rank fusion, and finish with a reranker model. Then we will run an evaluation script that measures the retrieval quality gain in numbers you can take to a stakeholder. Total work, if you already have a corpus indexed, is about 90 minutes. The recall improvement on a representative test set in my own deployments is consistently 25 to 45 percent.

We will use pgvector for the vector store because it is the simplest path. Qdrant and Turbopuffer are also fine choices, with caveats noted at the end.

The starting pipeline

A naive RAG setup looks like this. Embed the chunks, search by cosine similarity, return top-k, send to the model. It works for keyword-heavy questions about high-frequency topics. It fails for questions where the right chunk uses different vocabulary than the question, or where the answer is rare in the corpus.

from sqlalchemy import create_engine, text
from openai import OpenAI

client = OpenAI()
engine = create_engine("postgresql://localhost/rag")

def embed(text_in):
    return client.embeddings.create(
        model="text-embedding-3-large",
        input=text_in,
    ).data[0].embedding

def search_dense(query, k=10):
    emb = embed(query)
    with engine.connect() as conn:
        rows = conn.execute(text("""
            SELECT id, chunk, 1 - (embedding <=> :emb::vector) AS score
            FROM documents
            ORDER BY embedding <=> :emb::vector
            LIMIT :k
        """), {"emb": str(emb), "k": k}).fetchall()
    return [{"id": r.id, "chunk": r.chunk, "score": r.score} for r in rows]

This is what most teams ship and forget. The dense search has a known weakness. Embeddings are good at semantic similarity, bad at exact-match retrieval. If your query mentions a specific product code or a function name, dense search blurs it into "things that look kind of like a product code" rather than "the exact product code." That is where BM25 comes in.

Adding BM25

Postgres ships with full-text search, which uses BM25-like ranking through ts_rank_cd. It is good enough that you do not need a separate search engine. Add a tsvector column, an index, and a query.

ALTER TABLE documents ADD COLUMN chunk_tsv tsvector
    GENERATED ALWAYS AS (to_tsvector('english', chunk)) STORED;
CREATE INDEX documents_tsv_idx ON documents USING GIN(chunk_tsv);

The Python side:

def search_sparse(query, k=10):
    with engine.connect() as conn:
        rows = conn.execute(text("""
            SELECT id, chunk, ts_rank_cd(chunk_tsv, plainto_tsquery('english', :q)) AS score
            FROM documents
            WHERE chunk_tsv @@ plainto_tsquery('english', :q)
            ORDER BY score DESC
            LIMIT :k
        """), {"q": query, "k": k}).fetchall()
    return [{"id": r.id, "chunk": r.chunk, "score": r.score} for r in rows]

You now have two retrievers. They will frequently disagree, which is exactly what you want. Disagreement is information. The next step is fusion.

Brass scales weighing two retrieval methods

Reciprocal rank fusion

There are several ways to combine ranked lists. RRF is the boring, robust default. It does not require score normalization, which is good because dense scores and sparse scores live on different scales. RRF gives each item a score based on its rank in each list, with a constant k that smooths out top-rank differences.

def reciprocal_rank_fusion(result_lists, k=60):
    scores = {}
    items = {}
    for results in result_lists:
        for rank, item in enumerate(results):
            id_ = item["id"]
            scores[id_] = scores.get(id_, 0) + 1 / (k + rank + 1)
            items[id_] = item
    fused = sorted(scores.items(), key=lambda x: -x[1])
    return [items[id_] for id_, _ in fused]

def hybrid_search(query, k=20):
    dense = search_dense(query, k=k)
    sparse = search_sparse(query, k=k)
    return reciprocal_rank_fusion([dense, sparse])[:k]

This is the entire fusion logic. Twenty lines. With k=60 (the value Cormack and Clarke originally proposed in 2009 and which has held up since), this is robust across most domains. If you have heavy domain-specific tuning, you can adjust k, but you usually should not.

The reranker

After fusion you have, say, 20 candidates ordered by combined relevance. Most of them are still wrong. A reranker is a small model that scores query-document pairs more accurately than retrieval can, at the cost of being too slow to apply at retrieval time. You apply it to the top 20 to get the real top 5.

The Cohere rerank-3.1 model and BAAI bge-reranker-v3-large are both solid. I will show Cohere because it is one API call, but a self-hosted bge model is fine for cost-sensitive deployments.

import cohere

co = cohere.Client(os.environ["COHERE_API_KEY"])

def rerank(query, candidates, top_n=5):
    docs = [c["chunk"] for c in candidates]
    resp = co.rerank(
        model="rerank-3.1",
        query=query,
        documents=docs,
        top_n=top_n,
    )
    return [candidates[r.index] for r in resp.results]

def retrieve(query, top_n=5):
    candidates = hybrid_search(query, k=20)
    return rerank(query, candidates, top_n=top_n)

That is the full hybrid pipeline. Dense plus sparse, fused with RRF, reranked to top 5. This is what you want feeding the model.

The evaluation script

You cannot improve what you do not measure. Build a small eval set with 30 to 50 queries from real user logs (or representative synthetic ones), each tagged with the IDs of the chunks that should be retrieved. Then measure recall at k.

import json

def recall_at_k(retrieved_ids, relevant_ids, k):
    retrieved_top_k = set(retrieved_ids[:k])
    if not relevant_ids:
        return None
    return len(retrieved_top_k & set(relevant_ids)) / len(relevant_ids)

def evaluate(eval_set, retriever):
    results = []
    for item in eval_set:
        retrieved = retriever(item["query"], top_n=10)
        ids = [r["id"] for r in retrieved]
        results.append({
            "query": item["query"],
            "recall@1": recall_at_k(ids, item["relevant_ids"], 1),
            "recall@3": recall_at_k(ids, item["relevant_ids"], 3),
            "recall@10": recall_at_k(ids, item["relevant_ids"], 10),
        })
    return results

eval_set = json.load(open("eval.json"))

print("Dense only:")
print(evaluate(eval_set, lambda q, top_n: search_dense(q, k=top_n)))

print("Hybrid:")
print(evaluate(eval_set, lambda q, top_n: hybrid_search(q, k=top_n)))

print("Hybrid + Rerank:")
print(evaluate(eval_set, retrieve))

Run this on every commit. Track the numbers in a dashboard. When somebody proposes "let's swap to model X" or "let's tune the embedding," you have a number that tells you whether their change actually helped.

In a typical deployment of mine, recall@5 goes from 0.62 with dense-only to 0.79 with hybrid to 0.91 with hybrid plus rerank. That last 12 points is what the user perceives as "the system finally works."

Filing card index with golden ribbon pulling one card forward

Pgvector, Qdrant, or Turbopuffer

Pgvector is what you use when your data is already in Postgres or you want to keep your stack small. It scales fine to roughly 10 million vectors with the right HNSW indexing. Beyond that, you start fighting Postgres on memory and you should consider Qdrant or Turbopuffer.

Qdrant has built-in hybrid search support, including BM25, which removes the dual-query Postgres pattern. The architecture is similar to what we built, just operationally cleaner if you are already running Qdrant.

Turbopuffer is the new entrant. It is built on object storage and is meaningfully cheaper for large corpora. Its hybrid search story is solid. If you have north of 50 million vectors, look there first.

The pattern is the same in all three. Dense plus sparse, fused, reranked. The pieces are different. The architecture is identical.

Why this is the highest-leverage RAG fix

Every team I have worked with that ran into "RAG quality plateau" had the same setup. Dense-only search, no reranker, lots of effort spent on prompt engineering and model swaps. The improvements they were chasing were 2 or 3 percentage points. Hybrid plus rerank is 15 to 30 percentage points. The work is bounded, the eval is measurable, and the architecture is well understood.

If you take one thing from this article: hybrid search and reranking is not advanced RAG. It is the new baseline. Anything below this in 2026 is a quality regression you are choosing to accept. Spend the 90 minutes.

Gloss What This Means For You

If your production RAG system is underperforming, focus first on retrieval: keep your dense embeddings, add a BM25-style keyword retriever (Postgres FTS is usually enough), and combine the two with reciprocal rank fusion before applying a reranker. Expect the biggest wins on queries with specific identifiers, uncommon terminology, or vocabulary mismatch between question and source text. Run a simple retrieval evaluation before and after so you can quantify the recall lift and justify the change to stakeholders.