TL;DR
Why filtering after RRF fusion loses the right chunks, and how a "drop trait → mode → grade" progressive relaxation ladder keeps narrow queries answerable wit
A hybrid retriever combines a dense embedding model with a sparse BM25 index, fuses results with reciprocal rank fusion, and reranks. Adding metadata filtering on top of this — "only chunks tagged grade=4 and mode=active" — looks like a one-line change. It is not. Where the filter applies decides whether your retrieval quality survives narrow queries.
Post-filter loses chunks before the reranker sees them
The naive integration: retrieve top-K from each retriever, fuse, drop chunks whose metadata fails the filter. For broad queries this is fine — most chunks pass. For narrow queries (a specific grade and mode in a small corpus), 80% of the top-K may fail the filter. Now the reranker has 4 chunks to work with instead of 30, and the answer goes from "evidence-grounded" to "best of a poor pool."
Prefilter keeps the candidate pool full
The fix: push the filter down into both retrievers. Qdrant supports filter-during-search natively, so the dense side already retrieves only filter-passing chunks. The sparse side runs BM25 over the same prefiltered set. Fusion sees 200 candidates instead of 200-of-which-160-fail. The reranker gets a full 30-chunk input regardless of how narrow the filter is.
Progressive relaxation handles the empty-set case
Narrow filters sometimes return zero candidates — the corpus has no grade-4 active-mode chunk for "synonym practice for adjectives." A retrieval that returns zero is worse than one that returns slightly off-target chunks; the LLM produces "I do not have material on this" instead of generating from analogous content.
The relaxation ladder: drop the most specific trait (the writing-trait tag) first, retry; if still empty, drop mode (active/passive); if still empty, drop grade. Each step is one Qdrant call. The query that hit the relaxed level is logged so editors can see which trait/mode/grade combinations are sparse and decide whether to add content or merge tags.
What this looks like in practice
- Strict-filter queries: ~85% — relaxation never triggers
- ~12% relax once (drop trait), ~3% relax twice (drop mode), <0.5% relax three times
- Reranker input size: stays at 30 chunks regardless of filter narrowness
- Corpus: 584 chunks across 188 catalogued lessons
- P50 retrieval latency: ~80ms strict, +40ms per relaxation level
When this hurts
Push-down filters require indexed metadata fields. If your filter dimensions change weekly, every change is a reindex. Pick filter dimensions that are part of your domain model — grade, content type, language — not transient experiment flags.
Share this article
Muhammad Mudassir
Founder & CEO, Cognilium AI | 10+ years
Muhammad Mudassir
Founder & CEO, Cognilium AI | 10+ years experience
Mudassir Marwat is the Founder & CEO of Cognilium AI. He has shipped 100+ production AI systems acro...
