Muhammad Mudassir
Founder & CEO, Cognilium AI
the K-12 EdTech publisher had built a proprietary K-12 writing methodology spanning 30 PDFs (~1.37M characters), 188 catalogued mini-lessons, and 50+ named teaching strategies. Teachers who purchased the courses had access to all of it — but translating that into ready-to-use, grade-specific, trait-specific classroom lessons took 25-30 hours per week per teacher. A traditional search tool just finds content; teachers needed a tool that understood the methodology deeply enough to generate complete, classroom-ready 4-step mini-lessons with the correct terminology and pedagogical structure.
A retrieval-augmented generation system embedded directly in the LearnWorlds LMS via iframe. Phase 1 (Query Understanding) parses the teacher's natural-language question into structured intent, grade, trait, and writing mode. Phase 2 (Hybrid Retrieval) runs dense + sparse search over 584 curated chunks with metadata pre-filtering and progressive relaxation. Phase 3 assembles a 6,000-token context budget with content-type priority sorting. Phase 4 generates either a structured 4-step mini-lesson (Pydantic schema) or free-text coaching advice. Phase 5 validates every lesson against a 100-point judge rubric with up-to-3 temperature-escalated retries. The full pipeline runs inside a 60-second time budget with graceful degradation.
Dense embeddings (OpenAI text-embedding-3-large, 3072-dim) catch semantic matches. Sparse BM25 (fastembed) catches the publisher-specific terminology that the embedding misses — boosted with the 3-8 expanded terms extracted by the Query Understanding pass. Reciprocal Rank Fusion merges the two ranked lists; the filter is pushed down into both retrievers BEFORE fusion so narrow queries (specific grade + mode + trait) keep a full reranker input pool of 30 candidates.
When the strict filter returns fewer than 5 candidates, the system relaxes one dimension at a time — first the writing trait, then the mode, then the grade. Each relaxation is one Qdrant call; ~85% of queries never trigger relaxation, ~12% relax once, <0.5% reach the no-filter floor. Editors see a log of relaxed queries and decide whether to add content or merge tags.
A 298-term domain vocabulary loaded at server startup. Every generated output gets a 2-5ms regex-based extraction of named-entity-style terms (capitalized, multi-word, matching domain patterns). Each term is checked against the vocabulary set. Match: pass. Fuzzy-match within edit distance 1: log + auto-correct. No match, no fuzzy match: retry with a stricter prompt listing allowed terms inline. After two retries, fall back to the closest valid term by embedding similarity and flag for human review.
Stage 1 (code, <5ms): every step present, ≥2 sentences each, first-person language in step 2B (the think-aloud), Turn & Talk cues in step 3. Stage 2 (GPT-4o-mini judge, ~500ms): scores the lesson on a 100-point rubric across framework adherence, grade appropriateness, methodology accuracy, completeness, and teacher language quality. Score ≥85 = "validated" ship immediately; 60-84 = "acceptable" ship after retries exhausted; <40 = "error" refuse and ask teacher to rephrase.
LearnWorlds passes user_id + email to the iframe via Liquid template variables. The backend validates: Sec-Fetch-Dest header (iframe-only), HMAC-SHA256 signed URLs (2-min expiry), referrer, LearnWorlds API user verification, LearnWorlds API enrollment check, JWT (HS256, 15-min httpOnly cookie), and re-enrollment check on every refresh. Rate-limited to 50 queries/user/hour.
“It produces lessons that sound like me — same methodology, same phrasing, same pedagogical rigor. Teachers get classroom-ready content in 12 seconds instead of 45 minutes.”
TL;DR
How Cognilium built an AI co-pilot for a K-12 EdTech publisher that saves teachers 22 hours/week by generating methodology-grounded writing mini-lessons via hybrid RAG.
A K-12 writing instruction publisher had built a proprietary methodology spanning 30 PDFs (1.37 million characters), 188 catalogued mini-lessons, and 50+ named teaching strategies. Teachers who bought the courses had access to all of it — but translating it into ready-to-use, grade-specific, trait-specific classroom lessons took 25 to 30 hours per week per teacher. They did not need more search; they needed a generator that understood the methodology deeply enough to produce classroom-ready lessons.
A standard "embed everything, retrieve top-K, summarize" pipeline produces fluent output that sounds vaguely right. For domain-specific instruction that is the failure mode you cannot afford — teachers notice "voiceful sentence" instead of the actual the publisher term, trust evaporates, the tool gets dropped. The pipeline had to be designed around three pressures from the start: methodology-faithful generation, grade-appropriate retrieval, and detectable failure at every stage.
Phase 1 (Query Understanding) runs GPT-4o-mini with a 300-line domain vocabulary prompt to extract structured intent, grade, writing trait, writing mode, strategy name, and 3-8 expanded search terms. Phase 2 (Hybrid Retrieval) runs dense + sparse search over 584 manually curated chunks with metadata pre-filtering and progressive relaxation. Phase 3 (Context Assembly) sorts by content-type priority and fills a 6,000-token budget. Phase 4 (Generation) produces either a structured 4-step mini-lesson via Pydantic or free-text coaching. Phase 5 (Validation) runs a 100-point judge with three temperature-escalated retries.
The manual chunking was expensive (50+ hours) but irreplaceable — LLM chunking on a small, dense methodology dropped retrieval F1 by 12% in our internal eval. For corpora this small (<2M characters), manual is the right call. For larger corpora the trade-off flips and you need an LLM chunker with rigorous downstream evaluation.
The pattern — hybrid retrieval with prefetch-time filtering, runtime grounding against a domain vocabulary, and two-stage validation with a judge — is the right shape for any RAG product where the corpus is domain-specific, the vocabulary is closed, and the output has to sound like a domain expert. It is the pattern we ship for legal, medical, and regulated SaaS products.
Find answers to common questions about the topics covered in this article.
The engineering writeups that explain how the system was built.
Why filtering after RRF fusion loses the right chunks, and how a "drop trait → mode → grade" progressive relaxation ladder keeps narrow queries answerable without dropping retrieval quality.
A startup-loaded domain vocabulary the generator must match against, plus framework rules baked into every prompt — a low-cost pattern that catches hallucinated terminology before the user sees it.
The day-2 ops layer of an LLM product — what to evaluate, what to judge in real time, what to retry, and when to fail closed. The components that turn a prototype into something operable.