How does the co-pilot stay grounded in the publisher's methodology?

Three layers. (1) Hybrid retrieval over 584 manually-curated chunks ensures every generation has 6,000 tokens of methodology-faithful context. (2) A 298-term domain vocabulary loaded at startup is validated against every output — any term outside the vocabulary triggers a retry with stricter prompt or a fallback. (3) An LLM-as-judge scores every lesson on a 100-point rubric for methodology accuracy before delivery. Three layers makes hallucination structurally hard to ship.

Why hybrid retrieval instead of pure semantic search?

The publisher has 50+ named teaching strategies — "Slinky Test", "Dead Verbs", "Yes MA'AM". Pure embedding similarity misses these because the strategy name and the query share no semantic neighborhood. Sparse BM25 (boosted with the expanded the publisher terms from the Query Understanding pass) catches exact-term matches. RRF fuses both ranked lists; chunks appearing high in either get high scores. The result is recall on both paraphrased queries ("help kids add details") and exact-term queries ("Dead Verbs strategy").

How does the system handle queries it cannot answer?

Three escape hatches. (1) is_writing_related = false → immediate rejection: "I can only help with writing instruction questions." (2) Required fields missing (no grade for a mini-lesson request) → follow-up question, never a blind generation. (3) Retrieval returns near-empty even after relaxation → coaching mode falls back to general best practices with explicit transparency ("I don't have specific the publisher material on this — here's general advice"). Never silent failure, never confabulated methodology.

What does the 60-second time budget actually buy?

Three retries with temperature escalation (0.3 → 0.4 → 0.5) — most validated lessons land on attempt 1 (~12 sec), some need attempt 2 (~20 sec), rare ones need attempt 3 (~30 sec). The budget is enforced per-stage: if <15 sec remain the retry loop stops; if <5 sec the judge is skipped and the lesson ships as "unscored". This guarantees a teacher always gets something within 60 seconds rather than spinning forever on a hard query.

How is this embedded in LearnWorlds without LMS changes?

The co-pilot runs as a Next.js iframe. LearnWorlds passes user_id and email via Liquid template variables. The backend handles authentication itself — HMAC-SHA256 signed URLs (2-min expiry), JWT (15-min, httpOnly cookie), LearnWorlds API enrollment verification on every refresh. Zero changes to LearnWorlds; the only LMS-side requirement is a Learning Center plan or above (API access). Teachers see a chat interface inside their course page; everything else is invisible.

K-12 Writing Co-Pilot Case Study: 22 Hours/Week Saved wit...

Q: How is this embedded in LearnWorlds without LMS changes?

The co-pilot runs as a Next.js iframe. LearnWorlds passes user_id and email via Liquid template variables. The backend handles authentication itself — HMAC-SHA256 signed URLs (2-min expiry), JWT (15-min, httpOnly cookie), LearnWorlds API enrollment verification on every refresh. Zero changes to LearnWorlds; the only LMS-side requirement is a Learning Center plan or above (API access). Teachers see a chat interface inside their course page; everything else is invisible.

Outcome metrics

Client: A K-12 writing-curriculum publisher with a catalogued methodology of 188 mini-lessons and an 11-episode video course, serving teachers across the United States.

22hours/week saved

Lesson planning time per teacher

was 25-30 hours/week manual before

100% (judge-scored)

Quality-validated outputs

was 0% (no validation layer) before

12seconds

End-to-end response time (P50)

was 30-60 min manual planning before

<0.5%

Hard-fail rate after retries

was n/a before

584curated chunks (28 docs)

Knowledge-base coverage

was 1.37M chars raw PDF before

298terms validated per output

Domain vocabulary enforced

The Challenge

the K-12 EdTech publisher had built a proprietary K-12 writing methodology spanning 30 PDFs (~1.37M characters), 188 catalogued mini-lessons, and 50+ named teaching strategies. Teachers who purchased the courses had access to all of it — but translating that into ready-to-use, grade-specific, trait-specific classroom lessons took 25-30 hours per week per teacher. A traditional search tool just finds content; teachers needed a tool that understood the methodology deeply enough to generate complete, classroom-ready 4-step mini-lessons with the correct terminology and pedagogical structure.

The Solution

A retrieval-augmented generation system embedded directly in the LearnWorlds LMS via iframe. Phase 1 (Query Understanding) parses the teacher's natural-language question into structured intent, grade, trait, and writing mode. Phase 2 (Hybrid Retrieval) runs dense + sparse search over 584 curated chunks with metadata pre-filtering and progressive relaxation. Phase 3 assembles a 6,000-token context budget with content-type priority sorting. Phase 4 generates either a structured 4-step mini-lesson (Pydantic schema) or free-text coaching advice. Phase 5 validates every lesson against a 100-point judge rubric with up-to-3 temperature-escalated retries. The full pipeline runs inside a 60-second time budget with graceful degradation.

Implementation

Hybrid retrieval with prefetch-time metadata filtering

Dense embeddings (OpenAI text-embedding-3-large, 3072-dim) catch semantic matches. Sparse BM25 (fastembed) catches the publisher-specific terminology that the embedding misses — boosted with the 3-8 expanded terms extracted by the Query Understanding pass. Reciprocal Rank Fusion merges the two ranked lists; the filter is pushed down into both retrievers BEFORE fusion so narrow queries (specific grade + mode + trait) keep a full reranker input pool of 30 candidates.

Progressive filter relaxation

When the strict filter returns fewer than 5 candidates, the system relaxes one dimension at a time — first the writing trait, then the mode, then the grade. Each relaxation is one Qdrant call; ~85% of queries never trigger relaxation, ~12% relax once, <0.5% reach the no-filter floor. Editors see a log of relaxed queries and decide whether to add content or merge tags.

Anti-hallucination via runtime grounding

A 298-term domain vocabulary loaded at server startup. Every generated output gets a 2-5ms regex-based extraction of named-entity-style terms (capitalized, multi-word, matching domain patterns). Each term is checked against the vocabulary set. Match: pass. Fuzzy-match within edit distance 1: log + auto-correct. No match, no fuzzy match: retry with a stricter prompt listing allowed terms inline. After two retries, fall back to the closest valid term by embedding similarity and flag for human review.

Two-stage validation: structural + judge

Stage 1 (code, <5ms): every step present, ≥2 sentences each, first-person language in step 2B (the think-aloud), Turn & Talk cues in step 3. Stage 2 (GPT-4o-mini judge, ~500ms): scores the lesson on a 100-point rubric across framework adherence, grade appropriateness, methodology accuracy, completeness, and teacher language quality. Score ≥85 = "validated" ship immediately; 60-84 = "acceptable" ship after retries exhausted; <40 = "error" refuse and ask teacher to rephrase.

7-layer security for LMS embedding

LearnWorlds passes user_id + email to the iframe via Liquid template variables. The backend validates: Sec-Fetch-Dest header (iframe-only), HMAC-SHA256 signed URLs (2-min expiry), referrer, LearnWorlds API user verification, LearnWorlds API enrollment check, JWT (HS256, 15-min httpOnly cookie), and re-enrollment check on every refresh. Rate-limited to 50 queries/user/hour.

“It produces lessons that sound like me — same methodology, same phrasing, same pedagogical rigor. Teachers get classroom-ready content in 12 seconds instead of 45 minutes.”

Founder, K-12 EdTech publisher

TL;DR

How Cognilium built an AI co-pilot for a K-12 EdTech publisher that saves teachers 22 hours/week by generating methodology-grounded writing mini-lessons via hybrid RAG.

An AI co-pilot embedded in the LearnWorlds LMS that generates classroom-ready writing mini-lessons grounded in 1.37M characters of the publisher the publisher K-12 methodology — replacing 22 hours/week of manual lesson planning per teacher.

RAG case studyK-12 EdTech AIhybrid retrievalQdrant productionLLM-as-judgeanti-hallucinationLearnWorlds integrationAI lesson planning

A K-12 writing instruction publisher had built a proprietary methodology spanning 30 PDFs (1.37 million characters), 188 catalogued mini-lessons, and 50+ named teaching strategies. Teachers who bought the courses had access to all of it — but translating it into ready-to-use, grade-specific, trait-specific classroom lessons took 25 to 30 hours per week per teacher. They did not need more search; they needed a generator that understood the methodology deeply enough to produce classroom-ready lessons.

Why a traditional RAG would have failed

A standard "embed everything, retrieve top-K, summarize" pipeline produces fluent output that sounds vaguely right. For domain-specific instruction that is the failure mode you cannot afford — teachers notice "voiceful sentence" instead of the actual the publisher term, trust evaporates, the tool gets dropped. The pipeline had to be designed around three pressures from the start: methodology-faithful generation, grade-appropriate retrieval, and detectable failure at every stage.

The five-phase pipeline

Phase 1 (Query Understanding) runs GPT-4o-mini with a 300-line domain vocabulary prompt to extract structured intent, grade, writing trait, writing mode, strategy name, and 3-8 expanded search terms. Phase 2 (Hybrid Retrieval) runs dense + sparse search over 584 manually curated chunks with metadata pre-filtering and progressive relaxation. Phase 3 (Context Assembly) sorts by content-type priority and fills a 6,000-token budget. Phase 4 (Generation) produces either a structured 4-step mini-lesson via Pydantic or free-text coaching. Phase 5 (Validation) runs a 100-point judge with three temperature-escalated retries.

The numbers from production

22 hours/week saved per teacher (replacing 25-30 hours of manual lesson planning)
584 curated chunks across 28 source documents — 100% content coverage, no LLM-summarized loss
P50 end-to-end response time: 12 seconds (validated on first attempt)
Hard-fail rate after all retries: <0.5%
~$0.10 per validated mini-lesson (GPT-4o generation + GPT-4o-mini judge + retrieval)

What we would do differently

The manual chunking was expensive (50+ hours) but irreplaceable — LLM chunking on a small, dense methodology dropped retrieval F1 by 12% in our internal eval. For corpora this small (<2M characters), manual is the right call. For larger corpora the trade-off flips and you need an LLM chunker with rigorous downstream evaluation.

Where this generalizes

The pattern — hybrid retrieval with prefetch-time filtering, runtime grounding against a domain vocabulary, and two-stage validation with a judge — is the right shape for any RAG product where the corpus is domain-specific, the vocabulary is closed, and the output has to sound like a domain expert. It is the pattern we ship for legal, medical, and regulated SaaS products.

Technologies used

FastAPIPython 3.10QdrantOpenAI text-embedding-3-largeGPT-4oGPT-4o-miniBM25 (fastembed)LearnWorlds APIJWT (HS256)HMAC-SHA256Cloud RunVercel

How a K-12 EdTech Publisher Saves Teachers 22 Hours/Week with an AI Writing Co-Pilot

Outcome metrics

The Challenge

The Solution

Implementation

Hybrid retrieval with prefetch-time metadata filtering

Progressive filter relaxation

Anti-hallucination via runtime grounding

Two-stage validation: structural + judge

7-layer security for LMS embedding

Why a traditional RAG would have failed

The five-phase pipeline

The numbers from production

What we would do differently

Where this generalizes

Technologies used

Share this case study

Frequently Asked Questions

How does the co-pilot stay grounded in the publisher's methodology?

Why hybrid retrieval instead of pure semantic search?

How does the system handle queries it cannot answer?

What does the 60-second time budget actually buy?

How is this embedded in LearnWorlds without LMS changes?

Still have questions?

Related Case Studies

How a Multi-Family-Office SaaS Consolidated 7 AI Agents on Google ADK with Per-Org Tool Registration

Technical deep-dives behind this result

Hybrid Retrieval With Prefetch-Time Metadata Filtering

Anti-Hallucination via Runtime Grounding Against a Domain Vocabulary

The Production LLMOps Stack: Evals, Judges, Retries, Circuit Breakers

Want a result like this for your team?