Back to Case Studies
Published:
Last Updated:
Fresh Content
Case Study22 hours/week
Education / EdTechRAG System
high priority

How a K-12 EdTech Publisher Saves Teachers 22 Hours/Week with an AI Writing Co-Pilot

  • 22 hours/week of manual lesson planning eliminated per teacher
  • 100% of mini-lessons validated against a 100-point quality rubric before delivery
  • Sub-15-second average end-to-end response time (query → validated lesson)
  • 584 curated knowledge chunks across 28 source documents — 100% content coverage
  • 298-term domain vocabulary catches hallucinated terminology before it reaches teachers
8 weeks
2 engineers
9 min read
1,700 words
Muhammad Mudassir

Muhammad Mudassir

Founder & CEO, Cognilium AI

Outcome metrics

22hours/week saved
Lesson planning time per teacher
was 25-30 hours/week manual before
100% (judge-scored)
Quality-validated outputs
was 0% (no validation layer) before
12seconds
End-to-end response time (P50)
was 30-60 min manual planning before
<0.5%
Hard-fail rate after retries
was n/a before
584curated chunks (28 docs)
Knowledge-base coverage
was 1.37M chars raw PDF before
298terms validated per output
Domain vocabulary enforced

The Challenge

the K-12 EdTech publisher had built a proprietary K-12 writing methodology spanning 30 PDFs (~1.37M characters), 188 catalogued mini-lessons, and 50+ named teaching strategies. Teachers who purchased the courses had access to all of it — but translating that into ready-to-use, grade-specific, trait-specific classroom lessons took 25-30 hours per week per teacher. A traditional search tool just finds content; teachers needed a tool that understood the methodology deeply enough to generate complete, classroom-ready 4-step mini-lessons with the correct terminology and pedagogical structure.

The Solution

A retrieval-augmented generation system embedded directly in the LearnWorlds LMS via iframe. Phase 1 (Query Understanding) parses the teacher's natural-language question into structured intent, grade, trait, and writing mode. Phase 2 (Hybrid Retrieval) runs dense + sparse search over 584 curated chunks with metadata pre-filtering and progressive relaxation. Phase 3 assembles a 6,000-token context budget with content-type priority sorting. Phase 4 generates either a structured 4-step mini-lesson (Pydantic schema) or free-text coaching advice. Phase 5 validates every lesson against a 100-point judge rubric with up-to-3 temperature-escalated retries. The full pipeline runs inside a 60-second time budget with graceful degradation.

Implementation

Hybrid retrieval with prefetch-time metadata filtering

Dense embeddings (OpenAI text-embedding-3-large, 3072-dim) catch semantic matches. Sparse BM25 (fastembed) catches the publisher-specific terminology that the embedding misses — boosted with the 3-8 expanded terms extracted by the Query Understanding pass. Reciprocal Rank Fusion merges the two ranked lists; the filter is pushed down into both retrievers BEFORE fusion so narrow queries (specific grade + mode + trait) keep a full reranker input pool of 30 candidates.

Progressive filter relaxation

When the strict filter returns fewer than 5 candidates, the system relaxes one dimension at a time — first the writing trait, then the mode, then the grade. Each relaxation is one Qdrant call; ~85% of queries never trigger relaxation, ~12% relax once, <0.5% reach the no-filter floor. Editors see a log of relaxed queries and decide whether to add content or merge tags.

Anti-hallucination via runtime grounding

A 298-term domain vocabulary loaded at server startup. Every generated output gets a 2-5ms regex-based extraction of named-entity-style terms (capitalized, multi-word, matching domain patterns). Each term is checked against the vocabulary set. Match: pass. Fuzzy-match within edit distance 1: log + auto-correct. No match, no fuzzy match: retry with a stricter prompt listing allowed terms inline. After two retries, fall back to the closest valid term by embedding similarity and flag for human review.

Two-stage validation: structural + judge

Stage 1 (code, <5ms): every step present, ≥2 sentences each, first-person language in step 2B (the think-aloud), Turn & Talk cues in step 3. Stage 2 (GPT-4o-mini judge, ~500ms): scores the lesson on a 100-point rubric across framework adherence, grade appropriateness, methodology accuracy, completeness, and teacher language quality. Score ≥85 = "validated" ship immediately; 60-84 = "acceptable" ship after retries exhausted; <40 = "error" refuse and ask teacher to rephrase.

7-layer security for LMS embedding

LearnWorlds passes user_id + email to the iframe via Liquid template variables. The backend validates: Sec-Fetch-Dest header (iframe-only), HMAC-SHA256 signed URLs (2-min expiry), referrer, LearnWorlds API user verification, LearnWorlds API enrollment check, JWT (HS256, 15-min httpOnly cookie), and re-enrollment check on every refresh. Rate-limited to 50 queries/user/hour.

It produces lessons that sound like me — same methodology, same phrasing, same pedagogical rigor. Teachers get classroom-ready content in 12 seconds instead of 45 minutes.
Founder, K-12 EdTech publisher

TL;DR

How Cognilium built an AI co-pilot for a K-12 EdTech publisher that saves teachers 22 hours/week by generating methodology-grounded writing mini-lessons via hybrid RAG.

An AI co-pilot embedded in the LearnWorlds LMS that generates classroom-ready writing mini-lessons grounded in 1.37M characters of the publisher the publisher K-12 methodology — replacing 22 hours/week of manual lesson planning per teacher.
RAG case studyK-12 EdTech AIhybrid retrievalQdrant productionLLM-as-judgeanti-hallucinationLearnWorlds integrationAI lesson planning

A K-12 writing instruction publisher had built a proprietary methodology spanning 30 PDFs (1.37 million characters), 188 catalogued mini-lessons, and 50+ named teaching strategies. Teachers who bought the courses had access to all of it — but translating it into ready-to-use, grade-specific, trait-specific classroom lessons took 25 to 30 hours per week per teacher. They did not need more search; they needed a generator that understood the methodology deeply enough to produce classroom-ready lessons.

Why a traditional RAG would have failed

A standard "embed everything, retrieve top-K, summarize" pipeline produces fluent output that sounds vaguely right. For domain-specific instruction that is the failure mode you cannot afford — teachers notice "voiceful sentence" instead of the actual the publisher term, trust evaporates, the tool gets dropped. The pipeline had to be designed around three pressures from the start: methodology-faithful generation, grade-appropriate retrieval, and detectable failure at every stage.

The five-phase pipeline

Phase 1 (Query Understanding) runs GPT-4o-mini with a 300-line domain vocabulary prompt to extract structured intent, grade, writing trait, writing mode, strategy name, and 3-8 expanded search terms. Phase 2 (Hybrid Retrieval) runs dense + sparse search over 584 manually curated chunks with metadata pre-filtering and progressive relaxation. Phase 3 (Context Assembly) sorts by content-type priority and fills a 6,000-token budget. Phase 4 (Generation) produces either a structured 4-step mini-lesson via Pydantic or free-text coaching. Phase 5 (Validation) runs a 100-point judge with three temperature-escalated retries.

The numbers from production

  • 22 hours/week saved per teacher (replacing 25-30 hours of manual lesson planning)
  • 584 curated chunks across 28 source documents — 100% content coverage, no LLM-summarized loss
  • P50 end-to-end response time: 12 seconds (validated on first attempt)
  • Hard-fail rate after all retries: <0.5%
  • ~$0.10 per validated mini-lesson (GPT-4o generation + GPT-4o-mini judge + retrieval)

What we would do differently

The manual chunking was expensive (50+ hours) but irreplaceable — LLM chunking on a small, dense methodology dropped retrieval F1 by 12% in our internal eval. For corpora this small (<2M characters), manual is the right call. For larger corpora the trade-off flips and you need an LLM chunker with rigorous downstream evaluation.

Where this generalizes

The pattern — hybrid retrieval with prefetch-time filtering, runtime grounding against a domain vocabulary, and two-stage validation with a judge — is the right shape for any RAG product where the corpus is domain-specific, the vocabulary is closed, and the output has to sound like a domain expert. It is the pattern we ship for legal, medical, and regulated SaaS products.

Technologies used

FastAPIPython 3.10QdrantOpenAI text-embedding-3-largeGPT-4oGPT-4o-miniBM25 (fastembed)LearnWorlds APIJWT (HS256)HMAC-SHA256Cloud RunVercel

Share this case study

Frequently Asked Questions

Find answers to common questions about the topics covered in this article.

Still have questions?

Get in touch with our team for personalized assistance.

Contact Us

Want a result like this for your team?

Talk to an engineer about your AI system. We scope the engagement against the outcome you need, not the hours we want to bill.