TL;DR
A startup-loaded domain vocabulary the generator must match against, plus framework rules baked into every prompt — a low-cost pattern that catches hallucinat
Domain-specific generation has a recurring failure mode: the LLM produces output that is fluent and confident but uses terminology that does not exist in the domain. A K-12 writing methodology has 298 specific terms; the generator may produce "voiceful sentence" or "stylistic figure" — words that sound right and are not in the framework. End users notice, trust evaporates.
A validator that runs at output time and compares generated terminology against a startup-loaded vocabulary catches this before the user sees it, at 2-5ms of latency. The pattern is cheap and underused.
Vocabulary loading
At process startup, the validator loads the domain vocabulary from a versioned source — for our K-12 system it is a 298-term file extracted from the framework documentation. The vocab is parsed into a set with synonyms and inflection variants pre-computed. Total size: ~2KB in memory.
Validation flow
Every generated output gets one validation pass. The pass runs the structured output through a regex-based extractor that finds named-entity-style terms (capitalized, multi-word, or matching domain patterns). Each extracted term is checked against the vocabulary set.
- Match: pass.
- No match, fuzzy-match within edit distance 1: log + auto-correct (covers typos in generation).
- No match, no fuzzy match: validation fails.
What happens on failure
Failure triggers a retry with a stricter prompt that lists the allowed vocabulary inline ("Use only these terms for craft elements: ..."). After two retries the system falls back to the closest valid term by embedding similarity and flags the output for human review. The flag goes to a queue; an editor sees the original prompt, the failed generation, and the auto-correction within hours.
Why not just put the vocabulary in the system prompt?
You can. It costs tokens on every call. With 298 terms (roughly 4KB of text), that is ~1,000 tokens per request times every request. The validator approach loads the vocab in process memory and keeps system prompts short. The cost trade is real and worth measuring.
What we measured
- Validation latency: 2-5ms (regex + set lookup, no LLM call)
- Hard-fail rate after two retries: <0.5%
- False positives caught (terms LLM invented): ~3-5% of raw generations
- Vocab size: 298 terms, ~2KB resident memory
- Editor flag queue: ~10-20 items per day at 30K queries/month, manageable for a part-time reviewer
Share this article
Muhammad Mudassir
Founder & CEO, Cognilium AI | 10+ years
Muhammad Mudassir
Founder & CEO, Cognilium AI | 10+ years experience
Mudassir Marwat is the Founder & CEO of Cognilium AI. He has shipped 100+ production AI systems acro...
