How is this different from a regular eval?

Evals run offline on a sample. This runs online on every output, before delivery, in <100ms. It is a validator, not a measurement.

What if a legitimate term is missing from the vocabulary?

The vocab is loaded at startup from a versioned source (in our case, a Smekens vocabulary file with 298 terms). New terms ship via PR; the validator never silently approves an unknown term. False positives become a curation signal.

Why not solve this with a stronger model?

You can. GPT-4o hallucinates terms less than GPT-4o-mini. But cost and latency triple. The validator lets you keep the cheaper model for generation and pay 2-5ms of validation overhead instead.

What happens when validation fails?

The system retries the generation with a stricter prompt that lists the allowed vocabulary inline. After two retries, it falls back to the most-similar valid term and flags the output for human review. Hard fail rate: <0.5%.

Does this work for free-form prose?

Less well. It works best when the output is structurally constrained — terminology must come from a known set. For free-form prose, use it on the named-entity layer (only check that proper nouns and technical terms match the vocab) and let the connective tissue be free.

Anti-Hallucination via Runtime Grounding Against a Domain…

Q: What if a legitimate term is missing from the vocabulary?

The vocab is loaded at startup from a versioned source (in our case, a Smekens vocabulary file with 298 terms). New terms ship via PR; the validator never silently approves an unknown term. False positives become a curation signal.

Q: Why not solve this with a stronger model?

You can. GPT-4o hallucinates terms less than GPT-4o-mini. But cost and latency triple. The validator lets you keep the cheaper model for generation and pay 2-5ms of validation overhead instead.

Q: What happens when validation fails?

The system retries the generation with a stricter prompt that lists the allowed vocabulary inline. After two retries, it falls back to the most-similar valid term and flags the output for human review. Hard fail rate: <0.5%.

Q: Does this work for free-form prose?

Less well. It works best when the output is structurally constrained — terminology must come from a known set. For free-form prose, use it on the named-entity layer (only check that proper nouns and technical terms match the vocab) and let the connective tissue be free.

Domain-specific generation has a recurring failure mode: the LLM produces output that is fluent and confident but uses terminology that does not exist in the domain. A K-12 writing methodology has 298 specific terms; the generator may produce "voiceful sentence" or "stylistic figure" — words that sound right and are not in the framework. End users notice, trust evaporates.

A validator that runs at output time and compares generated terminology against a startup-loaded vocabulary catches this before the user sees it, at 2-5ms of latency. The pattern is cheap and underused.

Vocabulary loading

At process startup, the validator loads the domain vocabulary from a versioned source — for our K-12 system it is a 298-term file extracted from the framework documentation. The vocab is parsed into a set with synonyms and inflection variants pre-computed. Total size: ~2KB in memory.

Validation flow

Every generated output gets one validation pass. The pass runs the structured output through a regex-based extractor that finds named-entity-style terms (capitalized, multi-word, or matching domain patterns). Each extracted term is checked against the vocabulary set.

Match: pass.
No match, fuzzy-match within edit distance 1: log + auto-correct (covers typos in generation).
No match, no fuzzy match: validation fails.

What happens on failure

Failure triggers a retry with a stricter prompt that lists the allowed vocabulary inline ("Use only these terms for craft elements: ..."). After two retries the system falls back to the closest valid term by embedding similarity and flags the output for human review. The flag goes to a queue; an editor sees the original prompt, the failed generation, and the auto-correction within hours.

Why not just put the vocabulary in the system prompt?

You can. It costs tokens on every call. With 298 terms (roughly 4KB of text), that is ~1,000 tokens per request times every request. The validator approach loads the vocab in process memory and keeps system prompts short. The cost trade is real and worth measuring.

What we measured

Validation latency: 2-5ms (regex + set lookup, no LLM call)
Hard-fail rate after two retries: <0.5%
False positives caught (terms LLM invented): ~3-5% of raw generations
Vocab size: 298 terms, ~2KB resident memory
Editor flag queue: ~10-20 items per day at 30K queries/month, manageable for a part-time reviewer

Anti-Hallucination via Runtime Grounding Against a Domain Vocabulary

Vocabulary loading

Validation flow

What happens on failure

Why not just put the vocabulary in the system prompt?

What we measured

Share this article

Muhammad Mudassir

Muhammad Mudassir

Frequently Asked Questions

How is this different from a regular eval?

What if a legitimate term is missing from the vocabulary?

Why not solve this with a stronger model?

What happens when validation fails?

Does this work for free-form prose?

Still have questions?

Related Articles

RAG vs GraphRAG: When the Vector Database Stops Being Enough

Hybrid Retrieval With Prefetch-Time Metadata Filtering

Organizational Memory: RAG Across Slack, Confluence, and Loom

Explore More Insights