Reference

AI Engineering Glossary

Practitioner-grade definitions for 52+ terms across RAG, GraphRAG, agents, LLMOps, voice AI, knowledge graphs, and production patterns. Maintained by engineers shipping production AI for 50+ enterprise clients.

Retrieval & RAG

9 terms

ANN (Approximate Nearest Neighbor)

Retrieval & RAG

A class of algorithms that find vectors close to a query vector quickly without scanning the full index, trading exact correctness for speed.

HNSW, IVF-PQ, and ScaNN are common ANN algorithms. Used by Pinecone, Qdrant, Weaviate, and pgvector. Recall is typically tuned to 0.95-0.99 of exact search.

BM25

Retrieval & RAG

A keyword-based ranking function that scores documents against a query using term frequency, inverse document frequency, and document length normalization.

BM25 (Best Matching 25) is the lexical-search baseline. Hybrid retrieval combines BM25 with vector search via score fusion (RRF — Reciprocal Rank Fusion) to capture both keyword precision and semantic recall.

Chunking

Retrieval & RAG

The process of splitting documents into smaller passages (chunks) for embedding and retrieval.

Common strategies: fixed-size (e.g., 512 tokens), recursive character splitter, semantic chunking by sentence/paragraph boundaries, structural chunking (headings, lists). Chunk size + overlap is one of the highest-impact RAG knobs.

GraphRAG

Retrieval & RAG

Retrieval-augmented generation that queries a property graph instead of (or alongside) a flat vector index to capture entity relationships during retrieval.

GraphRAG wins above ~100k documents or when queries require multi-hop reasoning. Architecture is 4 layers: extraction, graph store, hybrid retrieval, answer synthesis. Costs 1.5×-3× more than vector RAG but lifts answer accuracy on relationship-heavy queries by 30-50%.

Hybrid Retrieval

Retrieval & RAG

Retrieval that combines lexical (BM25), semantic (vector), and sometimes graph-based methods to maximize both precision and recall.

Fusion is typically via Reciprocal Rank Fusion (RRF) or weighted score averaging. Hybrid retrieval almost always outperforms any single method on enterprise corpora.

RAG (Retrieval-Augmented Generation)

Retrieval & RAG

A pattern where an LLM is given retrieved context (passages, facts, code) at inference time to ground its answer, instead of relying purely on training-time knowledge.

RAG decouples the model from the corpus. Documents are embedded into a vector index; at query time, top-k similar chunks are retrieved and prepended to the LLM prompt. This pattern enables citation, freshness, and domain specialization without re-training.

Reranking

Retrieval & RAG

A second-stage scoring step that re-orders retrieved candidates using a more expensive but more accurate model (typically a cross-encoder or LLM).

Cohere Rerank 3, Jina Reranker, and cross-encoder models like BGE-reranker are typical choices. Reranking the top 100 candidates and keeping the top 5-10 outperforms larger flat top-k retrieval.

Semantic Search

Retrieval & RAG

Search that retrieves documents by meaning (via embeddings) rather than exact keyword match.

Sometimes used interchangeably with vector search. Distinguishes from lexical search (BM25, TF-IDF) which matches surface forms.

Vector Embedding

Retrieval & RAG

A dense numerical representation of text (or image, audio) that captures semantic meaning in a high-dimensional space, used for similarity search.

Modern embeddings come from models like OpenAI text-embedding-3-large, Cohere embed-v3, or open-weight Nomic and BGE. Typical dimensions: 768-3072. Cosine similarity is the default distance metric.

Foundation Models

8 terms

Context Window

Foundation Models

The maximum number of tokens an LLM can attend to in a single inference call, combining input and output.

Context windows have grown from 4k tokens (2022) to 1M+ tokens (Gemini 1.5 Pro, Claude 4.6 Sonnet) in 2026. Long context enables document-stuffing patterns but does not eliminate the need for RAG — retrieval-quality and cost both still favor selective context.

Distillation

Foundation Models

Training a smaller student model to mimic the outputs of a larger teacher model, reducing inference cost while preserving most capability.

Used to create Anthropic Haiku from Sonnet, OpenAI 4o-mini from 4o, etc. Production teams distill task-specific behaviors from frontier models into cheaper open-weight models for hot paths.

Fine-Tuning

Foundation Models

Updating a pre-trained model's weights on a smaller, task-specific dataset to specialize its behavior.

Modern fine-tuning is usually LoRA (low-rank adaptation) or QLoRA — parameter-efficient methods that update a tiny fraction of weights. Production fine-tuning for AI applications is rarer in 2026 than 2023; RAG + prompt engineering covers most needs.

Foundation Model

Foundation Models

A large model trained on broad data and adaptable to many downstream tasks via prompting, fine-tuning, or RAG.

Foundation models include LLMs, vision-language models (CLIP, Gemini Vision), speech models (Whisper, Ultravox), and multimodal models. The term highlights that one base model serves many applications.

LLM (Large Language Model)

Foundation Models

A transformer-based neural network trained on large text corpora to predict next tokens, ranging from 1B to 1T+ parameters.

Examples: Anthropic Claude (Fable, Mythos, Opus, Sonnet, Haiku), OpenAI GPT (4o, 4.1, 5), Google Gemini, Meta Llama. Most production AI in 2026 is built on closed-weight LLMs accessed via API.

MoE (Mixture of Experts)

Foundation Models

A model architecture where different "expert" sub-networks handle different inputs, activating only a subset of parameters per token.

MoE models (Mixtral, DeepSeek-V3, Llama 4) have high total parameter counts but lower active parameters per token, enabling cheaper inference. Most frontier models in 2026 are MoE under the hood.

Quantization

Foundation Models

Reducing the numerical precision of a model's weights (e.g., fp32 → int8 or int4) to shrink memory and speed inference.

Common quantization levels: fp16, bf16, int8, int4. Open-weight models served via vLLM, llama.cpp, or TensorRT typically run quantized in production. Quality degradation is minor for most tasks above 4-bit.

RLHF (Reinforcement Learning from Human Feedback)

Foundation Models

A training technique where humans rank model outputs and the model learns to prefer high-ranked responses through a reward model.

RLHF is how foundation labs (Anthropic, OpenAI) align models for helpfulness and safety. Most production teams do not do RLHF; they consume aligned models via API and use prompting/RAG to specialize.

Agents & Orchestration

8 terms

Agent

Agents & Orchestration

An LLM-driven system that observes, plans, takes actions through tools, and observes results in a loop to accomplish a goal.

Distinguished from a single prompt by the loop structure: think → act → observe → repeat. Production agents have evaluation, retries, cost caps, and human-in-the-loop checkpoints.

Bedrock AgentCore

Agents & Orchestration

AWS's managed agent runtime with built-in observability, memory, and tool integration, available as part of Amazon Bedrock.

Best fit when the rest of the stack is AWS-native and the customer requires data residency, VPC integration, or BAA. Cognilium has shipped production multi-agent systems on AgentCore.

CrewAI

Agents & Orchestration

A Python framework for orchestrating role-based multi-agent systems with declarative task assignment.

CrewAI uses a "crew of agents with roles" mental model (researcher, writer, editor). Simpler API than LangGraph for role-based workflows; less flexible for arbitrary state machines.

LangGraph

Agents & Orchestration

A Python library from LangChain Inc. for building agent workflows as explicit state-machine graphs with typed state and conditional edges.

LangGraph beats raw LangChain agents for production reliability because state is observable and transitions are explicit. Combined with LangSmith for trace observability, it is a common production stack choice.

Multi-Agent System

Agents & Orchestration

An architecture where multiple specialized agents collaborate, often coordinated by a supervisor or router agent.

Common pattern: supervisor agent routes tasks to specialist agents (researcher, coder, reviewer). Worth the complexity only when single-agent context limits or capability gaps are hit. Costs and failure modes compound.

ReAct Pattern

Agents & Orchestration

An agent loop where the LLM alternates between Reasoning (thought) and Acting (tool call), used in early agent frameworks.

Original ReAct paper (Yao et al., 2022) inspired LangChain agents. Modern frameworks (LangGraph, CrewAI) generalize beyond pure ReAct with state-machine and DAG patterns.

Supervisor Pattern

Agents & Orchestration

A multi-agent architecture where a top-level supervisor agent dispatches subtasks to specialist agents and aggregates their results.

Used in Cognilium-built production systems on AWS Bedrock AgentCore and Google ADK. Supervisor binds only the tools each tenant has access to, preventing forked agent definitions per customer.

Tool Use / Function Calling

Agents & Orchestration

An LLM capability where the model emits structured calls to external functions (APIs, code execution, search) and incorporates their results into its response.

All frontier LLMs support tool use in 2026. The model decides when to call which tool based on the prompt and tool schemas. Production tool use needs strict schemas, retry logic, and timeout handling.

Knowledge Graphs

7 terms

Cypher

Knowledge Graphs

The SQL-like query language for property graphs, originally from Neo4j, now supported by Memgraph, Neptune (openCypher), and others.

Example: MATCH (c:Company)-[:EMPLOYS]->(p:Person) WHERE c.industry = "Finance" RETURN p.name. Industry-default query language for knowledge graphs.

Entity Resolution

Knowledge Graphs

The process of deciding when two surface forms ("Acme Corp", "Acme Corporation") refer to the same real-world entity.

Combines canonical-name lookup, embedding similarity, and rule-based identifiers (DUNS, EIN, addresses). Without it, knowledge graphs accumulate duplicate identity nodes and retrieval over-fetches.

Graph Rot

Knowledge Graphs

The silent decay of a production knowledge graph's correctness through orphan nodes, duplicate entities, stale edges, and missing provenance.

Cognilium monitors 7 specific decay signals — orphan rate, duplicate identity, edge staleness, source-document drift, attribute conflicts, cycle pollution, and provenance gaps — with weekly health checks.

Graph Traversal

Knowledge Graphs

The process of walking from one node to others through edges, typically to assemble retrieval context or answer multi-hop queries.

Production graph traversal has depth caps (typically 2) and fanout caps per node (50-100) to prevent cost explosions on high-degree hub entities.

Knowledge Graph

Knowledge Graphs

A data structure of typed entities (nodes) and typed relationships (edges) with attached attributes, used to represent domain knowledge.

Distinguished from a flat database by the graph topology and the focus on relationships. Cognilium builds production knowledge graphs for legal, financial, and HR use cases on Neo4j, Memgraph, and Amazon Neptune.

Ontology

Knowledge Graphs

A formal specification of the entity types, relationship types, and constraints in a knowledge graph schema.

The ontology defines what is possible: "Person can EMPLOYED_BY Organization", "Contract REFERENCES Section". Ontology design is engineering, not an LLM task — sloppy schemas lead to graph rot.

Property Graph

Knowledge Graphs

A graph data model where both nodes and edges can have typed properties (key-value pairs), used by Neo4j, Memgraph, and Neptune.

Contrasts with RDF triples (subject-predicate-object), which are more rigid. Property graphs are the production default for knowledge-graph applications outside academic semantic web.

LLMOps

8 terms

Circuit Breaker

LLMOps

A pattern that halts requests to an LLM endpoint when error rates or latency exceed thresholds, allowing the system to fail closed rather than degrade.

Borrowed from microservices reliability. Production LLM systems set circuit breakers per provider (Anthropic, OpenAI, Bedrock) and route to fallback providers when a circuit opens.

Eval-Driven Development

LLMOps

A workflow where prompt and architecture changes are scored against a golden test set of queries before deployment.

Analogous to test-driven development for traditional software. Golden sets cover 100-300 queries spanning happy path, edge cases, and adversarial inputs. Scoring uses LLM-as-judge or human review.

Golden Set

LLMOps

A curated collection of input-output pairs used as the regression suite for LLM applications.

Maintained continuously: new failure modes from production logs get added; stale examples get pruned. Without a golden set, "is this prompt change better?" is unanswerable.

LLM-as-Judge

LLMOps

Using an LLM to score the output of another LLM against a rubric, replacing expensive human evaluation for scalable quality measurement.

Production judges include temperature-escalation retry patterns (retry at higher temperature if judge score is low) and ensemble judging (3 judges vote). Judges have known failure modes — verbosity bias, position bias.

LLMOps

The operational discipline of running LLM-powered applications in production — evaluation, observability, retries, cost engineering, prompt versioning.

LLMOps is to LLM applications what MLOps is to traditional ML. Stack components: golden-set evaluation, LLM-as-judge, LangSmith/Langfuse for traces, Helicone for token observability, circuit breakers, prompt versioning.

Observability

LLMOps

The ability to inspect and debug LLM application behavior in production through traces, logs, metrics, and cost data.

Standard stack in 2026: LangSmith or Langfuse for traces, Helicone for token-spend metering, Datadog for infrastructure metrics, OpenTelemetry GenAI conventions as the open standard.

Prompt Engineering

LLMOps

The practice of designing the instructions, structure, and examples given to an LLM to elicit a desired behavior.

In 2026, prompt engineering is less brittle than in 2023 because models are more aligned, but still consequential. Production prompts are versioned, tested against golden sets, and A/B tested.

Token Economics

LLMOps

The discipline of managing per-request and aggregate cost of LLM applications through routing, caching, batching, and model selection.

Levers: route trivial queries to cheaper models, cache repeated queries with semantic-similar lookup, batch where latency permits, prefer prompt caching when supported. A 70% cost reduction is typical without quality loss.

Voice AI

6 terms

Barge-In

Voice AI

The capability of a voice agent to detect when a user has started speaking over its response and gracefully stop, allowing natural conversation.

Engineering challenges: distinguishing real interruption from background noise, handling cut-off responses cleanly, recovering conversational state. Standard in production-grade voice systems.

Latency Budget

Voice AI

The maximum allowed end-to-end response time for a voice agent, typically budgeted across VAD, STT, LLM, TTS, and network legs.

Industry target: sub-1.5 seconds p95. Typical 2026 budget: 200ms VAD + 250ms streaming STT + 500ms first-token LLM + 300ms first-byte TTS + 250ms network = 1.5s. Each hop is where engineering effort goes.

Speech-to-Speech

Voice AI

A single model that goes directly from spoken input to spoken output without an intermediate text representation, reducing latency and preserving prosody.

Ultravox and OpenAI Realtime are leading production options in 2026. Removes 2 of the 3 latency hops (STT → LLM → TTS becomes one model), enabling sub-600ms first-token responses.

STT (Speech-to-Text)

Voice AI

The transcription of spoken audio into text. Modern STT supports streaming (partial transcripts during speech) for low-latency voice agents.

Production options: Deepgram Nova, AssemblyAI, OpenAI Whisper. Deepgram and AssemblyAI lead on streaming latency; Whisper is the open-weight default for batch transcription.

TTS (Text-to-Speech)

Voice AI

The synthesis of natural-sounding speech from text input, with streaming support for low-first-byte voice agents.

Production options: ElevenLabs (highest naturalness, branded-voice cloning), Cartesia (low latency), Azure Neural, Google Cloud TTS. Streaming TTS shaves 200-400ms off first-token latency.

Voice AI

AI systems that interact via spoken language, typically combining speech-to-text (STT), an LLM, and text-to-speech (TTS), or using a single speech-to-speech model.

Production voice AI in 2026 includes call-center agents, voice assistants, voice interview systems, and voice-augmented support. Latency budget (sub-1.5s p95) is the dominant engineering constraint.

Production Patterns

6 terms

Embedding Drift

Production Patterns

When an embedding model is updated and previously-indexed vectors are no longer comparable to new query embeddings, requiring full re-embedding.

A real operational concern when changing embedding models (e.g., text-embedding-ada-002 → text-embedding-3-large). Production teams version embedding models and gate model upgrades on full corpus re-index.

Function Calling

Production Patterns

A structured API where the LLM returns a JSON description of a function to call (with arguments) rather than free text, enabling tool use.

All frontier APIs (Anthropic, OpenAI, Google) support function calling. Production-grade implementations enforce JSON schema validity, retry on malformed outputs, and timeout on long-running tools.

Grounding

Production Patterns

The practice of constraining LLM outputs to be supported by retrieved context, cited sources, or a controlled vocabulary.

Strongest grounding: instruct the model to refuse if no supporting context is provided, require inline citations, validate citations exist in the corpus, and post-process outputs to reject claims without backing.

Hallucination

Production Patterns

When an LLM produces a confident-sounding output that is not grounded in the input context or in verified facts.

Hallucinations cannot be fully eliminated but are dramatically reduced through grounding (RAG), domain vocabulary enforcement, output validation, and instruct-cite-or-refuse prompting.

Prompt Injection

Production Patterns

An adversarial input that hijacks an LLM's instructions, causing it to ignore the system prompt or leak sensitive information.

Defenses: input filtering, dual LLM (one parses the input untrusted, one acts on a sanitized version), strict output schemas, system-prompt isolation. No defense is complete; assume some leakage.

Semantic Cache

Production Patterns

A cache where lookups are by semantic similarity of queries (via embeddings), not by exact match, allowing reuse of LLM responses for paraphrased queries.

Production semantic caches (Redis vector, GPTCache) hit on 20-40% of queries in domains with repeat patterns (FAQ, support). The cost saving is multiplicative on LLM spend.

Shipping AI in production?

Cognilium AI builds production AI for enterprise — GraphRAG, multi-agent, voice AI, LLMOps. 50+ projects delivered. Talk to an engineer about your project.

Talk to an Engineer Read the engineering blog