Production AI stack: LangGraph orchestration, Qdrant retrieval, Bedrock + Vertex multi-cloud, Triton inference
The exact engineering stack behind 50+ shipped projects and our 4 production AI products — Paralegent AI, ProspectVox, VectorHire, and VORTA. Every layer named, every choice justified.
Orchestration via LangGraph (default) and CrewAI / AWS Bedrock AgentCore / Google ADK where the deployment target requires it. Retrieval over fine-tuning for most workloads. Evaluation harnesses before models. OpenTelemetry traces on every LLM call. Customer-VPC deployment in a week, not a quarter.
Four engineering principles we won't compromise on
Picked the hard way, across 50+ shipped projects. These are the trade-offs we make consistently, and the ones we'd defend in a post-mortem.
Evals before models
Golden datasets and LLM-as-judge harnesses ship before the first prompt is written.
Without a regression suite — labelled goldens, an LLM-judge with bias checks against human annotations, a frozen test set kept off the training side — every prompt change is a coin flip. We build the eval harness on Day 1 (RAGAS for retrieval, custom judges for downstream behaviour, deterministic graders for structured outputs) and treat it as the contract a model has to pass before deploy.
Multi-agent is the unit of work
Most production failures are coordination bugs, not model bugs.
Single-agent chat-loops break the moment the task needs a tool that needs a sub-task. We design topologies up front — supervisor + workers, parallel committees with vote-merge, planner + executor + critic — built on LangGraph for state and checkpointing, with explicit handoff contracts and budget caps per node. The graph is the system; the LLM is a function call inside it.
Cloud-portable by design
We can move a customer deployment between AWS, GCP, and Azure in a week, not a quarter.
Every system is built against a thin LLM/embeddings/vector abstraction so the customer is never locked to one provider. We deploy into customer VPCs (and air-gapped variants for regulated industries) using Terraform modules and Helm charts that target Bedrock, Vertex AI, and Azure OpenAI interchangeably. BAA, DPA, and data-residency commitments come built-in, not bolted on.
Cost-aware by default
We engineer inference cost down before we deploy, not after the bill arrives.
Token spend is tracked per-route via Helicone or LangSmith from the first staging deploy, with budget alerts and per-tenant quotas. Routing logic pushes easy queries to Haiku / Gemini Flash / GPT-4o-mini, and reserves Opus / GPT-4o / Gemini Pro for the hard 5%. Prompt caching, semantic caching, and KV-cache reuse on Triton are not optimisations — they are part of the launch checklist.
Five engineering pillars, named tools, defended choices
What we build on every project, in this order. No marketing pillars — these are the architecture decisions that show up in the design doc.
Multi-agent orchestration as the default unit of work
Supervisor/worker graphs, not single-agent chat loops
Retrieval over fine-tuning, for most use cases
Hybrid search + reranking beats a custom model 80% of the time
Evaluation harnesses, not vibes
Golden datasets, LLM-as-judge, regression suites in CI
Cloud-portable by design
Customer-VPC and air-gap deployments, not just SaaS
Observability at every layer
OpenTelemetry traces, latency budgets, token spend, redaction
Multi-agent orchestration as the default unit of work
Supervisor/worker graphs, not single-agent chat loops
Single-agent loops fail under any task that fans out — they lose state, exhaust context, and have no recovery story when a tool call 500s. We build typed graphs with explicit handoff contracts, per-node budgets, and durable checkpoints so a long-running task can be paused, inspected, and resumed by a human reviewer.
The Cognilium stack — ten layers, named tools, why each
Not a logo wall. This is the actual reference matrix our engineers pick from on a Day-1 architecture call — and the one-line justification for every choice.
Orchestration
Where the graph and the agent topology live
LLMs (foundation models)
Routed by cost / latency / capability per call
Retrieval & vector
Hybrid search, reranking, graph
Voice
Real-time speech for telephony and live agents
Document AI
Extraction from PDFs, scans, and forms
Inference & serving
Where open-weight models actually run
MLOps
Training, tracking, packaging, scheduling
Data plane
OLTP, OLAP, streaming, CDC, transformation
Observability
Traces, evals, cost, prompts
Cloud
Where the customer's VPC actually lives
How a Cognilium project actually ships
Six stages, in this order, on every engagement. Evals exist before models. Topology exists before orchestrators. Drift gates exist before traffic.
Problem definition + eval-set design
Before a single model is loaded.
- Stakeholder interviews to write a 1-page success spec: what task, what input, what output, what failure mode is unacceptable
- Eval-set construction — 100-500 hand-curated examples per task, kept off the prompt-tuning side and versioned with DVC
- Metrics chosen up front: faithfulness + answer relevance (RAGAS) for RAG; tool-call accuracy + format compliance for agents; LLM-judge rubric for free-text
- Cost-per-call ceiling and latency p99 budget written into the spec; treated as deploy gates, not stretch goals
Retrieval prototype
BM25 baseline → hybrid → rerank, measured at every step.
- BM25 baseline (Elastic / OpenSearch) on the customer corpus — gives the floor that any later complexity has to beat
- Dense retrieval on Qdrant with embedding choice picked from a bake-off (text-embedding-3-large, voyage-3, bge-large) on the customer's eval set
- Hybrid fusion (reciprocal rank fusion) + Cohere or BGE rerank — only kept if RAGAS context-precision lifts ≥ 5 points over BM25
- Chunking strategy chosen for the doc structure — semantic, recursive, or layout-aware via Unstructured.io; never a default 512-token window
Agent topology design
Single-agent? Supervisor + workers? Committee?
- Topology picked from the task shape, not from a default — most failures here are coordination bugs not model bugs
- Patterns we choose from: single-agent ReAct (bounded tool budget), supervisor + N workers, plan-execute-critic, parallel committee with vote-merge
- LangGraph state schema written first — explicit reducers, checkpoint backend (Postgres for durability, Redis for hot state), retry policy per edge
- Per-node budgets — token cap, time cap, retry cap — to bound worst-case cost on adversarial inputs
Production pipeline
Orchestrator + observability + cost tracking + drift gates.
- Full OpenTelemetry GenAI-convention instrumentation on every LLM call, retrieval step, and tool execution
- Token-spend metering via Helicone or Langfuse — per-tenant, per-route, per-model — wired to budget alerts
- PII redaction (Microsoft Presidio + custom regex) on logged prompts/responses, with per-tenant retention configurable
- Drift gates — PSI / KS-test on embedding distributions, alert when production traffic moves > 1σ from eval set within 7 days
- Routing logic — Haiku / Flash / 4o-mini for the easy 95%, Opus / GPT-4o / Gemini Pro reserved for the hard 5%
Pre-launch validation
Red-team, safety eval, latency, cost-per-conversation.
- Red-team prompt suite — jailbreak, prompt-injection, indirect-injection-via-retrieval, scope-creep — has to pass before any traffic
- Safety evals against the customer's policy: refusal calibration, sensitive-topic handling, output-format compliance
- Latency p50 / p95 / p99 measured under realistic mixed-traffic load with k6 or Locust, not just sunny-day single-call
- Cost-per-conversation projected at 10x current volume and at the worst-case persona; if it overruns budget, route mix and prompt are rewritten
Deploy + on-call + eval CI
Customer VPC or shared infra. Runbook included.
- Deploy via Terraform module + Helm chart into customer EKS / GKE / AKS — or Cognilium-shared infra for SaaS-mode customers
- Shadow traffic, then 5% / 25% / 100% ramps with statistical-significance gates on the primary metric before each step
- On-call runbook — top 10 alerts, debug paths, model fallback plan, escalation contact — handed to the customer team in week 1
- Eval suite wired into the customer's CI so any prompt or model change has to clear the regression gate before merging
Verified outcomes — nothing rounded up
The four numbers we'll defend on a reference call. Everything else on this page is an engineering choice we made to keep them true.
Projects delivered
Production AI systems shipped across legal, financial services, healthcare, sales, and HR — each through the same six-stage delivery pipeline.
Production AI products
Paralegent AI (legal multi-agent), ProspectVox (voice sales agent on Ultravox), VectorHire (recruiting RAG), VORTA (agentic workspace search).
Client satisfaction
Measured at engagement close across the portfolio. The remaining 4% is the budget we hold for what we'd redo in hindsight.
Clients in US, UAE & Pakistan
Multi-region delivery teams, multi-cloud deployment defaults — the same engineers ship in customer VPCs on AWS, GCP, and Azure.
The stack in production — four case studies
Each one names the layers from the stack matrix above. Each one ships through the six-stage flow. Customer names anonymised where contracts require it.
22 hours/week saved on teacher lesson preparation
- Methodology-grounded lesson generation — answers stay faithful to the publisher's 188 catalogued mini-lessons + 50+ named teaching strategies
- Hybrid BM25 + dense embeddings (text-embedding-3-large) — dense catches semantic matches, sparse catches publisher-specific terminology that pure vectors miss
- Three escape hatches keep the co-pilot in scope: domain filter, must-cite-source guard, confidence floor — out-of-scope queries get a polite redirect
- Drift gate via PSI on embedding distribution — alerts trigger re-eval before quality erodes
Supervisor + 7 worker agents for diligence and portfolio operations
- Three-agent committee (analyst + risk + auditor) with vote-merge — disagreement on a diligence question routes to a human partner with full trace
- GraphRAG over the family entity graph (trusts, holdings, counterparties) — relationship queries that flat-vector retrieval mishandles
- Document AI extraction on bank and brokerage statements feeds the same graph, normalised via dbt — refresh latency under 24 hours
- Audit log retained 7 years, OpenTelemetry trace per decision, reproducible re-run of any historical query against today's data
Sub-second voice latency on Ultravox with CRM-aware tool-use
- Single-model speech-in / speech-out via Ultravox — avoids the STT → LLM → TTS stack latency that kills natural conversation
- Dialogue policy as a LangGraph state machine — explicit handoffs to a human SDR on intent-to-buy or objection patterns above threshold
- Real-time CRM tool-use mid-call: lookup contact, log activity, propose meeting slot, all within the conversation turn
- Per-call cost ceiling enforced — calls that exceed budget fail closed to a polite handoff, never to a runaway loop
Resume + JD semantic match at portfolio scale, with explainable rerank
- Hybrid retrieval — BM25 on hard skills + dense vectors on experience narratives — fused via RRF and reranked for top-K precision
- Explainable match — every candidate-to-JD score decomposes into highlighted spans and named skill matches, surfaced in the UI
- Embeddings refresh on resume edit via Debezium CDC — no stale candidate vectors, no batch-job lag
- RAGAS-style eval suite gating any change to the embedding model or rerank prompt
Engineering targets, not marketing outcomes
Latency budgets we put in the spec. Cost discipline we enforce in CI. Drift gates that trigger re-eval automatically. Compliance commitments that come with the deployment.
p99 targets we sign up for in the spec
bf16 + KV-cache reuse on a co-located node; sub-token latency reported per Triton perf-analyzer
Hybrid retrieval + Cohere rerank + Sonnet streaming first token
Parallel worker execution, supervisor vote-merge with early-exit on consensus
Single-model speech-in / speech-out path — avoids stacked STT → LLM → TTS latency
What we engineer to, before we sign the order
Projected at 10x volume and worst-case persona — if overrun, route mix and prompt are rewritten before launch
Helicone or Langfuse from Day 1 of staging — budget alerts fire before the credit card does
Frontier models (Opus / GPT-4o / Gemini Pro) reserved for the hard 5%; mix logged and reviewable
Anthropic prompt caching for stable system prompts; semantic cache for repeated user queries; KV reuse on Triton
How we keep production from rotting
Production traffic compared to eval set; alert when divergence > 1σ within 7 days
Drift detection → re-eval pipeline runs automatically, blocks promotion if RAGAS scores drop > 2 points
Any prompt or model change has to clear the golden-set regression suite before merge
Jailbreak, prompt-injection, indirect-via-retrieval, scope-creep prompts re-run on every release branch
What we sign up to in the MSA
Targets set against the customer's downstream criticality — we don't quote generic SLAs; we engineer to the one we sign
Terraform module + Helm chart targeting EKS / GKE / AKS — same artifact, different variable file
BAA (HIPAA) and DPA (GDPR) ready via Bedrock / Azure OpenAI; air-gap variant for regulated industries
Every LLM call logged with prompt, retrieval context, tool calls, response, and tenant — PII-redacted via Presidio
Everything You Need to Know About Cognilium AI Technology
The questions engineering teams actually ask us on the first architecture call. Stack choices, latency budgets, drift gates, compliance posture.