Production reference — updated for 2026

Production AI stack: LangGraph orchestration, Qdrant retrieval, Bedrock + Vertex multi-cloud, Triton inference

The exact engineering stack behind 50+ shipped projects and our 4 production AI products — Paralegent AI, ProspectVox, VectorHire, and VORTA. Every layer named, every choice justified.

Orchestration via LangGraph (default) and CrewAI / AWS Bedrock AgentCore / Google ADK where the deployment target requires it. Retrieval over fine-tuning for most workloads. Evaluation harnesses before models. OpenTelemetry traces on every LLM call. Customer-VPC deployment in a week, not a quarter.

SOC 2 Type II controls| ISO 27001-aligned|HIPAA BAA + GDPR DPA-ready|AWS / GCP / Azure|96% client satisfaction|Founded 2019 — US, UAE, Pakistan
p99 < 80ms
Triton inference
Per-token decode budget on bf16 + KV-cache reuse; RAG end-to-end p99 under 800ms
LangGraph
Default orchestrator
Supervisor/worker graphs, durable state, checkpointing, human-in-loop nodes
Qdrant + BM25
Hybrid retrieval
Dense + sparse fusion, Cohere/BGE rerank, payload filters at 100M-vector scale
Multi-cloud
AWS / GCP / Azure
Customer-VPC and air-gap variants; BAA/DPA-ready; SOC 2 Type II + ISO 27001 paths

Four engineering principles we won't compromise on

Picked the hard way, across 50+ shipped projects. These are the trade-offs we make consistently, and the ones we'd defend in a post-mortem.

Evals before models

Golden datasets and LLM-as-judge harnesses ship before the first prompt is written.

Without a regression suite — labelled goldens, an LLM-judge with bias checks against human annotations, a frozen test set kept off the training side — every prompt change is a coin flip. We build the eval harness on Day 1 (RAGAS for retrieval, custom judges for downstream behaviour, deterministic graders for structured outputs) and treat it as the contract a model has to pass before deploy.

Multi-agent is the unit of work

Most production failures are coordination bugs, not model bugs.

Single-agent chat-loops break the moment the task needs a tool that needs a sub-task. We design topologies up front — supervisor + workers, parallel committees with vote-merge, planner + executor + critic — built on LangGraph for state and checkpointing, with explicit handoff contracts and budget caps per node. The graph is the system; the LLM is a function call inside it.

Cloud-portable by design

We can move a customer deployment between AWS, GCP, and Azure in a week, not a quarter.

Every system is built against a thin LLM/embeddings/vector abstraction so the customer is never locked to one provider. We deploy into customer VPCs (and air-gapped variants for regulated industries) using Terraform modules and Helm charts that target Bedrock, Vertex AI, and Azure OpenAI interchangeably. BAA, DPA, and data-residency commitments come built-in, not bolted on.

Cost-aware by default

We engineer inference cost down before we deploy, not after the bill arrives.

Token spend is tracked per-route via Helicone or LangSmith from the first staging deploy, with budget alerts and per-tenant quotas. Routing logic pushes easy queries to Haiku / Gemini Flash / GPT-4o-mini, and reserves Opus / GPT-4o / Gemini Pro for the hard 5%. Prompt caching, semantic caching, and KV-cache reuse on Triton are not optimisations — they are part of the launch checklist.

Five engineering pillars, named tools, defended choices

What we build on every project, in this order. No marketing pillars — these are the architecture decisions that show up in the design doc.

Multi-agent orchestration as the default unit of work

Supervisor/worker graphs, not single-agent chat loops

Retrieval over fine-tuning, for most use cases

Hybrid search + reranking beats a custom model 80% of the time

Evaluation harnesses, not vibes

Golden datasets, LLM-as-judge, regression suites in CI

Cloud-portable by design

Customer-VPC and air-gap deployments, not just SaaS

Observability at every layer

OpenTelemetry traces, latency budgets, token spend, redaction

Multi-agent orchestration as the default unit of work

Supervisor/worker graphs, not single-agent chat loops

Single-agent loops fail under any task that fans out — they lose state, exhaust context, and have no recovery story when a tool call 500s. We build typed graphs with explicit handoff contracts, per-node budgets, and durable checkpoints so a long-running task can be paused, inspected, and resumed by a human reviewer.

LangGraph (default) — durable state, checkpointing to Postgres/Redis, human-in-loop nodes, retry-with-backoff at the edge level
CrewAI for role-based collaborative agents (PM + engineer + reviewer patterns) where the value is in the persona separation
AWS Bedrock AgentCore when the customer is AWS-native and wants managed runtime + IAM-scoped tool execution
Google ADK (Agent Development Kit) for Vertex-native tool registration with built-in tracing through Cloud Trace
Topology patterns we use most: supervisor + N workers, plan-execute-critic, parallel committee with vote-merge, ReAct with bounded tool budget

The Cognilium stack — ten layers, named tools, why each

Not a logo wall. This is the actual reference matrix our engineers pick from on a Day-1 architecture call — and the one-line justification for every choice.

Orchestration

Where the graph and the agent topology live

LangGraph
Default. Durable state + checkpointing + human-in-loop nodes; the only mainstream orchestrator with a defensible recovery story.
CrewAI
Role-based collaborative agents — when persona separation (PM, engineer, reviewer) is itself the product.
AWS Bedrock AgentCore
Managed deployment target for AWS-native customers; IAM-scoped tool execution + Guardrails out of the box.
Google ADK
Vertex-native tool registration with built-in Cloud Trace, when the customer&apos;s ML budget is on GCP.
LangChain
Used sparingly for adapters and small utility chains, not as primary control flow.

LLMs (foundation models)

Routed by cost / latency / capability per call

Anthropic Claude (Opus / Sonnet / Haiku)
Default for complex tool-use, long-context summarisation, and tasks where refusal calibration matters.
OpenAI GPT-4o / 4o-mini
Cost-efficient routing for high-volume routes; multimodal vision tasks where Claude is overkill.
AWS Bedrock (Anthropic, Cohere, Mistral, Meta)
When the customer requires AWS data-residency / BAA / private-link — same models, AWS contract.
Google Gemini (Pro / Flash)
1M+ token context for whole-codebase or whole-trial-transcript ingest; Flash for low-cost summarisation.
Open-weight via Together / Groq / vLLM
Llama 3.x, Mistral, Qwen for air-gapped deployments and per-token cost floors; Groq for sub-300ms latency requirements.

Retrieval & vector

Hybrid search, reranking, graph

Qdrant
Most common in production. Hybrid search + payload filtering at 100M-vector scale; cheap to self-host in customer VPC.
Pinecone
Managed serverless when ops budget is tight; namespace-per-tenant pattern for multi-tenant SaaS.
Weaviate
When the customer wants a modular hybrid retrieval engine they can extend with custom modules.
Elastic / OpenSearch
BM25 sparse half of every hybrid setup; also the operational search backbone customers already trust.
Postgres pgvector
When vectors live next to transactional data and operating a second store isn&apos;t worth it (≤ 5M vectors).
Cohere Rerank 3 / BGE re-rankers
Cross-encoder rerank stage — Cohere for managed, BGE (bge-reranker-v2-m3) for in-VPC.
Neo4j / Memgraph
Property-graph backbone for GraphRAG when relationship traversal beats nearest-neighbour.

Voice

Real-time speech for telephony and live agents

Ultravox
Cognilium&apos;s default for low-latency speech-in / speech-out — single-model design avoids the STT→LLM→TTS stack latency.
Deepgram
Best-in-class streaming STT when we need provider separation from the LLM; word-level timestamps for analytics.
AssemblyAI
Speaker diarization, sentiment, and entity detection on call recordings — VORTA and ProspectVox lean on it.
ElevenLabs
Highest-fidelity TTS for branded voices; multilingual cloning for non-English deployments.
Twilio / LiveKit
Telephony and WebRTC orchestration — the carrier-grade layer above the speech stack.

Document AI

Extraction from PDFs, scans, and forms

Google Document AI
Best processor library for forms, invoices, and contracts — we pair it with an LLM post-pass for free-text fields.
Azure Form Recognizer (Document Intelligence)
Default for Azure-native enterprises; strong on table extraction and custom-model training UX.
AWS Textract
AWS-native baseline — Queries API and Layout outputs feed into Bedrock-hosted post-processing.
Unstructured.io
Layout-aware chunking for RAG ingestion — turns messy PDFs / DOCX / HTML into clean elements.

Inference & serving

Where open-weight models actually run

NVIDIA Triton Inference Server
p99 < 80ms decode budget on bf16 with KV-cache reuse; multi-model serving for embeddings + rerankers + small LLMs on one node.
vLLM
PagedAttention throughput for open-weight LLM serving — default behind Triton for Llama / Mistral / Qwen.
AWS SageMaker endpoints
Managed autoscaling endpoints when the customer is fully AWS and won&apos;t run their own Triton.
Vertex AI predictions
Same role on GCP — co-located with BigQuery / Vertex datasets for cheaper data egress.
Replicate / Modal
Burst inference for image, audio, and infrequent open-weight calls without standing GPU cost.

MLOps

Training, tracking, packaging, scheduling

MLflow
Experiment tracking and model registry — promoted runs become the canonical &apos;production&apos; pointer.
Weights & Biases
When the team wants richer eval dashboards and sweeps; integrates cleanly with our RAGAS harness.
BentoML
Packaging models + pre/post-processing as a single deployable when SageMaker / Vertex containers aren&apos;t enough.
Argo Workflows / Kubeflow
Kubernetes-native DAGs for training pipelines and large batch inference — the default in customer EKS / GKE.
DVC
Versioning for golden eval datasets and labelled training data — diffable, reviewable in PRs.

Data plane

OLTP, OLAP, streaming, CDC, transformation

PostgreSQL + TimescaleDB
Default OLTP plus time-series for event and metric data; pgvector when vectors live here.
ClickHouse
Sub-second OLAP over billions of rows for trace and analytics queries; the backbone of our observability dashboards.
Snowflake / Databricks
Customer-owned warehouse/lakehouse — we land features and eval data here, never duplicate.
dbt
All transformations defined as version-controlled SQL with tests; the contract between data team and ML team.
Kafka / Redpanda
Event backbone for streaming ingest, agent message buses, and at-least-once delivery to downstream stores.
Debezium CDC
Low-latency change capture from OLTP into Kafka and the vector store — keeps RAG corpora fresh.

Observability

Traces, evals, cost, prompts

OpenTelemetry (GenAI conventions)
Vendor-neutral traces across LLM calls, retrieval, and tool calls — exported to whatever the customer already runs.
LangSmith
Trace + eval + prompt-management when the customer is happy with a hosted LLM-ops layer.
Langfuse
Self-hostable LangSmith equivalent for air-gapped and EU-data-residency deployments.
Datadog
Customer&apos;s existing APM — we export OTel into it so on-call sees LLM spans next to service spans.
Helicone
Per-tenant token-spend metering, caching, and budget alerts in front of OpenAI / Anthropic.

Cloud

Where the customer&apos;s VPC actually lives

AWS
Bedrock, SageMaker, Lambda, ECS / EKS, S3, RDS, OpenSearch — the most common deployment target across our 50+ projects.
GCP
Vertex AI, Cloud Run, BigQuery, GKE — where the data is already in BigQuery and Gemini commits drive routing.
Azure
Azure OpenAI Service, Azure ML, Functions, AKS, ADLS — common for healthcare and EU enterprises with M365 commits.

How a Cognilium project actually ships

Six stages, in this order, on every engagement. Evals exist before models. Topology exists before orchestrators. Drift gates exist before traffic.

Stage 0

Problem definition + eval-set design

Before a single model is loaded.

  • Stakeholder interviews to write a 1-page success spec: what task, what input, what output, what failure mode is unacceptable
  • Eval-set construction — 100-500 hand-curated examples per task, kept off the prompt-tuning side and versioned with DVC
  • Metrics chosen up front: faithfulness + answer relevance (RAGAS) for RAG; tool-call accuracy + format compliance for agents; LLM-judge rubric for free-text
  • Cost-per-call ceiling and latency p99 budget written into the spec; treated as deploy gates, not stretch goals
Stage 1

Retrieval prototype

BM25 baseline → hybrid → rerank, measured at every step.

  • BM25 baseline (Elastic / OpenSearch) on the customer corpus — gives the floor that any later complexity has to beat
  • Dense retrieval on Qdrant with embedding choice picked from a bake-off (text-embedding-3-large, voyage-3, bge-large) on the customer&apos;s eval set
  • Hybrid fusion (reciprocal rank fusion) + Cohere or BGE rerank — only kept if RAGAS context-precision lifts ≥ 5 points over BM25
  • Chunking strategy chosen for the doc structure — semantic, recursive, or layout-aware via Unstructured.io; never a default 512-token window
Stage 2

Agent topology design

Single-agent? Supervisor + workers? Committee?

  • Topology picked from the task shape, not from a default — most failures here are coordination bugs not model bugs
  • Patterns we choose from: single-agent ReAct (bounded tool budget), supervisor + N workers, plan-execute-critic, parallel committee with vote-merge
  • LangGraph state schema written first — explicit reducers, checkpoint backend (Postgres for durability, Redis for hot state), retry policy per edge
  • Per-node budgets — token cap, time cap, retry cap — to bound worst-case cost on adversarial inputs
Stage 3

Production pipeline

Orchestrator + observability + cost tracking + drift gates.

  • Full OpenTelemetry GenAI-convention instrumentation on every LLM call, retrieval step, and tool execution
  • Token-spend metering via Helicone or Langfuse — per-tenant, per-route, per-model — wired to budget alerts
  • PII redaction (Microsoft Presidio + custom regex) on logged prompts/responses, with per-tenant retention configurable
  • Drift gates — PSI / KS-test on embedding distributions, alert when production traffic moves > 1σ from eval set within 7 days
  • Routing logic — Haiku / Flash / 4o-mini for the easy 95%, Opus / GPT-4o / Gemini Pro reserved for the hard 5%
Stage 4

Pre-launch validation

Red-team, safety eval, latency, cost-per-conversation.

  • Red-team prompt suite — jailbreak, prompt-injection, indirect-injection-via-retrieval, scope-creep — has to pass before any traffic
  • Safety evals against the customer&apos;s policy: refusal calibration, sensitive-topic handling, output-format compliance
  • Latency p50 / p95 / p99 measured under realistic mixed-traffic load with k6 or Locust, not just sunny-day single-call
  • Cost-per-conversation projected at 10x current volume and at the worst-case persona; if it overruns budget, route mix and prompt are rewritten
Stage 5

Deploy + on-call + eval CI

Customer VPC or shared infra. Runbook included.

  • Deploy via Terraform module + Helm chart into customer EKS / GKE / AKS — or Cognilium-shared infra for SaaS-mode customers
  • Shadow traffic, then 5% / 25% / 100% ramps with statistical-significance gates on the primary metric before each step
  • On-call runbook — top 10 alerts, debug paths, model fallback plan, escalation contact — handed to the customer team in week 1
  • Eval suite wired into the customer&apos;s CI so any prompt or model change has to clear the regression gate before merging

Verified outcomes — nothing rounded up

The four numbers we'll defend on a reference call. Everything else on this page is an engineering choice we made to keep them true.

50+

Projects delivered

Production AI systems shipped across legal, financial services, healthcare, sales, and HR — each through the same six-stage delivery pipeline.

4

Production AI products

Paralegent AI (legal multi-agent), ProspectVox (voice sales agent on Ultravox), VectorHire (recruiting RAG), VORTA (agentic workspace search).

96%

Client satisfaction

Measured at engagement close across the portfolio. The remaining 4% is the budget we hold for what we&apos;d redo in hindsight.

Founded 2019

Clients in US, UAE & Pakistan

Multi-region delivery teams, multi-cloud deployment defaults — the same engineers ship in customer VPCs on AWS, GCP, and Azure.

The stack in production — four case studies

Each one names the layers from the stack matrix above. Each one ships through the six-stage flow. Customer names anonymised where contracts require it.

K-12 EdTech — AI writing co-pilot (RAG)

22 hours/week saved on teacher lesson preparation

Stack
Hybrid retrieval (Qdrant + BM25) · 1.37M-char methodology corpus · Claude Sonnet for generation · LearnWorlds LMS embed · OpenTelemetry + Langfuse
  • Methodology-grounded lesson generation — answers stay faithful to the publisher's 188 catalogued mini-lessons + 50+ named teaching strategies
  • Hybrid BM25 + dense embeddings (text-embedding-3-large) — dense catches semantic matches, sparse catches publisher-specific terminology that pure vectors miss
  • Three escape hatches keep the co-pilot in scope: domain filter, must-cite-source guard, confidence floor — out-of-scope queries get a polite redirect
  • Drift gate via PSI on embedding distribution — alerts trigger re-eval before quality erodes
Read the engineering write-up
Multi-family office — multi-agent operations layer

Supervisor + 7 worker agents for diligence and portfolio operations

Stack
LangGraph parallel committee · Neo4j GraphRAG over relationship + holdings graph · Bedrock Claude · Document AI for statements · Snowflake feature store
  • Three-agent committee (analyst + risk + auditor) with vote-merge — disagreement on a diligence question routes to a human partner with full trace
  • GraphRAG over the family entity graph (trusts, holdings, counterparties) — relationship queries that flat-vector retrieval mishandles
  • Document AI extraction on bank and brokerage statements feeds the same graph, normalised via dbt — refresh latency under 24 hours
  • Audit log retained 7 years, OpenTelemetry trace per decision, reproducible re-run of any historical query against today&apos;s data
Read the engineering write-up
ProspectVox — outbound voice sales agent (Cognilium product)

Sub-second voice latency on Ultravox with CRM-aware tool-use

Stack
Ultravox speech model · LangGraph dialogue policy · Twilio carrier layer · Salesforce + HubSpot tool adapters · Helicone budget metering
  • Single-model speech-in / speech-out via Ultravox — avoids the STT → LLM → TTS stack latency that kills natural conversation
  • Dialogue policy as a LangGraph state machine — explicit handoffs to a human SDR on intent-to-buy or objection patterns above threshold
  • Real-time CRM tool-use mid-call: lookup contact, log activity, propose meeting slot, all within the conversation turn
  • Per-call cost ceiling enforced — calls that exceed budget fail closed to a polite handoff, never to a runaway loop
Read the engineering write-up
VectorHire — RAG-native recruiting platform (Cognilium product)

Resume + JD semantic match at portfolio scale, with explainable rerank

Stack
Qdrant + BM25 hybrid · bge-large embeddings (in-VPC) · Cohere Rerank 3 · Postgres + pgvector for ATS data · MLflow registry
  • Hybrid retrieval — BM25 on hard skills + dense vectors on experience narratives — fused via RRF and reranked for top-K precision
  • Explainable match — every candidate-to-JD score decomposes into highlighted spans and named skill matches, surfaced in the UI
  • Embeddings refresh on resume edit via Debezium CDC — no stale candidate vectors, no batch-job lag
  • RAGAS-style eval suite gating any change to the embedding model or rerank prompt
Read the engineering write-up

Engineering targets, not marketing outcomes

Latency budgets we put in the spec. Cost discipline we enforce in CI. Drift gates that trigger re-eval automatically. Compliance commitments that come with the deployment.

Latency budgets

p99 targets we sign up for in the spec

Triton single-model inferencep99 < 80ms

bf16 + KV-cache reuse on a co-located node; sub-token latency reported per Triton perf-analyzer

RAG end-to-end (retrieve + rerank + generate)p99 < 800ms

Hybrid retrieval + Cohere rerank + Sonnet streaming first token

Multi-agent committee (3 workers + merge)p99 < 2.5s

Parallel worker execution, supervisor vote-merge with early-exit on consensus

Voice agent first-token (Ultravox)< 600ms

Single-model speech-in / speech-out path — avoids stacked STT → LLM → TTS latency

Cost discipline

What we engineer to, before we sign the order

Cost-per-call ceilingWritten into spec

Projected at 10x volume and worst-case persona — if overrun, route mix and prompt are rewritten before launch

Token-spend meteringPer-tenant, per-route

Helicone or Langfuse from Day 1 of staging — budget alerts fire before the credit card does

Routing mix~95% to Haiku/Flash/4o-mini

Frontier models (Opus / GPT-4o / Gemini Pro) reserved for the hard 5%; mix logged and reviewable

CachingPrompt + semantic + KV

Anthropic prompt caching for stable system prompts; semantic cache for repeated user queries; KV reuse on Triton

Drift & quality gates

How we keep production from rotting

Embedding-distribution driftPSI / KS-test

Production traffic compared to eval set; alert when divergence > 1σ within 7 days

Retraining / re-eval trigger< 7-day lag

Drift detection → re-eval pipeline runs automatically, blocks promotion if RAGAS scores drop > 2 points

Regression gate in CIPass-rate floor

Any prompt or model change has to clear the golden-set regression suite before merge

Red-team suitePre-launch + nightly

Jailbreak, prompt-injection, indirect-via-retrieval, scope-creep prompts re-run on every release branch

Reliability & compliance

What we sign up to in the MSA

UptimeCustomer-defined SLO

Targets set against the customer&apos;s downstream criticality — we don&apos;t quote generic SLAs; we engineer to the one we sign

Customer-VPC deployment1 week

Terraform module + Helm chart targeting EKS / GKE / AKS — same artifact, different variable file

Compliance postureSOC 2 Type II + ISO 27001 aligned

BAA (HIPAA) and DPA (GDPR) ready via Bedrock / Azure OpenAI; air-gap variant for regulated industries

Audit trail7-year retention configurable

Every LLM call logged with prompt, retrieval context, tool calls, response, and tenant — PII-redacted via Presidio

Technology

Everything You Need to Know About Cognilium AI Technology

The questions engineering teams actually ask us on the first architecture call. Stack choices, latency budgets, drift gates, compliance posture.

Still have questions?

Get personalized answers from our AI experts

Contact Our Team

Let's co-build AI that worksand scales.

Join many companies already shipping production AI with Cognilium's proven technology stack

Fast
Response Time
Fully
NDA Protected
30-min
Architecture Review
No
Lock-in Contracts