What is the default orchestration framework in the Cognilium production AI stack?

LangGraph is the default. We use it because it is the only mainstream orchestrator with durable state, checkpointing to Postgres or Redis, and human-in-loop nodes — which means a long-running task can be paused, inspected, and resumed by a human reviewer. CrewAI is used when role-based collaborative personas (PM, engineer, reviewer) are the product. AWS Bedrock AgentCore is used for AWS-native customers who want managed runtime + IAM-scoped tool execution. Google ADK is used for Vertex-native deployments with built-in Cloud Trace.

Why does Cognilium default to Qdrant for vector retrieval over Pinecone or Weaviate?

Qdrant gives us hybrid search and payload filtering at 100M-vector scale, and it is cheap to self-host inside a customer VPC — which matters because more than half our enterprise contracts require data to never leave the customer perimeter. Pinecone is used for managed serverless when the customer has no ops budget. Weaviate is used when the customer wants a modular hybrid engine they can extend. Postgres pgvector is used at small scale (≤ 5M vectors) when running a second store is not worth the operational cost.

How does Cognilium evaluate retrieval and agent quality before deploy?

Every project ships with a versioned golden dataset (100-500 hand-curated examples per task, stored in DVC) and a RAGAS-based eval harness that measures faithfulness, answer relevance, context precision, and context recall. We also run an LLM-as-judge with explicit rubrics, with judge-vs-human agreement tracked and multiple judge models used to detect single-model bias. Regression suites block deploy if pass-rate drops more than 2 points or any P0 example fails.

What is the production latency budget for a typical Cognilium RAG pipeline?

p99 < 80ms for single-model Triton inference (bf16 + KV-cache reuse), p99 < 800ms for end-to-end RAG (hybrid retrieval + Cohere or BGE rerank + Sonnet streaming first token), p99 < 2.5s for a 3-worker multi-agent committee with supervisor vote-merge, and < 600ms first-token for voice agents on Ultravox. These are written into the project spec as deploy gates, not stretch goals.

How does Cognilium deploy into a customer VPC and how long does it take?

We deploy in approximately one week via a Terraform module plus a Helm chart that target customer EKS (AWS), GKE (GCP), or AKS (Azure) — same artifact, different variable file. For regulated industries we offer an air-gapped variant that runs open-weight LLMs on vLLM + Triton with local embeddings (bge-large, e5-mistral) and no outbound LLM API calls. BAA (HIPAA) and DPA (GDPR) commitments come built in via Bedrock or Azure OpenAI as the model provider.

How does Cognilium handle multi-cloud LLM provider portability?

Every system is built against a thin LLM / embeddings / vector abstraction so model swaps are a configuration change, not a refactor. The same orchestration graph can run against Anthropic Claude on the Anthropic API, Claude on AWS Bedrock, GPT-4o on OpenAI or Azure OpenAI, Gemini on Vertex AI, or open-weight Llama / Mistral / Qwen on vLLM behind Triton. Routing logic decides per-call based on cost, latency, and capability requirements.

How does Cognilium track and control LLM token spend in production?

Token spend is metered per-tenant, per-route, and per-model via Helicone or Langfuse from the first staging deploy, with budget alerts wired into PagerDuty or Slack. Routing pushes the easy 95% of queries to Haiku, Gemini Flash, or GPT-4o-mini, and reserves Opus, GPT-4o, and Gemini Pro for the hard 5%. Anthropic prompt caching, semantic caching, and Triton KV-cache reuse are all on the launch checklist — cost is engineered down before deploy, not after the bill arrives.

What observability does Cognilium instrument on production LLM calls?

Every LLM call is instrumented with OpenTelemetry GenAI semantic conventions — model name, tokens in/out, latency, cost, parent span ID — so customer SRE teams see LLM spans next to service spans in their existing Datadog, Grafana, or Jaeger. We also wire LangSmith or Langfuse for prompt-level traces and eval, Helicone for spend metering, and Microsoft Presidio plus custom regex for PII redaction on logged prompts and responses. Trace replay lets any production trace be re-run against a new prompt or model to estimate the diff before deploy.

How does Cognilium detect and respond to model drift in production?

We compute PSI (population stability index) and KS-test divergence on embedding distributions of prompts and responses against the eval set. When production traffic moves more than 1σ from the eval distribution within a 7-day window, an alert fires and a re-eval pipeline runs automatically. If RAGAS scores or LLM-judge pass-rate drop more than 2 points, the regression gate blocks promotion of the next release until the eval set is refreshed or the prompt and retrieval config are tuned.

Which compliance frameworks does the Cognilium production AI stack support?

SOC 2 Type II controls and ISO 27001-aligned operations across the engineering team. HIPAA BAA-ready via AWS Bedrock and Azure OpenAI as default LLM providers for healthcare customers. GDPR DPA-ready with region-pinned model endpoints and regional vector storage. PII redaction via Microsoft Presidio on every logged prompt and response. Audit trail retention configurable up to 7 years to satisfy SEC Rule 17a-4, FINRA 4511, and equivalent regimes.

Production reference - updated for 2026

Production AI stack: LangGraph orchestration, Qdrant retrieval, Bedrock + Vertex multi-cloud, Triton inference

Q: What is the production latency budget for a typical Cognilium RAG pipeline?

p99 < 80ms for single-model Triton inference (bf16 + KV-cache reuse), p99 < 800ms for end-to-end RAG (hybrid retrieval + Cohere or BGE rerank + Sonnet streaming first token), p99 < 2.5s for a 3-worker multi-agent committee with supervisor vote-merge, and < 600ms first-token for voice agents on Ultravox. These are written into the project spec as deploy gates, not stretch goals.

Q: How does Cognilium deploy into a customer VPC and how long does it take?

We deploy in approximately one week via a Terraform module plus a Helm chart that target customer EKS (AWS), GKE (GCP), or AKS (Azure) — same artifact, different variable file. For regulated industries we offer an air-gapped variant that runs open-weight LLMs on vLLM + Triton with local embeddings (bge-large, e5-mistral) and no outbound LLM API calls. BAA (HIPAA) and DPA (GDPR) commitments come built in via Bedrock or Azure OpenAI as the model provider.

Q: How does Cognilium handle multi-cloud LLM provider portability?

Every system is built against a thin LLM / embeddings / vector abstraction so model swaps are a configuration change, not a refactor. The same orchestration graph can run against Anthropic Claude on the Anthropic API, Claude on AWS Bedrock, GPT-4o on OpenAI or Azure OpenAI, Gemini on Vertex AI, or open-weight Llama / Mistral / Qwen on vLLM behind Triton. Routing logic decides per-call based on cost, latency, and capability requirements.

Q: How does Cognilium track and control LLM token spend in production?

Token spend is metered per-tenant, per-route, and per-model via Helicone or Langfuse from the first staging deploy, with budget alerts wired into PagerDuty or Slack. Routing pushes the easy 95% of queries to Haiku, Gemini Flash, or GPT-4o-mini, and reserves Opus, GPT-4o, and Gemini Pro for the hard 5%. Anthropic prompt caching, semantic caching, and Triton KV-cache reuse are all on the launch checklist — cost is engineered down before deploy, not after the bill arrives.

Q: What observability does Cognilium instrument on production LLM calls?

Every LLM call is instrumented with OpenTelemetry GenAI semantic conventions — model name, tokens in/out, latency, cost, parent span ID — so customer SRE teams see LLM spans next to service spans in their existing Datadog, Grafana, or Jaeger. We also wire LangSmith or Langfuse for prompt-level traces and eval, Helicone for spend metering, and Microsoft Presidio plus custom regex for PII redaction on logged prompts and responses. Trace replay lets any production trace be re-run against a new prompt or model to estimate the diff before deploy.

Q: How does Cognilium detect and respond to model drift in production?

We compute PSI (population stability index) and KS-test divergence on embedding distributions of prompts and responses against the eval set. When production traffic moves more than 1σ from the eval distribution within a 7-day window, an alert fires and a re-eval pipeline runs automatically. If RAGAS scores or LLM-judge pass-rate drop more than 2 points, the regression gate blocks promotion of the next release until the eval set is refreshed or the prompt and retrieval config are tuned.

Q: Which compliance frameworks does the Cognilium production AI stack support?

SOC 2 Type II controls and ISO 27001-aligned operations across the engineering team. HIPAA BAA-ready via AWS Bedrock and Azure OpenAI as default LLM providers for healthcare customers. GDPR DPA-ready with region-pinned model endpoints and regional vector storage. PII redaction via Microsoft Presidio on every logged prompt and response. Audit trail retention configurable up to 7 years to satisfy SEC Rule 17a-4, FINRA 4511, and equivalent regimes.

The exact engineering stack behind 50+ shipped projects and our 4 production AI products - Paralegent AI, ProspectVox, VectorHire, and VORTA. Every layer named, every choice justified.

Orchestration via LangGraph (default) and CrewAI / AWS Bedrock AgentCore / Google ADK where the deployment target requires it. Retrieval over fine-tuning for most workloads. Evaluation harnesses before models. OpenTelemetry traces on every LLM call. Customer-VPC deployment in a week, not a quarter.

p99 < 80ms

Triton inference

Per-token decode budget on bf16 + KV-cache reuse; RAG end-to-end p99 under 800ms

LangGraph

Default orchestrator

Supervisor/worker graphs, durable state, checkpointing, human-in-loop nodes

Qdrant + BM25

Hybrid retrieval

Dense + sparse fusion, Cohere/BGE rerank, payload filters at 100M-vector scale

Multi-cloud

AWS / GCP / Azure

Customer-VPC and air-gap variants; BAA/DPA-ready; SOC 2 Type II + ISO 27001 paths

Four engineering principles we won't compromise on

Picked the hard way, across 50+ shipped projects. These are the trade-offs we make consistently, and the ones we'd defend in a post-mortem.

Evals before models

Golden datasets and LLM-as-judge harnesses ship before the first prompt is written.

Without a regression suite - labelled goldens, an LLM-judge with bias checks against human annotations, a frozen test set kept off the training side - every prompt change is a coin flip. We build the eval harness on Day 1 (RAGAS for retrieval, custom judges for downstream behaviour, deterministic graders for structured outputs) and treat it as the contract a model has to pass before deploy.

Multi-agent is the unit of work

Most production failures are coordination bugs, not model bugs.

Single-agent chat-loops break the moment the task needs a tool that needs a sub-task. We design topologies up front - supervisor + workers, parallel committees with vote-merge, planner + executor + critic - built on LangGraph for state and checkpointing, with explicit handoff contracts and budget caps per node. The graph is the system; the LLM is a function call inside it.

Cloud-portable by design

We can move a customer deployment between AWS, GCP, and Azure in a week, not a quarter.

Every system is built against a thin LLM/embeddings/vector abstraction so the customer is never locked to one provider. We deploy into customer VPCs (and air-gapped variants for regulated industries) using Terraform modules and Helm charts that target Bedrock, Vertex AI, and Azure OpenAI interchangeably. BAA, DPA, and data-residency commitments come built-in, not bolted on.

Cost-aware by default

We engineer inference cost down before we deploy, not after the bill arrives.

Token spend is tracked per-route via Helicone or LangSmith from the first staging deploy, with budget alerts and per-tenant quotas. Routing logic pushes easy queries to Haiku / Gemini Flash / GPT-4o-mini, and reserves Opus / GPT-4o / Gemini Pro for the hard 5%. Prompt caching, semantic caching, and KV-cache reuse on Triton are not optimisations - they are part of the launch checklist.

Five engineering pillars, named tools, defended choices

What we build on every project, in this order. No marketing pillars - these are the architecture decisions that show up in the design doc.

Multi-agent orchestration as the default unit of work

Supervisor/worker graphs, not single-agent chat loops

Retrieval over fine-tuning, for most use cases

Hybrid search + reranking beats a custom model 80% of the time

Evaluation harnesses, not vibes

Golden datasets, LLM-as-judge, regression suites in CI

Cloud-portable by design

Customer-VPC and air-gap deployments, not just SaaS

Observability at every layer

OpenTelemetry traces, latency budgets, token spend, redaction

Multi-agent orchestration as the default unit of work

Supervisor/worker graphs, not single-agent chat loops

Single-agent loops fail under any task that fans out - they lose state, exhaust context, and have no recovery story when a tool call 500s. We build typed graphs with explicit handoff contracts, per-node budgets, and durable checkpoints so a long-running task can be paused, inspected, and resumed by a human reviewer.

LangGraph (default) - durable state, checkpointing to Postgres/Redis, human-in-loop nodes, retry-with-backoff at the edge level

CrewAI for role-based collaborative agents (PM + engineer + reviewer patterns) where the value is in the persona separation

AWS Bedrock AgentCore when the customer is AWS-native and wants managed runtime + IAM-scoped tool execution

Google ADK (Agent Development Kit) for Vertex-native tool registration with built-in tracing through Cloud Trace

Topology patterns we use most: supervisor + N workers, plan-execute-critic, parallel committee with vote-merge, ReAct with bounded tool budget

The Cognilium stack - ten layers, named tools, why each

Not a logo wall. This is the actual reference matrix our engineers pick from on a Day-1 architecture call - and the one-line justification for every choice.

Orchestration

Where the graph and the agent topology live

LangGraph

Default. Durable state + checkpointing + human-in-loop nodes; the only mainstream orchestrator with a defensible recovery story.

CrewAI

Role-based collaborative agents - when persona separation (PM, engineer, reviewer) is itself the product.

AWS Bedrock AgentCore

Managed deployment target for AWS-native customers; IAM-scoped tool execution + Guardrails out of the box.

Google ADK

Vertex-native tool registration with built-in Cloud Trace, when the customer's ML budget is on GCP.

LangChain

Used sparingly for adapters and small utility chains, not as primary control flow.

LLMs (foundation models)

Routed by cost / latency / capability per call

Anthropic Claude (Opus / Sonnet / Haiku)

Default for complex tool-use, long-context summarisation, and tasks where refusal calibration matters.

OpenAI GPT-4o / 4o-mini

Cost-efficient routing for high-volume routes; multimodal vision tasks where Claude is overkill.

AWS Bedrock (Anthropic, Cohere, Mistral, Meta)

When the customer requires AWS data-residency / BAA / private-link - same models, AWS contract.

Google Gemini (Pro / Flash)

1M+ token context for whole-codebase or whole-trial-transcript ingest; Flash for low-cost summarisation.

Open-weight via Together / Groq / vLLM

Llama 3.x, Mistral, Qwen for air-gapped deployments and per-token cost floors; Groq for sub-300ms latency requirements.

Retrieval & vector

Hybrid search, reranking, graph

Qdrant

Most common in production. Hybrid search + payload filtering at 100M-vector scale; cheap to self-host in customer VPC.

Pinecone

Managed serverless when ops budget is tight; namespace-per-tenant pattern for multi-tenant SaaS.

Weaviate

When the customer wants a modular hybrid retrieval engine they can extend with custom modules.

Elastic / OpenSearch

BM25 sparse half of every hybrid setup; also the operational search backbone customers already trust.

Postgres pgvector

When vectors live next to transactional data and operating a second store isn't worth it (≤ 5M vectors).

Cohere Rerank 3 / BGE re-rankers

Cross-encoder rerank stage - Cohere for managed, BGE (bge-reranker-v2-m3) for in-VPC.

Neo4j / Memgraph

Property-graph backbone for GraphRAG when relationship traversal beats nearest-neighbour.

Voice

Real-time speech for telephony and live agents

Ultravox

Cognilium's default for low-latency speech-in / speech-out - single-model design avoids the STT→LLM→TTS stack latency.

Deepgram

Best-in-class streaming STT when we need provider separation from the LLM; word-level timestamps for analytics.

AssemblyAI

Speaker diarization, sentiment, and entity detection on call recordings - VORTA and ProspectVox lean on it.

ElevenLabs

Highest-fidelity TTS for branded voices; multilingual cloning for non-English deployments.

Twilio / LiveKit

Telephony and WebRTC orchestration - the carrier-grade layer above the speech stack.

Document AI

Extraction from PDFs, scans, and forms

Google Document AI

Best processor library for forms, invoices, and contracts - we pair it with an LLM post-pass for free-text fields.

Azure Form Recognizer (Document Intelligence)

Default for Azure-native enterprises; strong on table extraction and custom-model training UX.

AWS Textract

AWS-native baseline - Queries API and Layout outputs feed into Bedrock-hosted post-processing.

Unstructured.io

Layout-aware chunking for RAG ingestion - turns messy PDFs / DOCX / HTML into clean elements.

Inference & serving

Where open-weight models actually run

NVIDIA Triton Inference Server

p99 < 80ms decode budget on bf16 with KV-cache reuse; multi-model serving for embeddings + rerankers + small LLMs on one node.

vLLM

PagedAttention throughput for open-weight LLM serving - default behind Triton for Llama / Mistral / Qwen.

AWS SageMaker endpoints

Managed autoscaling endpoints when the customer is fully AWS and won't run their own Triton.

Vertex AI predictions

Same role on GCP - co-located with BigQuery / Vertex datasets for cheaper data egress.

Replicate / Modal

Burst inference for image, audio, and infrequent open-weight calls without standing GPU cost.

MLOps

Training, tracking, packaging, scheduling

MLflow

Experiment tracking and model registry - promoted runs become the canonical 'production' pointer.

Weights & Biases

When the team wants richer eval dashboards and sweeps; integrates cleanly with our RAGAS harness.

BentoML

Packaging models + pre/post-processing as a single deployable when SageMaker / Vertex containers aren't enough.

Argo Workflows / Kubeflow

Kubernetes-native DAGs for training pipelines and large batch inference - the default in customer EKS / GKE.

DVC

Versioning for golden eval datasets and labelled training data - diffable, reviewable in PRs.

Data plane

OLTP, OLAP, streaming, CDC, transformation

PostgreSQL + TimescaleDB

Default OLTP plus time-series for event and metric data; pgvector when vectors live here.

ClickHouse

Sub-second OLAP over billions of rows for trace and analytics queries; the backbone of our observability dashboards.

Snowflake / Databricks

Customer-owned warehouse/lakehouse - we land features and eval data here, never duplicate.

dbt

All transformations defined as version-controlled SQL with tests; the contract between data team and ML team.

Kafka / Redpanda

Event backbone for streaming ingest, agent message buses, and at-least-once delivery to downstream stores.

Debezium CDC

Low-latency change capture from OLTP into Kafka and the vector store - keeps RAG corpora fresh.

Observability

Traces, evals, cost, prompts

OpenTelemetry (GenAI conventions)

Vendor-neutral traces across LLM calls, retrieval, and tool calls - exported to whatever the customer already runs.

LangSmith

Trace + eval + prompt-management when the customer is happy with a hosted LLM-ops layer.

Langfuse

Self-hostable LangSmith equivalent for air-gapped and EU-data-residency deployments.

Datadog

Customer's existing APM - we export OTel into it so on-call sees LLM spans next to service spans.

Helicone

Per-tenant token-spend metering, caching, and budget alerts in front of OpenAI / Anthropic.

Cloud

Where the customer's VPC actually lives

AWS

Bedrock, SageMaker, Lambda, ECS / EKS, S3, RDS, OpenSearch - the most common deployment target across our 50+ projects.

GCP

Vertex AI, Cloud Run, BigQuery, GKE - where the data is already in BigQuery and Gemini commits drive routing.

Azure

Azure OpenAI Service, Azure ML, Functions, AKS, ADLS - common for healthcare and EU enterprises with M365 commits.

How a Cognilium project actually ships

Six stages, in this order, on every engagement. Evals exist before models. Topology exists before orchestrators. Drift gates exist before traffic.

Stage 0

Problem definition + eval-set design

Before a single model is loaded.

Stakeholder interviews to write a 1-page success spec: what task, what input, what output, what failure mode is unacceptable
Eval-set construction - 100-500 hand-curated examples per task, kept off the prompt-tuning side and versioned with DVC
Metrics chosen up front: faithfulness + answer relevance (RAGAS) for RAG; tool-call accuracy + format compliance for agents; LLM-judge rubric for free-text
Cost-per-call ceiling and latency p99 budget written into the spec; treated as deploy gates, not stretch goals

Stage 1

Retrieval prototype

BM25 baseline → hybrid → rerank, measured at every step.

BM25 baseline (Elastic / OpenSearch) on the customer corpus - gives the floor that any later complexity has to beat
Dense retrieval on Qdrant with embedding choice picked from a bake-off (text-embedding-3-large, voyage-3, bge-large) on the customer's eval set
Hybrid fusion (reciprocal rank fusion) + Cohere or BGE rerank - only kept if RAGAS context-precision lifts ≥ 5 points over BM25
Chunking strategy chosen for the doc structure - semantic, recursive, or layout-aware via Unstructured.io; never a default 512-token window

Stage 2

Agent topology design

Single-agent? Supervisor + workers? Committee?

Topology picked from the task shape, not from a default - most failures here are coordination bugs not model bugs
Patterns we choose from: single-agent ReAct (bounded tool budget), supervisor + N workers, plan-execute-critic, parallel committee with vote-merge
LangGraph state schema written first - explicit reducers, checkpoint backend (Postgres for durability, Redis for hot state), retry policy per edge
Per-node budgets - token cap, time cap, retry cap - to bound worst-case cost on adversarial inputs

Stage 3

Production pipeline

Orchestrator + observability + cost tracking + drift gates.

Full OpenTelemetry GenAI-convention instrumentation on every LLM call, retrieval step, and tool execution
Token-spend metering via Helicone or Langfuse - per-tenant, per-route, per-model - wired to budget alerts
PII redaction (Microsoft Presidio + custom regex) on logged prompts/responses, with per-tenant retention configurable
Drift gates - PSI / KS-test on embedding distributions, alert when production traffic moves > 1σ from eval set within 7 days
Routing logic - Haiku / Flash / 4o-mini for the easy 95%, Opus / GPT-4o / Gemini Pro reserved for the hard 5%

Stage 4

Pre-launch validation

Red-team, safety eval, latency, cost-per-conversation.

Red-team prompt suite - jailbreak, prompt-injection, indirect-injection-via-retrieval, scope-creep - has to pass before any traffic
Safety evals against the customer's policy: refusal calibration, sensitive-topic handling, output-format compliance
Latency p50 / p95 / p99 measured under realistic mixed-traffic load with k6 or Locust, not just sunny-day single-call
Cost-per-conversation projected at 10x current volume and at the worst-case persona; if it overruns budget, route mix and prompt are rewritten

Stage 5

Deploy + on-call + eval CI

Customer VPC or shared infra. Runbook included.

Deploy via Terraform module + Helm chart into customer EKS / GKE / AKS - or Cognilium-shared infra for SaaS-mode customers
Shadow traffic, then 5% / 25% / 100% ramps with statistical-significance gates on the primary metric before each step
On-call runbook - top 10 alerts, debug paths, model fallback plan, escalation contact - handed to the customer team in week 1
Eval suite wired into the customer's CI so any prompt or model change has to clear the regression gate before merging

Verified outcomes - nothing rounded up

The four numbers we'll defend on a reference call. Everything else on this page is an engineering choice we made to keep them true.

50+

Projects delivered

Production AI systems shipped across legal, financial services, healthcare, sales, and HR - each through the same six-stage delivery pipeline.

Production AI products

Paralegent AI (legal multi-agent), ProspectVox (voice sales agent on Ultravox), VectorHire (recruiting RAG), VORTA (agentic workspace search).

96%

Client satisfaction

Measured at engagement close across the portfolio. The remaining 4% is the budget we hold for what we'd redo in hindsight.

Founded 2019

Clients in US, UAE & Pakistan

Multi-region delivery teams, multi-cloud deployment defaults - the same engineers ship in customer VPCs on AWS, GCP, and Azure.

The stack in production - four case studies

Each one names the layers from the stack matrix above. Each one ships through the six-stage flow. Customer names anonymised where contracts require it.

K-12 EdTech - AI writing co-pilot (RAG)

22 hours/week saved on teacher lesson preparation

Stack

Hybrid retrieval (Qdrant + BM25) · 1.37M-char methodology corpus · Claude Sonnet for generation · LearnWorlds LMS embed · OpenTelemetry + Langfuse

Methodology-grounded lesson generation - answers stay faithful to the publisher's 188 catalogued mini-lessons + 50+ named teaching strategies
Hybrid BM25 + dense embeddings (text-embedding-3-large) - dense catches semantic matches, sparse catches publisher-specific terminology that pure vectors miss
Three escape hatches keep the co-pilot in scope: domain filter, must-cite-source guard, confidence floor - out-of-scope queries get a polite redirect
Drift gate via PSI on embedding distribution - alerts trigger re-eval before quality erodes

Read the engineering write-up

Multi-family office - multi-agent operations layer

Supervisor + 7 worker agents for diligence and portfolio operations

Stack

LangGraph parallel committee · Neo4j GraphRAG over relationship + holdings graph · Bedrock Claude · Document AI for statements · Snowflake feature store

Three-agent committee (analyst + risk + auditor) with vote-merge - disagreement on a diligence question routes to a human partner with full trace
GraphRAG over the family entity graph (trusts, holdings, counterparties) - relationship queries that flat-vector retrieval mishandles
Document AI extraction on bank and brokerage statements feeds the same graph, normalised via dbt - refresh latency under 24 hours
Audit log retained 7 years, OpenTelemetry trace per decision, reproducible re-run of any historical query against today's data

Read the engineering write-up

ProspectVox - outbound voice sales agent (Cognilium product)

Sub-second voice latency on Ultravox with CRM-aware tool-use

Stack

Ultravox speech model · LangGraph dialogue policy · Twilio carrier layer · Salesforce + HubSpot tool adapters · Helicone budget metering

Single-model speech-in / speech-out via Ultravox - avoids the STT → LLM → TTS stack latency that kills natural conversation
Dialogue policy as a LangGraph state machine - explicit handoffs to a human SDR on intent-to-buy or objection patterns above threshold
Real-time CRM tool-use mid-call: lookup contact, log activity, propose meeting slot, all within the conversation turn
Per-call cost ceiling enforced - calls that exceed budget fail closed to a polite handoff, never to a runaway loop

Read the engineering write-up

VectorHire - RAG-native recruiting platform (Cognilium product)

Resume + JD semantic match at portfolio scale, with explainable rerank

Stack

Qdrant + BM25 hybrid · bge-large embeddings (in-VPC) · Cohere Rerank 3 · Postgres + pgvector for ATS data · MLflow registry

Hybrid retrieval - BM25 on hard skills + dense vectors on experience narratives - fused via RRF and reranked for top-K precision
Explainable match - every candidate-to-JD score decomposes into highlighted spans and named skill matches, surfaced in the UI
Embeddings refresh on resume edit via Debezium CDC - no stale candidate vectors, no batch-job lag
RAGAS-style eval suite gating any change to the embedding model or rerank prompt

Read the engineering write-up

Engineering targets, not marketing outcomes

Latency budgets we put in the spec. Cost discipline we enforce in CI. Drift gates that trigger re-eval automatically. Compliance commitments that come with the deployment.

Latency budgets

p99 targets we sign up for in the spec

Triton single-model inferencep99 < 80ms

bf16 + KV-cache reuse on a co-located node; sub-token latency reported per Triton perf-analyzer

RAG end-to-end (retrieve + rerank + generate)p99 < 800ms

Hybrid retrieval + Cohere rerank + Sonnet streaming first token

Multi-agent committee (3 workers + merge)p99 < 2.5s

Parallel worker execution, supervisor vote-merge with early-exit on consensus

Voice agent first-token (Ultravox)< 600ms

Single-model speech-in / speech-out path - avoids stacked STT → LLM → TTS latency

Cost discipline

What we engineer to, before we sign the order

Cost-per-call ceilingWritten into spec

Projected at 10x volume and worst-case persona - if overrun, route mix and prompt are rewritten before launch

Token-spend meteringPer-tenant, per-route

Helicone or Langfuse from Day 1 of staging - budget alerts fire before the credit card does

Routing mix~95% to Haiku/Flash/4o-mini

Frontier models (Opus / GPT-4o / Gemini Pro) reserved for the hard 5%; mix logged and reviewable

CachingPrompt + semantic + KV

Anthropic prompt caching for stable system prompts; semantic cache for repeated user queries; KV reuse on Triton

Drift & quality gates

How we keep production from rotting

Embedding-distribution driftPSI / KS-test

Production traffic compared to eval set; alert when divergence > 1σ within 7 days

Retraining / re-eval trigger< 7-day lag

Drift detection → re-eval pipeline runs automatically, blocks promotion if RAGAS scores drop > 2 points

Regression gate in CIPass-rate floor

Any prompt or model change has to clear the golden-set regression suite before merge

Red-team suitePre-launch + nightly

Jailbreak, prompt-injection, indirect-via-retrieval, scope-creep prompts re-run on every release branch

Reliability & compliance

What we sign up to in the MSA

UptimeCustomer-defined SLO

Targets set against the customer's downstream criticality - we don't quote generic SLAs; we engineer to the one we sign

Customer-VPC deployment1 week

Terraform module + Helm chart targeting EKS / GKE / AKS - same artifact, different variable file

Compliance postureSOC 2 Type II + ISO 27001 aligned

BAA (HIPAA) and DPA (GDPR) ready via Bedrock / Azure OpenAI; air-gap variant for regulated industries

Audit trail7-year retention configurable

Every LLM call logged with prompt, retrieval context, tool calls, response, and tenant - PII-redacted via Presidio

Technology

Everything You Need to Know About Cognilium AI Technology

The questions engineering teams actually ask us on the first architecture call. Stack choices, latency budgets, drift gates, compliance posture.

Still have questions?

Get personalized answers from our AI experts

Contact Our Team

Let's co-build AI that works and scales.

Join many companies already shipping production AI with Cognilium's proven technology stack

Fast

Response Time

Fully

NDA Protected

30-min

Architecture Review

Lock-in Contracts