Your agent gave a wrong answer. A user complained. How do you debug it? Without observability, you're guessing. With proper tracing, you can see exactly which tool failed, which context was missing, and where the reasoning broke down—in under 5 minutes. Here's how to set it up.
What is Agent Observability?
Agent observability is the ability to understand what your AI agent did, why it did it, and how long it took—for every request. It includes logging (what happened), metrics (how often and how fast), and tracing (the path through your system). For agents, this extends to tool calls, memory retrieval, and model reasoning.
Why Agent Observability is Different
Traditional API monitoring tracks request/response. Agent monitoring must track:
| Traditional API | AI Agent |
|---|---|
| Request received | Request received |
| Business logic executed | Prompt constructed |
| Response returned | Model called (latency varies) |
| Tool calls (0 to N) | |
| Memory retrieved | |
| Guardrails checked | |
| Response returned |
An agent request might involve 5+ internal operations, each with its own latency and failure modes.
The Three Pillars: Logs, Metrics, Traces
Logs: What Happened
{
"timestamp": "2025-01-15T10:23:45Z",
"level": "INFO",
"agent": "FinancialAdvisor",
"session_id": "user-123-session-456",
"event": "TOOL_CALL",
"tool": "calculate_budget",
"input": {"income": 8000, "expenses": 5000},
"output": {"savings": 3000},
"latency_ms": 45
}
Metrics: How Often and How Fast
agent_requests_total{agent="FinancialAdvisor"} 1542
agent_latency_p99{agent="FinancialAdvisor"} 3.2s
agent_errors_total{agent="FinancialAdvisor", error_type="guardrail"} 34
token_usage_total{model="claude-3-sonnet"} 2450000
Traces: The Path Through Your System
[Trace ID: abc-123]
├── [Span] Request Received (0ms)
├── [Span] Memory Retrieved (45ms)
│ └── DynamoDB Query
├── [Span] Prompt Constructed (12ms)
├── [Span] Model Invocation (2,340ms)
│ └── Claude 3 Sonnet
├── [Span] Tool: calculate_budget (48ms)
├── [Span] Guardrail Check (23ms)
└── [Span] Response Sent (2,468ms total)
Setting Up CloudWatch Integration
Enable Observability in AgentCore
# config.yaml
observability:
enabled: true
log_level: INFO # DEBUG for development
logs:
destination: cloudwatch
log_group: /agentcore/financial-advisor
retention_days: 30
metrics:
enabled: true
namespace: AgentCore/FinancialAdvisor
dimensions:
- agent_name
- session_id
- tool_name
tracing:
enabled: true
service: financial-advisor
sample_rate: 1.0 # 100% in dev, reduce in prod
Deploy with Observability
agentcore deploy --observability-enabled
Verify Logs Are Flowing
aws logs tail /agentcore/financial-advisor --follow
Distributed Tracing for Multi-Agent Systems
When agents call other agents, you need distributed tracing to see the full picture.
The Problem Without Tracing
User: "Research AI trends and write a blog post"
Log Entry 1: ResearchAgent received request
Log Entry 2: WriterAgent received request
Log Entry 3: Response sent
# ❓ Which request? How are they connected? What was the timing?
The Solution: Trace Context Propagation
from bedrock_agentcore import Agent, trace_context
@agent.handler
def handle_request(request, context):
# Trace ID propagates automatically
trace_id = context.trace_id # abc-123
# When calling another agent, trace continues
research_result = research_agent.invoke(
message=request.message,
trace_context=context # Propagate trace
)
# All spans linked under same trace
return process(research_result)
Viewing Traces in CloudWatch
# Query for a specific trace
aws logs filter-log-events \
--log-group-name /agentcore/financial-advisor \
--filter-pattern '{ $.trace_id = "abc-123" }' \
--query 'events[*].message'
Trace Visualization
Essential Metrics to Track
Request Metrics
# Custom metrics to emit
metrics = [
# Volume
{"name": "RequestCount", "unit": "Count"},
{"name": "ActiveSessions", "unit": "Count"},
# Latency
{"name": "Latency", "unit": "Milliseconds"},
{"name": "ModelLatency", "unit": "Milliseconds"},
{"name": "ToolLatency", "unit": "Milliseconds"},
# Errors
{"name": "ErrorCount", "unit": "Count"},
{"name": "GuardrailTriggerCount", "unit": "Count"},
{"name": "TimeoutCount", "unit": "Count"},
# Tokens
{"name": "InputTokens", "unit": "Count"},
{"name": "OutputTokens", "unit": "Count"},
{"name": "TotalTokens", "unit": "Count"},
]
Publishing Custom Metrics
import boto3
from datetime import datetime
cloudwatch = boto3.client('cloudwatch')
def publish_metric(name, value, unit, dimensions):
cloudwatch.put_metric_data(
Namespace='AgentCore/FinancialAdvisor',
MetricData=[{
'MetricName': name,
'Value': value,
'Unit': unit,
'Timestamp': datetime.utcnow(),
'Dimensions': [
{'Name': k, 'Value': v} for k, v in dimensions.items()
]
}]
)
# Example: track tool latency
publish_metric(
name='ToolLatency',
value=48,
unit='Milliseconds',
dimensions={'tool_name': 'calculate_budget', 'agent': 'FinancialAdvisor'}
)
Building a Production Dashboard
Essential Dashboard Widgets
{
"widgets": [
{
"title": "Request Volume",
"type": "metric",
"metrics": [["AgentCore/FinancialAdvisor", "RequestCount"]]
},
{
"title": "P99 Latency",
"type": "metric",
"metrics": [["AgentCore/FinancialAdvisor", "Latency", {"stat": "p99"}]]
},
{
"title": "Error Rate",
"type": "metric",
"metrics": [
[{"expression": "errors/requests*100", "label": "Error %"}]
]
},
{
"title": "Token Usage",
"type": "metric",
"metrics": [["AgentCore/FinancialAdvisor", "TotalTokens"]]
},
{
"title": "Guardrail Triggers",
"type": "metric",
"metrics": [["AgentCore/FinancialAdvisor", "GuardrailTriggerCount"]]
},
{
"title": "Active Sessions",
"type": "metric",
"metrics": [["AgentCore/FinancialAdvisor", "ActiveSessions"]]
}
]
}
Create Dashboard via CLI
aws cloudwatch put-dashboard \
--dashboard-name AgentCore-Production \
--dashboard-body file://dashboard.json
Alerting Strategy
Critical Alerts (Page On-Call)
# Error rate > 5%
aws cloudwatch put-metric-alarm \
--alarm-name "AgentCore-HighErrorRate" \
--metric-name ErrorCount \
--namespace AgentCore/FinancialAdvisor \
--statistic Sum \
--period 300 \
--threshold 10 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 2 \
--alarm-actions arn:aws:sns:us-east-1:123456789:pagerduty
# P99 latency > 10 seconds
aws cloudwatch put-metric-alarm \
--alarm-name "AgentCore-HighLatency" \
--metric-name Latency \
--namespace AgentCore/FinancialAdvisor \
--extended-statistic p99 \
--period 300 \
--threshold 10000 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 2 \
--alarm-actions arn:aws:sns:us-east-1:123456789:pagerduty
Warning Alerts (Slack Notification)
# Guardrail trigger rate increasing
aws cloudwatch put-metric-alarm \
--alarm-name "AgentCore-GuardrailSpike" \
--metric-name GuardrailTriggerCount \
--namespace AgentCore/FinancialAdvisor \
--statistic Sum \
--period 3600 \
--threshold 50 \
--comparison-operator GreaterThanThreshold \
--evaluation-periods 1 \
--alarm-actions arn:aws:sns:us-east-1:123456789:slack-alerts
Alert Escalation Matrix
| Metric | Warning | Critical |
|---|---|---|
| Error Rate | > 2% | > 5% |
| P99 Latency | > 5s | > 10s |
| Guardrail Triggers | > 50/hour | > 100/hour |
| Token Burn Rate | > 2x normal | > 5x normal |
Cost Tracking and Attribution
Track Cost Per Request
# Pricing (approximate)
COST_PER_1K_INPUT_TOKENS = 0.003 # Claude 3 Sonnet
COST_PER_1K_OUTPUT_TOKENS = 0.015
def calculate_cost(input_tokens, output_tokens):
input_cost = (input_tokens / 1000) * COST_PER_1K_INPUT_TOKENS
output_cost = (output_tokens / 1000) * COST_PER_1K_OUTPUT_TOKENS
return input_cost + output_cost
# Log cost per request
cost = calculate_cost(input_tokens=450, output_tokens=380)
logger.info(f"Request cost: ${cost:.4f}")
publish_metric('RequestCost', cost, 'None', {'agent': 'FinancialAdvisor'})
Cost Attribution by User/Tenant
# Track costs by tenant for billing
publish_metric(
name='TenantTokens',
value=total_tokens,
unit='Count',
dimensions={
'tenant_id': request.tenant_id,
'agent': 'FinancialAdvisor'
}
)
Monthly Cost Report Query
-- CloudWatch Logs Insights
fields @timestamp, tenant_id, tokens, cost
| filter agent = "FinancialAdvisor"
| stats sum(cost) as total_cost by tenant_id
| sort total_cost desc
| limit 20
Common Debugging Patterns
Pattern 1: "Why did the agent give a wrong answer?"
-- Find the trace for a specific session
fields @timestamp, @message
| filter session_id = "user-123-session-456"
| sort @timestamp asc
| limit 100
Look for:
- Memory retrieved (was context correct?)
- Prompt constructed (was information included?)
- Tool calls (did they return expected results?)
Pattern 2: "Why is latency high?"
-- Find slowest components
fields @timestamp, span_name, duration_ms
| filter trace_id = "abc-123"
| sort duration_ms desc
| limit 10
Common culprits:
- Model cold start (first request after idle)
- Large context windows (too many tokens)
- External tool calls (API latency)
Pattern 3: "Why did guardrail trigger?"
-- Find guardrail events
fields @timestamp, input_text, guardrail_result, reason
| filter event = "GUARDRAIL_TRIGGERED"
| sort @timestamp desc
| limit 50
Production Observability Checklist
Logging
- All agent events logged (request, response, tools, errors)
- Session ID included in every log
- Trace ID propagated across agents
- Log retention configured (30+ days)
- Sensitive data redacted
Metrics
- Request volume tracked
- Latency percentiles (p50, p95, p99)
- Error counts by type
- Token usage tracked
- Cost per request calculated
Tracing
- Distributed tracing enabled
- All spans named meaningfully
- Tool calls traced
- Memory operations traced
Alerting
- Critical alerts page on-call
- Warning alerts to Slack
- Escalation matrix documented
- Alert fatigue reviewed monthly
Dashboards
- Production dashboard exists
- Key metrics visible at glance
- Team has access
Next Steps
-
Getting Started with AgentCore → Set up observability from the start
-
Multi-Agent Orchestration → Trace across multiple agents
-
AgentCore Memory Layer → Debug memory issues
-
AgentCore vs ADK → Compare observability capabilities
Need help with production agent monitoring?
At Cognilium, we run agents with 99.9% uptime and 5-minute MTTR. Let's discuss your observability needs →
Share this article
Muhammad Mudassir
Founder & CEO, Cognilium AI
Muhammad Mudassir
Founder & CEO, Cognilium AI
Mudassir Marwat is the Founder & CEO of Cognilium AI, where he leads the design and deployment of pr...
