How much does CloudWatch observability cost?

Typical costs for an agent with 10,000 daily requests: ~$15/month for logs (5GB), ~$10/month for custom metrics (20 metrics), ~$5/month for dashboards. Total: ~$30/month. Use log sampling in high-volume production to reduce costs.

Should I log every request in production?

Log 100% of errors and guardrail triggers. For successful requests, consider sampling (10-50%) in high-volume environments. Always log full details for traced requests. Keep 100% logging in staging.

How do I correlate logs across multiple agents?

Use distributed tracing. Propagate a trace_id from the initial request through all agent calls. Query CloudWatch Logs with filter trace_id = "xyz" to see the complete chain.

What's the best way to debug a specific user's issue?

Get their session_id from your support system. Query logs: filter session_id = "user-123". Review the trace to see memory retrieved, prompts constructed, and tools called. Most issues are visible in the trace.

How do I track costs by customer for billing?

Include tenant_id or customer_id as a dimension in your token metrics. Run monthly CloudWatch Logs Insights queries to sum tokens by tenant. Multiply by your token pricing for cost attribution.

How much does CloudWatch observability cost?

Typical costs for an agent with 10,000 daily requests: ~$15/month for logs (5GB), ~$10/month for custom metrics (20 metrics), ~$5/month for dashboards. Total: ~$30/month. Use log sampling in high-volume production to reduce costs.

Should I log every request in production?

Log 100% of errors and guardrail triggers. For successful requests, consider sampling (10-50%) in high-volume environments. Always log full details for traced requests. Keep 100% logging in staging.

How do I correlate logs across multiple agents?

Use distributed tracing. Propagate a trace_id from the initial request through all agent calls. Query CloudWatch Logs with filter trace_id = "xyz" to see the complete chain.

What's the best way to debug a specific user's issue?

Get their session_id from your support system. Query logs: filter session_id = "user-123". Review the trace to see memory retrieved, prompts constructed, and tools called. Most issues are visible in the trace.

How do I track costs by customer for billing?

Include tenant_id or customer_id as a dimension in your token metrics. Run monthly CloudWatch Logs Insights queries to sum tokens by tenant. Multiply by your token pricing for cost attribution.

AgentCore Observability: Monitoring AI Agents in Production

Your agent gave a wrong answer. A user complained. How do you debug it? Without observability, you're guessing. With proper tracing, you can see exactly which tool failed, which context was missing, and where the reasoning broke down—in under 5 minutes. Here's how to set it up.

What is Agent Observability?

Agent observability is the ability to understand what your AI agent did, why it did it, and how long it took—for every request. It includes logging (what happened), metrics (how often and how fast), and tracing (the path through your system). For agents, this extends to tool calls, memory retrieval, and model reasoning.

Why Agent Observability is Different

Traditional API monitoring tracks request/response. Agent monitoring must track:

Traditional API	AI Agent
Request received	Request received
Business logic executed	Prompt constructed
Response returned	Model called (latency varies)
	Tool calls (0 to N)
	Memory retrieved
	Guardrails checked
	Response returned

An agent request might involve 5+ internal operations, each with its own latency and failure modes.

The Three Pillars: Logs, Metrics, Traces

Logs: What Happened

{
    "timestamp": "2025-01-15T10:23:45Z",
    "level": "INFO",
    "agent": "FinancialAdvisor",
    "session_id": "user-123-session-456",
    "event": "TOOL_CALL",
    "tool": "calculate_budget",
    "input": {"income": 8000, "expenses": 5000},
    "output": {"savings": 3000},
    "latency_ms": 45
}

Metrics: How Often and How Fast

agent_requests_total{agent="FinancialAdvisor"} 1542
agent_latency_p99{agent="FinancialAdvisor"} 3.2s
agent_errors_total{agent="FinancialAdvisor", error_type="guardrail"} 34
token_usage_total{model="claude-3-sonnet"} 2450000

Traces: The Path Through Your System

[Trace ID: abc-123]
├── [Span] Request Received (0ms)
├── [Span] Memory Retrieved (45ms)
│   └── DynamoDB Query
├── [Span] Prompt Constructed (12ms)
├── [Span] Model Invocation (2,340ms)
│   └── Claude 3 Sonnet
├── [Span] Tool: calculate_budget (48ms)
├── [Span] Guardrail Check (23ms)
└── [Span] Response Sent (2,468ms total)

Setting Up CloudWatch Integration

Enable Observability in AgentCore

# config.yaml
observability:
  enabled: true
  log_level: INFO  # DEBUG for development
  
  logs:
    destination: cloudwatch
    log_group: /agentcore/financial-advisor
    retention_days: 30
  
  metrics:
    enabled: true
    namespace: AgentCore/FinancialAdvisor
    dimensions:
      - agent_name
      - session_id
      - tool_name
  
  tracing:
    enabled: true
    service: financial-advisor
    sample_rate: 1.0  # 100% in dev, reduce in prod

Deploy with Observability

agentcore deploy --observability-enabled

Verify Logs Are Flowing

aws logs tail /agentcore/financial-advisor --follow

Distributed Tracing for Multi-Agent Systems

When agents call other agents, you need distributed tracing to see the full picture.

The Problem Without Tracing

User: "Research AI trends and write a blog post"

Log Entry 1: ResearchAgent received request
Log Entry 2: WriterAgent received request
Log Entry 3: Response sent

# ❓ Which request? How are they connected? What was the timing?

The Solution: Trace Context Propagation

from bedrock_agentcore import Agent, trace_context

@agent.handler
def handle_request(request, context):
    # Trace ID propagates automatically
    trace_id = context.trace_id  # abc-123
    
    # When calling another agent, trace continues
    research_result = research_agent.invoke(
        message=request.message,
        trace_context=context  # Propagate trace
    )
    
    # All spans linked under same trace
    return process(research_result)

Viewing Traces in CloudWatch

# Query for a specific trace
aws logs filter-log-events \
    --log-group-name /agentcore/financial-advisor \
    --filter-pattern '{ $.trace_id = "abc-123" }' \
    --query 'events[*].message'

Trace Visualization

Architecture Diagram

Essential Metrics to Track

Request Metrics

# Custom metrics to emit
metrics = [
    # Volume
    {"name": "RequestCount", "unit": "Count"},
    {"name": "ActiveSessions", "unit": "Count"},
    
    # Latency
    {"name": "Latency", "unit": "Milliseconds"},
    {"name": "ModelLatency", "unit": "Milliseconds"},
    {"name": "ToolLatency", "unit": "Milliseconds"},
    
    # Errors
    {"name": "ErrorCount", "unit": "Count"},
    {"name": "GuardrailTriggerCount", "unit": "Count"},
    {"name": "TimeoutCount", "unit": "Count"},
    
    # Tokens
    {"name": "InputTokens", "unit": "Count"},
    {"name": "OutputTokens", "unit": "Count"},
    {"name": "TotalTokens", "unit": "Count"},
]

Publishing Custom Metrics

import boto3
from datetime import datetime

cloudwatch = boto3.client('cloudwatch')

def publish_metric(name, value, unit, dimensions):
    cloudwatch.put_metric_data(
        Namespace='AgentCore/FinancialAdvisor',
        MetricData=[{
            'MetricName': name,
            'Value': value,
            'Unit': unit,
            'Timestamp': datetime.utcnow(),
            'Dimensions': [
                {'Name': k, 'Value': v} for k, v in dimensions.items()
            ]
        }]
    )

# Example: track tool latency
publish_metric(
    name='ToolLatency',
    value=48,
    unit='Milliseconds',
    dimensions={'tool_name': 'calculate_budget', 'agent': 'FinancialAdvisor'}
)

Building a Production Dashboard

Essential Dashboard Widgets

{
    "widgets": [
        {
            "title": "Request Volume",
            "type": "metric",
            "metrics": [["AgentCore/FinancialAdvisor", "RequestCount"]]
        },
        {
            "title": "P99 Latency",
            "type": "metric",
            "metrics": [["AgentCore/FinancialAdvisor", "Latency", {"stat": "p99"}]]
        },
        {
            "title": "Error Rate",
            "type": "metric",
            "metrics": [
                [{"expression": "errors/requests*100", "label": "Error %"}]
            ]
        },
        {
            "title": "Token Usage",
            "type": "metric",
            "metrics": [["AgentCore/FinancialAdvisor", "TotalTokens"]]
        },
        {
            "title": "Guardrail Triggers",
            "type": "metric",
            "metrics": [["AgentCore/FinancialAdvisor", "GuardrailTriggerCount"]]
        },
        {
            "title": "Active Sessions",
            "type": "metric",
            "metrics": [["AgentCore/FinancialAdvisor", "ActiveSessions"]]
        }
    ]
}

Create Dashboard via CLI

aws cloudwatch put-dashboard \
    --dashboard-name AgentCore-Production \
    --dashboard-body file://dashboard.json

Alerting Strategy

Critical Alerts (Page On-Call)

# Error rate > 5%
aws cloudwatch put-metric-alarm \
    --alarm-name "AgentCore-HighErrorRate" \
    --metric-name ErrorCount \
    --namespace AgentCore/FinancialAdvisor \
    --statistic Sum \
    --period 300 \
    --threshold 10 \
    --comparison-operator GreaterThanThreshold \
    --evaluation-periods 2 \
    --alarm-actions arn:aws:sns:us-east-1:123456789:pagerduty

# P99 latency > 10 seconds
aws cloudwatch put-metric-alarm \
    --alarm-name "AgentCore-HighLatency" \
    --metric-name Latency \
    --namespace AgentCore/FinancialAdvisor \
    --extended-statistic p99 \
    --period 300 \
    --threshold 10000 \
    --comparison-operator GreaterThanThreshold \
    --evaluation-periods 2 \
    --alarm-actions arn:aws:sns:us-east-1:123456789:pagerduty

Warning Alerts (Slack Notification)

# Guardrail trigger rate increasing
aws cloudwatch put-metric-alarm \
    --alarm-name "AgentCore-GuardrailSpike" \
    --metric-name GuardrailTriggerCount \
    --namespace AgentCore/FinancialAdvisor \
    --statistic Sum \
    --period 3600 \
    --threshold 50 \
    --comparison-operator GreaterThanThreshold \
    --evaluation-periods 1 \
    --alarm-actions arn:aws:sns:us-east-1:123456789:slack-alerts

Alert Escalation Matrix

Metric	Warning	Critical
Error Rate	> 2%	> 5%
P99 Latency	> 5s	> 10s
Guardrail Triggers	> 50/hour	> 100/hour
Token Burn Rate	> 2x normal	> 5x normal

Cost Tracking and Attribution

Track Cost Per Request

# Pricing (approximate)
COST_PER_1K_INPUT_TOKENS = 0.003  # Claude 3 Sonnet
COST_PER_1K_OUTPUT_TOKENS = 0.015

def calculate_cost(input_tokens, output_tokens):
    input_cost = (input_tokens / 1000) * COST_PER_1K_INPUT_TOKENS
    output_cost = (output_tokens / 1000) * COST_PER_1K_OUTPUT_TOKENS
    return input_cost + output_cost

# Log cost per request
cost = calculate_cost(input_tokens=450, output_tokens=380)
logger.info(f"Request cost: ${cost:.4f}")
publish_metric('RequestCost', cost, 'None', {'agent': 'FinancialAdvisor'})

Cost Attribution by User/Tenant

# Track costs by tenant for billing
publish_metric(
    name='TenantTokens',
    value=total_tokens,
    unit='Count',
    dimensions={
        'tenant_id': request.tenant_id,
        'agent': 'FinancialAdvisor'
    }
)

Monthly Cost Report Query

-- CloudWatch Logs Insights
fields @timestamp, tenant_id, tokens, cost
| filter agent = "FinancialAdvisor"
| stats sum(cost) as total_cost by tenant_id
| sort total_cost desc
| limit 20

Common Debugging Patterns

Pattern 1: "Why did the agent give a wrong answer?"

-- Find the trace for a specific session
fields @timestamp, @message
| filter session_id = "user-123-session-456"
| sort @timestamp asc
| limit 100

Look for:

Memory retrieved (was context correct?)
Prompt constructed (was information included?)
Tool calls (did they return expected results?)

Pattern 2: "Why is latency high?"

-- Find slowest components
fields @timestamp, span_name, duration_ms
| filter trace_id = "abc-123"
| sort duration_ms desc
| limit 10

Common culprits:

Model cold start (first request after idle)
Large context windows (too many tokens)
External tool calls (API latency)

Pattern 3: "Why did guardrail trigger?"

-- Find guardrail events
fields @timestamp, input_text, guardrail_result, reason
| filter event = "GUARDRAIL_TRIGGERED"
| sort @timestamp desc
| limit 50

Production Observability Checklist

Logging

All agent events logged (request, response, tools, errors)
Session ID included in every log
Trace ID propagated across agents
Log retention configured (30+ days)
Sensitive data redacted

Metrics

Tracing

Distributed tracing enabled
All spans named meaningfully
Tool calls traced
Memory operations traced

Alerting

Critical alerts page on-call
Warning alerts to Slack
Escalation matrix documented
Alert fatigue reviewed monthly

Dashboards

Production dashboard exists
Key metrics visible at glance
Team has access

Next Steps

Getting Started with AgentCore → Set up observability from the start
Multi-Agent Orchestration → Trace across multiple agents
AgentCore Memory Layer → Debug memory issues
AgentCore vs ADK → Compare observability capabilities

Need help with production agent monitoring?

At Cognilium, we run agents with 99.9% uptime and 5-minute MTTR. Let's discuss your observability needs →

AgentCore Observability: Monitoring AI Agents in Production

What is Agent Observability?

Why Agent Observability is Different

The Three Pillars: Logs, Metrics, Traces

Logs: What Happened

Metrics: How Often and How Fast

Traces: The Path Through Your System

Setting Up CloudWatch Integration

Enable Observability in AgentCore

Deploy with Observability

Verify Logs Are Flowing

Distributed Tracing for Multi-Agent Systems

The Problem Without Tracing

The Solution: Trace Context Propagation

Viewing Traces in CloudWatch

Trace Visualization

Essential Metrics to Track

Request Metrics

Publishing Custom Metrics

Building a Production Dashboard

Essential Dashboard Widgets

Create Dashboard via CLI

Alerting Strategy

Critical Alerts (Page On-Call)

Warning Alerts (Slack Notification)

Alert Escalation Matrix

Cost Tracking and Attribution

Track Cost Per Request

Cost Attribution by User/Tenant

Monthly Cost Report Query

Common Debugging Patterns

Pattern 1: "Why did the agent give a wrong answer?"

Pattern 2: "Why is latency high?"

Pattern 3: "Why did guardrail trigger?"

Production Observability Checklist

Logging

Metrics

Tracing

Alerting

Dashboards

Next Steps

Share this article

Muhammad Mudassir

Muhammad Mudassir

Frequently Asked Questions

How much does CloudWatch observability cost?

Should I log every request in production?

How do I correlate logs across multiple agents?

What's the best way to debug a specific user's issue?

How do I track costs by customer for billing?

Still have questions?