Observability & Monitoring
Instrument Aixgo agents with OpenTelemetry, distributed tracing, and structured logging for production visibility.
Production AI systems need comprehensive observability. This guide shows how to instrument Aixgo agents with OpenTelemetry, distributed tracing, metrics, and structured logging.
Built-in Observability
Aixgo includes comprehensive observability features out of the box:
- OpenTelemetry - OTLP export for distributed tracing
- Langfuse Integration - LLM-specific analytics via OTLP
- Prometheus Metrics - Production monitoring and alerting
- Health Checks - Liveness and readiness probes
Enable with Configuration
# config/agents.yaml
supervisor:
name: coordinator
max_rounds: 10
observability:
tracing: true
service_name: 'aixgo-production'
exporter: 'otlp'
endpoint: 'localhost:4317'
sampling_rate: 1.0 # 100% of traces
agents:
- name: analyzer
role: react
model: gpt-4-turboThat’s it. Distributed tracing is now enabled for your entire agent system.
Distributed Tracing
How Tracing Works
Every message flow is automatically traced:
Request → Producer → Analyzer → Logger → Response
| | | | |
span1 span2 span3 span4 span5Each span includes:
- Agent name
- Processing duration
- Input/output messages
- LLM API calls
- Tool executions
- Errors
Viewing Traces
Aixgo exports traces to any OpenTelemetry-compatible backend:
Jaeger (Open Source):
# docker-compose.yml
services:
jaeger:
image: jaegertracing/all-in-one:latest
ports:
- '4317:4317' # OTLP gRPC
- '16686:16686' # Jaeger UI# Start Jaeger
docker-compose up -d
# View traces at http://localhost:16686Grafana Cloud:
observability:
tracing: true
service_name: 'aixgo-prod'
exporter: 'otlp'
endpoint: 'https://otlp-gateway-prod-us-central-0.grafana.net/otlp'
headers:
authorization: 'Bearer ${GRAFANA_API_KEY}'Datadog:
observability:
tracing: true
service_name: 'aixgo-prod'
exporter: 'datadog'
endpoint: 'http://datadog-agent:8126'Trace Attributes
Aixgo automatically adds contextual attributes:
{
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"span_id": "00f067aa0ba902b7",
"service.name": "aixgo-production",
"agent.name": "analyzer",
"agent.role": "react",
"agent.model": "gpt-4-turbo",
"message.id": "msg-123",
"llm.provider": "xai",
"llm.model": "gpt-4-turbo",
"llm.tokens.input": 150,
"llm.tokens.output": 75,
"llm.latency_ms": 850,
"tool.name": "query_database",
"tool.duration_ms": 120
}Structured Logging
Default JSON Logging
Aixgo uses structured logging with automatic trace correlation:
import "log/slog"
func main() {
// Configure JSON logging
logger := slog.New(slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
Level: slog.LevelInfo,
}))
slog.SetDefault(logger)
slog.Info("Starting Aixgo agent",
"version", "1.0.0",
"env", os.Getenv("ENV"),
)
if err := aixgo.Run("config/agents.yaml"); err != nil {
slog.Error("Agent failed", "error", err)
os.Exit(1)
}
}Output:
{
"time": "2025-11-16T10:30:00Z",
"level": "INFO",
"msg": "Starting Aixgo agent",
"version": "1.0.0",
"env": "production",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"span_id": "00f067aa0ba902b7"
}Contextual Logging
Log with trace context automatically:
func processMessage(ctx context.Context, msg string) {
slog.InfoContext(ctx, "Processing message",
"message_length", len(msg),
"agent", "analyzer",
)
// trace_id and span_id added automatically
}Metrics
Built-in Metrics
Aixgo exports these metrics automatically:
Agent Metrics:
aixgo_agent_messages_total- Total messages processed per agentaixgo_agent_errors_total- Total errors per agentaixgo_agent_duration_seconds- Processing duration histogram
LLM Metrics:
aixgo_llm_requests_total- Total LLM API callsaixgo_llm_tokens_input- Input tokens consumedaixgo_llm_tokens_output- Output tokens generatedaixgo_llm_latency_seconds- LLM API latency histogram
Supervisor Metrics:
aixgo_supervisor_rounds_total- Total workflow roundsaixgo_supervisor_active_agents- Number of active agentsaixgo_supervisor_message_queue_size- Pending messages
Prometheus Integration
# config/agents.yaml
observability:
tracing: true
metrics: true
metrics_port: 9090 # Prometheus scrape endpointPrometheus configuration:
# prometheus.yml
scrape_configs:
- job_name: 'aixgo'
static_configs:
- targets: ['localhost:9090']Custom Metrics
Add application-specific metrics:
import (
"github.com/prometheus/client_golang/prometheus"
"github.com/prometheus/client_golang/prometheus/promauto"
)
var (
dataProcessed = promauto.NewCounter(prometheus.CounterOpts{
Name: "myapp_data_processed_total",
Help: "Total data records processed",
})
processingDuration = promauto.NewHistogram(prometheus.HistogramOpts{
Name: "myapp_processing_duration_seconds",
Help: "Time spent processing data",
Buckets: prometheus.DefBuckets,
})
)
func processData(data string) {
timer := prometheus.NewTimer(processingDuration)
defer timer.ObserveDuration()
// Process data
// ...
dataProcessed.Inc()
}Health Checks
Aixgo exposes liveness and readiness endpoints for container orchestration:
# config/agents.yaml
observability:
health_port: 8080 # Health check endpointEndpoints:
/healthz- Liveness probe (is the process running?)/readyz- Readiness probe (is the service ready to accept traffic?)
Kubernetes example:
# deployment.yaml
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 5
periodSeconds: 10
readinessProbe:
httpGet:
path: /readyz
port: 8080
initialDelaySeconds: 5
periodSeconds: 5Integration with Observability Platforms
Grafana Stack (Loki + Tempo + Mimir)
Complete observability setup:
# docker-compose.yml
services:
# Logs
loki:
image: grafana/loki:latest
ports:
- '3100:3100'
# Traces
tempo:
image: grafana/tempo:latest
ports:
- '4317:4317' # OTLP gRPC
- '3200:3200' # Tempo UI
# Metrics
prometheus:
image: prom/prometheus:latest
ports:
- '9090:9090'
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
# Dashboards
grafana:
image: grafana/grafana:latest
ports:
- '3000:3000'
environment:
- GF_AUTH_ANONYMOUS_ENABLED=trueAixgo configuration:
observability:
tracing: true
metrics: true
service_name: 'aixgo-prod'
exporter: 'otlp'
endpoint: 'tempo:4317'
metrics_port: 9090Datadog
# docker-compose.yml
services:
datadog-agent:
image: gcr.io/datadoghq/agent:latest
environment:
- DD_API_KEY=${DATADOG_API_KEY}
- DD_SITE=datadoghq.com
- DD_LOGS_ENABLED=true
- DD_APM_ENABLED=true
ports:
- '8126:8126' # APMAixgo configuration:
observability:
tracing: true
service_name: 'aixgo-prod'
exporter: 'datadog'
endpoint: 'datadog-agent:8126'Langfuse (LLM-Specific)
Track LLM calls, costs, and performance:
observability:
tracing: true
llm_observability:
enabled: true
provider: 'langfuse'
endpoint: 'https://cloud.langfuse.com'
public_key: ${LANGFUSE_PUBLIC_KEY}
secret_key: ${LANGFUSE_SECRET_KEY}Langfuse dashboard shows:
- LLM call traces
- Token usage and costs
- Model performance comparisons
- Prompt engineering analytics
Debugging Multi-Agent Workflows
Trace Visualization
View message flow across agents:
Request [trace_id: abc123]
├─ producer (50ms)
│ └─ generate_data
├─ analyzer (1.2s)
│ ├─ llm_call [gpt-4-turbo] (850ms)
│ └─ tool_call [query_db] (120ms)
└─ logger (30ms)
└─ persist_result
Total: 1.28sIdentifying Bottlenecks
Sort spans by duration to find slow operations:
| Agent | Operation | Duration | % of Total |
|---|---|---|---|
| analyzer | llm_call | 850ms | 66% |
| analyzer | tool_call | 120ms | 9% |
| producer | generate_data | 50ms | 4% |
Insight: LLM call is the bottleneck (66% of time)
Actions:
- Use faster model (gpt-3.5-turbo instead of gpt-4)
- Optimize prompt length
- Add caching for repeated queries
Alerting
Prometheus Alerts
# alerts.yml
groups:
- name: aixgo
rules:
# High error rate
- alert: HighErrorRate
expr: rate(aixgo_agent_errors_total[5m]) > 0.1
for: 5m
annotations:
summary: 'High error rate in {{ $labels.agent }}'
# Slow LLM calls
- alert: SlowLLMCalls
expr: histogram_quantile(0.95, aixgo_llm_latency_seconds) > 5
for: 10m
annotations:
summary: '95th percentile LLM latency > 5s'
# High token usage
- alert: HighTokenUsage
expr: rate(aixgo_llm_tokens_input[1h]) > 100000
for: 1h
annotations:
summary: 'High token consumption (cost concern)'Grafana Alerts
Create dashboards with alerts:
{
"dashboard": {
"title": "Aixgo Agent Monitoring",
"panels": [
{
"title": "Agent Throughput",
"targets": [
{
"expr": "rate(aixgo_agent_messages_total[5m])"
}
],
"alert": {
"conditions": [
{
"evaluator": { "type": "lt", "params": [10] },
"query": { "params": ["A", "5m", "now"] }
}
]
}
}
]
}
}Best Practices
1. Always Enable Tracing in Production
# config/agents-prod.yaml
observability:
tracing: true
sampling_rate: 0.1 # 10% sampling to reduce overhead2. Use Structured Logging
// ❌ Bad: unstructured
log.Println("Error processing message:", err)
// ✅ Good: structured
slog.Error("Message processing failed",
"error", err,
"agent", "analyzer",
"message_id", msgID,
)3. Set SLOs and Monitor Them
Define service level objectives:
- Latency: P95 < 2 seconds
- Error rate: < 1%
- Throughput: > 100 msg/s
Monitor with alerts.
4. Correlate Logs with Traces
Always log with context:
slog.InfoContext(ctx, "Processing started")
// trace_id and span_id automatically included5. Track LLM Costs
Monitor token usage to control costs:
// Aixgo tracks this automatically
// aixgo_llm_tokens_input
// aixgo_llm_tokens_output
// Calculate cost
costPerToken := 0.00001 // $0.01 per 1K tokens
totalCost := totalTokens * costPerTokenTroubleshooting Common Issues
Missing Traces
Symptom: Traces not appearing in backend
Check:
- Endpoint configured correctly
- Network connectivity to OTLP endpoint
- Firewall rules allow port 4317 (OTLP)
- API keys/credentials set correctly
Debug:
observability:
tracing: true
debug: true # Enable verbose OTLP loggingHigh Overhead
Symptom: Tracing impacting performance
Solution:
observability:
tracing: true
sampling_rate: 0.1 # Sample 10% instead of 100%Incomplete Spans
Symptom: Some spans missing from trace
Cause: Long-running operations timing out
Solution:
observability:
tracing: true
span_timeout: 60s # Increase timeoutExample: Complete Observability Setup
# config/agents-prod.yaml
supervisor:
name: coordinator
max_rounds: 10
observability:
# Tracing
tracing: true
service_name: 'aixgo-production'
exporter: 'otlp'
endpoint: 'tempo:4317'
sampling_rate: 0.2
# Metrics
metrics: true
metrics_port: 9090
# LLM-specific
llm_observability:
enabled: true
provider: 'langfuse'
endpoint: 'https://cloud.langfuse.com'
public_key: ${LANGFUSE_PUBLIC_KEY}
secret_key: ${LANGFUSE_SECRET_KEY}
agents:
- name: analyzer
role: react
model: gpt-4-turbo
prompt: 'Analyze the data'Application code:
package main
import (
"log/slog"
"os"
"github.com/aixgo-dev/aixgo"
_ "github.com/aixgo-dev/aixgo/agents"
)
func main() {
// Structured JSON logging
logger := slog.New(slog.NewJSONHandler(os.Stdout, &slog.HandlerOptions{
Level: slog.LevelInfo,
}))
slog.SetDefault(logger)
slog.Info("Starting Aixgo agent",
"version", "1.0.0",
"env", "production",
)
// Run with observability enabled
if err := aixgo.Run("config/agents-prod.yaml"); err != nil {
slog.Error("Agent failed", "error", err)
os.Exit(1)
}
}Result:
- ✅ Distributed traces in Tempo/Grafana
- ✅ Metrics in Prometheus
- ✅ LLM analytics in Langfuse
- ✅ Structured logs with trace correlation
Key Takeaways
- Built-in observability - OpenTelemetry included, no manual instrumentation
- Distributed tracing - Track messages across multi-agent workflows
- Structured logging - JSON logs with automatic trace correlation
- Metrics export - Agent, LLM, and supervisor metrics
- Platform integration - Works with Grafana, Datadog, Langfuse, etc.
Production AI systems are complex. Comprehensive observability is not optional—it’s essential.
Next Steps
- Production Deployment - Deploy with monitoring
- Multi-Agent Orchestration - Debug complex workflows
- Building Docker Images - Container-based deployments