Automated Alert Intelligence
Auto Queen is the automated alert enrichment service. Prometheus alerts trigger AI-powered analysis via Step Functions, delivering enriched context to Slack — and engineers can continue the conversation in-thread.
How Auto Queen Works
From AlertManager → SNS → Step Functions → AI Analysis → Slack thread.
Pipeline Steps
AWS Step Functions orchestrate the entire flow.
Alert → SNS
Prometheus AlertManager fires alert to SNS topic — the entry point into Auto Queen.
Slack ACK
Step Functions immediately posts an acknowledgement to Slack so engineers know it's being handled.
Gather Context
Extract alert labels, annotations, and metadata. Query related metrics and logs for context.
AI Analysis
AI analyzes the context and may query RedQueen for deeper insights on metrics, logs, or Kubernetes state.
Thread Reply
Post enriched analysis as a thread reply with actionable insights and recommendations.
Agent Follow-up
Engineers can continue the conversation in-thread, asking RedQueen for more details or different angles.
Benefits
Why automated alert enrichment matters.
Instant Context
On-call engineers see AI analysis alongside the alert. No manual investigation needed.
Reduced MTTR
Faster incident resolution with pre-analyzed logs, metrics, and historical correlation.
24/7 First Responder
Every alert gets the same thorough analysis, regardless of time of day or team availability.
Continuous Conversation
Engineers can ask follow-up questions in-thread. Auto Queen keeps the context and digs deeper.
What Gets Analyzed
Auto Queen extracts context and queries RedQueen for deeper insights.
Alert Metadata
- Alert name and severity
- Labels (namespace, pod, service)
- Annotations and runbook URLs
- Firing duration
Prometheus Queries
- Related metrics for context
- Historical trends
- Threshold comparisons
- Resource utilization
OpenSearch Logs
- Error logs from affected pods
- Recent deployments
- WAF blocks if relevant
- Correlated events
Example: High Response Time Alert
See how Auto Queen handles a typical performance alert.
ResponseTime90thPercentile > 0.9s for frontend-api in prod namespace
🚨 HIGH Response Time | frontend-api | 90th percentile at 1.2s ⏳ Analyzing...
Querying RedQueen for request rates, error rates, and pod status... Searching logs for recent errors and deployments...
📊 Analysis: Response time spike correlates with 2x traffic increase. CPU at 85%. Deployment 15 min ago introduced new DB query. Recommend: check new query performance or scale replicas.
👤 "Show me the slow DB queries" 🤖 "Found 3 queries over 500ms in the last hour..."