Auto Queen

Automated Alert Intelligence

Auto Queen is the automated alert enrichment service. Prometheus alerts trigger AI-powered analysis via Step Functions, delivering enriched context to Slack — and engineers can continue the conversation in-thread.

How Auto Queen Works

From AlertManager → SNS → Step Functions → AI Analysis → Slack thread.

AlertManager

SNS

Step Functions

1. ACK Slack

2. Context

3. AI Analysis

4. Reply

Slack Thread

Agent Query

RedQueen

React Flow

Pipeline Steps

AWS Step Functions orchestrate the entire flow.

Alert → SNS

Prometheus AlertManager fires alert to SNS topic — the entry point into Auto Queen.

Slack ACK

Step Functions immediately posts an acknowledgement to Slack so engineers know it's being handled.

Gather Context

Extract alert labels, annotations, and metadata. Query related metrics and logs for context.

AI Analysis

AI analyzes the context and may query RedQueen for deeper insights on metrics, logs, or Kubernetes state.

Thread Reply

Post enriched analysis as a thread reply with actionable insights and recommendations.

Agent Follow-up

Engineers can continue the conversation in-thread, asking RedQueen for more details or different angles.

Benefits

Why automated alert enrichment matters.

Instant Context

On-call engineers see AI analysis alongside the alert. No manual investigation needed.

Reduced MTTR

Faster incident resolution with pre-analyzed logs, metrics, and historical correlation.

24/7 First Responder

Every alert gets the same thorough analysis, regardless of time of day or team availability.

Continuous Conversation

Engineers can ask follow-up questions in-thread. Auto Queen keeps the context and digs deeper.

What Gets Analyzed

Auto Queen extracts context and queries RedQueen for deeper insights.

Alert Metadata

Alert name and severity
Labels (namespace, pod, service)
Annotations and runbook URLs
Firing duration

Prometheus Queries

Related metrics for context
Historical trends
Threshold comparisons
Resource utilization

OpenSearch Logs

Error logs from affected pods
Recent deployments
WAF blocks if relevant
Correlated events

Example: High Response Time Alert

See how Auto Queen handles a typical performance alert.

Alert → SNS

ResponseTime90thPercentile > 0.9s for frontend-api in prod namespace

Slack ACK

🚨 HIGH Response Time | frontend-api | 90th percentile at 1.2s ⏳ Analyzing...

AI Analysis

Querying RedQueen for request rates, error rates, and pod status... Searching logs for recent errors and deployments...

Thread Reply

📊 Analysis: Response time spike correlates with 2x traffic increase. CPU at 85%. Deployment 15 min ago introduced new DB query. Recommend: check new query performance or scale replicas.

Agent Follow-up

👤 "Show me the slow DB queries" 🤖 "Found 3 queries over 500ms in the last hour..."