Auto Queen

Automated Alert Intelligence

Auto Queen is the automated alert enrichment service. Prometheus alerts trigger AI-powered analysis via Step Functions, delivering enriched context to Slack — and engineers can continue the conversation in-thread.

How Auto Queen Works

From AlertManager → SNS → Step Functions → AI Analysis → Slack thread.

Pipeline Steps

AWS Step Functions orchestrate the entire flow.

1

Alert → SNS

Prometheus AlertManager fires alert to SNS topic — the entry point into Auto Queen.

2

Slack ACK

Step Functions immediately posts an acknowledgement to Slack so engineers know it's being handled.

3

Gather Context

Extract alert labels, annotations, and metadata. Query related metrics and logs for context.

4

AI Analysis

AI analyzes the context and may query RedQueen for deeper insights on metrics, logs, or Kubernetes state.

5

Thread Reply

Post enriched analysis as a thread reply with actionable insights and recommendations.

6

Agent Follow-up

Engineers can continue the conversation in-thread, asking RedQueen for more details or different angles.

Benefits

Why automated alert enrichment matters.

Instant Context

On-call engineers see AI analysis alongside the alert. No manual investigation needed.

Reduced MTTR

Faster incident resolution with pre-analyzed logs, metrics, and historical correlation.

24/7 First Responder

Every alert gets the same thorough analysis, regardless of time of day or team availability.

Continuous Conversation

Engineers can ask follow-up questions in-thread. Auto Queen keeps the context and digs deeper.

What Gets Analyzed

Auto Queen extracts context and queries RedQueen for deeper insights.

Alert Metadata

  • Alert name and severity
  • Labels (namespace, pod, service)
  • Annotations and runbook URLs
  • Firing duration

Prometheus Queries

  • Related metrics for context
  • Historical trends
  • Threshold comparisons
  • Resource utilization

OpenSearch Logs

  • Error logs from affected pods
  • Recent deployments
  • WAF blocks if relevant
  • Correlated events

Example: High Response Time Alert

See how Auto Queen handles a typical performance alert.

Alert → SNS

ResponseTime90thPercentile > 0.9s for frontend-api in prod namespace

Slack ACK

🚨 HIGH Response Time | frontend-api | 90th percentile at 1.2s ⏳ Analyzing...

AI Analysis

Querying RedQueen for request rates, error rates, and pod status... Searching logs for recent errors and deployments...

Thread Reply

📊 Analysis: Response time spike correlates with 2x traffic increase. CPU at 85%. Deployment 15 min ago introduced new DB query. Recommend: check new query performance or scale replicas.

Agent Follow-up

👤 "Show me the slow DB queries" 🤖 "Found 3 queries over 500ms in the last hour..."