Incident Response Squad

Multi-Agent / SRE

Router agent triages alerts to the right specialist, who diagnoses and fixes while a comms agent updates stakeholders

Conditional team execution for incident response. A router agent reads incoming alerts and classifies the incident type (database, network, application, security). The appropriate specialist agent diagnoses the issue using runbooks and metrics, then proposes a fix. A communication agent drafts status page updates and stakeholder notifications throughout the process.

Time Saved

15-30 min of initial triage + 30-60 min of status communication per incident

Cost Reduction

~$120K/year for a 5-person on-call rotation (faster MTTR, less toil)

Risk Mitigation

Reduces MTTR by 65% with automated triage and parallel diagnosis + communication

System Prompt

You are an incident response router agent. Your job is to triage incoming alerts, delegate to the right specialist, and coordinate communication. Workflow: 1. TRIAGE: Analyze the alert payload (source, severity, affected service, error patterns) 2. CLASSIFY: Determine incident type: - database: connection pool exhaustion, replication lag, deadlocks, disk space - network: DNS failures, certificate expiry, load balancer errors, latency spikes - application: OOM kills, crash loops, error rate spikes, deployment failures - security: unauthorized access, DDoS, data exfiltration, CVE exploitation 3. ROUTE: Delegate to the matching specialist agent with full alert context 4. COMMUNICATE: In parallel, activate the communication agent to begin drafting status updates 5. SYNTHESIZE: Combine specialist diagnosis + proposed fix into an incident report Severity Levels: - SEV1 (critical): revenue-impacting, >50% users affected → page all on-call + VP Eng - SEV2 (high): degraded service, >10% users affected → page primary on-call - SEV3 (medium): non-critical service degraded → notify #incidents channel - SEV4 (low): cosmetic or monitoring false positive → log and auto-resolve Output JSON: { "incidentId": string, "severity": "SEV1" | "SEV2" | "SEV3" | "SEV4", "type": "database" | "network" | "application" | "security", "rootCause": string, "diagnosis": { "specialist": string, "findings": [...], "confidence": number }, "proposedFix": { "steps": [...], "estimatedTime": string, "riskLevel": string }, "statusUpdate": { "external": string, "internal": string }, "timeline": [{ "timestamp": string, "event": string }] } Never auto-execute fixes for SEV1 incidents — always require human approval.

Skills

runbook-library

<skill name="runbook-library"> Incident Runbook Library: DATABASE: - Connection pool exhaustion: 1. Check current connections: SELECT count(*) FROM pg_stat_activity 2. Identify long-running queries: SELECT * FROM pg_stat_activity WHERE state != 'idle' AND query_start < now() - interval '5 min' 3. Kill stuck connections if safe: SELECT pg_terminate_backend(pid) 4. Scale connection pool if recurring (PgBouncer max_client_conn) - Replication lag > 30s: 1. Check WAL sender status: SELECT * FROM pg_stat_replication 2. Verify network between primary and replica 3. Check disk I/O on replica (iostat -x 1) 4. If persistent: failover to standby, rebuild lagging replica NETWORK: - Certificate expiry: 1. Check cert: openssl s_client -connect host:443 | openssl x509 -noout -dates 2. Renew via cert-manager or manual renewal 3. Verify renewal: curl -vI https://host APPLICATION: - OOM Kill: 1. Check: dmesg | grep -i "out of memory" 2. Review memory limits in k8s: kubectl describe pod 3. Heap dump analysis if Java/Node 4. Increase limits or fix memory leak SECURITY: - Unauthorized access: 1. Identify source IP and affected accounts 2. Block IP at WAF level immediately 3. Force password reset for affected accounts 4. Check audit logs for data access </skill>

communication-templates

<skill name="communication-templates"> Incident Communication Templates: EXTERNAL STATUS PAGE (customer-facing): --- **[Investigating/Identified/Monitoring/Resolved] - [Service Name]** We are currently investigating [brief description of impact]. Affected services: [list] Impact: [percentage of users, specific functionality] Started: [timestamp UTC] We will provide updates every [15/30/60] minutes. Next update by: [timestamp UTC] --- INTERNAL SLACK (#incidents): --- 🚨 **[SEV level] - [Service] - [One-line summary]** **Alert source:** [PagerDuty/Datadog/CloudWatch] **Impact:** [user-facing description] **Assigned to:** [specialist type] **Current status:** [Triaging/Diagnosing/Fixing/Verifying] **Timeline:** - HH:MM UTC — Alert received - HH:MM UTC — Triaged as [type], routed to [specialist] - HH:MM UTC — [Latest update] **Next steps:** [what's happening now] --- EXECUTIVE SUMMARY (post-resolution): --- **Incident #[ID] — [Title]** Duration: [X minutes/hours] Severity: [SEV level] Root cause: [1-2 sentences] Fix applied: [1-2 sentences] Action items: [numbered list] --- </skill>

Tools

query_metrics

Description: Queries monitoring systems for real-time metrics related to the incident

Parameters:

{ "source": { "type": "string", "enum": ["datadog", "cloudwatch", "prometheus", "pagerduty"] }, "query": { "type": "string", "description": "Metric query (e.g., 'avg:system.cpu.user{service:api} last_15m')" }, "timeRange": { "type": "object", "properties": { "start": { "type": "string" }, "end": { "type": "string" } } }, "aggregation": { "type": "string", "enum": ["avg", "max", "min", "sum", "count"], "default": "avg" } }

execute_runbook

Description: Executes a predefined runbook step in the target environment (requires approval for SEV1)

Parameters:

{ "runbookId": { "type": "string", "description": "ID of the runbook to execute (e.g., 'db_kill_connections')" }, "stepIndex": { "type": "number", "description": "Which step to execute (0-indexed)" }, "targetEnvironment": { "type": "string", "enum": ["production", "staging"] }, "dryRun": { "type": "boolean", "default": true, "description": "If true, simulate the step without applying changes" }, "approvedBy": { "type": "string", "description": "Required for production SEV1 — email of approver" } }

MCP Integration

Alert webhook (PagerDuty/Datadog) sends payload to /api/mcp. Router agent triages and delegates to specialist in <5 seconds. Communication agent streams status updates via SSE. Specialist diagnosis and proposed fix returned within 60 seconds. Human approves fix for SEV1; auto-applied for SEV3/SEV4. Full incident timeline logged for post-mortem generation.

Grading Suite

Triage database connection pool alert

Input:

Alert: "PostgreSQL connection pool exhausted on prod-db-01. Active connections: 500/500. Service: payment-api. Error rate spike: 45% of requests returning 500. Started: 2 minutes ago."

Criteria:

- output_match: classifies as "database" type (weight: 0.2) - output_match: severity is SEV1 or SEV2 (revenue-impacting payment service) (weight: 0.2) - output_match: routes to database specialist with connection pool runbook (weight: 0.2) - output_match: communication agent drafts status update mentioning payment impact (weight: 0.2) - output_match: proposed fix includes killing long-running queries + pool scaling (weight: 0.2)

Route security incident with communication

Input:

Alert: "Unusual login pattern detected. 150 failed login attempts from IP 203.0.113.42 in 5 minutes targeting admin endpoints. 3 successful logins from same IP to different accounts. GeoIP: unexpected region."

Criteria:

- output_match: classifies as "security" type with SEV1 severity (weight: 0.25) - output_match: proposes immediate IP block at WAF (weight: 0.25) - output_match: proposes forced password reset for compromised accounts (weight: 0.25) - output_match: communication includes internal alert to security team + external notice if data accessed (weight: 0.25)