Escouade de Réponse aux Incidents

Multi-Agent / SRE

Un agent routeur trie les alertes vers le bon spécialiste, qui diagnostique et corrige pendant qu'un agent communication informe les parties prenantes

Exécution conditionnelle d'équipe pour la réponse aux incidents. Un agent routeur lit les alertes entrantes et classifie le type d'incident (base de données, réseau, application, sécurité). L'agent spécialiste approprié diagnostique le problème à l'aide de runbooks et de métriques, puis propose un correctif. Un agent communication rédige les mises à jour de la page de statut et les notifications aux parties prenantes tout au long du processus.

Temps Économisé

15-30 min de triage initial + 30-60 min de communication de statut par incident

Réduction des Coûts

~120K€/an pour une rotation d'astreinte de 5 personnes (MTTR réduit, moins de toil)

Atténuation des Risques

Réduit le MTTR de 65% grâce au triage automatisé et au diagnostic + communication parallèles

System Prompt

You are an incident response router agent. Your job is to triage incoming alerts, delegate to the right specialist, and coordinate communication. Workflow: 1. TRIAGE: Analyze the alert payload (source, severity, affected service, error patterns) 2. CLASSIFY: Determine incident type: - database: connection pool exhaustion, replication lag, deadlocks, disk space - network: DNS failures, certificate expiry, load balancer errors, latency spikes - application: OOM kills, crash loops, error rate spikes, deployment failures - security: unauthorized access, DDoS, data exfiltration, CVE exploitation 3. ROUTE: Delegate to the matching specialist agent with full alert context 4. COMMUNICATE: In parallel, activate the communication agent to begin drafting status updates 5. SYNTHESIZE: Combine specialist diagnosis + proposed fix into an incident report Severity Levels: - SEV1 (critical): revenue-impacting, >50% users affected → page all on-call + VP Eng - SEV2 (high): degraded service, >10% users affected → page primary on-call - SEV3 (medium): non-critical service degraded → notify #incidents channel - SEV4 (low): cosmetic or monitoring false positive → log and auto-resolve Output JSON: { "incidentId": string, "severity": "SEV1" | "SEV2" | "SEV3" | "SEV4", "type": "database" | "network" | "application" | "security", "rootCause": string, "diagnosis": { "specialist": string, "findings": [...], "confidence": number }, "proposedFix": { "steps": [...], "estimatedTime": string, "riskLevel": string }, "statusUpdate": { "external": string, "internal": string }, "timeline": [{ "timestamp": string, "event": string }] } Never auto-execute fixes for SEV1 incidents — always require human approval.

Skills

runbook-library

<skill name="runbook-library"> Incident Runbook Library: DATABASE: - Connection pool exhaustion: 1. Check current connections: SELECT count(*) FROM pg_stat_activity 2. Identify long-running queries: SELECT * FROM pg_stat_activity WHERE state != 'idle' AND query_start < now() - interval '5 min' 3. Kill stuck connections if safe: SELECT pg_terminate_backend(pid) 4. Scale connection pool if recurring (PgBouncer max_client_conn) - Replication lag > 30s: 1. Check WAL sender status: SELECT * FROM pg_stat_replication 2. Verify network between primary and replica 3. Check disk I/O on replica (iostat -x 1) 4. If persistent: failover to standby, rebuild lagging replica NETWORK: - Certificate expiry: 1. Check cert: openssl s_client -connect host:443 | openssl x509 -noout -dates 2. Renew via cert-manager or manual renewal 3. Verify renewal: curl -vI https://host APPLICATION: - OOM Kill: 1. Check: dmesg | grep -i "out of memory" 2. Review memory limits in k8s: kubectl describe pod 3. Heap dump analysis if Java/Node 4. Increase limits or fix memory leak SECURITY: - Unauthorized access: 1. Identify source IP and affected accounts 2. Block IP at WAF level immediately 3. Force password reset for affected accounts 4. Check audit logs for data access </skill>

communication-templates

<skill name="communication-templates"> Incident Communication Templates: EXTERNAL STATUS PAGE (customer-facing): --- **[Investigating/Identified/Monitoring/Resolved] - [Service Name]** We are currently investigating [brief description of impact]. Affected services: [list] Impact: [percentage of users, specific functionality] Started: [timestamp UTC] We will provide updates every [15/30/60] minutes. Next update by: [timestamp UTC] --- INTERNAL SLACK (#incidents): --- 🚨 **[SEV level] - [Service] - [One-line summary]** **Alert source:** [PagerDuty/Datadog/CloudWatch] **Impact:** [user-facing description] **Assigned to:** [specialist type] **Current status:** [Triaging/Diagnosing/Fixing/Verifying] **Timeline:** - HH:MM UTC — Alert received - HH:MM UTC — Triaged as [type], routed to [specialist] - HH:MM UTC — [Latest update] **Next steps:** [what's happening now] --- EXECUTIVE SUMMARY (post-resolution): --- **Incident #[ID] — [Title]** Duration: [X minutes/hours] Severity: [SEV level] Root cause: [1-2 sentences] Fix applied: [1-2 sentences] Action items: [numbered list] --- </skill>

Tools

query_metrics

Description: Queries monitoring systems for real-time metrics related to the incident

Parameters:

{ "source": { "type": "string", "enum": ["datadog", "cloudwatch", "prometheus", "pagerduty"] }, "query": { "type": "string", "description": "Metric query (e.g., 'avg:system.cpu.user{service:api} last_15m')" }, "timeRange": { "type": "object", "properties": { "start": { "type": "string" }, "end": { "type": "string" } } }, "aggregation": { "type": "string", "enum": ["avg", "max", "min", "sum", "count"], "default": "avg" } }

execute_runbook

Description: Executes a predefined runbook step in the target environment (requires approval for SEV1)

Parameters:

{ "runbookId": { "type": "string", "description": "ID of the runbook to execute (e.g., 'db_kill_connections')" }, "stepIndex": { "type": "number", "description": "Which step to execute (0-indexed)" }, "targetEnvironment": { "type": "string", "enum": ["production", "staging"] }, "dryRun": { "type": "boolean", "default": true, "description": "If true, simulate the step without applying changes" }, "approvedBy": { "type": "string", "description": "Required for production SEV1 — email of approver" } }

MCP Integration

Alert webhook (PagerDuty/Datadog) sends payload to /api/mcp. Router agent triages and delegates to specialist in <5 seconds. Communication agent streams status updates via SSE. Specialist diagnosis and proposed fix returned within 60 seconds. Human approves fix for SEV1; auto-applied for SEV3/SEV4. Full incident timeline logged for post-mortem generation.

Grading Suite

Triage database connection pool alert

Input:

Alert: "PostgreSQL connection pool exhausted on prod-db-01. Active connections: 500/500. Service: payment-api. Error rate spike: 45% of requests returning 500. Started: 2 minutes ago."

Criteria:

- output_match: classifies as "database" type (weight: 0.2) - output_match: severity is SEV1 or SEV2 (revenue-impacting payment service) (weight: 0.2) - output_match: routes to database specialist with connection pool runbook (weight: 0.2) - output_match: communication agent drafts status update mentioning payment impact (weight: 0.2) - output_match: proposed fix includes killing long-running queries + pool scaling (weight: 0.2)

Route security incident with communication

Input:

Alert: "Unusual login pattern detected. 150 failed login attempts from IP 203.0.113.42 in 5 minutes targeting admin endpoints. 3 successful logins from same IP to different accounts. GeoIP: unexpected region."

Criteria:

- output_match: classifies as "security" type with SEV1 severity (weight: 0.25) - output_match: proposes immediate IP block at WAF (weight: 0.25) - output_match: proposes forced password reset for compromised accounts (weight: 0.25) - output_match: communication includes internal alert to security team + external notice if data accessed (weight: 0.25)