Skip to main content

Overview

AI Runbooks allow you to define a set of tasks that Guardian should perform when an alert fires. Instead of manually investigating alerts, you can configure Guardian to automatically gather relevant data, analyze the situation, and document findings.

How Runbooks Work

  1. Alert Fires - A configured alert is triggered in Metoro
  2. Runbook Executes - Guardian automatically executes the runbook you’ve defined
  3. Investigation Document - Guardian creates a new document containing all the information gathered
  4. Notification - A link to the investigation document is included in the alert notification

What Can Runbooks Do?

Guardian can perform a broad set of investigation tasks in your runbooks:

Find Logs

Search for relevant logs around the time of the alert, including error logs, warnings, and contextual information

Analyze Traces

Find and analyze traces related to the alert, including slow requests, errors, and dependency failures

Query Metrics

Gather relevant metrics like error rates, latency percentiles, throughput, and resource utilization

Check Dependencies

Analyze upstream and downstream service dependencies to identify cascading failures

Isolate Failing Pods

Identify which specific pods or instances are experiencing issues

Correlate Changes

Link issues to recent deployments or configuration changes

Creating a Runbook

Step 1: Create or Edit an Alert

  1. Navigate to Alerts in the main navigation
  2. Create a new alert or edit an existing one
  3. In the alert configuration, find the AI Runbook section

Step 2: Define the Runbook

Write instructions for what Guardian should investigate when the alert fires. Use natural language to describe what you want Guardian to do.
Find error logs for this service in the last 30 minutes.
Look for any new error patterns.
Check if there were any recent deployments.

Step 3: Enable Guardian for the Alert

  1. Toggle Enable Guardian AI for the alert
  2. Save the alert configuration

Viewing Runbook Results

When an alert fires and the runbook executes:
  1. Alert Notification - The alert notification includes a link to the investigation document
  2. Investigations Page - View all investigation documents in GuardianInvestigations
  3. Investigation Details - Click on an investigation to see:
    • All data Guardian gathered
    • Analysis and findings
    • Recommended actions

Best Practices

Be Specific

The more specific your runbook instructions, the better Guardian can investigate:
Check what's wrong with the service.

Include Context

Give Guardian context about what the alert means:
This alert fires when the authentication service error rate exceeds 1%.

Our auth service depends on:
- PostgreSQL for user data
- Redis for session caching
- The identity-provider service for SSO

Please check all of these dependencies and look for the root cause.

Focus on Actionable Information

Ask for information that helps with resolution:
Find information that will help us resolve this issue quickly:
- Which specific endpoint or function is failing?
- What error messages are users seeing?
- Is this affecting all users or a subset?
- What changed recently that might have caused this?

Runbook Examples

High Error Rate Alert

This alert fires when the error rate exceeds 1% for more than 5 minutes.

Investigation steps:
1. Get the breakdown of errors by HTTP status code
2. Find the top 5 most frequent error messages in the logs
3. Identify which endpoints have the highest error rates
4. Check if any recent deployments correlate with the error spike
5. Look at dependency health (database, cache, external APIs)
6. Check pod health and resource utilization

Latency Alert

This alert fires when p99 latency exceeds 2 seconds.

Please investigate:
1. Find the slowest traces in the last 30 minutes
2. Identify which span is taking the longest (database, external API, etc.)
3. Check if latency is high for all endpoints or specific ones
4. Look at database query times
5. Check connection pool metrics
6. Look for any resource constraints (CPU throttling, memory pressure)

Resource Alert

This alert fires when memory usage exceeds 80%.

Investigation steps:
1. Check which pods have the highest memory usage
2. Look at memory growth over the last hour
3. Find any memory-related error logs (OOM warnings)
4. Check if there are memory leaks (continuous growth pattern)
5. Look at the number of active connections and requests
6. Check if there were recent deployments with code changes