AI Runbooks - Metoro Documentation

Overview

AI Runbooks allow you to define a set of tasks that Guardian should perform when an alert fires. Instead of manually investigating alerts, you can configure Guardian to automatically gather relevant data, analyze the situation, and document findings.

How Runbooks Work

Alert Fires - A configured alert is triggered in Metoro
Runbook Executes - Guardian automatically executes the runbook you’ve defined
Investigation Document - Guardian creates a new document containing all the information gathered
Notification - A link to the investigation document is included in the alert notification

What Can Runbooks Do?

Guardian can perform a broad set of investigation tasks in your runbooks:

Find Logs

Search for relevant logs around the time of the alert, including error logs, warnings, and contextual information

Analyze Traces

Find and analyze traces related to the alert, including slow requests, errors, and dependency failures

Query Metrics

Gather relevant metrics like error rates, latency percentiles, throughput, and resource utilization

Check Dependencies

Analyze upstream and downstream service dependencies to identify cascading failures

Isolate Failing Pods

Identify which specific pods or instances are experiencing issues

Correlate Changes

Link issues to recent deployments or configuration changes

Creating a Runbook

Step 1: Create or Edit an Alert

Navigate to Alerts in the main navigation
Create a new alert or edit an existing one
In the alert configuration, find the AI Runbook section

Step 2: Define the Runbook

Write instructions for what Guardian should investigate when the alert fires. Use natural language to describe what you want Guardian to do.

Simple Runbook
Detailed Runbook
Specific Investigation

Find error logs for this service in the last 30 minutes.
Look for any new error patterns.
Check if there were any recent deployments.

Find all error logs for this service in the last hour
Identify any new error messages that weren't present before
Get the p99 latency for this service over the last 2 hours
Check if there were any deployments in the last 4 hours
Look at upstream dependencies and check their error rates
Find any pods that are consuming excessive memory or CPU
Summarize the likely root cause

This alert fires when our payment processing latency exceeds 500ms.

Please investigate:
- Check the database connection pool metrics
- Look for slow database queries in the traces
- Check if the payment gateway dependency is responding slowly
- Look for any error logs related to timeouts
- Check if any pods are being throttled

Step 3: Enable Guardian for the Alert

Toggle Enable Guardian AI for the alert
Save the alert configuration

Viewing Runbook Results

When an alert fires and the runbook executes:

Alert Notification - The alert notification includes a link to the investigation document
Investigations Page - View all investigation documents in Guardian → Investigations
Investigation Details - Click on an investigation to see:
- All data Guardian gathered
- Analysis and findings
- Recommended actions

Best Practices

Be Specific

The more specific your runbook instructions, the better Guardian can investigate:

Less Effective
More Effective

Check what's wrong with the service.

Find error logs containing "connection refused" or "timeout"
Check the error rate for the /api/checkout endpoint
Look at the response times for our Redis cache
Check if any pods are in CrashLoopBackOff

Include Context

Give Guardian context about what the alert means:

This alert fires when the authentication service error rate exceeds 1%.

Our auth service depends on:
- PostgreSQL for user data
- Redis for session caching
- The identity-provider service for SSO

Please check all of these dependencies and look for the root cause.

Focus on Actionable Information

Ask for information that helps with resolution:

Find information that will help us resolve this issue quickly:
- Which specific endpoint or function is failing?
- What error messages are users seeing?
- Is this affecting all users or a subset?
- What changed recently that might have caused this?

Runbook Examples

High Error Rate Alert

This alert fires when the error rate exceeds 1% for more than 5 minutes.

Investigation steps:
Get the breakdown of errors by HTTP status code
Find the top 5 most frequent error messages in the logs
Identify which endpoints have the highest error rates
Check if any recent deployments correlate with the error spike
Look at dependency health (database, cache, external APIs)
Check pod health and resource utilization

Latency Alert

This alert fires when p99 latency exceeds 2 seconds.

Please investigate:
Find the slowest traces in the last 30 minutes
Identify which span is taking the longest (database, external API, etc.)
Check if latency is high for all endpoints or specific ones
Look at database query times
Check connection pool metrics
Look for any resource constraints (CPU throttling, memory pressure)

Resource Alert

This alert fires when memory usage exceeds 80%.

Investigation steps:
Check which pods have the highest memory usage
Look at memory growth over the last hour
Find any memory-related error logs (OOM warnings)
Check if there are memory leaks (continuous growth pattern)
Look at the number of active connections and requests
Check if there were recent deployments with code changes

Guardian Overview

Learn about Guardian AI capabilities

Alerts

Configure alerts in Metoro

Inbox

View actionable items from Guardian

Get Started

Concepts

Guardian AI

Traces

Logs

Metrics

Profiling

Kubernetes Resources

Dashboards

Infrastructure

Issue Detection

Alerts & Monitoring

Inbox

Integrations

Uptime Monitoring

User Management

On-Premises

Administration

​Overview

​How Runbooks Work

​What Can Runbooks Do?

Find Logs

Analyze Traces

Query Metrics

Check Dependencies

Isolate Failing Pods

Correlate Changes

​Creating a Runbook

​Step 1: Create or Edit an Alert

​Step 2: Define the Runbook

​Step 3: Enable Guardian for the Alert

​Viewing Runbook Results

​Best Practices

​Be Specific

​Include Context

​Focus on Actionable Information

​Runbook Examples

​High Error Rate Alert

​Latency Alert

​Resource Alert

​Related Documentation

Guardian Overview

Alerts

Inbox

Overview

How Runbooks Work

What Can Runbooks Do?

Creating a Runbook

Step 1: Create or Edit an Alert

Step 2: Define the Runbook

Step 3: Enable Guardian for the Alert

Viewing Runbook Results

Best Practices

Be Specific

Include Context

Focus on Actionable Information

Runbook Examples

High Error Rate Alert

Latency Alert

Resource Alert

Related Documentation