Skip to main content

Overview

Metoro’s autonomous issue detection workflow automatically identifies unusual patterns in your systems without requiring you to configure explicit alert thresholds first. When an anomaly is detected, Metoro investigates whether the behavior is expected noise or a real production issue and, if it is real, continues to the likely root cause. You can see all Anomaly Detection investigation from Guardian -> Agents -> Anomalies in Metoro. Take me there Anomaly Investigations View Click on an Anomaly Investigation to view its details and evidence. Anomaly Investigation Details

How It Works

  1. Detection - Metoro continuously monitors your systems for anomalous behavior
  2. Investigation - When an anomaly is detected, Metoro automatically runs an investigation
  3. Analysis - Metoro determines whether the anomaly represents a real issue
  4. Notification - If an issue is confirmed, Metoro posts to Slack with its findings

Types of anomalies detected

Metoro anomaly detection currently covers:
  • 5XX error rate spikes
  • Pod failure spikes for:
    • CrashLoopBackOff
    • ImagePullBackOff
    • ErrImagePull
    • OOMKilled
    • Init:Error

Enabling anomaly detection

Step 1: Navigate to Settings

Go to SettingsFeaturesAnomaly Detection Take me there

Step 2: Enable Anomaly Detection

Toggle Enable Anomaly Detection to activate the feature.

Step 3: Configure Detection Scope

Select which services and environments should have anomaly detection enabled:
  • Services - Choose specific services or select all
  • Environments - Choose specific environments (e.g., prod, staging)
We recommend starting with production environments to focus on the most impactful issues.

Configuring notifications

Autonomous issue detection uses the same flexible notification configuration as other AI SRE workflows.

Setting Up Notification Rules

  1. Navigate to SettingsFeaturesAutonomous Investigation
  2. Click Add Notification Configuration
  3. Configure:
    • Services - Which services should trigger notifications
    • Environments - Which environments should trigger notifications
    • Destination - Where to send notifications (Slack channel, webhook, etc.)

Example Configurations

Route anomalies for critical services to an incidents channel:
  • Services: payment-service, auth-service, checkout-service
  • Environments: prod
  • Destination: #incidents

How this differs from alerts

FeatureAlertsAnomaly Detection
ConfigurationYou define thresholdsAutomatic baseline learning
TriggerFixed thresholdsStatistical anomalies
InvestigationManual or runbookAutomatic
Best forKnown failure modesUnknown unknowns
Anomaly Detection and Alerts are complementary. Use alerts for known failure modes with specific thresholds, and anomaly detection to catch unexpected issues.

Per-workload configuration

You can customize anomaly detection behavior for individual workloads using Kubernetes annotations. This allows you to fine-tune detection windows or disable detection entirely for specific services.

Available Annotations

AnnotationTypeDefaultRangeDescription
metoro.io/anomaly-detection-disabledstring"false""true"/"false"Disable anomaly detection for this workload
metoro.io/anomaly-detection-baseline-minutesint305-30Baseline window for calculating normal behavior
metoro.io/anomaly-detection-evaluation-minutesint51-10Evaluation window compared against baseline
The evaluation window must be at most half the baseline window (e.g., if baseline is 10 minutes, evaluation can be at most 5 minutes). This ensures statistical validity of anomaly detection.

Example: Disable Detection for a Service

For services with expected high error rates or batch jobs:
apiVersion: apps/v1
kind: Deployment
metadata:
  name: batch-processor
  annotations:
    metoro.io/anomaly-detection-disabled: "true"
spec:
  # ...

Example: Shorter Detection Window

For services where you want faster detection at the cost of potentially more false positives:
apiVersion: apps/v1
kind: Deployment
metadata:
  name: api-gateway
  annotations:
    metoro.io/anomaly-detection-baseline-minutes: "10"
    metoro.io/anomaly-detection-evaluation-minutes: "2"
spec:
  # ...

Example: Longer Baseline for Stable Services

For stable services where you want to reduce noise:
apiVersion: apps/v1
kind: Deployment
metadata:
  name: core-service
  annotations:
    metoro.io/anomaly-detection-baseline-minutes: "30"
    metoro.io/anomaly-detection-evaluation-minutes: "10"
spec:
  # ...
Annotations can be placed in either metadata.annotations or spec.template.metadata.annotations. The former takes precedence if both are specified.

Anomaly Detection Coverage Matrix

Anomaly detection in Metoro is driven by a set of detectors that monitor for anomalies. When a detector is triggered, it causes Metoro to investigate to determine if it represents a real issue. We are always adding new detectors to improve coverage to reduce the chance of false positives.
Issue TypeAnomaly Detection Coverage
HTTP Server: 5XX error rate spikeYes
HTTP Server: Request Rate DropNo
HTTP Server: Request Rate SurgeNo
HTTP Server: P50 Latency spikeYes
HTTP Server: P90 Latency spikeYes
HTTP Server: P95 Latency spikeNo
HTTP Server: P99 Latency spikeNo
External HTTP Dependencies: 5XX error rate spikeYes
External HTTP Dependencies: Request Rate DropNo
External HTTP Dependencies: Request Rate SurgeNo
External HTTP Dependencies: P50 Latency spikeYes
External HTTP Dependencies: P90 Latency spikeYes
External HTTP Dependencies: P95 Latency spikeNo
External HTTP Dependencies: P99 Latency spikeNo
Database: Error rate spikeAlpha
Database: P95 Latency spikeAlpha
General Server (All Protocols): Error rate spikeAlpha
General Server (All Protocols): P50 Latency spikeAlpha
General Server (All Protocols): P90 Latency spikeAlpha
External Dependencies (All Protocols): Error rate spikeAlpha
External Dependencies (All Protocols): P50 Latency spikeAlpha
External Dependencies (All Protocols): P90 Latency spikeAlpha
Pod Failure: CrashLoopBackOffYes
Pod Failure: ImagePullBackOffYes
Pod Failure: ErrImagePullYes
Pod Failure: OOMKilledYes
Pod Failure: Init:ErrorYes
Pod Restart spikeNo
Probe failure spikeNo
Pod Scheduling: Pod Stuck in PendingNo
Unschedulable pod spikeNo
Node-pressure eviction spikeNo
Service Resource Usage: CPU ThrottlingDevelopment
Service Resource Usage: Network Send RateDevelopment
Service Resource Usage: Network Receive RateDevelopment
Service Resource Usage: Disk Write RateDevelopment
Service Resource Usage: Disk Read RateDevelopment
Service Resource Usage: Disk UsageDevelopment
Service Resource Allocation: CPU request too highYes - Advisor
Service Resource Allocation: CPU request too lowYes - Advisor
Service Resource Allocation: CPU limit too lowYes - Advisor
Service Resource Allocation: Memory request too highYes - Advisor
Service Resource Allocation: Memory request too lowYes - Advisor
Service Resource Allocation: Memory limit too lowYes - Advisor
Kubernetes Events: Cluster wide count of Warning EventsAlpha
Kubernetes Events: FailedSchedulingNo
Kubernetes Events: BackOffNo
Kubernetes Events: FailedMountNo
Persistent Volume Claim (PVC) creation failureNo
Persistent Volume Claim (PVC) deletion failureNo
Persistent Volume Claim (PVC) UsageNo
PVC / volume unhealthyNo
Node Ready false / unknownNo
Node Resource Usage: CPU ThrottlingNo
Node MemoryPressureNo
Node DiskPressureNo
Node Resource Usage: Network Send RateAlpha
Node Resource Usage: Network Receive RateAlpha
API server not readyNo
Cluster Disk usage spikeNo
Cluster Network error spikeNo
Cluster CPU usage spikeDevelopment
Cluster Memory usage spikeDevelopment
Cluster Network TCP RetransmitsDevelopment
Yes means the issue type is monitored by anomaly detection today.Beta means the detector is available in limited rollout or behind a feature flag while tuning continues.Alpha means the detector is in active development and early validation before broader rollout.Development means the detector is in active development but not yet validated for signal quality.No means the issue type is not currently monitored by anomaly detection but is in the backlog for future development.Advisor means the signal is surfaced in Advisor rather than kicking off an anomaly investigation.

Best practices

Start with Production

Focus anomaly detection on production environments first, where issues have the most impact.

Review investigation quality

Periodically review the investigations to ensure they’re finding real issues:
  • Are the anomalies significant?
  • Is the root cause analysis accurate?
  • Provide feedback to improve detection

Combine with alerts

Use both anomaly detection and alerts:
  • Alerts for critical thresholds you always want to know about
  • Anomaly detection for catching unexpected issues

Tune notification routing

Route notifications appropriately:
  • Critical services → dedicated incident channels
  • Non-critical services → general monitoring channels

Deployment Verification

Automatic verification of deployments

AI Alert Investigations

Investigate firing alerts with AI

AI Runbooks

Configure investigation runbooks for alerts

Alerts

Configure threshold-based alerts