Skip to main content

Documentation Index

Fetch the complete documentation index at: https://metoro.io/docs/llms.txt

Use this file to discover all available pages before exploring further.

Metoro logo

What is Metoro?

Metoro is a Kubernetes-native AI SRE platform that automatically detects, investigates, and identifies root causes of production issues. It can investigate your alerts, follow runbooks, verify deployments and suggest fixes for the detected issues. Out of the box, Metoro collects logs, metrics, traces, profiling, and Kubernetes state with zero manual instrumentation. That gives Metoro’s AI agents full context to do accurate root cause analysis instead of relying on partial telemetry. Teams who would like to send their custom OpenTelemetry traces/metrics to Metoro can do so by pointing their OpenTelemetry collector to Metoro for seamless integration. Metoro is also an observability platform that can replace Datadog, Grafana, and other tools (as long as running on Kubernetes), so teams can move from automated detection and investigation into direct debugging without switching tools. Setup takes < 5 minutes with a single Helm install.

AI SRE workflows

Here are some of the key features that Metoro can help you with:

Deployment Verification

Automatically detect deployments, compare pre- and post-deployment telemetry, and flag regressions with evidence.

Autonomous Issue Detection

Detect unusual behavior across telemetry, decide whether it is a real production issue, and continue to root cause automatically.

Alert Investigations

Investigate firing alerts with Metoro, identify noisy alerts versus real incidents, and return full RCA with supporting evidence. Either use Metoro’s alerting or send your alerts from different sources to Metoro to investigate.

Code Fixes

Connect GitHub so Metoro can inspect code changes for deployments, recommend and prepare fixes for production issues, and create pull requests for review.

Advisor

Review right-sizing, OOM, and CPU throttling findings for your services with detailed evidence and recurrence history.

AI Runbooks

Attach investigation instructions to alerts so Metoro gathers the context your team cares about most.

Assisted Debugging with AI

Ask Metoro to investigate or gather information for you.

Metoro MCP Server

Hook up Metoro’s MCP Server to your local agents to get production insights during development.

Observability coverage

You will get access to the following observability data and features when you install Metoro:

Logs

Centralized logs with fast search, structured parsing, transformations and log metric visualizations.

Metrics

Cluster, node, pod, and service metrics out of the box, plus support for custom application metrics.

Traces

Automatic zero-instrumentation traces for common protocols, with OpenTelemetry support for custom tracing.

Profiling

On-CPU profiling for all containers so you can see hot code paths to find bottlenecks.

Kubernetes State

View your full cluster state overtime to correlate runtime issues with cluster changes. Similar to having k9s with time travel enabled to see historical cluster state.

Dashboards

Custom Dashboards that can be created from templates or built from scratch.

Alerting

Automatically setup alerts for your cluster in 5 minutes with AI powered alert suggestions.

Resource Optimization

Optimize your Kubernetes resources with Metoro’s Advisor. Find the workloads that are underutilized, overutilized and experiencing high throttling. Get resource sizing recommendations based on historical usage.

Uptime Monitoring

Set up uptime monitoring for your endpoints and receive alerts when they go down.

Status Pages

Create custom status pages for your services from any metric or uptime monitors in your account.

Why teams use Metoro

Teams typically adopt Metoro because:
  • Reduce MTTR by catching regressions earlier, investigating faster, and correlating telemetry, infrastructure state, and code changes in one workflow.
  • Reduce noisy alert work by letting AI agents filter noise, investigate real issues, and hand teams supporting evidence instead of raw symptoms.
  • Engineering teams spend too much time debugging production issues across disconnected dashboards, logs, alerts, traces, and infrastructure tools.
  • Correlating different signals and getting from symptom to root cause is often slow, manual, and dependent on whoever knows the system best.