OOM Detection

The OOM Detection workflow monitors your Kubernetes services for Out of Memory (OOM) events and creates issues when services experience OOM events. This helps you identify memory-related problems in your services and take corrective action.

How it Works

The workflow monitors the container_oom_kills_total metric, which is incremented each time a container in your service is killed due to an Out of Memory condition. When a service experiences more than the configured number of OOM events, an issue is created with details about the events.

Configuration

The workflow can be configured with the following parameters:

Parameter	Type	Description	Default
`minOOMEventsToCreateIssue`	integer	Minimum number of OOM events required to create an issue	1

Issue Details

When an issue is created, it includes:

The service and environment where OOM events occurred
The number of OOM events in the last 24 hours
The severity level (high if OOM count is 10x the minimum threshold)
A visualization showing:
- OOM events over time
- Memory usage patterns
- Memory limits and requests

Example Issue

Here’s an example of an issue created by the OOM Detection workflow:

Title: OOMs Detected: my-service (production)

Service my-service (production environment) has experienced 5 OOM events in the last 24 hours.
High severity as the service experienced > 10x the minimum number of OOM events.

Severity Levels

The workflow assigns severity levels based on the number of OOM events:

Medium: When the number of OOM events meets or exceeds minOOMEventsToCreateIssue
High: When the number of OOM events is 10x or more than minOOMEventsToCreateIssue

Best Practices

Set Appropriate Thresholds: Configure minOOMEventsToCreateIssue based on your service’s characteristics. A lower threshold is more sensitive but may generate more issues.
Monitor Memory Usage: Use the issue details view to understand memory usage patterns leading up to OOM events. Look for:
- Memory usage approaching limits
- Sudden spikes in memory usage
- Inadequate memory limits or requests
Regular Review: Regularly review OOM issues to identify patterns and systemic problems in your services.
Memory Management: When OOM issues are detected:
- Review and adjust memory limits
- Look for memory leaks
- Consider implementing memory optimization strategies
- Monitor memory usage trends

Issue Detection Overview

Right-Sizing Workflow CPU Throttling Detection

On this page

How it Works
Configuration
Issue Details
Example Issue
Severity Levels
Best Practices
Related Documentation

Get Started

Concepts

Traces

Logs

Metrics

Profiling

Kubernetes Resources

Dashboards

Infrastructure

Issue Detection

Investigations

Alerts & Monitoring

Integrations

Uptime Monitoring

User Management

On-Premises

Administration

How it Works

Configuration

Issue Details

Example Issue

Severity Levels

Best Practices

Get Started

Concepts

Traces

Logs

Metrics

Profiling

Kubernetes Resources

Dashboards

Infrastructure

Issue Detection

Investigations

Alerts & Monitoring

Integrations

Uptime Monitoring

User Management

On-Premises

Administration

​How it Works

​Configuration

​Issue Details

​Example Issue

​Severity Levels

​Best Practices

​Related Documentation

How it Works

Configuration

Issue Details

Example Issue

Severity Levels

Best Practices

Related Documentation