The OOM Detection workflow monitors your Kubernetes services for Out of Memory (OOM) events and creates issues when services experience OOM events. This helps you identify memory-related problems in your services and take corrective action.

How it Works

The workflow monitors the container_oom_kills_total metric, which is incremented each time a container in your service is killed due to an Out of Memory condition. When a service experiences more than the configured number of OOM events, an issue is created with details about the events.

Configuration

The workflow can be configured with the following parameters:

ParameterTypeDescriptionDefault
minOOMEventsToCreateIssueintegerMinimum number of OOM events required to create an issue1

Issue Details

When an issue is created, it includes:

  • The service and environment where OOM events occurred
  • The number of OOM events in the last 24 hours
  • The severity level (high if OOM count is 10x the minimum threshold)
  • A visualization showing:
    • OOM events over time
    • Memory usage patterns
    • Memory limits and requests

Example Issue

Here’s an example of an issue created by the OOM Detection workflow:

Title: OOMs Detected: my-service (production)

Service my-service (production environment) has experienced 5 OOM events in the last 24 hours.
High severity as the service experienced > 10x the minimum number of OOM events.

Severity Levels

The workflow assigns severity levels based on the number of OOM events:

  • Medium: When the number of OOM events meets or exceeds minOOMEventsToCreateIssue
  • High: When the number of OOM events is 10x or more than minOOMEventsToCreateIssue

Best Practices

  1. Set Appropriate Thresholds: Configure minOOMEventsToCreateIssue based on your service’s characteristics. A lower threshold is more sensitive but may generate more issues.

  2. Monitor Memory Usage: Use the issue details view to understand memory usage patterns leading up to OOM events. Look for:

    • Memory usage approaching limits
    • Sudden spikes in memory usage
    • Inadequate memory limits or requests
  3. Regular Review: Regularly review OOM issues to identify patterns and systemic problems in your services.

  4. Memory Management: When OOM issues are detected:

    • Review and adjust memory limits
    • Look for memory leaks
    • Consider implementing memory optimization strategies
    • Monitor memory usage trends