How it Works
The workflow monitors thecontainer_oom_kills_total
metric, which is incremented each time a container in your service is killed due to an Out of Memory condition. When a service experiences more than the configured number of OOM events, an issue is created with details about the events.
Configuration
The workflow can be configured with the following parameters:Parameter | Type | Description | Default |
---|---|---|---|
minOOMEventsToCreateIssue | integer | Minimum number of OOM events required to create an issue | 1 |
Issue Details
When an issue is created, it includes:- The service and environment where OOM events occurred
- The number of OOM events in the last 24 hours
- The severity level (high if OOM count is 10x the minimum threshold)
- A visualization showing:
- OOM events over time
- Memory usage patterns
- Memory limits and requests
Example Issue
Here’s an example of an issue created by the OOM Detection workflow:Severity Levels
The workflow assigns severity levels based on the number of OOM events:- Medium: When the number of OOM events meets or exceeds
minOOMEventsToCreateIssue
- High: When the number of OOM events is 10x or more than
minOOMEventsToCreateIssue
Best Practices
-
Set Appropriate Thresholds: Configure
minOOMEventsToCreateIssue
based on your service’s characteristics. A lower threshold is more sensitive but may generate more issues. -
Monitor Memory Usage: Use the issue details view to understand memory usage patterns leading up to OOM events. Look for:
- Memory usage approaching limits
- Sudden spikes in memory usage
- Inadequate memory limits or requests
- Regular Review: Regularly review OOM issues to identify patterns and systemic problems in your services.
-
Memory Management: When OOM issues are detected:
- Review and adjust memory limits
- Look for memory leaks
- Consider implementing memory optimization strategies
- Monitor memory usage trends