Kubernetes Monitoring Best Practices: A Production SRE Checklist
A practical Kubernetes monitoring checklist for production teams covering service health, cluster signals, labels, dashboards, alerts, eBPF, and AI triage.
Kubernetes monitoring best practices are not about collecting every metric Kubernetes can emit. They are about noticing user pain early, proving what changed, and getting from alert to root cause without turning on-call into archaeology.
This post is narrower: a production checklist for teams that already run Kubernetes and want monitoring that works during incidents.
flowchart TB Alert["Alert fires"] Symptom["User-facing symptom"] Service["Service health"] Runtime["Pod, node, and workload state"] Signals["Logs, traces, events, metrics"] Change["Deploy and config changes"] Cause["Likely root cause"] Action["Rollback, fix, scale, or tune"] Alert --> Symptom Symptom --> Service Service --> Runtime Runtime --> Signals Signals --> Change Change --> Cause Cause --> Action
Quick Kubernetes Monitoring Checklist
Use this as the short version. Every row should have an owner, a dashboard, and a clear alerting policy.
| Practice | Why it matters | Signals to collect | Common mistake |
|---|---|---|---|
| Monitor service health first | Users feel slow or broken services before they feel node CPU | Request rate, errors, latency, saturation, SLO burn | Starting with node dashboards and hoping they explain app failures |
| Cover every layer | Kubernetes incidents move across layers quickly | Control plane, nodes, pods, services, network, dependencies | Monitoring pods but ignoring DNS, storage, ingress, and third-party calls |
| Standardize labels | Queries and alerts need stable ownership metadata | service, team, environment, cluster, namespace, version | Letting every team invent a different label vocabulary |
| Use RED and USE together | App symptoms and resource pressure answer different questions | Rate, errors, duration, utilization, saturation, errors | Alerting on CPU without knowing whether users are affected |
| Correlate signals | Incidents rarely fit into one telemetry type | Metrics, logs, traces, events, deploys, configs | Buying separate tools that cannot pivot between signals |
| Keep dashboards operational | Dashboards should answer on-call questions fast | Golden paths, dependencies, rollouts, capacity, noisy alerts | Building decorative wallboards no one uses during a page |
| Alert on symptoms | Pages should represent urgent, actionable risk | SLO burn, high error rate, latency, failed rollouts, capacity risk | Paging on every threshold breach, restart, or warning event |
| Attach change context | Most incidents are caused by recent changes | Deploys, images, ConfigMaps, Secrets, HPA changes, node churn | Looking at metrics with no release timeline |
| Use eBPF for baseline coverage | Not every workload is instrumented well | Service maps, network calls, RED metrics, traces, profiling | Waiting for every team to add SDKs before monitoring production |
| Automate first-pass triage | Humans should review evidence, not gather it by hand | Alert, runbook, traces, logs, events, ownership, code context | Sending pages with no hypothesis or next step |
1. Monitor Service Health Before Cluster Health
Start with the thing users feel. For most services, that means latency, traffic, errors, and saturation. Google's SRE book calls these the four golden signals: latency, traffic, errors, and saturation. The names are old. The point is still right.
For an API, watch:
- Request rate by route, method, status code, and caller.
- Error rate split by user-visible failures and dependency failures.
- p50, p90, p95, and p99 latency, with failed requests separated.
- Saturation signals such as CPU throttling, queue depth, connection pool exhaustion, memory pressure, and worker lag.
Only after service health is visible should you walk down into cluster health. Node pressure matters. Pod restarts matter. Control plane health matters. But if a checkout API is serving 500s, the first question is not "what is node CPU?" It is "which user path is broken, when did it start, and what changed?"
2. Cover The Kubernetes Layers That Actually Fail
A production Kubernetes monitoring setup needs more than pod CPU and memory.
Kubernetes itself exposes component metrics in Prometheus format from /metrics endpoints, and the kubelet exposes additional cAdvisor, resource, and probe metrics according to the Kubernetes observability docs.
Use those signals, but organize them by failure domain.
Monitor these layers:
- Control plane: API server latency and errors, scheduler health, controller-manager behavior, etcd health where available, API throttling, admission webhooks, and managed control plane incidents.
- Nodes: readiness, memory pressure, disk pressure, inode pressure, network saturation, CPU pressure, filesystem growth, and kernel-level drops. Kubernetes documents node-pressure eviction when memory, disk, or inodes cross thresholds.
- Workloads: pod phase, readiness, restarts, CrashLoopBackOff, ImagePullBackOff, OOMKilled, rollout status, replica availability, probe failures, and job completion.
- Services and ingress: status codes, latency, retries, TLS errors, route health, endpoint changes, and load balancer health.
- Network and dependencies: DNS errors, service-to-service latency, external API calls, database calls, queue latency, refused connections, and timeouts.
- Storage: PVC state, mount failures, attach errors, disk latency, IOPS saturation, and volume expansion issues.
The resource metrics pipeline is useful for HPA and kubectl top, but Kubernetes is explicit that it only provides the minimum CPU and memory metrics for autoscaling.
That is not enough for incident response.
3. Standardize Labels And Ownership Metadata Early
Bad labels make good telemetry useless.
Kubernetes says labels are key-value pairs used to organize and select objects, and the recommended app.kubernetes.io labels exist so tools can understand applications consistently.
That guidance is practical, not academic.
During an incident, you need to answer:
- Which team owns this service?
- Which deploy changed it?
- Which environment and cluster are affected?
- Which namespace, workload, pod, node, and container produced the signal?
- Which customer, region, or tenant is affected if your system supports that split?
At minimum, enforce these across metrics, logs, traces, and events:
app.kubernetes.io/name: checkout
app.kubernetes.io/instance: checkout-prod
app.kubernetes.io/version: '2026.05.20.3'
app.kubernetes.io/component: api
app.kubernetes.io/part-of: storefront
app.kubernetes.io/managed-by: helm
team: payments
environment: production
Do not wait to clean this up later. Later means after a page, when every query has a different spelling for the same service.
4. Use RED And USE, Plus Kubernetes-Specific Signals
RED metrics tell you how the service is behaving:
- Rate: how much work is arriving.
- Errors: how much work is failing.
- Duration: how long work takes.
USE metrics tell you whether a resource is stressed:
- Utilization: how busy it is.
- Saturation: how much work is queued or throttled.
- Errors: whether the resource is failing.
Kubernetes adds its own signals:
- Pod readiness and availability.
- Restarts, OOM kills, and probe failures.
- Scheduling failures and pending pods.
- HPA decisions and scaling lag. The HPA controller scales from observed metrics, but missing metrics and not-yet-ready pods can dampen scaling behavior.
- Deployment progress, rollback events, image changes, and config changes.
- Kubernetes events. The Event API docs describe events as best-effort supplemental data with limited retention, so persist them if you want them during post-incident review.
You want both views. RED says "checkout p95 latency doubled." USE plus Kubernetes state says "new pods are throttled, two are not ready, and HPA is waiting on metrics."
5. Collect Logs, Traces, Events, And Metrics Together
Metrics detect many problems. They rarely explain the whole problem.
Kubernetes logging also has a trap: container logs are easy to access with kubectl logs, but Kubernetes does not provide a native cluster-level log storage backend.
The logging architecture docs recommend separate storage and commonly use a node-level logging agent, usually as a DaemonSet.
That gives you log collection. It does not give you incident context by itself.
A good setup lets you pivot:
- From an alert to the affected service.
- From the service to the exact pod, node, image, and rollout.
- From the pod to logs, traces, events, and resource metrics in the same time window.
- From a slow endpoint to the downstream service, database, queue, or external API.
- From a deploy to the telemetry that changed after it shipped.
This is where Kubernetes observability matters. Not as a buzzword. As a navigation problem.
6. Keep Dashboards Operational, Not Decorative
A production dashboard should answer questions an on-call engineer asks under pressure.
Good dashboards show:
- Is the service healthy?
- Which routes, consumers, or tenants changed?
- Which dependencies are slow or failing?
- Which pods are serving the bad traffic?
- Did this start after a deployment, config change, autoscaler event, or node event?
- Is the problem isolated to one cluster, namespace, node pool, or availability zone?
- What is the next useful drilldown?
Bad dashboards show twenty charts because those metrics were easy to collect.
Prometheus and Grafana are excellent building blocks. Prometheus scrapes and stores metrics, Grafana visualizes them, and both are common in Kubernetes stacks. But dashboards stop being enough when the responder has to manually join metric labels, logs, trace IDs, pod state, and deploy history across five tabs.
Metoro Kubernetes dashboards and metrics is built around that gap: PromQL-compatible querying, Kubernetes templates, Grafana import, metrics, logs, traces, and resource state on the same canvas.
7. Alert On Symptoms And SLO Burn, Not Every Cause
Prometheus gives the clearest short version of alerting best practice: keep alerts simple, alert on symptoms, and avoid pages where there is nothing to do. The Prometheus alerting docs also recommend paging on high latency and error rates high in the stack.
That should shape Kubernetes alerting.
Page on:
- Sustained user-visible error rate.
- Latency SLO burn for a critical service.
- Failed critical jobs when the missed job will hurt users.
- Failed rollouts that reduce availability.
- DNS failure affecting live traffic.
- Capacity exhaustion that will become an outage soon.
- Repeated OOM kills or restarts tied to service impact.
Ticket or notify on:
- A single pod restart with no service impact.
- CPU above a generic threshold for a short period.
- A non-critical warning event.
- A namespace nearing quota days before impact.
- A low-priority deployment taking longer than usual.
Google's SRE workbook recommends multi-window burn-rate alerts because they catch fast budget burn while reducing false positives. For a 99.9 percent SLO, their starting page thresholds include 2 percent budget consumption in one hour and 5 percent in six hours.
8. Write Alerts Like A Runbook Entry
Alert quality is easiest to see in examples.
| Alert | Verdict | Why |
|---|---|---|
CPU > 80% for pod checkout-abc123 for 5m | Bad page | It may be normal load, and it points at an individual pod rather than user impact |
checkout p95 latency > 750ms and error budget burn > 14x for 5m and 1h | Good page | It maps to user pain, urgency, and an SLO |
CrashLoopBackOff exists in namespace prod | Bad page | One crash loop can be harmless if no served path is affected |
checkout ready replicas < desired replicas for 10m after rollout and 5xx rate > 2% | Good page | It combines rollout failure with service impact |
Kubernetes Warning event count > 0 | Bad page | Events are noisy and best-effort; most need context before they deserve a human |
CoreDNS error rate > 5% and checkout dependency timeouts increased for 10m | Good page | It connects platform failure to affected services |
Every page should include:
- Affected service and owning team.
- SLO or user symptom.
- Start time and current severity.
- Recent deploys and config changes.
- Top traces, logs, events, and dashboards.
- Known runbook or likely next action.
If an alert cannot include that context, keep it as a ticket until it can.
9. Attach Deploy And Change Context To Every Signal
Kubernetes changes constantly. Pods churn. ReplicaSets rotate. Autoscalers move targets. ConfigMaps and Secrets change. Nodes drain and rejoin.
Monitoring without change context forces people to guess.
Persist and correlate:
- Deployment time, image, commit, author, and rollout status.
- ReplicaSet changes and rollback events.
- ConfigMap and Secret changes.
- HPA decisions and replica count changes.
- Node upgrades, drains, taints, and autoscaler events.
- Ingress, service, endpoint, and network policy changes.
The useful incident question is often not "what is the value of this metric?" It is "what changed five minutes before this metric moved?"
Metoro Kubernetes APM and Kubernetes logging both lean on that correlation: traces, logs, metrics, Kubernetes state, and deployment history should be part of the same investigation.
10. Use eBPF For Baseline Visibility
Manual instrumentation is still valuable. OpenTelemetry is the right standard for custom spans, metrics, logs, and vendor-neutral pipelines, and the OpenTelemetry Kubernetes docs exist because Kubernetes users need consistent observability tooling.
But production clusters always contain gaps:
- Services without SDKs.
- Third-party containers.
- Legacy apps.
- Jobs and internal tools no one instrumented.
- New services that shipped before telemetry was finished.
eBPF helps fill that baseline. The eBPF project describes eBPF as a way to run sandboxed programs in privileged kernel contexts without changing kernel source or loading kernel modules. For monitoring, that means you can capture useful runtime and network behavior from the node. The Grafana Beyla docs describe eBPF auto-instrumentation capturing RED metrics and trace spans without application code changes. Pixie similarly documents automatic Kubernetes telemetry without manual instrumentation.
That does not make SDKs obsolete. It means your default coverage is not blocked on every application team doing perfect instrumentation first.
11. Use AI For First-Pass Triage, Not Blind Autopilot
AI alert investigation is useful when it does the boring first ten minutes:
- Pull the alert and its thresholds.
- Identify the affected service, owner, and recent changes.
- Check service health, dependency behavior, and rollout state.
- Gather relevant traces, logs, Kubernetes events, and metrics.
- Compare the current incident to previous ones.
- Suggest likely root cause and next steps.
- Decide whether the alert is noisy, actionable, or missing context.
That is not magic. It is structured evidence gathering.
Metoro AI Alert Investigation investigates firing alerts with telemetry, deploy history, Kubernetes metadata, runbooks, and prior context. Metoro AI SRE extends the same workflow toward remediation: root cause, evidence, and proposed fixes or PRs when the next step is clear.
MTTR Work vs Telemetry Work
More telemetry can help. It can also make incidents slower if responders have to stitch it together manually.
| Reduces MTTR | Mostly adds telemetry |
|---|---|
| Alerts tied to SLO burn and service ownership | Alerts on every pod restart |
| Service maps with live latency, error, and request volume | Static architecture diagrams |
| Logs, traces, metrics, and events filtered to the same service and time window | Separate tools with different label schemes |
| Deployment and config changes on the incident timeline | Release notes in a different system |
| Dashboards built around incident questions | Dashboards built around every available metric |
| eBPF baseline coverage for uninstrumented workloads | Waiting for every service to add custom SDKs |
| AI triage that collects evidence and suggests next actions | AI summaries with no links back to data |
Example Alert-To-Root-Cause Workflow
Here is what the workflow should feel like.
- A page fires:
checkoutis burning error budget and p95 latency crossed the page threshold. - The service dashboard shows latency rose after the last deployment.
- The service map shows new calls from
checkouttopricing-cache. - Traces show most slow requests wait on
pricing-cache. - Logs show connection pool exhaustion in the
checkoutpods. - Kubernetes events show new pods are ready, but HPA scale-up lagged because metrics were briefly missing.
- The deploy timeline shows a pool-size change in the new image.
- The responder rolls back or patches the pool config, then keeps a ticket to tune the HPA and dashboard.
That is Kubernetes monitoring working. The page started at a user symptom. The system kept enough correlated evidence to find the cause. The next action was obvious.
The Short Version
Kubernetes monitoring best practices are mostly discipline:
- Watch user-facing service health before low-level infrastructure.
- Cover control plane, nodes, workloads, services, network, dependencies, logs, traces, and events.
- Keep labels and ownership metadata boring and consistent.
- Use RED and USE together.
- Build dashboards for incident questions.
- Page on symptoms, SLO burn, and imminent risk.
- Attach deploy and config context everywhere.
- Use eBPF to cover what manual instrumentation misses.
- Use AI to gather evidence and shorten triage, while humans stay in control.
Metoro is built for this model: Kubernetes APM, logs, dashboards and metrics, service maps, traces, events, eBPF telemetry, and AI alert investigation in one Kubernetes-native workflow.
FAQ
What are the most important Kubernetes monitoring best practices?
Start with user-facing service health, then monitor the cluster layers that explain it: control plane, nodes, pods, workloads, services, network, dependencies, logs, traces, events, and recent changes. Use consistent labels, alert on symptoms and SLO burn, and correlate signals in one workflow.
What should I monitor in Kubernetes?
Monitor request rate, error rate, latency, saturation, pod readiness, restarts, OOM kills, probe failures, deployment status, node pressure, scheduling failures, DNS health, ingress and egress latency, dependency errors, logs, traces, Kubernetes events, and autoscaler behavior.
What is the best way to alert on Kubernetes issues?
Page on user-visible symptoms and imminent risk: SLO burn, high error rate, high latency, unavailable services, failed critical jobs, failed rollouts, DNS failure affecting traffic, repeated OOM kills tied to impact, or capacity exhaustion. Use tickets for hygiene alerts and isolated low-impact events.
Is Prometheus and Grafana enough for Kubernetes monitoring?
Prometheus and Grafana can be enough for metrics and dashboards, especially for teams with strong platform engineering. They are not the whole incident workflow by themselves. Production teams still need logs, traces, Kubernetes events, deployment history, ownership metadata, alert routing, and fast correlation.
How does eBPF help Kubernetes monitoring?
eBPF can collect runtime and network telemetry from the node without requiring every workload to add an SDK. That helps teams see service dependencies, request behavior, RED metrics, traces, and profiling signals for services that are uninstrumented, third-party, or newly shipped.
How do Kubernetes observability best practices differ from monitoring best practices?
Monitoring watches known symptoms with dashboards and alerts. Observability helps investigate unknown causes by correlating metrics, logs, traces, profiles, Kubernetes events, resource state, deployment history, and ownership metadata. Good Kubernetes monitoring depends on enough observability context to explain alerts.
Related reading
More Metoro articles that deepen the same topic from another angle.
Kubernetes Monitoring: A Practical Guide for Production Teams
Learn how to monitor Kubernetes in production across clusters, workloads, applications, networks, logs, traces, events, and alerts.
Read article →Best Kubernetes Monitoring Tools in 2026
Compare Kubernetes monitoring tools including Metoro, Prometheus, Grafana, Datadog, Dynatrace, New Relic/Pixie, Coroot, Dash0, Elastic, Better Stack, and Kubernetes Dashboard.
Read article →Kubernetes Observability: The Complete Guide
Learn what Kubernetes observability is and how to implement effective observability for your k8s clusters.
Read article →