Taking Pods Offline

Use this runbook when you need to take bundled ClickHouse or ClickHouse Keeper pods offline during node maintenance, storage work, scaling, or manual troubleshooting. If ClickHouse is externally managed, follow the operational process for that ClickHouse service. The Metoro chart does not manage external ClickHouse pods.

Availability Rules

Avoid taking too many ClickHouse or Keeper pods offline at once. ClickHouse Keeper must keep quorum. With the default three Keeper pods, keep at least two Keeper pods ready. Losing Keeper quorum can block ClickHouse coordination and replica management, which can block telemetry ingestion and user queries until enough Keeper pods are healthy again. The ClickHouseInstallation needs at least one ClickHouse pod online. Running with one ClickHouse pod removes redundancy and reduces query and ingest capacity, but it keeps the telemetry store available. Taking every ClickHouse pod offline is a planned outage for ClickHouse-backed UI, API, and ingestion paths. For maintenance, work one pod or node at a time. Wait for Keeper quorum and at least one ready ClickHouse pod before continuing to the next pod or node.

Before Maintenance

Check ClickHouse pods:

kubectl -n metoro-hub get pods -l app=metoro-clickhouse

Check Keeper pods:

kubectl -n metoro-hub get pods -l app=metoro-clickhouse-keeper

Check the ClickHouseInstallation:

kubectl -n metoro-hub get chi metoro

Do not start maintenance if Keeper is already below quorum or if no ClickHouse pod is ready.

During Maintenance

Take one pod or one node offline at a time. After each step, confirm Keeper quorum and ClickHouse availability before moving on:

kubectl -n metoro-hub get pods -l app=metoro-clickhouse
kubectl -n metoro-hub get pods -l app=metoro-clickhouse-keeper

Expected state during routine maintenance:

At least two Keeper pods stay ready in the default three-Keeper deployment.
At least one ClickHouse pod stays ready.
ClickHouse-backed UI, API, and ingestion paths may have reduced capacity, but should not become fully unavailable.

After Maintenance

Confirm all expected ClickHouse and Keeper pods are ready:

kubectl -n metoro-hub get pods -l app=metoro-clickhouse
kubectl -n metoro-hub get pods -l app=metoro-clickhouse-keeper

Check recent events if pods are slow to return:

kubectl -n metoro-hub get events --sort-by='.lastTimestamp'

The expected outcome is:

Keeper has quorum.
At least one ClickHouse pod stayed ready throughout the maintenance window, unless the work was planned as a ClickHouse outage.
All expected ClickHouse and Keeper pods return to ready.
Apiserver and Ingester recover automatically if ClickHouse readiness briefly changed.

Get Started

AI SRE

Concepts

Traces

Logs

Metrics

Profiling

Kubernetes Resources

Dashboards

Infrastructure

Advisor

Alerts & Monitoring

Integrations

Uptime Monitoring

User Management

On-Premises

Administration

Availability Rules

Before Maintenance

During Maintenance

After Maintenance

​Availability Rules

​Before Maintenance

​During Maintenance

​After Maintenance

Availability Rules

Before Maintenance

During Maintenance

After Maintenance