Skip to main content
Use this runbook when you need to take bundled ClickHouse or ClickHouse Keeper pods offline during node maintenance, storage work, scaling, or manual troubleshooting. If ClickHouse is externally managed, follow the operational process for that ClickHouse service. The Metoro chart does not manage external ClickHouse pods.

Availability Rules

Avoid taking too many ClickHouse or Keeper pods offline at once. ClickHouse Keeper must keep quorum. With the default three Keeper pods, keep at least two Keeper pods ready. Losing Keeper quorum can block ClickHouse coordination and replica management, which can block telemetry ingestion and user queries until enough Keeper pods are healthy again. The ClickHouseInstallation needs at least one ClickHouse pod online. Running with one ClickHouse pod removes redundancy and reduces query and ingest capacity, but it keeps the telemetry store available. Taking every ClickHouse pod offline is a planned outage for ClickHouse-backed UI, API, and ingestion paths. For maintenance, work one pod or node at a time. Wait for Keeper quorum and at least one ready ClickHouse pod before continuing to the next pod or node.

Before Maintenance

Check ClickHouse pods:
kubectl -n metoro-hub get pods -l app=metoro-clickhouse
Check Keeper pods:
kubectl -n metoro-hub get pods -l app=metoro-clickhouse-keeper
Check the ClickHouseInstallation:
kubectl -n metoro-hub get chi metoro
Do not start maintenance if Keeper is already below quorum or if no ClickHouse pod is ready.

During Maintenance

Take one pod or one node offline at a time. After each step, confirm Keeper quorum and ClickHouse availability before moving on:
kubectl -n metoro-hub get pods -l app=metoro-clickhouse
kubectl -n metoro-hub get pods -l app=metoro-clickhouse-keeper
Expected state during routine maintenance:
  • At least two Keeper pods stay ready in the default three-Keeper deployment.
  • At least one ClickHouse pod stays ready.
  • ClickHouse-backed UI, API, and ingestion paths may have reduced capacity, but should not become fully unavailable.

After Maintenance

Confirm all expected ClickHouse and Keeper pods are ready:
kubectl -n metoro-hub get pods -l app=metoro-clickhouse
kubectl -n metoro-hub get pods -l app=metoro-clickhouse-keeper
Check recent events if pods are slow to return:
kubectl -n metoro-hub get events --sort-by='.lastTimestamp'
The expected outcome is:
  • Keeper has quorum.
  • At least one ClickHouse pod stayed ready throughout the maintenance window, unless the work was planned as a ClickHouse outage.
  • All expected ClickHouse and Keeper pods return to ready.
  • Apiserver and Ingester recover automatically if ClickHouse readiness briefly changed.