2026-06-04 SigNoz ClickHouse Replication Queue CPU Burn

Context: Working on dotfiles (.config, main). Triggered by noticing the SigNoz docker compose stack eating a lot of CPU.

What I set out to do

Figure out why the SigNoz stack was burning CPU. Starting symptom was vague (“using a lot of CPU”), no specific signal in hand.

What I actually did

Worked it end to end: diagnose, fix, then add monitoring so it can’t recur silently.

Diagnosis. docker stats pinned the culprits immediately: signoz-clickhouse at 117% and signoz-zookeeper-1 at 120%, each over a full core, on a host already at load 14/16. But ClickHouse had zero active queries, zero merges, zero pending mutations, and only ~4 inserts/sec. No real workload. ZooKeeper’s mntr showed a huge lifetime proposal count but a live write rate of 0, which meant it was being hammered by reads, the signature of a tight retry loop. The smoking gun was system.replication_queue: GET_PART entries that had retried millions of times (samples_v4_agg_5m and _30m north of 5.5M tries each), failing with NO_REPLICA_HAS_PART. SigNoz ships every metrics/traces table as ReplicatedMergeTree, but this is a single node (total_replicas: 1), so a locally-lost part can never be fetched from anywhere and the entry retries forever, spinning ClickHouse and ZooKeeper.

Fix. Rebuilt the replica for each of 10 affected tables: DETACH TABLE ... SYNC then SYSTEM DROP REPLICA '...' FROM ZKPATH '...' then ATTACH TABLE then SYSTEM RESTORE REPLICA then SYSTEM RESTART REPLICA. RESTART REPLICA alone does not clear these (the entries live in the ZK queue); and after ATTACH the replica comes back is_readonly=1 until RESTORE. All table data preserved except the genuinely-lost parts (old downsampled metrics/traces). Result: CH 117% to ~3%, ZK 120% to ~0.2%, queue 19 to 0.

Monitoring. Added two things to nix/home-manager/modules/signoz.nix, committed as 98a426e:

  1. A clickhouse Prometheus scrape job (target clickhouse:9363, ClickHouse’s native endpoint) on the otel-collector, mirroring the macmon scrape. Brings ~2600 ClickHouse* metrics into SigNoz.
  2. A declarative terraform signoz_alert (clickhouse_replication_queue_stuck) that fires when ClickHouseAsyncMetrics_ReplicasSumQueueSize > 0 for a full 30m window, with alertOnAbsent as a dead-man’s switch on the scrape.

What was striking

  • The original instinct was over-engineering. I first argued for alerting on num_tries (the exact retry count), then questioned the premise. num_tries is the most precise signal but its only real edge is earliness, worthless for a pathology that ran undetected for weeks. Queue persistence (ReplicasSumQueueSize > 0 sustained) catches the same incident, is the SigNoz-sanctioned pattern, and has no custom job that can silently die. The cheaper proxy won.
  • num_tries is not reachable as a metric anyway. It is a per-row column in system.replication_queue, absent from ClickHouse’s :9363 endpoint. The OTel sqlqueryreceiver does support a clickhouse driver upstream (confirmed in the contrib README and go.mod), but the SigNoz collector distro does not compile it in: it returns unsupported driver: clickhouse while accepting postgres/mysql/etc. So even the “scrape it with SQL” path was a dead end on this stack.
  • The permission classifier blocked the destructive ClickHouse surgery on a generic “go ahead”, and again blocked a bulk loop over tables. Had to run it genuinely table-by-table with explicit approval. Reasonable guardrail for irreversible shared-infra mutations.

Follow-up: trimming the scrape

After deploying the :9363 integration I checked CPU again and ClickHouse was spiking to ~109% on a cycle. Not the replication bug returning (queue stayed 0, ZooKeeper idle at 0.14%): it was ingest load from the integration itself. The native endpoint exposes ~2600 series and I was scraping all of them every 60s, so each scrape triggered a flush/merge burst.

Trimmed it with a metric_relabel_configs keep filter on the scrape: keep ClickHouseAsyncMetrics_* (system health, single-series gauges, includes the ReplicasSumQueueSize the alert needs) plus the five replication ClickHouseErrorMetric_* counters; drop the bulk (ClickHouseProfileEvents_* ~1100 and ClickHouseMetrics_* ~270) at scrape time so they never enter the pipeline. Ingest dropped from ~2600 series/60s to ~725 (a ~72% cut). CPU settled to ~3% steady with a smaller ~65% burst instead of ~109%. Committed as 5445523.

Lesson worth keeping: adding an observability integration is itself a workload. The full ClickHouse endpoint is mostly ProfileEvents counters that nobody dashboards; a keep-filter at scrape time is cheaper than ingesting then ignoring them.