2026-06-04 SigNoz ClickHouse Replication Queue CPU Burn
Context: Working on dotfiles (.config, main). Triggered by noticing the SigNoz docker compose stack eating a lot of CPU.
What I set out to do
Figure out why the SigNoz stack was burning CPU. Starting symptom was vague (“using a lot of CPU”), no specific signal in hand.
What I actually did
Worked it end to end: diagnose, fix, then add monitoring so it can’t recur silently.
Diagnosis. docker stats pinned the culprits immediately: signoz-clickhouse at 117% and signoz-zookeeper-1 at 120%, each over a full core, on a host already at load 14/16. But ClickHouse had zero active queries, zero merges, zero pending mutations, and only ~4 inserts/sec. No real workload. ZooKeeper’s mntr showed a huge lifetime proposal count but a live write rate of 0, which meant it was being hammered by reads, the signature of a tight retry loop. The smoking gun was system.replication_queue: GET_PART entries that had retried millions of times (samples_v4_agg_5m and _30m north of 5.5M tries each), failing with NO_REPLICA_HAS_PART. SigNoz ships every metrics/traces table as ReplicatedMergeTree, but this is a single node (total_replicas: 1), so a locally-lost part can never be fetched from anywhere and the entry retries forever, spinning ClickHouse and ZooKeeper.
Fix. Rebuilt the replica for each of 10 affected tables: DETACH TABLE ... SYNC then SYSTEM DROP REPLICA '...' FROM ZKPATH '...' then ATTACH TABLE then SYSTEM RESTORE REPLICA then SYSTEM RESTART REPLICA. RESTART REPLICA alone does not clear these (the entries live in the ZK queue); and after ATTACH the replica comes back is_readonly=1 until RESTORE. All table data preserved except the genuinely-lost parts (old downsampled metrics/traces). Result: CH 117% to ~3%, ZK 120% to ~0.2%, queue 19 to 0.
Monitoring. Added two things to nix/home-manager/modules/signoz.nix, committed as 98a426e:
- A
clickhousePrometheus scrape job (targetclickhouse:9363, ClickHouse’s native endpoint) on the otel-collector, mirroring the macmon scrape. Brings ~2600ClickHouse*metrics into SigNoz. - A declarative terraform
signoz_alert(clickhouse_replication_queue_stuck) that fires whenClickHouseAsyncMetrics_ReplicasSumQueueSize > 0for a full 30m window, withalertOnAbsentas a dead-man’s switch on the scrape.
What was striking
- The original instinct was over-engineering. I first argued for alerting on
num_tries(the exact retry count), then questioned the premise.num_triesis the most precise signal but its only real edge is earliness, worthless for a pathology that ran undetected for weeks. Queue persistence (ReplicasSumQueueSize > 0sustained) catches the same incident, is the SigNoz-sanctioned pattern, and has no custom job that can silently die. The cheaper proxy won. num_triesis not reachable as a metric anyway. It is a per-row column insystem.replication_queue, absent from ClickHouse’s:9363endpoint. The OTelsqlqueryreceiverdoes support aclickhousedriver upstream (confirmed in the contrib README andgo.mod), but the SigNoz collector distro does not compile it in: it returnsunsupported driver: clickhousewhile accepting postgres/mysql/etc. So even the “scrape it with SQL” path was a dead end on this stack.- The permission classifier blocked the destructive ClickHouse surgery on a generic “go ahead”, and again blocked a bulk loop over tables. Had to run it genuinely table-by-table with explicit approval. Reasonable guardrail for irreversible shared-infra mutations.
Related
- 2026-05-16 SigNoz Dockerstats and OpAMP Investigation (the dockerstats receiver added then is the same metrics path)
- 2026-04-11 SigNoz ClickHouse TTL Overflow Post-Mortem (prior ClickHouse storage-layer debugging)
- OTel Metric Temporality
- SigNoz
Follow-up: trimming the scrape
After deploying the :9363 integration I checked CPU again and ClickHouse was spiking to ~109% on a cycle. Not the replication bug returning (queue stayed 0, ZooKeeper idle at 0.14%): it was ingest load from the integration itself. The native endpoint exposes ~2600 series and I was scraping all of them every 60s, so each scrape triggered a flush/merge burst.
Trimmed it with a metric_relabel_configs keep filter on the scrape: keep ClickHouseAsyncMetrics_* (system health, single-series gauges, includes the ReplicasSumQueueSize the alert needs) plus the five replication ClickHouseErrorMetric_* counters; drop the bulk (ClickHouseProfileEvents_* ~1100 and ClickHouseMetrics_* ~270) at scrape time so they never enter the pipeline. Ingest dropped from ~2600 series/60s to ~725 (a ~72% cut). CPU settled to ~3% steady with a smaller ~65% burst instead of ~109%. Committed as 5445523.
Lesson worth keeping: adding an observability integration is itself a workload. The full ClickHouse endpoint is mostly ProfileEvents counters that nobody dashboards; a keep-filter at scrape time is cheaper than ingesting then ignoring them.