Post-Mortem: SigNoz Host Metrics Debugging
Session: 4a386e14-1c86-4c21-b1cf-f70b0d7426e2
Date: 2026-04-11
Duration: 3h 54m (14:19 - 18:13 EDT)
Incident Summary
Host metrics (system.* from the OTel hostmetrics receiver and macmon_* from Prometheus scraping of Apple Silicon sensors) were not appearing in SigNoz dashboards or the Infrastructure Monitoring page. The metrics were being written to samples_v4 (raw data) but silently deleted from time_series_v4 (metric discovery/metadata) by ClickHouse’s background merge process due to a DateTime overflow in the TTL expression.
All trace data and most historical metric data was lost before the issue was identified and fixed.
Root Cause
The SigNoz retention was configured to 876000 days (~2400 years). SigNoz internally converts this to a TTL expression:
TTL toDateTime(toUInt32(unix_milli / 1000), 'UTC') + toIntervalSecond(3153600000)ClickHouse’s DateTime type is UInt32 (max value: 4294967295, representing 2106-02-07 06:28:15 UTC). Adding 3153600000 seconds (100 years) to the current epoch (1776000000) produces 4929600000, which overflows UInt32 and wraps to a date in the 1970s/1990s.
With ttl_only_drop_parts = 1, ClickHouse checks during background merges whether all rows in a merged part have expired TTL. Since every row’s computed TTL was in the past, ClickHouse dropped every merged part. Data only ever existed in unmerged parts (recent writes) and was destroyed as soon as those parts were merged.
This affected 23 tables across signoz_metrics (8 tables), signoz_traces (7 tables), and their aggregation views.
Cost
| Agent | Role | Model | Tool Calls | Cost |
|---|---|---|---|---|
| main | Primary debugging, config changes | opus-4-6 | 495 | $58.42 |
| a1ee74038fdd69509 | Deep dive: ClickHouse TTL root cause | opus-4-6 | 117 | $9.12 |
| ae1a14d2b408bea09 | GitHub issue search (timing) | opus-4-6 | 38 | $1.96 |
| a865c0c86cc72b46f | SigNoz retention config research | haiku-4-5 | 46 | $1.15 |
| ada350e5eb0173575 | GitHub issue search (hostmetrics) | opus-4-6 | 13 | $0.54 |
| a51ad70e8c5957be5 | HM activation config research | haiku-4-5 | 17 | $0.32 |
| a796d5cb63c714fb3 | Trace enrichment research | opus-4-6 | 1 | $0.15 |
| Total | 727 | $71.67 |
Timeline
Phase 1: Setup (14:19 - 15:04) — Working correctly
Unrelated config work (LiteLLM, Claude Code, SillyTavern), then adding hostmetrics receiver, macmon Prometheus scraping, and SigNoz dashboards. Six hm switch deploys. Dashboards deployed successfully but showed no data.
Phase 2: Wrong hypothesis — “timing/race condition” (15:04 - 16:34)
- 15:04 - 16:23: Investigated via SigNoz MCP tools, GitHub issues, docs. Found
time_series_v4was empty forsystem.*whilesamples_v4had data. - 16:23: Added
level: debugto collector telemetry. Data appeared. Removed it, data disappeared. This was interpreted as a startup race condition. In reality, debug logging slowed the collector enough that parts hadn’t merged yet during the check window. - 16:29: Applied
initial_delay: 30son hostmetrics receiver andretry_on_failureon the exporter. Deployed viahm switch. Did not fix the issue. - 16:34: Added
depends_on: signoz-telemetrystore-migratorto otel-collector compose override, reasoning the collector was writing before tables were created. Deployed. Did not fix the issue.
This phase consumed ~1.5 hours chasing a plausible but incorrect theory. Three config changes were deployed, none effective.
Phase 3: Stale alias detour (16:35 - 16:42)
Discovered the COMPOSE_FILE env var in shell aliases embedded nix store paths that became stale after each hm switch. Multiple stack restarts silently used the old override (without the migrator dependency). This compounded confusion about whether fixes were being applied.
Phase 4: Root cause discovery (17:04 - 17:19)
After the user said “take a deeper look, we’ve been spinning our wheels,” a deep investigation subagent was spawned. It made 117 tool calls over 14.5 minutes:
- Read the exporter source from GitHub (confirmed no metric-name filtering in
writeTimeSeries) - Queried
system.partsand found merged parts with 0 rows and TTL dates in1970/1990 - Queried
system.part_logand confirmed everyMergePartsevent produced 0 rows - Computed the DateTime overflow:
epoch_now + 3153600000 > UInt32_max - Verified old
samples_v4data survived only because it was merged under a previous (valid) 30-day TTL rule
This subagent (58.42) missed.
Phase 5: Fix attempts (17:19 - 18:06)
- 17:19: Changed
retentionDaysfrom876000to29200(~80 years). Arithmetic error:epoch_now + 29200*86400still overflows UInt32 by ~45 days. - 17:23 - 17:28: Refactored compose aliases from
COMPOSE_FILEenv var to explicit-fflags. Correct fix for the alias staleness, but distracted from the TTL issue. - 17:28: Deployed. Ran
configure-retentioncontainer. Discovered SigNoz’s retention API returns400for metrics/traces (“custom retention TTL only supported for logs”). The API never actually updated the ClickHouse TTL. - 18:00: Changed
retentionDaysto25550(~70 years,2207520000seconds). Verifiedepoch_now + 2207520000 = 3983520000 < 4294967295. Safe. - 18:02: Ran manual
ALTER TABLE ... MODIFY TTLon all affected tables. Restarted collector. Data appeared and survived merges. Parts showed TTL expiry of2096-03-24(valid).
Phase 6: Verification and final fix (18:06 - 18:30)
Confirmed 44 host metrics (20 macmon_* + 24 system.*) present in time_series_v4. Merged parts retained rows. TTL dates valid.
During post-mortem analysis, discovered that SigNoz’s v1 API (/api/v1/settings/ttl) is the proper first-class endpoint for setting metrics and traces retention. It handles all ALTER TABLE ... ON CLUSTER statements internally across all relevant tables. Reverted the configure-retention container from direct ClickHouse commands back to curl, using the correct API per signal: v1 for metrics/traces, v2 for logs.
Data Loss
| Signal | Rows remaining | Date range | Assessment |
|---|---|---|---|
| Metrics (samples_v4) | 56,743 | 2026-04-11 only | All pre-today data lost to merges |
| Metrics (time_series_v4) | 2,227 | 2026-04-11 only | Discovery metadata lost, rebuilt from new scrapes |
| Traces (signoz_index_v3) | 0 | None | All trace data destroyed |
| Logs | Not checked | Unknown | Logs use a separate retention path; may be unaffected |
Mitigating factor: The ClickHouse Docker volume was created at 2026-04-11T01:01:38Z (earlier today), so the maximum data loss window is ~21 hours. There was no long-running historical dataset.
Recovery: Not possible. ClickHouse’s old_parts_lifetime = 480 seconds (8 minutes) means dropped part files are deleted from disk within minutes. No detached parts exist. No volume snapshots.
What Went Wrong
1. Chased a red herring for 1.5 hours
The “debug logging makes it work” observation was misinterpreted as a timing/race condition. The actual mechanism: debug logging increases I/O load, which delays ClickHouse background merges, which means unmerged parts survive longer, which means spot-checking time_series_v4 shows data that will be deleted on the next merge.
Counterfactual: Checking system.parts for TTL metadata at 15:04 would have revealed the overflow immediately. The 1.5-hour detour through exporter source code, cache poisoning theories, and startup race conditions was unnecessary.
2. Applied 3 ineffective config changes before questioning the hypothesis
Each change (initial_delay, retry_on_failure, migrator depends_on) was internally consistent with the race condition theory. But none were tested against falsifiable predictions. “If it’s a race, then adding a 30s delay should fix it” was tested, failed, and yet the race hypothesis persisted.
3. Arithmetic error on first TTL fix
Changed to 29200 days (~80 years) without computing whether it overflows. It did, by 45 days. Cost one more deploy cycle.
4. SigNoz retention API was used incorrectly
The configure-retention container used the v2 API (POST /api/v2/settings/ttl) for all three signals, but v2 only supports logs. Metrics and traces require the v1 API (POST /api/v1/settings/ttl?type=metrics&duration=613200h). The v2 call returned 400 for metrics/traces, but curl -sf suppressed the error and the shell script echoed “retention set” regardless. The actual ClickHouse TTL for metrics/traces was set during SigNoz’s internal initialization with the overflowed value, not corrected by the API.
5. Shell alias staleness caused phantom failures
The COMPOSE_FILE env var baked nix store paths into shell aliases at login time. After hm switch, the running shell still had stale paths. Stack restarts appeared to work but used the old override without new fixes.
What Should Have Happened
The ideal debugging path (~15 minutes, ~$5):
- “Metrics not in dashboard” → query
time_series_v4→ empty forsystem.* - Query
samples_v4→ data exists → exporter writes samples but time_series is empty - Check
system.partsfortime_series_v4→ see merged parts with 0 rows and TTL dates in the past SHOW CREATE TABLE signoz_metrics.time_series_v4→ seetoIntervalSecond(3153600000)→ compute the overflow- Set retention via the v1 API:
POST /api/v1/settings/ttl?type=metrics&duration=613200h→ restart collector → done
The failure was not checking the database engine/TTL metadata. All investigation focused on the application layer (exporter code, cache behavior, pipeline configuration) when the answer was in the storage layer.
Lessons and Action Items
Debugging methodology
-
When data exists in one table but not a related one written by the same code path, check the table’s engine and TTL before the application code. The storage layer can silently discard data that the application successfully wrote.
-
“It works when I add logging” does not always mean timing. It can mean the extra I/O delays background operations (merges, compaction, GC) that are destroying the data.
-
When a hypothesis fails, question the hypothesis, not just the parameters. “initial_delay didn’t fix the race” should prompt “is this actually a race?” not “maybe the delay isn’t long enough.”
SigNoz-specific
-
ClickHouse DateTime is UInt32 (max 2106-02-07). Any retention beyond ~70 years from the current date will overflow. This is not documented by SigNoz.
-
SigNoz has two retention APIs, split by signal. The v1 API (
/api/v1/settings/ttl?type=metrics&duration=613200h) handles metrics and traces. The v2 API (/api/v2/settings/ttlwith JSON body) handles only logs. Using v2 for metrics/traces returns a 400 error. The originalconfigure-retentioncontainer used v2 for all three signals, meaning metrics and traces retention was never actually set via the API. -
The
configure-retentioncontainer was silently failing for 2 of 3 signals becausecurl -sfsuppresses error output and the shell script echoed “retention set” regardless of curl’s exit code. Fixed by using the correct API per signal (v1 for metrics/traces, v2 for logs) with proper error handling.
Infrastructure
-
Shell aliases that embed nix store paths are fragile. Replaced
COMPOSE_FILEenv var with explicit-fflags pointing to mutable paths in the deploy directory. -
curl -sfsuppresses HTTP error responses. The retention container reported success because the shell script echoed after curl, regardless of curl’s exit code. Fixed by using the correct API per signal and chainingechoafter&&so it only prints on success.