Post-Mortem: SigNoz Host Metrics Debugging

Session: 4a386e14-1c86-4c21-b1cf-f70b0d7426e2 Date: 2026-04-11 Duration: 3h 54m (14:19 - 18:13 EDT)

Incident Summary

Host metrics (system.* from the OTel hostmetrics receiver and macmon_* from Prometheus scraping of Apple Silicon sensors) were not appearing in SigNoz dashboards or the Infrastructure Monitoring page. The metrics were being written to samples_v4 (raw data) but silently deleted from time_series_v4 (metric discovery/metadata) by ClickHouse’s background merge process due to a DateTime overflow in the TTL expression.

All trace data and most historical metric data was lost before the issue was identified and fixed.

Root Cause

The SigNoz retention was configured to 876000 days (~2400 years). SigNoz internally converts this to a TTL expression:

TTL toDateTime(toUInt32(unix_milli / 1000), 'UTC') + toIntervalSecond(3153600000)

ClickHouse’s DateTime type is UInt32 (max value: 4294967295, representing 2106-02-07 06:28:15 UTC). Adding 3153600000 seconds (~~100 years) to the current epoch (~~1776000000) produces 4929600000, which overflows UInt32 and wraps to a date in the 1970s/1990s.

With ttl_only_drop_parts = 1, ClickHouse checks during background merges whether all rows in a merged part have expired TTL. Since every row’s computed TTL was in the past, ClickHouse dropped every merged part. Data only ever existed in unmerged parts (recent writes) and was destroyed as soon as those parts were merged.

This affected 23 tables across signoz_metrics (8 tables), signoz_traces (7 tables), and their aggregation views.

Cost

Agent	Role	Model	Tool Calls	Cost
main	Primary debugging, config changes	opus-4-6	495	$58.42
a1ee74038fdd69509	Deep dive: ClickHouse TTL root cause	opus-4-6	117	$9.12
ae1a14d2b408bea09	GitHub issue search (timing)	opus-4-6	38	$1.96
a865c0c86cc72b46f	SigNoz retention config research	haiku-4-5	46	$1.15
ada350e5eb0173575	GitHub issue search (hostmetrics)	opus-4-6	13	$0.54
a51ad70e8c5957be5	HM activation config research	haiku-4-5	17	$0.32
a796d5cb63c714fb3	Trace enrichment research	opus-4-6	1	$0.15
Total			727	$71.67

Timeline

Phase 1: Setup (14:19 - 15:04) — Working correctly

Unrelated config work (LiteLLM, Claude Code, SillyTavern), then adding hostmetrics receiver, macmon Prometheus scraping, and SigNoz dashboards. Six hm switch deploys. Dashboards deployed successfully but showed no data.

Phase 2: Wrong hypothesis — “timing/race condition” (15:04 - 16:34)

15:04 - 16:23: Investigated via SigNoz MCP tools, GitHub issues, docs. Found time_series_v4 was empty for system.* while samples_v4 had data.
16:23: Added level: debug to collector telemetry. Data appeared. Removed it, data disappeared. This was interpreted as a startup race condition. In reality, debug logging slowed the collector enough that parts hadn’t merged yet during the check window.
16:29: Applied initial_delay: 30s on hostmetrics receiver and retry_on_failure on the exporter. Deployed via hm switch. Did not fix the issue.
16:34: Added depends_on: signoz-telemetrystore-migrator to otel-collector compose override, reasoning the collector was writing before tables were created. Deployed. Did not fix the issue.

This phase consumed ~1.5 hours chasing a plausible but incorrect theory. Three config changes were deployed, none effective.

Phase 3: Stale alias detour (16:35 - 16:42)

Discovered the COMPOSE_FILE env var in shell aliases embedded nix store paths that became stale after each hm switch. Multiple stack restarts silently used the old override (without the migrator dependency). This compounded confusion about whether fixes were being applied.

Phase 4: Root cause discovery (17:04 - 17:19)

After the user said “take a deeper look, we’ve been spinning our wheels,” a deep investigation subagent was spawned. It made 117 tool calls over 14.5 minutes:

Read the exporter source from GitHub (confirmed no metric-name filtering in writeTimeSeries)
Queried system.parts and found merged parts with 0 rows and TTL dates in 1970/1990
Queried system.part_log and confirmed every MergeParts event produced 0 rows
Computed the DateTime overflow: epoch_now + 3153600000 > UInt32_max
Verified old samples_v4 data survived only because it was merged under a previous (valid) 30-day TTL rule

This subagent ( $9.12) f o u n d t h eroo t c a u se t ha t 1.5 h o u rso f main - a g e n t d e b ugg in g ($ 58.42) missed.

Phase 5: Fix attempts (17:19 - 18:06)

17:19: Changed retentionDays from 876000 to 29200 (~80 years). Arithmetic error: epoch_now + 29200*86400 still overflows UInt32 by ~45 days.
17:23 - 17:28: Refactored compose aliases from COMPOSE_FILE env var to explicit -f flags. Correct fix for the alias staleness, but distracted from the TTL issue.
17:28: Deployed. Ran configure-retention container. Discovered SigNoz’s retention API returns 400 for metrics/traces (“custom retention TTL only supported for logs”). The API never actually updated the ClickHouse TTL.
18:00: Changed retentionDays to 25550 (~70 years, 2207520000 seconds). Verified epoch_now + 2207520000 = 3983520000 < 4294967295. Safe.
18:02: Ran manual ALTER TABLE ... MODIFY TTL on all affected tables. Restarted collector. Data appeared and survived merges. Parts showed TTL expiry of 2096-03-24 (valid).

Phase 6: Verification and final fix (18:06 - 18:30)

Confirmed 44 host metrics (20 macmon_* + 24 system.*) present in time_series_v4. Merged parts retained rows. TTL dates valid.

During post-mortem analysis, discovered that SigNoz’s v1 API (/api/v1/settings/ttl) is the proper first-class endpoint for setting metrics and traces retention. It handles all ALTER TABLE ... ON CLUSTER statements internally across all relevant tables. Reverted the configure-retention container from direct ClickHouse commands back to curl, using the correct API per signal: v1 for metrics/traces, v2 for logs.

Data Loss

Signal	Rows remaining	Date range	Assessment
Metrics (samples_v4)	56,743	2026-04-11 only	All pre-today data lost to merges
Metrics (time_series_v4)	2,227	2026-04-11 only	Discovery metadata lost, rebuilt from new scrapes
Traces (signoz_index_v3)	0	None	All trace data destroyed
Logs	Not checked	Unknown	Logs use a separate retention path; may be unaffected

Mitigating factor: The ClickHouse Docker volume was created at 2026-04-11T01:01:38Z (earlier today), so the maximum data loss window is ~21 hours. There was no long-running historical dataset.

Recovery: Not possible. ClickHouse’s old_parts_lifetime = 480 seconds (8 minutes) means dropped part files are deleted from disk within minutes. No detached parts exist. No volume snapshots.

What Went Wrong

1. Chased a red herring for 1.5 hours

The “debug logging makes it work” observation was misinterpreted as a timing/race condition. The actual mechanism: debug logging increases I/O load, which delays ClickHouse background merges, which means unmerged parts survive longer, which means spot-checking time_series_v4 shows data that will be deleted on the next merge.

Counterfactual: Checking system.parts for TTL metadata at 15:04 would have revealed the overflow immediately. The 1.5-hour detour through exporter source code, cache poisoning theories, and startup race conditions was unnecessary.

2. Applied 3 ineffective config changes before questioning the hypothesis

Each change (initial_delay, retry_on_failure, migrator depends_on) was internally consistent with the race condition theory. But none were tested against falsifiable predictions. “If it’s a race, then adding a 30s delay should fix it” was tested, failed, and yet the race hypothesis persisted.

3. Arithmetic error on first TTL fix

Changed to 29200 days (~80 years) without computing whether it overflows. It did, by 45 days. Cost one more deploy cycle.

4. SigNoz retention API was used incorrectly

The configure-retention container used the v2 API (POST /api/v2/settings/ttl) for all three signals, but v2 only supports logs. Metrics and traces require the v1 API (POST /api/v1/settings/ttl?type=metrics&duration=613200h). The v2 call returned 400 for metrics/traces, but curl -sf suppressed the error and the shell script echoed “retention set” regardless. The actual ClickHouse TTL for metrics/traces was set during SigNoz’s internal initialization with the overflowed value, not corrected by the API.

5. Shell alias staleness caused phantom failures

The COMPOSE_FILE env var baked nix store paths into shell aliases at login time. After hm switch, the running shell still had stale paths. Stack restarts appeared to work but used the old override without new fixes.

What Should Have Happened

The ideal debugging path (~15 minutes, ~$5):

“Metrics not in dashboard” → query time_series_v4 → empty for system.*
Query samples_v4 → data exists → exporter writes samples but time_series is empty
Check system.parts for time_series_v4 → see merged parts with 0 rows and TTL dates in the past
SHOW CREATE TABLE signoz_metrics.time_series_v4 → see toIntervalSecond(3153600000) → compute the overflow
Set retention via the v1 API: POST /api/v1/settings/ttl?type=metrics&duration=613200h → restart collector → done

The failure was not checking the database engine/TTL metadata. All investigation focused on the application layer (exporter code, cache behavior, pipeline configuration) when the answer was in the storage layer.

Lessons and Action Items

Debugging methodology

When data exists in one table but not a related one written by the same code path, check the table’s engine and TTL before the application code. The storage layer can silently discard data that the application successfully wrote.
“It works when I add logging” does not always mean timing. It can mean the extra I/O delays background operations (merges, compaction, GC) that are destroying the data.
When a hypothesis fails, question the hypothesis, not just the parameters. “initial_delay didn’t fix the race” should prompt “is this actually a race?” not “maybe the delay isn’t long enough.”

SigNoz-specific

ClickHouse DateTime is UInt32 (max 2106-02-07). Any retention beyond ~70 years from the current date will overflow. This is not documented by SigNoz.
SigNoz has two retention APIs, split by signal. The v1 API (/api/v1/settings/ttl?type=metrics&duration=613200h) handles metrics and traces. The v2 API (/api/v2/settings/ttl with JSON body) handles only logs. Using v2 for metrics/traces returns a 400 error. The original configure-retention container used v2 for all three signals, meaning metrics and traces retention was never actually set via the API.
The configure-retention container was silently failing for 2 of 3 signals because curl -sf suppresses error output and the shell script echoed “retention set” regardless of curl’s exit code. Fixed by using the correct API per signal (v1 for metrics/traces, v2 for logs) with proper error handling.

Infrastructure

Shell aliases that embed nix store paths are fragile. Replaced COMPOSE_FILE env var with explicit -f flags pointing to mutable paths in the deploy directory.
curl -sf suppresses HTTP error responses. The retention container reported success because the shell script echoed after curl, regardless of curl’s exit code. Fixed by using the correct API per signal and chaining echo after && so it only prints on success.

Quartz 4

Explorer

SigNoz ClickHouse TTL Overflow Post-Mortem