Context: Working on dotfiles (~/.config, main branch) — instrumenting full Anthropic /v1/messages payloads into self-hosted SigNoz for Claude Code observability.
What Happened
Goal was to capture raw on-the-wire Anthropic request/response JSON for debugging and analysis. Claude Code’s built-in OTEL log events don’t include the full body (they carry only lightweight attributes). The work spanned two sessions and an unexpected number of side quests.
Starting point (wrong one): began by exploring LiteLLM as an AI gateway intercepting /v1/messages traffic. Made a confident claim that LiteLLM integration was “fiddly” without evidence. User pushed back, rightly, and forced evidence. After digging into LiteLLM source and recent versions, the fiddliness claim was mostly wrong — the installed 1.82.3 had the beta-header bug already fixed. Retracted.
Pivot: asking “why do I actually want this?” surfaced the real motivation — seeing full request/response JSON for debugging prompt-cache drift and message-array mutations. This led to discovering OTEL_LOG_RAW_API_BODIES, an undocumented Claude Code env var (added in 2.1.111+). Two modes:
=1inline — writes JSON into OTEL log event attribute, truncated at 60 KB (nV9 = 61440constant, hardcoded in the Bun-compiled binary).=file:<dir>— writes untruncated JSON files to disk, one per request/response.
Measured actual request bodies (~1 MB each for cache-heavy turns) against the 60 KB cap and confirmed inline mode would lose 95%+ of payload. Went with file mode.
Filelog pipeline — five sequential bugs. Adding a filelog receiver to the SigNoz OTel collector to ingest the JSON files into ClickHouse turned into a bug cascade. Each fix unblocked the next failure:
-
Feature gate required.
delete_after_read: truerefused to start without--feature-gates=filelog.allowFileDeletion. Fix: override collector entrypoint in docker-compose.override.yaml. Side-quest: background-agent investigation of the gate confirmed it’s permanent alpha by design (security control, not maturity marker). Produced zettel. -
Docker Desktop
/hostfs≠ Mac FS. First attempt used/hostfs/Users/achhina/...paths. On Docker Desktop for Mac,/hostfsis the Linux VM’s root, not the Mac’s. Collector logged “no files match the configured criteria” — vague enough to send me chasing regex/glob patterns instead of structural. Fix: explicit bind mount of the Mac path to a clean mount point (/claude-api-bodies). -
Silent data drops from
force_flush_period. After errors 1+2 were resolved, only 4-6 of ~96 files were landing. Addedforce_flush_period: 1sthinking it would help; made it worse. A 925 KB file landed as 3.5 KB, truncated mid-JSON at"stream":true. The flush period was cutting buffers mid-read. -
Default line tokenizer broken by compact JSON without trailing newline. Claude Code writes compact JSON with no
\nat EOF. Default filelog tokenizer never emits the final record. -
Final fix:
multiline: {line_end_pattern: "\\}$"}— explicit close-on-final-brace terminates each record. Worked end-to-end: 1 MB request bodies now land intact.
Disk-usage math tangent. Initial back-of-napkin used 4 bytes/token, projected modest disk growth. User requested tighter math via the session-analysis skill. Recalibrated against 23 captured files: actual ratio is 2.76 bytes/token for cache-heavy English conversation JSON. Projected disk growth corrected upward: ~10 GB/month, ~116 GB/year. Still fits within SigNoz logs retention limits (25550 days, ~70 years max, set to safe margin under ClickHouse DateTime UInt32 overflow). See SigNoz ClickHouse TTL Overflow Post-Mortem for the overflow history.
Prompt caching observation. Confirmed empirically that each CC turn re-sends the full conversation history with only cache_control marker drift between adjacent turns. Favorable for ClickHouse columnar compression but explains the 10 GB/month growth.
Regex fallback + alert. Added operator chain to tag unparseable filenames as body_kind=unparsed (rather than silently drop) and created a SigNoz alert on any such appearance. Low-frequency canary for “Claude changed its filename format.” No unparsed entries observed over 24h.
Attribution header rabbit hole. Spent time investigating whether CLAUDE_CODE_ATTRIBUTION_HEADER=0 still helped cache hit rate. Dispatched a background agent that inspected the 213 MB Bun-compiled binary with strings | rg, located the header-emission function yB_, found the cch= value is a hardcoded literal 00000 on 2.1.119. Cross-referenced with GitHub issue #40652 — dynamic-cch bug was mitigated in 2.1.90, fully fixed in 2.1.91. Confirmed via SigNoz log search: 40:1 cacheRead:cacheCreation ratio, header always constant. The env var is a no-op on current versions; the community workaround is obsolete. Decided not to set it.
Went through one extra round of confusion when user said “let’s just remove the header” — I interpreted as “set the env var to suppress the header” and shipped a hm switch with CLAUDE_CODE_ATTRIBUTION_HEADER=0. User corrected: they meant “remove it from our topic list, since it’s a no-op.” Reverted.
Temporality/metadata investigation. After the hm switch cycle reapplied the collector with OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE=cumulative, short-window (1h) metric queries started returning empty despite samples flowing and stored correctly. Traced the flow: samples_v4 stored as Cumulative → metadata table has 20+ rows mixing Delta (old CC versions) and Cumulative (current) → SigNoz API layer picks one via anyLast(temporality) which is non-deterministic in ClickHouse and returned Delta → delta-formula rate math against cumulative samples returned empty. Upstream: SigNoz #8961 (open, cache-aside TTL design unassigned), #9708 (closed, declared “pick latest” policy but implemented with non-deterministic picker). Decided to accept the gap — 7d queries work, dashboards/alerts can override temporality explicitly.
Service.name position walk-back. Initially recommended splitting the filelog pipeline to service.name=claude-code-raw-bodies for SigNoz UX reasons. User asked about OTel conventions. Re-read OTel Resource semconv — service.name describes the logical service producing telemetry, not the ingestion path. Both streams describe the same Claude Code process; splitting would violate semantics. Corrected position to keep service.name=claude-code and disambiguate via an attribute like telemetry.stream or existing body_kind.
Why It Was Notable
- Framing cost real time. The LiteLLM pivot wasn’t just a retraction — it reflected landing on the wrong abstraction level initially. When the user asked “what makes it fiddly?” the honest answer revealed the original premise was misdirected. Next time, surface the actual goal before committing to a path.
- Cascade bugs teach the pipeline better than a clean setup would. Fighting through 5 sequential filelog failures forced reading the receiver code, understanding the tokenizer, and confirming Docker Desktop’s filesystem model. A tutorial that “just worked” would have taught less.
- Tangents produced durable artifacts. The filelog-gate background investigation → OTel Feature Gates zettel. The temporality debugging → OTel Temporality zettel. The attribution-header rabbit hole → reverted a pointless config change and settled a community workaround as obsolete. Side quests pulled their weight.
- Multiple cases of my initial position being wrong. LiteLLM fiddliness (retracted), service.name splitting (reversed), attribution-header suppression (misinterpreted user intent). Each corrected by the user pushing back with evidence or clarifying intent. Pattern to watch: under uncertainty I was generating confident-sounding recommendations that hadn’t been falsified.
Resolution
End state of the pipeline:
- Raw body JSON (1 MB typical, up to 8 MB cap) flows reliably: CC writes to
~/.local/share/claude/api-bodies/, filelog ingests every 5s,delete_after_read: trueprunes, ClickHouse becomes sole store. - Data model:
service.name=claude-code,body_kind ∈ {request, response, unparsed},body_id=<uuid or req_01...>. Filterbody_kind EXISTSisolates raw-body records from native OTEL events. - Retention: aligned with SigNoz logs retention (25550 days). No pressure on the UInt32 overflow horizon.
- Alert: “Claude raw body filename parse drift” fires on any
body_kind=unparsedappearance. Routed to placeholder webhook (no listener) — visible in SigNoz Alerts UI. - Accepted gaps: permanent
filelog.allowFileDeletionoverride in entrypoint; temporality metadata cache staleness for short-window metric queries; placeholder webhook for alert delivery.
Artifacts produced:
- OTel Metric Temporality - Delta vs Cumulative — zettel
- OTel Feature Gates - Permanent vs Transitional — zettel
- Memory: “Claude Code attribution header is a no-op on 2.1.90+” (drafted, then deleted at user request; keeping here in the journal for traceability)
Five-commit fix chain in dotfiles:
feat(signoz): filelog receiver for Claude Code raw API bodiesfix(signoz): enable filelog.allowFileDeletion feature gatefix(signoz): bind-mount macOS path directly for filelog receiverfix(signoz): drop multiline line_start_pattern, add force_flush_periodfix(signoz): terminate filelog records on trailing '}' via line_end_patternfeat(signoz): tag unparsed filelog filenames as body_kind=unparsed
Related
- OTel Metric Temporality - Delta vs Cumulative
- OTel Feature Gates - Permanent vs Transitional
- SigNoz ClickHouse TTL Overflow Post-Mortem — prior SigNoz silent-failure incident, same storage-catalog-layer pattern
- Upstream: OTel Collector #16314, #45120, SigNoz #8961, Claude Code #40652