OTel Metric Temporality - Delta vs Cumulative

Core Principle

OpenTelemetry counter metrics can be reported in two equivalent representations, delta and cumulative, and the backend’s query math must match the representation used at emission. A mismatch between stored temporality and assumed temporality produces silently wrong (often empty) results, not errors.

Cumulative: each data point is the running total since process start. rate() = differentiate consecutive points, then divide by elapsed time.
Delta: each data point is the change since the last report. rate() = divide the value directly by the reporting interval.

Both encode the same information. Neither is inherently better — the fit depends on the producer’s lifecycle. Long-lived services (HTTP servers, daemons) naturally suit cumulative; short-lived invocations (serverless, CLI commands, batch jobs) naturally suit delta.

Why This Matters

The SDK sets temporality per exporter preference (OTEL_EXPORTER_OTLP_METRICS_TEMPORALITY_PREFERENCE=cumulative|delta), and each data point carries a temporality flag. The backend is expected to honor it.
Backends that don’t store the flag per-sample, or that cache a single “canonical” temporality per metric name, break as soon as an app changes its preference — either via an env var flip or via an SDK default change across app versions.
The failure mode is silent: wrong formula over technically-valid samples → empty aggregation result or nonsense numbers. No exception, no log line. You only notice when a dashboard goes blank.
The OTel spec permits producers to change temporality, so this is not “misuse” — it’s an ongoing backend correctness requirement.

Evidence/Examples

SigNoz stale-metadata bug (April 2026). Claude Code emitted delta metrics through v2.1.112, then cumulative from v2.1.117 after we set the env var preference. SigNoz stores samples with per-row temporality correctly (samples_v4 shows Cumulative for recent rows, Delta for old), but the metric catalog (signoz_metrics.metadata) has rows for every (version, temporality) combination ever seen, and the API layer picks one with ClickHouse’s anyLast(temporality) GROUP BY metric_name. anyLast is non-deterministic — it returned a stale Delta row. The query planner then applied delta-formula rate math to cumulative samples. 1h queries returned empty; 7d happened to work because its scalar reduction path skipped the temporality lookup. Upstream: SigNoz #8961 (cache-aside TTL design, open, unassigned) and SigNoz #9708 (closed; declared the “keep all rows, pick latest” policy but implemented it with a non-deterministic picker).

Prometheus’s cumulative-only origin. Prometheus historically only understood cumulative counters. OTel-to-Prom conversion requires a cumulativetodelta or deltatocumulative processor in the Collector. Same principle: the reader assumes one form, the writer uses the other, so a translation layer has to exist somewhere.

AWS CloudWatch. Takes cumulative internally. Delta-emitting OTLP exporters routed to CloudWatch need a conversion processor or get zero-rate series.

Implications

Pick temporality deliberately and early, document it in the project’s observability config, and change it only with eyes open. Each change is a soft breaking event for backends that cache.
Prefer cumulative for long-lived processes targeting Prometheus/ClickHouse/CloudWatch; prefer delta only when the backend explicitly supports it (OTLP-native stores, Honeycomb, some configurations of New Relic).
When a producer version bump changes temporality, expect backend breakage. Include a catalog reset / metadata purge step in the upgrade procedure.
Silent emptiness is the diagnostic tell. When a metric query returns no rows despite the underlying table clearly having data, the first hypothesis should be “temporality mismatch between catalog and samples,” not “data is missing.”
Dashboards and alerts should override temporality explicitly when the backend supports it, defense-in-depth against catalog drift.

SigNoz ClickHouse TTL Overflow Post-Mortem — same class of SigNoz failure: silent data loss from a catalog/storage-layer invariant (DateTime UInt32 overflow there, temporality cache here). Both have the shape “data looks fine going in, disappears or misaggregates coming out, no error anywhere.”

Questions

Does the OTel spec mandate backend behavior when a producer changes temporality mid-stream? Spec requires readers to honor per-point temporality, but backend catalog/cache behavior is implementation-defined.
Is there a standard “temporality-stable fingerprint” across backends, or does every backend reinvent its own cache keying?
What does Grafana Tempo do with temporality changes in span metrics? Worth checking next time that surface comes up.

Achhina's Digital Garden

Explorer

OTel Metric Temporality - Delta vs Cumulative

Core Principle

Why This Matters

Evidence/Examples

Implications

Questions

Graph View

Table of Contents

Backlinks

Achhina's Digital Garden

Explorer

OTel Metric Temporality - Delta vs Cumulative

Core Principle

Why This Matters

Evidence/Examples

Implications

Related Ideas

Questions

Graph View

Table of Contents

Backlinks