2026-05-16 SigNoz Dockerstats and OpAMP Investigation
What I set out to do
Noticed SigNoz host metrics weren’t showing much movement on network packets even though I was downloading on Plex from a local server. Wanted to figure out what was wrong. Snowballed into adding a dockerstatsreceiver, finding two pre-existing bugs, and a long detour through SigNoz’s OpAMP wrapper source.
What I actually did
Three commits on main, plus a manual deploy of the patched SigNoz config because of bug #2:
-
The premise was wrong. Queried SigNoz:
host.name=atlashadsystem.network.packetsforeth0, lo, gre0, gretap0, ip6_vti0, ip6gre0, ip6tnl0, ip_vti0, sit0, tunl0, erspan0. Those are Linux tunnel-module interfaces — they cannot exist on macOS. The collector was reading/proc/net/devfrom its own container’s netns inside Docker Desktop’s Linux VM, not the Mac’s en0. The/:/hostfs:romount we have atsignoz-override.yaml.inis meaningless on Mac because macOS has no/proc. The 110 KB/s eth0 transmit number we were seeing was the collector’s own chatter to ClickHouse. -
Plex isn’t a container either. Confirmed with
ps -ef | rg plex— Plex Media Server runs natively on the Mac (PID 93082, listening on*:32400). So dockerstats can’t see Plex regardless. Best we can do for “is Plex pulling bytes” via the existing collector is its container neighbors (qBittorrent, Sonarr, etc.), unless we ship a native otelcol-contrib via launchd or scrape Plex’s/library/sessionsAPI. -
2dc1c09— Added dockerstats receiver. Two changes:nix/home-manager/files/compose/signoz-override.yaml.in: bound/var/run/docker.sock:/var/run/docker.sock:rointo the collector.nix/home-manager/modules/signoz.nix: added the receiver block +metrics/dockerstatspipeline insidepatchOtelConfigText, sharing the same processors/exporters asmetrics/hostmetrics.
-
hm switchran clean but my config wasn’t deployed. Wasted ten minutes assuming a Nix evaluation issue before grepping the actual generated activation script in/nix/store/.../activateand noticing_iNote "Activating %s" "syncSigNoz"at line 1139, well past where mytail -50cut off. The visible output stopped atrestartLitellmAgent— silent truncation. Manually executed the syncSigNoz block to force the deploy, thendocker compose up -d --force-recreate --no-deps otel-collector. -
6da8c33— Fixed the silent activation truncation.nix/home-manager/modules/litellm/default.nix:347-370. The block used[ -f "$plist" ] || exit 0and[ "$new" = "$old" ] && exit 0as early returns for the common no-op case. HM activations run in oneset -euscript with no function wrapping, soexit 0exits the whole thing. On everyhm switchwhere the litellm plist hash matched the stamp (i.e. almost every run),signZoteroWordIntegration,sqliteSeed-open-webui, andsyncSigNozwere silently skipped. Restructured with nestedifso control falls through. After the fixhm switchactually showsActivating syncSigNozat the tail. -
Container metrics flowed for ~30 seconds, then everything died. After the recreate,
system.network.ioandsystem.network.packetsstopped producing fresh samples andcontainer.*metrics never appeared. The.rollbackfile in/var/tmp/had my real config (with docker_stats and the dockerstats pipeline). The active/var/tmp/collector-config.yamlhad every pipeline rewritten toreceivers: [nop], exporters: [nop]. Same modify timestamp on both files — they got written together at agent startup. -
Read the signoz-otel-collector source.
opamp/server_client.godefinesinitialNopConfig()which deliberately rewrites everyservice.pipelines.*to nop on boot. The comment says: “enabling extensions so to bypass healthchecks in docker and helm installation.” The OpAMP wrapper expects the SigNoz server to push the real config back viaonRemoteConfigHandlerand only then setsrunningNopConfig.Store(false). The agent reports its real config as EffectiveConfig (override at line 152) so the server should recommend the same config back. Server side (pkg/query-service/app/opamp/model/agent.go::processStatusUpdate) it gates the push onagentDescrChangedandupdateRemoteConfig’s diff. For a fresh install with no UI ingestion pipelines, the recommendation equals the agent’s reported config and the round-trip silently fails to close. -
6d55c64— Bypassed OpAMP. Dropped--manager-config=/etc/manager-config.yamland--copy-path=/var/tmp/collector-config.yamlfrom the override’s command. Collector now runs purely from/etc/otel-collector-config.yaml. Hostmetrics resumed, dockerstats started ingesting.container.network.io.usage.rx_bytesshows the live media stack pulling bytes:sonarr 8.7KB / qbittorrent 3.3KB / bazarr 3.3KB / radarr 1.3KB / prowlarr 0over a 15min window. Tiny because nothing’s actively grabbing right now, but the receiver is healthy. -
Saved a memory note at
memory/project_signoz_opamp_nop_bug.mdso future-me doesn’t try to re-enable--manager-configwithout expecting the all-nop trap. Indexed inMEMORY.mdunder the SigNoz section alongside the retention API and counter aggregation notes.
What was striking
- The Docker Desktop netns confusion is generic, not specific. Anyone running a containerized hostmetrics receiver on Docker Desktop Mac will see Linux tunnel interfaces and assume they’re getting host metrics. The mount, the
root_path: /hostfs, thehost.name=atlasresource attribute — all of it points the wrong way. Worth a SigNoz reference note someday: “containerized hostmetrics on Docker Desktop Mac doesn’t see the Mac.” - The litellm
exit 0bug had been silently truncating my activations for who knows how long. Possibly weeks. Everyhm switchreported success and the user wouldn’t know. The smoke test ishome-manager switch ... 2>&1 | rg 'Activating' | tail -3— if the last line isn’t the literal last activation innix/store/.../activate, the script aborted early. Maybe worth a CI hook that diffs the expected activation list against actual output. - SigNoz’s
initialNopConfigis by design but the design is broken for the install path we’re on. The intent (per the source comment) is health checks during Docker/Helm install. The realization (server doesn’t push back when recommendation equals report) is silent. If you don’t watch ingestion you don’t notice. I noticed because we previously had data flowing and a recreate killed it. - Two-collector-instance startup is normal but it looks weird. The SigNoz wrapper boots otelcol with the real config first (instance d08ad710 in my logs), then calls
reload(noopConfig)which writes the active file, copies the real one to.rollback, and restarts the inner otelcol (new instance 8c065d3c). The “Restarting collector service” log at +180ms with no apparent trigger is just this. - Plex Media Server runs as a regular macOS app and that’s fine. I was about to suggest containerizing it. Plex has been on this Mac since
20Apr26(perps) and is doing its job. The friction is that the rest of my stack is containerized and the observability machinery has nothing to say about native processes. Right answer for Plex specifically is its own/status/sessions/bandwidthendpoint.
Top 3 tomorrow
- Decide whether to add a Plex API scraper for
container.network.*-style attribution of Plex traffic. Community option:arnarg/plex_exporter. Otherwise write a small Python adapter behind the prometheus receiver we already run formacmon. - Maybe add a CI check or a wrapper around
hm switchthat asserts the lastActivatingline matches the expected tail of the activation script — would have caught the litellm bug in seconds. - File an upstream issue with SigNoz about the all-nop-on-fresh-install behavior. Reproducer is essentially “run signoz-otel-collector with —manager-config against a SigNoz server that has no ingestion pipelines configured.”
Related
- 2026-04-11 SigNoz ClickHouse TTL Overflow Post-Mortem — last time SigNoz host metrics misbehaved on this Mac
- OTel Metric Temporality
- Some OTel Feature Gates Are Permanent
- Nix - Home Manager