2026-05-16 SigNoz Dockerstats and OpAMP Investigation

What I set out to do

Noticed SigNoz host metrics weren’t showing much movement on network packets even though I was downloading on Plex from a local server. Wanted to figure out what was wrong. Snowballed into adding a dockerstatsreceiver, finding two pre-existing bugs, and a long detour through SigNoz’s OpAMP wrapper source.

What I actually did

Three commits on main, plus a manual deploy of the patched SigNoz config because of bug #2:

  1. The premise was wrong. Queried SigNoz: host.name=atlas had system.network.packets for eth0, lo, gre0, gretap0, ip6_vti0, ip6gre0, ip6tnl0, ip_vti0, sit0, tunl0, erspan0. Those are Linux tunnel-module interfaces — they cannot exist on macOS. The collector was reading /proc/net/dev from its own container’s netns inside Docker Desktop’s Linux VM, not the Mac’s en0. The /:/hostfs:ro mount we have at signoz-override.yaml.in is meaningless on Mac because macOS has no /proc. The 110 KB/s eth0 transmit number we were seeing was the collector’s own chatter to ClickHouse.

  2. Plex isn’t a container either. Confirmed with ps -ef | rg plex — Plex Media Server runs natively on the Mac (PID 93082, listening on *:32400). So dockerstats can’t see Plex regardless. Best we can do for “is Plex pulling bytes” via the existing collector is its container neighbors (qBittorrent, Sonarr, etc.), unless we ship a native otelcol-contrib via launchd or scrape Plex’s /library/sessions API.

  3. 2dc1c09 — Added dockerstats receiver. Two changes:

    • nix/home-manager/files/compose/signoz-override.yaml.in: bound /var/run/docker.sock:/var/run/docker.sock:ro into the collector.
    • nix/home-manager/modules/signoz.nix: added the receiver block + metrics/dockerstats pipeline inside patchOtelConfigText, sharing the same processors/exporters as metrics/hostmetrics.
  4. hm switch ran clean but my config wasn’t deployed. Wasted ten minutes assuming a Nix evaluation issue before grepping the actual generated activation script in /nix/store/.../activate and noticing _iNote "Activating %s" "syncSigNoz" at line 1139, well past where my tail -50 cut off. The visible output stopped at restartLitellmAgent — silent truncation. Manually executed the syncSigNoz block to force the deploy, then docker compose up -d --force-recreate --no-deps otel-collector.

  5. 6da8c33 — Fixed the silent activation truncation. nix/home-manager/modules/litellm/default.nix:347-370. The block used [ -f "$plist" ] || exit 0 and [ "$new" = "$old" ] && exit 0 as early returns for the common no-op case. HM activations run in one set -eu script with no function wrapping, so exit 0 exits the whole thing. On every hm switch where the litellm plist hash matched the stamp (i.e. almost every run), signZoteroWordIntegration, sqliteSeed-open-webui, and syncSigNoz were silently skipped. Restructured with nested if so control falls through. After the fix hm switch actually shows Activating syncSigNoz at the tail.

  6. Container metrics flowed for ~30 seconds, then everything died. After the recreate, system.network.io and system.network.packets stopped producing fresh samples and container.* metrics never appeared. The .rollback file in /var/tmp/ had my real config (with docker_stats and the dockerstats pipeline). The active /var/tmp/collector-config.yaml had every pipeline rewritten to receivers: [nop], exporters: [nop]. Same modify timestamp on both files — they got written together at agent startup.

  7. Read the signoz-otel-collector source. opamp/server_client.go defines initialNopConfig() which deliberately rewrites every service.pipelines.* to nop on boot. The comment says: “enabling extensions so to bypass healthchecks in docker and helm installation.” The OpAMP wrapper expects the SigNoz server to push the real config back via onRemoteConfigHandler and only then sets runningNopConfig.Store(false). The agent reports its real config as EffectiveConfig (override at line 152) so the server should recommend the same config back. Server side (pkg/query-service/app/opamp/model/agent.go::processStatusUpdate) it gates the push on agentDescrChanged and updateRemoteConfig’s diff. For a fresh install with no UI ingestion pipelines, the recommendation equals the agent’s reported config and the round-trip silently fails to close.

  8. 6d55c64 — Bypassed OpAMP. Dropped --manager-config=/etc/manager-config.yaml and --copy-path=/var/tmp/collector-config.yaml from the override’s command. Collector now runs purely from /etc/otel-collector-config.yaml. Hostmetrics resumed, dockerstats started ingesting. container.network.io.usage.rx_bytes shows the live media stack pulling bytes: sonarr 8.7KB / qbittorrent 3.3KB / bazarr 3.3KB / radarr 1.3KB / prowlarr 0 over a 15min window. Tiny because nothing’s actively grabbing right now, but the receiver is healthy.

  9. Saved a memory note at memory/project_signoz_opamp_nop_bug.md so future-me doesn’t try to re-enable --manager-config without expecting the all-nop trap. Indexed in MEMORY.md under the SigNoz section alongside the retention API and counter aggregation notes.

What was striking

  • The Docker Desktop netns confusion is generic, not specific. Anyone running a containerized hostmetrics receiver on Docker Desktop Mac will see Linux tunnel interfaces and assume they’re getting host metrics. The mount, the root_path: /hostfs, the host.name=atlas resource attribute — all of it points the wrong way. Worth a SigNoz reference note someday: “containerized hostmetrics on Docker Desktop Mac doesn’t see the Mac.”
  • The litellm exit 0 bug had been silently truncating my activations for who knows how long. Possibly weeks. Every hm switch reported success and the user wouldn’t know. The smoke test is home-manager switch ... 2>&1 | rg 'Activating' | tail -3 — if the last line isn’t the literal last activation in nix/store/.../activate, the script aborted early. Maybe worth a CI hook that diffs the expected activation list against actual output.
  • SigNoz’s initialNopConfig is by design but the design is broken for the install path we’re on. The intent (per the source comment) is health checks during Docker/Helm install. The realization (server doesn’t push back when recommendation equals report) is silent. If you don’t watch ingestion you don’t notice. I noticed because we previously had data flowing and a recreate killed it.
  • Two-collector-instance startup is normal but it looks weird. The SigNoz wrapper boots otelcol with the real config first (instance d08ad710 in my logs), then calls reload(noopConfig) which writes the active file, copies the real one to .rollback, and restarts the inner otelcol (new instance 8c065d3c). The “Restarting collector service” log at +180ms with no apparent trigger is just this.
  • Plex Media Server runs as a regular macOS app and that’s fine. I was about to suggest containerizing it. Plex has been on this Mac since 20Apr26 (per ps) and is doing its job. The friction is that the rest of my stack is containerized and the observability machinery has nothing to say about native processes. Right answer for Plex specifically is its own /status/sessions/bandwidth endpoint.

Top 3 tomorrow

  1. Decide whether to add a Plex API scraper for container.network.*-style attribution of Plex traffic. Community option: arnarg/plex_exporter. Otherwise write a small Python adapter behind the prometheus receiver we already run for macmon.
  2. Maybe add a CI check or a wrapper around hm switch that asserts the last Activating line matches the expected tail of the activation script — would have caught the litellm bug in seconds.
  3. File an upstream issue with SigNoz about the all-nop-on-fresh-install behavior. Reproducer is essentially “run signoz-otel-collector with —manager-config against a SigNoz server that has no ingestion pipelines configured.”