Skip to content

DLPX-96312 Add InfluxDB/Telegraf infrastructure for Engine Performance Analytics#119

Open
dbshah12 wants to merge 5 commits into
developfrom
dlpx/pr/dbshah12/5d79e679-49b6-4c0a-8241-1c919bfcaedb
Open

DLPX-96312 Add InfluxDB/Telegraf infrastructure for Engine Performance Analytics#119
dbshah12 wants to merge 5 commits into
developfrom
dlpx/pr/dbshah12/5d79e679-49b6-4c0a-8241-1c919bfcaedb

Conversation

@dbshah12

@dbshah12 dbshah12 commented Mar 31, 2026

Copy link
Copy Markdown

Design Doc

Problem

Telegraf is already collecting engine performance metrics and writing them to local JSON files on the appliance. However, there is no local time-series database to store and serve these metrics, making it difficult for tools like DCT Smart Proxy to query historical performance data from the engine directly.

Additionally, several valuable metrics — per-connection TCP statistics, and storage I/O (NFS, iSCSI, backend disk) — were either not collected or only available when the performance playbook was explicitly enabled.

Storing all metrics in a single bucket would also mix Grafana-dashboard data with low-level diagnostics (aggregates, process counters, TCP internals), inflating storage costs for data that serves no dashboard purpose.

Solution

InfluxDB 2.x infrastructure

Add InfluxDB 2.x to the appliance as the single metrics store, mirroring the existing Telegraf setup pattern:

  • influxdb/influxdb.toml — InfluxDB daemon config: bound to 127.0.0.1:8086, with bolt/engine paths matching the installed package (/var/lib/influxdb/). Named .toml (not .conf) because InfluxDB uses the Viper config library, which determines the file format from the extension — .conf is not recognized and is silently ignored, causing influxd to fall back to defaults (~/.influxdbv2/).
  • influxdb/influxdb-init.conf — Tunable init config (org, bucket names, retention period, readiness wait parameters) sourced by the init script. Change values here without touching the script.
  • influxdb/delphix-influxdb-init — One-time init script that:
    • Exits immediately if /etc/influxdb/influxdb_meta already exists (safe on upgrades and reboots).
    • Waits for InfluxDB to be ready via the /health endpoint.
    • Calls /api/v2/setup to create the org, default bucket, and admin credentials (one-shot; uses curl directly, no influx CLI dependency).
    • Is crash-safe: persists a setup state file immediately after /api/v2/setup; each subsequent step (bucket creation, token creation) appends its result to the state file and checks for it on re-run, so the entire script is idempotent end-to-end.
    • Creates the support_metrics bucket for diagnostic and aggregate data that is not displayed in Grafana dashboards.
    • Creates three scoped tokens: a write-only token for Telegraf → default, a read-only token for DCT Smart Proxy → default, and a write-only token for Telegraf → support_metrics.
    • Writes three [[outputs.influxdb_v2]] stanzas to /etc/telegraf/telegraf.outputs.influxdb (chmod 640) — see Dual-bucket routing below.
    • Touches /etc/telegraf/INFLUXDB_ENABLED to enable InfluxDB output by default.
    • Atomically writes /etc/influxdb/influxdb_meta (chmod 600) containing: INFLUXDB_ORG, INFLUXDB_BUCKET, INFLUXDB_BUCKET_ID, INFLUXDB_SUPPORT_BUCKET, INFLUXDB_SUPPORT_BUCKET_ID, INFLUXDB_ADMIN_USER, INFLUXDB_ADMIN_PASSWORD, INFLUXDB_WRITE_TOKEN, INFLUXDB_READ_TOKEN. Bucket IDs are written so influxdb_lookup_bucket in support_info.sh can resolve them without an API call on engines that never had them stored. The support_metrics bucket is created with 30-day retention (INFLUXDB_SUPPORT_RETENTION_SECONDS); if the bucket already exists (engine upgrading from 7-day retention), influx_patch() PATCHes the retention so existing engines are updated without re-initialisation.
  • influxdb/delphix-influxdb-service — Wrapper that starts influxd with INFLUXD_CONFIG_PATH=/etc/influxdb/influxdb.toml in the background, runs the init script, then waits on the daemon PID. (influxd does not accept a --config-path flag; the config path must be set via the environment variable.)
  • influxdb/delphix-influxdb.service — Systemd unit following the same structure as delphix-telegraf.service (PartOf=delphix.target, Restart=on-failure, runs as root).
  • influxdb/perf_influxdb — Toggle script (mirrors perf_playbook) to enable/disable InfluxDB metric output from Telegraf without stopping InfluxDB itself. Manages the /etc/telegraf/INFLUXDB_ENABLED flag and restarts Telegraf.
  • influxdb/influxdb-nginx.conf — nginx reverse proxy config that exposes InfluxDB externally at /influxdb/, allowing tools like DCT Smart Proxy and Grafana to reach it without direct port access.
  • debian/rules — Installs all influxdb files: scripts to /usr/bin/, systemd unit to /lib/systemd/system/, configs to /etc/influxdb/, nginx config to /opt/delphix/server/etc/nginx/conf.d/.
  • debian/control — Added influxdb2 and curl to Depends.

Dual-bucket routing

Metrics are split across two buckets to keep Grafana-facing data separate from diagnostic data:

Bucket Purpose Measurements
default Grafana dashboards cpu, disk, diskio, net, zfs, tcp_stats, estat_nfs, estat_iscsi, hist_estat_nfs, hist_estat_iscsi, hist_estat_backend-io
support_metrics Everything else — diagnostics, aggregates, uncategorised mem, processes, system, procstat, agg_*, nfs_threads, estat_backend-io, estat_zpl/zvol/zio/zio-queue/metaslab-alloc, hist_estat_zpl/zvol/zio/…, docker_container_*

Routing is controlled by two [[outputs.influxdb_v2]] stanzas written by delphix-influxdb-init:

  1. Default bucketnamepass lists exactly the 11 measurements currently used in Grafana dashboards: cpu, disk, diskio, net, tcp_stats, zfs, estat_nfs, estat_iscsi, hist_estat_nfs, hist_estat_iscsi, hist_estat_backend-io. Any measurement added in future lands in support_metrics by default until explicitly promoted.
  2. support_metrics bucketnamedrop mirrors the default namepass list, so every other measurement flows here automatically.

Why only dashboard-used measurements in default? Keeping the default bucket to exactly what Grafana queries minimises storage and query cost. Measurements not yet wired into a dashboard panel — playbook estat_*, process/aggregate counters, nfs_threads, Docker metrics — sit in support_metrics until a dashboard panel needs them.

Why move estat_backend-io scalars? Grafana uses the histogram clone (hist_estat_backend-io, which stays in default) for its I/O heatmap. The raw per-interval scalar rows from estat_backend-io serve no dashboard purpose but are useful for support investigations.

Why agg_* in support_metrics? Hourly aggregates duplicate raw data in summarised form. Grafana queries raw measurements directly; aggregates are only needed for support cases requiring a long time-range summary without fetching raw points.

Telegraf metric collection changes

All metrics now flow exclusively to InfluxDB — JSON file outputs have been removed entirely:

  • telegraf/telegraf.base — Updated:
    • Removed all [[outputs.file]] stanzas; InfluxDB is now the sole output.
    • Removed [[inputs.filestat]] and [[inputs.netstat]] (not required).
    • [[inputs.cpu]]: changed percpu = truepercpu = false — only cpu-total collected, not per-core. Reduces data volume on many-CPU engines; agg_cpu inherits this automatically.
    • [[inputs.disk]]: added tagexclude = ["fstype", "mode"] — these tags add no diagnostic value and inflate cardinality.
    • [[inputs.diskio]]: updated tagdrop to exclude ZFS internal zvol devices (zd*), NVMe partitions (*p[0-9]*), and SCSI/SATA partitions (sd*[0-9]*). Added tagexclude = ["wwid"] to drop the redundant 100+ character wwid tag. Partition entries accounted for ~29.5% of diskio/agg_diskio line volume.
    • [[inputs.procstat]] (both delphix-mgmt and zfs-object-agent instances): added tagexclude = ["cgroup_full"] — long cgroup path adds cardinality without diagnostic value.
    • Removed [[inputs.swap]] — swap usage adds no diagnostic value for Delphix appliances.
    • Added [[inputs.execd]] for per-connection TCP stats via connstat-stats.sh (measurement: tcp_stats).
  • telegraf/connstat-stats.sh — New shell/awk script running connstat -PLe -i 10 -T u to collect per-connection TCP statistics, aggregated by remote endpoint (laddr, raddr, service). Uses fflush() in awk explicitly after every 10-second batch to ensure deterministic output to Telegraf's execd pipe.
    • rport is excluded from the aggregation key — service already captures the semantic meaning of the port, and including rport causes cardinality explosion on Oracle dNFS engines where hundreds of connections to the same VDB host use different ephemeral remote ports (all mapping to service=nfs on lport 2049). Mirrors the aggregation in LocalTCPStatsCollector.
    • Service name resolved from /etc/services (lport first, then rport), with Delphix-specific ports not in /etc/services hard-coded (see script).
    • Cumulative fields (inbytes, outbytes, etc.) are summed; window/RTT fields (cwnd, swnd, rwnd, rtt) are averaged; connections reports the count of aggregated TCP connections.
  • telegraf/telegraf.inputs.storage_io — New always-on fragment (appended when InfluxDB is enabled, independent of playbook state) collecting:
    • estat_nfs — NFS server I/O (reads/writes from NFS clients).
    • estat_iscsi — iSCSI target I/O (reads/writes from iSCSI initiators).
    • estat_backend-io — Backend disk I/O via estat backend-io (equivalent to stbtrace io). Measures I/O at the physical/virtual disk layer after ZFS processing.
    • [[processors.converter]] to convert estat string fields to integers.
    • [[processors.clone]] (order=1) — clones all estat_* measurements as hist_estat_* to hold histogram data exclusively.
    • [[processors.strings]] (order=2) — removes the microseconds field from all original estat_* measurements after cloning, ensuring histogram data lives only in hist_estat_*. The original {val,count} format (e.g. {20000,5},{30000,15}) is preserved as-is — the previous regex+parser pipeline that attempted JSON conversion was removed because numeric field names are invalid in InfluxDB line protocol.
  • telegraf/telegraf.inputs.playbook — Removed estat_nfs, estat_iscsi, and estat_backend-io stanzas (moved to telegraf.inputs.storage_io). Removed the broken regex+parser histogram pipeline (replaced by clone+strings in storage_io). Scoped [[processors.converter]] to playbook-only metrics. Updated estat_metaslab-alloc command to use the new wrapper script.
  • telegraf/metaslab-alloc-stats.sh — Moved to a dedicated PR (DLPX-88427 Filter garbage stat names from estat metaslab-alloc output #120 / DLPX-88427).
  • telegraf/telegraf.inputs.dct — Removed [[outputs.file]] for metrics_docker.json; docker metrics now go to InfluxDB.
  • telegraf/delphix-telegraf-service — When InfluxDB is enabled, appends both telegraf.inputs.storage_io and telegraf.outputs.influxdb (the three-stanza file) to the assembled config. Falls back to [[outputs.discard]] if InfluxDB output is not configured, so Telegraf always starts with a valid config regardless of state.

BPF/estat kernel compatibility fixes

Several estat commands were failing to compile with redefinition and forward declaration errors on the current kernel. These fixes are required for the always-on estat_nfs, estat_iscsi, and estat_backend-io measurements to work correctly (DLPX-96701):

  • bpf/estat/nfs.c and bpf/stbtrace/nfs.st — Removed struct bpf_wq forward declaration that conflicts with updated kernel headers (the struct is now defined by the kernel itself).
  • bpf/estat/zvol.c — Removed zv_request_t struct typedef that conflicts with updated kernel headers.
  • bpf/stbtrace/iscsi.st — Added struct iscsi_conn; forward declaration before #include "iscsi_target_core.h" to resolve an incomplete type error.
  • bpf/standalone/arc_prefetch.py, bpf/standalone/txg.py, bpf/standalone/zil.py, cmd/estat.py — Added -D__KERNEL__ and -D_KERNEL BPF compiler flags required by newer kernel headers.
  • bpf/standalone/zil.py — Removed the zil_commit_waiter_skip kprobe (function no longer exists in the current kernel). Added default=60 to --coll so estat zil works without requiring -c. Simplified the collection loop to always run until Ctrl-C, using --coll as the sleep interval between output cycles.
  • cmd/estat.py — Updated estat zil help text to document the -c INTERVAL and -p POOL options.

Complete list of measurements in InfluxDB

Measurement Source Bucket Availability
cpu [[inputs.cpu]] (cpu-total only; per-core excluded) default Always
disk [[inputs.disk]] (fstype/mode tags excluded) default Always
diskio [[inputs.diskio]] (zd*, p[0-9], sd*[0-9]*, wwid excluded) default Always
net [[inputs.net]] default Always
zfs [[inputs.zfs]] default Always
tcp_stats connstatconnections, inbytes, outbytes, retranssegs, rtt, cwnd, swnd, rwnd, suna, unsent default Always
mem [[inputs.mem]] support_metrics Always
processes [[inputs.processes]] support_metrics Always
system [[inputs.system]] support_metrics Always
procstat [[inputs.procstat]] — mgmt + zfs-object-agent (cgroup_full excluded) support_metrics Always
agg_cpu/disk/diskio/mem/net/processes/system Hourly min/max/mean/stdev aggregates support_metrics Always
estat_nfs NFS server I/O via estat nfs default When InfluxDB enabled
estat_iscsi iSCSI target I/O via estat iscsi default When InfluxDB enabled
estat_backend-io Backend disk I/O scalars via estat backend-io support_metrics When InfluxDB enabled
hist_estat_nfs, hist_estat_iscsi, hist_estat_backend-io Histogram-only clones of always-on storage I/O default When InfluxDB enabled
estat_zpl/zio/zvol/zio-queue/metaslab-alloc ZFS operation stats support_metrics Playbook only
hist_estat_zpl/zvol/zio/… Histogram clones of playbook estat_* support_metrics Playbook only
nfs_threads NFS thread utilization support_metrics Playbook only
docker_container_* Docker/DCT container metrics support_metrics DCT engines only

Notes to Reviewers

Runtime dependency decisions (debian/control)

When someone runs apt install performance-diagnostics, APT checks each package listed in Depends:

  • If already installed → skip (no reinstall, no harm).
  • If not installed → automatically download and install it.

The init script (delphix-influxdb-init) relies on curl, openssl, and python3 at runtime. Here is why only curl is explicitly added to Depends:

Dependency Decision Reason
openssl Added Used by delphix-influxdb-init (openssl rand -hex 16) to generate the admin password. Although openssl ships with delphix-platform, it is only in Build-Depends there, not Depends, so it is declared explicitly here to be safe.
python3 Not added Already present via python3-minimal in our existing Depends.
curl Added Only in delphix-platform's Build-Depends (build-time only) — so explicitly declared here to be safe.

Why influxdb.toml instead of influxdb.conf

InfluxDB 2.x uses Viper for config parsing, which determines the file format from the extension. Only .json, .toml, .yaml, and .yml are recognized — .conf is silently ignored and influxd falls back to defaults (~/.influxdbv2/ for root). Verified on InfluxDB v2.8.0: INFLUXD_CONFIG_PATH=influxdb.conf → paths/settings ignored; INFLUXD_CONFIG_PATH=influxdb.toml → config fully respected.

All metrics go to InfluxDB — no file outputs

Previously Telegraf wrote metrics to local JSON files (metrics_cpu.json, metrics_docker.json, etc.). Those [[outputs.file]] stanzas have been removed entirely. Routing between the two buckets is controlled by the three [[outputs.influxdb_v2]] stanzas in telegraf.outputs.influxdb (written by delphix-influxdb-init). When InfluxDB output is disabled, delphix-telegraf-service inserts [[outputs.discard]] so Telegraf always starts with a valid config.

estat_backend-io vs stbtrace io

estat is a Delphix wrapper around stbtrace (BPF kernel tracing). estat backend-io is the stbtrace io equivalent — it instruments I/O at the backend storage device layer (after ZFS cache/compression/RAID transforms). Combined with estat_nfs and estat_iscsi, this lets you trace the full I/O path: client request → ZFS → physical disk.

Disk partition and tag exclusions ([[inputs.diskio]])

ZFS zvol block devices (zd0, zd1, …), NVMe partitions (nvme0n1p1, etc.), and SCSI/SATA partitions (sda1, sdb2, etc.) appear in /proc/diskstats but add no diagnostic value — partition-level I/O duplicates what is already visible at the whole-disk level. These accounted for ~29.5% of diskio/agg_diskio line volume. The wwid tag is a redundant 100+ character identifier; the short-form name tag is sufficient. Both reductions lower storage and query cost in InfluxDB.

tcp_stats — per-endpoint TCP statistics

connstat -PLe -i 10 -T u outputs per-connection TCP stats every 10 seconds. The wrapper script (connstat-stats.sh) aggregates by (laddr, raddr, service) to mirror LocalTCPStatsCollector. rport is excluded to prevent cardinality explosion on Oracle dNFS engines. The service tag is resolved from /etc/services (lport first, then rport), with dlpx-sp (port 50001) hard-coded as a special case. Fields: inbytes, outbytes, retranssegs, suna (unacknowledged bytes), unsent, swnd/cwnd/rwnd, rtt, connections.

hist_estat_* histogram measurements

Histogram data (microseconds field — e.g. {20000,5},{30000,15}) is stored exclusively in hist_estat_* measurements. The original estat_* measurements have microseconds removed after cloning (via processors.strings fieldexclude). This eliminates duplication and keeps time-series rows lean. The {val,count} format is preserved as-is — a previous regex+parser pipeline that attempted JSON conversion was removed because numeric field names (e.g. "20000") are invalid in InfluxDB line protocol.

metaslab-alloc-stats.sh — DLPX-88427 garbage name filter

Moved to a dedicated PR (#120 / DLPX-88427). Not part of this PR.

Testing Done

ab-pre-push

/etc/influxdb# ls -l
total 4
-rw-r--r-- 1 root root  86 Mar 31 09:56 config.toml
-rw-r--r-- 1 root root 357 Mar 31 09:19 influxdb-init.conf
-rw-r--r-- 1 root root 274 Mar 31 09:19 influxdb.toml
-rw------- 1 root root 347 Mar 31 12:24 influxdb_meta

/etc/influxdb# ls -l /var/lib/influxdb
total 23
drwxr-x--- 5 influxdb influxdb      5 Mar 31 12:46 engine
-rw------- 1 influxdb influxdb  65536 Mar 31 12:46 influxd.bolt
-rw-r----- 1 influxdb influxdb      4 Mar 31 12:22 influxd.pid
-rw-r----- 1 influxdb influxdb 122880 Mar 31 12:23 influxd.sqlite
  • InfluxDB setup is also completed, and I can see data there in the UI:
Screenshot 2026-03-31 at 6 27 22 PM

Measurements verified in InfluxDB

All expected measurements verified across both buckets on live engines:

default bucket (Grafana-facing — dashboard measurements only):

cpu              disk             diskio           estat_iscsi
estat_nfs        hist_estat_backend-io             hist_estat_iscsi
hist_estat_nfs   net              tcp_stats        zfs

support_metrics bucket (everything else):

agg_cpu          agg_disk         agg_diskio       agg_mem
agg_net          agg_processes    agg_system       estat_backend-io
mem              nfs_threads      processes        procstat
system

Change-specific verifications

Change Verification
diskio NVMe/SCSI partition exclusion Confirmed nvme0n1p* and sda[0-9]* absent; only whole-disk entries present
diskio wwid tag removal Confirmed wwid not present in diskio data
disk fstype/mode tag removal Confirmed tags absent from disk measurement
procstat cgroup_full tag removal Confirmed tag absent from procstat measurement
hist_estat_* in default bucket hist_estat_nfs, hist_estat_iscsi, hist_estat_backend-io present with microseconds field
No microseconds duplication microseconds absent from estat_nfs/estat_iscsi/estat_backend-io originals
tcp_stats slim in default — 4 fields only connections, inbytes, outbytes, retranssegs present; rtt/cwnd/swnd/rwnd/suna/unsent absent
tcp_stats full in support_metrics rtt, cwnd, swnd, rwnd, suna, unsent all present alongside core fields
agg_* in support_metrics only All 7 aggregate measurements confirmed in support_metrics; absent from default
mem/processes/system/procstat in support_metrics Confirmed in support_metrics; absent from default
estat_backend-io scalars in support_metrics Confirmed; histogram clone (hist_estat_backend-io) present in default
tcp_stats service tag service tag present (e.g. nfs, https, dlpx-sp)
tcp_stats rport tag removed Confirmed absent; aggregation key is (laddr, raddr, service)
connstat Python — deterministic flush 10-second batches flushed on schedule; no multi-hour mawk buffering
estat_metaslab-alloc via wrapper Data present; garbage-name entries absent
estat nfs/iscsi/backend-io BPF compilation Compile and collect without redefinition or forward-declaration errors
estat zil default collection Runs without -c flag (defaults to 60 s); zil_commit_waiter_skip probe removed without errors

perf_influxdb enable/disable testing

Test Result
INFLUXDB_ENABLED flag exists on fresh boot
telegraf.outputs.influxdb exists with correct perms (-rw-r-----)
Telegraf loaded influxdb_v2 output (2x) on boot
perf_influxdb disable removes flag; Telegraf assembles config with [[outputs.discard]]
perf_influxdb enable recreates flag; Telegraf reloads with both influxdb_v2 stanzas
After enable, data flows to both default and support_metrics
Non-root user blocked with clear error (must be run as root)
No errors in journalctl

@dbshah12 dbshah12 force-pushed the dlpx/pr/dbshah12/5d79e679-49b6-4c0a-8241-1c919bfcaedb branch 2 times, most recently from bb1bd01 to 985a3ac Compare March 31, 2026 08:49
@dbshah12 dbshah12 requested a review from Copilot March 31, 2026 08:53

This comment was marked as resolved.

@dbshah12 dbshah12 force-pushed the dlpx/pr/dbshah12/5d79e679-49b6-4c0a-8241-1c919bfcaedb branch from 6286549 to 2a39e0c Compare March 31, 2026 09:16
@dbshah12 dbshah12 marked this pull request as ready for review March 31, 2026 10:57
@dbshah12 dbshah12 marked this pull request as draft March 31, 2026 11:01
@dbshah12 dbshah12 force-pushed the dlpx/pr/dbshah12/5d79e679-49b6-4c0a-8241-1c919bfcaedb branch 2 times, most recently from bad0342 to df102c9 Compare March 31, 2026 13:03
@dbshah12 dbshah12 marked this pull request as ready for review March 31, 2026 13:07
@dbshah12 dbshah12 requested a review from sebroy March 31, 2026 13:07
@dbshah12 dbshah12 self-assigned this Mar 31, 2026
@dbshah12 dbshah12 requested a review from Copilot April 1, 2026 05:37

This comment was marked as resolved.

This comment was marked as spam.

@dbshah12 dbshah12 force-pushed the dlpx/pr/dbshah12/5d79e679-49b6-4c0a-8241-1c919bfcaedb branch 3 times, most recently from 34acba1 to 02cc5df Compare April 1, 2026 06:28
@dbshah12 dbshah12 requested a review from Copilot April 1, 2026 06:29

This comment was marked as spam.

@delphix delphix deleted a comment from Copilot AI Apr 1, 2026
@dbshah12 dbshah12 force-pushed the dlpx/pr/dbshah12/5d79e679-49b6-4c0a-8241-1c919bfcaedb branch from 02cc5df to 7095d33 Compare April 1, 2026 06:43

This comment was marked as resolved.

@dbshah12 dbshah12 force-pushed the dlpx/pr/dbshah12/5d79e679-49b6-4c0a-8241-1c919bfcaedb branch from 2a7d1b7 to 444ca18 Compare April 20, 2026 13:55
@dbshah12 dbshah12 requested a review from Copilot April 20, 2026 14:10

This comment was marked as resolved.

@dbshah12 dbshah12 force-pushed the dlpx/pr/dbshah12/5d79e679-49b6-4c0a-8241-1c919bfcaedb branch 9 times, most recently from 304c9a7 to 60660da Compare April 27, 2026 11:53
@dbshah12 dbshah12 requested a review from Copilot April 28, 2026 07:27

This comment was marked as resolved.

@dbshah12 dbshah12 force-pushed the dlpx/pr/dbshah12/5d79e679-49b6-4c0a-8241-1c919bfcaedb branch from daae5c1 to 9a5175a Compare May 19, 2026 17:15
@dbshah12

Copy link
Copy Markdown
Author

All Copilot review comments addressed — stale threads resolved, valid issues fixed or explained inline.

@dbshah12 dbshah12 closed this Jun 11, 2026
@dbshah12 dbshah12 reopened this Jun 11, 2026
@dbshah12 dbshah12 force-pushed the dlpx/pr/dbshah12/5d79e679-49b6-4c0a-8241-1c919bfcaedb branch from 2e8baa6 to 9bf19d7 Compare June 12, 2026 06:06
@dbshah12 dbshah12 requested review from nealquigley and removed request for sisodiyam8 June 12, 2026 14:35
@dbshah12 dbshah12 force-pushed the dlpx/pr/dbshah12/5d79e679-49b6-4c0a-8241-1c919bfcaedb branch from bde1608 to 774cdc1 Compare June 15, 2026 05:14
dbshah12 and others added 4 commits June 17, 2026 15:45
The Starlark processor was returning [metric] (pass-through) for
hist_estat_* rows whose microseconds field is absent — specifically
name=total summary rows, which carry only iops and throughput.
Those clones are pure duplicates of the corresponding estat_* row
and should not exist in hist_estat_*.

Fix: return [] to drop the metric when microseconds is None.
Only rows with actual histogram data produce le=<bucket> output.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
When microseconds is present but parses to no valid bucket pairs
(e.g. "{ }" from a zero-activity interval), the Starlark function
was falling through to `return result if result else [metric]` and
passing the metric through unchanged — a duplicate of the estat_*
summary row for that series.

Fix: return [] instead of [metric] when parsing yields no results.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
mawk 1.3.4 does not flush its stdout buffer via fflush() when writing to a
Telegraf execd pipe, causing tcp_stats data to never reach InfluxDB on engines
where mawk is the default awk. Wrapping connstat in a while loop with -c 2
forces awk to exit naturally after each 10-second interval, triggering the C
runtime exit flush (fclose) that reliably delivers data to Telegraf. The END
block captures the partial second interval on each awk exit so no data is lost.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Extend the existing name=total starlark drop filter (previously
estat_backend-io only) to also cover estat_nfs and estat_iscsi.
The total row is read+write summed at collection time and can be
derived in Grafana, so storing it wastes ~33% of each measurement's
space. The hist_estat_* clones are unaffected since name=total rows
carry no microseconds field and are already dropped by the histogram
processor.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants