DLPX-96312 Add InfluxDB/Telegraf infrastructure for Engine Performance Analytics#119
Open
dbshah12 wants to merge 5 commits into
Open
DLPX-96312 Add InfluxDB/Telegraf infrastructure for Engine Performance Analytics#119dbshah12 wants to merge 5 commits into
dbshah12 wants to merge 5 commits into
Conversation
bb1bd01 to
985a3ac
Compare
6286549 to
2a39e0c
Compare
bad0342 to
df102c9
Compare
34acba1 to
02cc5df
Compare
02cc5df to
7095d33
Compare
2a7d1b7 to
444ca18
Compare
304c9a7 to
60660da
Compare
daae5c1 to
9a5175a
Compare
Author
|
All Copilot review comments addressed — stale threads resolved, valid issues fixed or explained inline. |
2e8baa6 to
9bf19d7
Compare
bde1608 to
774cdc1
Compare
The Starlark processor was returning [metric] (pass-through) for hist_estat_* rows whose microseconds field is absent — specifically name=total summary rows, which carry only iops and throughput. Those clones are pure duplicates of the corresponding estat_* row and should not exist in hist_estat_*. Fix: return [] to drop the metric when microseconds is None. Only rows with actual histogram data produce le=<bucket> output. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
When microseconds is present but parses to no valid bucket pairs
(e.g. "{ }" from a zero-activity interval), the Starlark function
was falling through to `return result if result else [metric]` and
passing the metric through unchanged — a duplicate of the estat_*
summary row for that series.
Fix: return [] instead of [metric] when parsing yields no results.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
mawk 1.3.4 does not flush its stdout buffer via fflush() when writing to a Telegraf execd pipe, causing tcp_stats data to never reach InfluxDB on engines where mawk is the default awk. Wrapping connstat in a while loop with -c 2 forces awk to exit naturally after each 10-second interval, triggering the C runtime exit flush (fclose) that reliably delivers data to Telegraf. The END block captures the partial second interval on each awk exit so no data is lost. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Extend the existing name=total starlark drop filter (previously estat_backend-io only) to also cover estat_nfs and estat_iscsi. The total row is read+write summed at collection time and can be derived in Grafana, so storing it wastes ~33% of each measurement's space. The hist_estat_* clones are unaffected since name=total rows carry no microseconds field and are already dropped by the histogram processor. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Design Doc
Problem
Telegraf is already collecting engine performance metrics and writing them to local JSON files on the appliance. However, there is no local time-series database to store and serve these metrics, making it difficult for tools like DCT Smart Proxy to query historical performance data from the engine directly.
Additionally, several valuable metrics — per-connection TCP statistics, and storage I/O (NFS, iSCSI, backend disk) — were either not collected or only available when the performance playbook was explicitly enabled.
Storing all metrics in a single bucket would also mix Grafana-dashboard data with low-level diagnostics (aggregates, process counters, TCP internals), inflating storage costs for data that serves no dashboard purpose.
Solution
InfluxDB 2.x infrastructure
Add InfluxDB 2.x to the appliance as the single metrics store, mirroring the existing Telegraf setup pattern:
influxdb/influxdb.toml— InfluxDB daemon config: bound to127.0.0.1:8086, with bolt/engine paths matching the installed package (/var/lib/influxdb/). Named.toml(not.conf) because InfluxDB uses the Viper config library, which determines the file format from the extension —.confis not recognized and is silently ignored, causing influxd to fall back to defaults (~/.influxdbv2/).influxdb/influxdb-init.conf— Tunable init config (org, bucket names, retention period, readiness wait parameters) sourced by the init script. Change values here without touching the script.influxdb/delphix-influxdb-init— One-time init script that:/etc/influxdb/influxdb_metaalready exists (safe on upgrades and reboots)./healthendpoint./api/v2/setupto create the org,defaultbucket, and admin credentials (one-shot; usescurldirectly, noinfluxCLI dependency)./api/v2/setup; each subsequent step (bucket creation, token creation) appends its result to the state file and checks for it on re-run, so the entire script is idempotent end-to-end.support_metricsbucket for diagnostic and aggregate data that is not displayed in Grafana dashboards.default, a read-only token for DCT Smart Proxy →default, and a write-only token for Telegraf →support_metrics.[[outputs.influxdb_v2]]stanzas to/etc/telegraf/telegraf.outputs.influxdb(chmod 640) — see Dual-bucket routing below./etc/telegraf/INFLUXDB_ENABLEDto enable InfluxDB output by default./etc/influxdb/influxdb_meta(chmod 600) containing:INFLUXDB_ORG,INFLUXDB_BUCKET,INFLUXDB_BUCKET_ID,INFLUXDB_SUPPORT_BUCKET,INFLUXDB_SUPPORT_BUCKET_ID,INFLUXDB_ADMIN_USER,INFLUXDB_ADMIN_PASSWORD,INFLUXDB_WRITE_TOKEN,INFLUXDB_READ_TOKEN. Bucket IDs are written soinfluxdb_lookup_bucketinsupport_info.shcan resolve them without an API call on engines that never had them stored. Thesupport_metricsbucket is created with 30-day retention (INFLUXDB_SUPPORT_RETENTION_SECONDS); if the bucket already exists (engine upgrading from 7-day retention),influx_patch()PATCHes the retention so existing engines are updated without re-initialisation.influxdb/delphix-influxdb-service— Wrapper that startsinfluxdwithINFLUXD_CONFIG_PATH=/etc/influxdb/influxdb.tomlin the background, runs the init script, then waits on the daemon PID. (influxddoes not accept a--config-pathflag; the config path must be set via the environment variable.)influxdb/delphix-influxdb.service— Systemd unit following the same structure asdelphix-telegraf.service(PartOf=delphix.target,Restart=on-failure, runs as root).influxdb/perf_influxdb— Toggle script (mirrorsperf_playbook) to enable/disable InfluxDB metric output from Telegraf without stopping InfluxDB itself. Manages the/etc/telegraf/INFLUXDB_ENABLEDflag and restarts Telegraf.influxdb/influxdb-nginx.conf— nginx reverse proxy config that exposes InfluxDB externally at/influxdb/, allowing tools like DCT Smart Proxy and Grafana to reach it without direct port access.debian/rules— Installs all influxdb files: scripts to/usr/bin/, systemd unit to/lib/systemd/system/, configs to/etc/influxdb/, nginx config to/opt/delphix/server/etc/nginx/conf.d/.debian/control— Addedinfluxdb2andcurltoDepends.Dual-bucket routing
Metrics are split across two buckets to keep Grafana-facing data separate from diagnostic data:
defaultcpu,disk,diskio,net,zfs,tcp_stats,estat_nfs,estat_iscsi,hist_estat_nfs,hist_estat_iscsi,hist_estat_backend-iosupport_metricsmem,processes,system,procstat,agg_*,nfs_threads,estat_backend-io,estat_zpl/zvol/zio/zio-queue/metaslab-alloc,hist_estat_zpl/zvol/zio/…,docker_container_*Routing is controlled by two
[[outputs.influxdb_v2]]stanzas written bydelphix-influxdb-init:namepasslists exactly the 11 measurements currently used in Grafana dashboards:cpu,disk,diskio,net,tcp_stats,zfs,estat_nfs,estat_iscsi,hist_estat_nfs,hist_estat_iscsi,hist_estat_backend-io. Any measurement added in future lands insupport_metricsby default until explicitly promoted.support_metricsbucket —namedropmirrors thedefaultnamepass list, so every other measurement flows here automatically.Why only dashboard-used measurements in
default? Keeping the default bucket to exactly what Grafana queries minimises storage and query cost. Measurements not yet wired into a dashboard panel — playbookestat_*, process/aggregate counters,nfs_threads, Docker metrics — sit insupport_metricsuntil a dashboard panel needs them.Why move
estat_backend-ioscalars? Grafana uses the histogram clone (hist_estat_backend-io, which stays indefault) for its I/O heatmap. The raw per-interval scalar rows fromestat_backend-ioserve no dashboard purpose but are useful for support investigations.Why
agg_*insupport_metrics? Hourly aggregates duplicate raw data in summarised form. Grafana queries raw measurements directly; aggregates are only needed for support cases requiring a long time-range summary without fetching raw points.Telegraf metric collection changes
All metrics now flow exclusively to InfluxDB — JSON file outputs have been removed entirely:
telegraf/telegraf.base— Updated:[[outputs.file]]stanzas; InfluxDB is now the sole output.[[inputs.filestat]]and[[inputs.netstat]](not required).[[inputs.cpu]]: changedpercpu = true→percpu = false— onlycpu-totalcollected, not per-core. Reduces data volume on many-CPU engines;agg_cpuinherits this automatically.[[inputs.disk]]: addedtagexclude = ["fstype", "mode"]— these tags add no diagnostic value and inflate cardinality.[[inputs.diskio]]: updatedtagdropto exclude ZFS internal zvol devices (zd*), NVMe partitions (*p[0-9]*), and SCSI/SATA partitions (sd*[0-9]*). Addedtagexclude = ["wwid"]to drop the redundant 100+ character wwid tag. Partition entries accounted for ~29.5% of diskio/agg_diskio line volume.[[inputs.procstat]](bothdelphix-mgmtandzfs-object-agentinstances): addedtagexclude = ["cgroup_full"]— long cgroup path adds cardinality without diagnostic value.[[inputs.swap]]— swap usage adds no diagnostic value for Delphix appliances.[[inputs.execd]]for per-connection TCP stats viaconnstat-stats.sh(measurement:tcp_stats).telegraf/connstat-stats.sh— New shell/awk script runningconnstat -PLe -i 10 -T uto collect per-connection TCP statistics, aggregated by remote endpoint (laddr,raddr,service). Usesfflush()in awk explicitly after every 10-second batch to ensure deterministic output to Telegraf'sexecdpipe.rportis excluded from the aggregation key —servicealready captures the semantic meaning of the port, and includingrportcauses cardinality explosion on Oracle dNFS engines where hundreds of connections to the same VDB host use different ephemeral remote ports (all mapping toservice=nfson lport 2049). Mirrors the aggregation inLocalTCPStatsCollector./etc/services(lport first, then rport), with Delphix-specific ports not in/etc/serviceshard-coded (see script).inbytes,outbytes, etc.) are summed; window/RTT fields (cwnd,swnd,rwnd,rtt) are averaged;connectionsreports the count of aggregated TCP connections.telegraf/telegraf.inputs.storage_io— New always-on fragment (appended when InfluxDB is enabled, independent of playbook state) collecting:estat_nfs— NFS server I/O (reads/writes from NFS clients).estat_iscsi— iSCSI target I/O (reads/writes from iSCSI initiators).estat_backend-io— Backend disk I/O viaestat backend-io(equivalent tostbtrace io). Measures I/O at the physical/virtual disk layer after ZFS processing.[[processors.converter]]to convert estat string fields to integers.[[processors.clone]](order=1) — clones allestat_*measurements ashist_estat_*to hold histogram data exclusively.[[processors.strings]](order=2) — removes themicrosecondsfield from all originalestat_*measurements after cloning, ensuring histogram data lives only inhist_estat_*. The original{val,count}format (e.g.{20000,5},{30000,15}) is preserved as-is — the previous regex+parser pipeline that attempted JSON conversion was removed because numeric field names are invalid in InfluxDB line protocol.telegraf/telegraf.inputs.playbook— Removedestat_nfs,estat_iscsi, andestat_backend-iostanzas (moved totelegraf.inputs.storage_io). Removed the broken regex+parser histogram pipeline (replaced by clone+strings in storage_io). Scoped[[processors.converter]]to playbook-only metrics. Updatedestat_metaslab-alloccommand to use the new wrapper script.telegraf/metaslab-alloc-stats.sh— Moved to a dedicated PR (DLPX-88427 Filter garbage stat names from estat metaslab-alloc output #120 / DLPX-88427).telegraf/telegraf.inputs.dct— Removed[[outputs.file]]formetrics_docker.json; docker metrics now go to InfluxDB.telegraf/delphix-telegraf-service— When InfluxDB is enabled, appends bothtelegraf.inputs.storage_ioandtelegraf.outputs.influxdb(the three-stanza file) to the assembled config. Falls back to[[outputs.discard]]if InfluxDB output is not configured, so Telegraf always starts with a valid config regardless of state.BPF/estat kernel compatibility fixes
Several
estatcommands were failing to compile with redefinition and forward declaration errors on the current kernel. These fixes are required for the always-onestat_nfs,estat_iscsi, andestat_backend-iomeasurements to work correctly (DLPX-96701):bpf/estat/nfs.candbpf/stbtrace/nfs.st— Removedstruct bpf_wqforward declaration that conflicts with updated kernel headers (the struct is now defined by the kernel itself).bpf/estat/zvol.c— Removedzv_request_tstruct typedef that conflicts with updated kernel headers.bpf/stbtrace/iscsi.st— Addedstruct iscsi_conn;forward declaration before#include "iscsi_target_core.h"to resolve an incomplete type error.bpf/standalone/arc_prefetch.py,bpf/standalone/txg.py,bpf/standalone/zil.py,cmd/estat.py— Added-D__KERNEL__and-D_KERNELBPF compiler flags required by newer kernel headers.bpf/standalone/zil.py— Removed thezil_commit_waiter_skipkprobe (function no longer exists in the current kernel). Addeddefault=60to--collsoestat zilworks without requiring-c. Simplified the collection loop to always run until Ctrl-C, using--collas the sleep interval between output cycles.cmd/estat.py— Updatedestat zilhelp text to document the-c INTERVALand-p POOLoptions.Complete list of measurements in InfluxDB
cpu[[inputs.cpu]](cpu-total only; per-core excluded)defaultdisk[[inputs.disk]](fstype/mode tags excluded)defaultdiskio[[inputs.diskio]](zd*, p[0-9], sd*[0-9]*, wwid excluded)defaultnet[[inputs.net]]defaultzfs[[inputs.zfs]]defaulttcp_statsconnstat—connections,inbytes,outbytes,retranssegs,rtt,cwnd,swnd,rwnd,suna,unsentdefaultmem[[inputs.mem]]support_metricsprocesses[[inputs.processes]]support_metricssystem[[inputs.system]]support_metricsprocstat[[inputs.procstat]]— mgmt + zfs-object-agent (cgroup_full excluded)support_metricsagg_cpu/disk/diskio/mem/net/processes/systemsupport_metricsestat_nfsestat nfsdefaultestat_iscsiestat iscsidefaultestat_backend-ioestat backend-iosupport_metricshist_estat_nfs,hist_estat_iscsi,hist_estat_backend-iodefaultestat_zpl/zio/zvol/zio-queue/metaslab-allocsupport_metricshist_estat_zpl/zvol/zio/…support_metricsnfs_threadssupport_metricsdocker_container_*support_metricsNotes to Reviewers
Runtime dependency decisions (
debian/control)When someone runs
apt install performance-diagnostics, APT checks each package listed inDepends:The init script (
delphix-influxdb-init) relies oncurl,openssl, andpython3at runtime. Here is why onlycurlis explicitly added toDepends:openssldelphix-influxdb-init(openssl rand -hex 16) to generate the admin password. Althoughopensslships withdelphix-platform, it is only inBuild-Dependsthere, notDepends, so it is declared explicitly here to be safe.python3python3-minimalin our existingDepends.curldelphix-platform'sBuild-Depends(build-time only) — so explicitly declared here to be safe.Why
influxdb.tomlinstead ofinfluxdb.confInfluxDB 2.x uses Viper for config parsing, which determines the file format from the extension. Only
.json,.toml,.yaml, and.ymlare recognized —.confis silently ignored and influxd falls back to defaults (~/.influxdbv2/for root). Verified on InfluxDB v2.8.0:INFLUXD_CONFIG_PATH=influxdb.conf→ paths/settings ignored;INFLUXD_CONFIG_PATH=influxdb.toml→ config fully respected.All metrics go to InfluxDB — no file outputs
Previously Telegraf wrote metrics to local JSON files (
metrics_cpu.json,metrics_docker.json, etc.). Those[[outputs.file]]stanzas have been removed entirely. Routing between the two buckets is controlled by the three[[outputs.influxdb_v2]]stanzas intelegraf.outputs.influxdb(written bydelphix-influxdb-init). When InfluxDB output is disabled,delphix-telegraf-serviceinserts[[outputs.discard]]so Telegraf always starts with a valid config.estat_backend-iovsstbtrace ioestatis a Delphix wrapper aroundstbtrace(BPF kernel tracing).estat backend-iois thestbtrace ioequivalent — it instruments I/O at the backend storage device layer (after ZFS cache/compression/RAID transforms). Combined withestat_nfsandestat_iscsi, this lets you trace the full I/O path: client request → ZFS → physical disk.Disk partition and tag exclusions (
[[inputs.diskio]])ZFS zvol block devices (
zd0,zd1, …), NVMe partitions (nvme0n1p1, etc.), and SCSI/SATA partitions (sda1,sdb2, etc.) appear in/proc/diskstatsbut add no diagnostic value — partition-level I/O duplicates what is already visible at the whole-disk level. These accounted for ~29.5% of diskio/agg_diskio line volume. Thewwidtag is a redundant 100+ character identifier; the short-formnametag is sufficient. Both reductions lower storage and query cost in InfluxDB.tcp_stats— per-endpoint TCP statisticsconnstat -PLe -i 10 -T uoutputs per-connection TCP stats every 10 seconds. The wrapper script (connstat-stats.sh) aggregates by(laddr, raddr, service)to mirrorLocalTCPStatsCollector.rportis excluded to prevent cardinality explosion on Oracle dNFS engines. Theservicetag is resolved from/etc/services(lport first, then rport), withdlpx-sp(port 50001) hard-coded as a special case. Fields:inbytes,outbytes,retranssegs,suna(unacknowledged bytes),unsent,swnd/cwnd/rwnd,rtt,connections.hist_estat_*histogram measurementsHistogram data (
microsecondsfield — e.g.{20000,5},{30000,15}) is stored exclusively inhist_estat_*measurements. The originalestat_*measurements havemicrosecondsremoved after cloning (viaprocessors.strings fieldexclude). This eliminates duplication and keeps time-series rows lean. The{val,count}format is preserved as-is — a previous regex+parser pipeline that attempted JSON conversion was removed because numeric field names (e.g."20000") are invalid in InfluxDB line protocol.metaslab-alloc-stats.sh— DLPX-88427 garbage name filterMoved to a dedicated PR (#120 / DLPX-88427). Not part of this PR.
Testing Done
ab-pre-push
Measurements verified in InfluxDB
All expected measurements verified across both buckets on live engines:
defaultbucket (Grafana-facing — dashboard measurements only):support_metricsbucket (everything else):Change-specific verifications
diskioNVMe/SCSI partition exclusionnvme0n1p*andsda[0-9]*absent; only whole-disk entries presentdiskiowwidtag removalwwidnot present indiskiodatadiskfstype/modetag removaldiskmeasurementprocstatcgroup_fulltag removalprocstatmeasurementhist_estat_*indefaultbuckethist_estat_nfs,hist_estat_iscsi,hist_estat_backend-iopresent withmicrosecondsfieldmicrosecondsduplicationmicrosecondsabsent fromestat_nfs/estat_iscsi/estat_backend-iooriginalstcp_statsslim indefault— 4 fields onlyconnections,inbytes,outbytes,retranssegspresent;rtt/cwnd/swnd/rwnd/suna/unsentabsenttcp_statsfull insupport_metricsrtt,cwnd,swnd,rwnd,suna,unsentall present alongside core fieldsagg_*insupport_metricsonlysupport_metrics; absent fromdefaultmem/processes/system/procstatinsupport_metricssupport_metrics; absent fromdefaultestat_backend-ioscalars insupport_metricshist_estat_backend-io) present indefaulttcp_statsservicetagservicetag present (e.g.nfs,https,dlpx-sp)tcp_statsrporttag removed(laddr, raddr, service)connstatPython — deterministic flushestat_metaslab-allocvia wrapperestat nfs/iscsi/backend-ioBPF compilationestat zildefault collection-cflag (defaults to 60 s);zil_commit_waiter_skipprobe removed without errorsperf_influxdb enable/disable testing
INFLUXDB_ENABLEDflag exists on fresh boottelegraf.outputs.influxdbexists with correct perms (-rw-r-----)influxdb_v2output (2x) on bootperf_influxdb disableremoves flag; Telegraf assembles config with[[outputs.discard]]perf_influxdb enablerecreates flag; Telegraf reloads with bothinfluxdb_v2stanzasdefaultandsupport_metricsmust be run as root)