DLPX-96312 Add InfluxDB/Telegraf infrastructure for Engine Performance Analytics by dbshah12 · Pull Request #119 · delphix/performance-diagnostics

dbshah12 · 2026-03-31T08:38:49Z

Design Doc

https://perforce.atlassian.net/wiki/spaces/DLXEN/pages/2740027450/Design+of+Delphix+Engine+Performance+Analytics+in+DCT

Problem

Telegraf is already collecting engine performance metrics and writing them to local JSON files on the appliance. However, there is no local time-series database to store and serve these metrics, making it difficult for tools like DCT Smart Proxy to query historical performance data from the engine directly.

Additionally, several valuable metrics — per-connection TCP statistics, and storage I/O (NFS, iSCSI, backend disk) — were either not collected or only available when the performance playbook was explicitly enabled.

Storing all metrics in a single bucket would also mix Grafana-dashboard data with low-level diagnostics (aggregates, process counters, TCP internals), inflating storage costs for data that serves no dashboard purpose.

Solution

InfluxDB 2.x infrastructure

Add InfluxDB 2.x to the appliance as the single metrics store, mirroring the existing Telegraf setup pattern:

influxdb/influxdb.toml — InfluxDB daemon config: bound to 127.0.0.1:8086, with bolt/engine paths matching the installed package (/var/lib/influxdb/). Named .toml (not .conf) because InfluxDB uses the Viper config library, which determines the file format from the extension — .conf is not recognized and is silently ignored, causing influxd to fall back to defaults (~/.influxdbv2/).
influxdb/influxdb-init.conf — Tunable init config (org, bucket names, retention period, readiness wait parameters) sourced by the init script. Change values here without touching the script.
influxdb/delphix-influxdb-init — One-time init script that:
- Exits immediately if /etc/influxdb/influxdb_meta already exists (safe on upgrades and reboots).
- Waits for InfluxDB to be ready via the /health endpoint.
- Calls /api/v2/setup to create the org, default bucket, and admin credentials (one-shot; uses curl directly, no influx CLI dependency).
- Is crash-safe: persists a setup state file immediately after /api/v2/setup; each subsequent step (bucket creation, token creation) appends its result to the state file and checks for it on re-run, so the entire script is idempotent end-to-end.
- Creates the support_metrics bucket for diagnostic and aggregate data that is not displayed in Grafana dashboards.
- Creates three scoped tokens: a write-only token for Telegraf → default, a read-only token for DCT Smart Proxy → default, and a write-only token for Telegraf → support_metrics.
- Writes three [[outputs.influxdb_v2]] stanzas to /etc/telegraf/telegraf.outputs.influxdb (chmod 640) — see Dual-bucket routing below.
- Touches /etc/telegraf/INFLUXDB_ENABLED to enable InfluxDB output by default.
- Atomically writes /etc/influxdb/influxdb_meta (chmod 600) containing: INFLUXDB_ORG, INFLUXDB_BUCKET, INFLUXDB_BUCKET_ID, INFLUXDB_SUPPORT_BUCKET, INFLUXDB_SUPPORT_BUCKET_ID, INFLUXDB_ADMIN_USER, INFLUXDB_ADMIN_PASSWORD, INFLUXDB_WRITE_TOKEN, INFLUXDB_READ_TOKEN. Bucket IDs are written so influxdb_lookup_bucket in support_info.sh can resolve them without an API call on engines that never had them stored. The support_metrics bucket is created with 30-day retention (INFLUXDB_SUPPORT_RETENTION_SECONDS); if the bucket already exists (engine upgrading from 7-day retention), influx_patch() PATCHes the retention so existing engines are updated without re-initialisation.
influxdb/delphix-influxdb-service — Wrapper that starts influxd with INFLUXD_CONFIG_PATH=/etc/influxdb/influxdb.toml in the background, runs the init script, then waits on the daemon PID. (influxd does not accept a --config-path flag; the config path must be set via the environment variable.)
influxdb/delphix-influxdb.service — Systemd unit following the same structure as delphix-telegraf.service (PartOf=delphix.target, Restart=on-failure, runs as root).
influxdb/perf_influxdb — Toggle script (mirrors perf_playbook) to enable/disable InfluxDB metric output from Telegraf without stopping InfluxDB itself. Manages the /etc/telegraf/INFLUXDB_ENABLED flag and restarts Telegraf.
influxdb/influxdb-nginx.conf — nginx reverse proxy config that exposes InfluxDB externally at /influxdb/, allowing tools like DCT Smart Proxy and Grafana to reach it without direct port access.
debian/rules — Installs all influxdb files: scripts to /usr/bin/, systemd unit to /lib/systemd/system/, configs to /etc/influxdb/, nginx config to /opt/delphix/server/etc/nginx/conf.d/.
debian/control — Added influxdb2 and curl to Depends.

Dual-bucket routing

Metrics are split across two buckets to keep Grafana-facing data separate from diagnostic data:

Bucket	Purpose	Measurements
`default`	Grafana dashboards	`cpu`, `disk`, `diskio`, `net`, `zfs`, `tcp_stats`, `estat_nfs`, `estat_iscsi`, `hist_estat_nfs`, `hist_estat_iscsi`, `hist_estat_backend-io`
`support_metrics`	Everything else — diagnostics, aggregates, uncategorised	`mem`, `processes`, `system`, `procstat`, `agg_`, `nfs_threads`, `estat_backend-io`, `estat_zpl/zvol/zio/zio-queue/metaslab-alloc`, `hist_estat_zpl/zvol/zio/…`, `docker_container_`

Routing is controlled by two [[outputs.influxdb_v2]] stanzas written by delphix-influxdb-init:

Default bucket — namepass lists exactly the 11 measurements currently used in Grafana dashboards: cpu, disk, diskio, net, tcp_stats, zfs, estat_nfs, estat_iscsi, hist_estat_nfs, hist_estat_iscsi, hist_estat_backend-io. Any measurement added in future lands in support_metrics by default until explicitly promoted.
support_metrics bucket — namedrop mirrors the default namepass list, so every other measurement flows here automatically.

Why only dashboard-used measurements in default? Keeping the default bucket to exactly what Grafana queries minimises storage and query cost. Measurements not yet wired into a dashboard panel — playbook estat_*, process/aggregate counters, nfs_threads, Docker metrics — sit in support_metrics until a dashboard panel needs them.

Why move estat_backend-io scalars? Grafana uses the histogram clone (hist_estat_backend-io, which stays in default) for its I/O heatmap. The raw per-interval scalar rows from estat_backend-io serve no dashboard purpose but are useful for support investigations.

Why agg_* in support_metrics? Hourly aggregates duplicate raw data in summarised form. Grafana queries raw measurements directly; aggregates are only needed for support cases requiring a long time-range summary without fetching raw points.

Telegraf metric collection changes

All metrics now flow exclusively to InfluxDB — JSON file outputs have been removed entirely:

telegraf/telegraf.base — Updated:
- Removed all [[outputs.file]] stanzas; InfluxDB is now the sole output.
- Removed [[inputs.filestat]] and [[inputs.netstat]] (not required).
- [[inputs.cpu]]: changed percpu = true → percpu = false — only cpu-total collected, not per-core. Reduces data volume on many-CPU engines; agg_cpu inherits this automatically.
- [[inputs.disk]]: added tagexclude = ["fstype", "mode"] — these tags add no diagnostic value and inflate cardinality.
- [[inputs.diskio]]: updated tagdrop to exclude ZFS internal zvol devices (zd*), NVMe partitions (*p[0-9]*), and SCSI/SATA partitions (sd*[0-9]*). Added tagexclude = ["wwid"] to drop the redundant 100+ character wwid tag. Partition entries accounted for ~29.5% of diskio/agg_diskio line volume.
- [[inputs.procstat]] (both delphix-mgmt and zfs-object-agent instances): added tagexclude = ["cgroup_full"] — long cgroup path adds cardinality without diagnostic value.
- Removed [[inputs.swap]] — swap usage adds no diagnostic value for Delphix appliances.
- Added [[inputs.execd]] for per-connection TCP stats via connstat-stats.sh (measurement: tcp_stats).
telegraf/connstat-stats.sh — New shell/awk script running connstat -PLe -i 10 -T u to collect per-connection TCP statistics, aggregated by remote endpoint (laddr, raddr, service). Uses fflush() in awk explicitly after every 10-second batch to ensure deterministic output to Telegraf's execd pipe.
- rport is excluded from the aggregation key — service already captures the semantic meaning of the port, and including rport causes cardinality explosion on Oracle dNFS engines where hundreds of connections to the same VDB host use different ephemeral remote ports (all mapping to service=nfs on lport 2049). Mirrors the aggregation in LocalTCPStatsCollector.
- Service name resolved from /etc/services (lport first, then rport), with Delphix-specific ports not in /etc/services hard-coded (see script).
- Cumulative fields (inbytes, outbytes, etc.) are summed; window/RTT fields (cwnd, swnd, rwnd, rtt) are averaged; connections reports the count of aggregated TCP connections.
telegraf/telegraf.inputs.storage_io — New always-on fragment (appended when InfluxDB is enabled, independent of playbook state) collecting:
- estat_nfs — NFS server I/O (reads/writes from NFS clients).
- estat_iscsi — iSCSI target I/O (reads/writes from iSCSI initiators).
- estat_backend-io — Backend disk I/O via estat backend-io (equivalent to stbtrace io). Measures I/O at the physical/virtual disk layer after ZFS processing.
- [[processors.converter]] to convert estat string fields to integers.
- [[processors.clone]] (order=1) — clones all estat_* measurements as hist_estat_* to hold histogram data exclusively.
- [[processors.strings]] (order=2) — removes the microseconds field from all original estat_* measurements after cloning, ensuring histogram data lives only in hist_estat_*. The original {val,count} format (e.g. {20000,5},{30000,15}) is preserved as-is — the previous regex+parser pipeline that attempted JSON conversion was removed because numeric field names are invalid in InfluxDB line protocol.
telegraf/telegraf.inputs.playbook — Removed estat_nfs, estat_iscsi, and estat_backend-io stanzas (moved to telegraf.inputs.storage_io). Removed the broken regex+parser histogram pipeline (replaced by clone+strings in storage_io). Scoped [[processors.converter]] to playbook-only metrics. Updated estat_metaslab-alloc command to use the new wrapper script.
telegraf/metaslab-alloc-stats.sh — Moved to a dedicated PR (DLPX-88427 Filter garbage stat names from estat metaslab-alloc output #120 / DLPX-88427).
telegraf/telegraf.inputs.dct — Removed [[outputs.file]] for metrics_docker.json; docker metrics now go to InfluxDB.
telegraf/delphix-telegraf-service — When InfluxDB is enabled, appends both telegraf.inputs.storage_io and telegraf.outputs.influxdb (the three-stanza file) to the assembled config. Falls back to [[outputs.discard]] if InfluxDB output is not configured, so Telegraf always starts with a valid config regardless of state.

BPF/estat kernel compatibility fixes

Several estat commands were failing to compile with redefinition and forward declaration errors on the current kernel. These fixes are required for the always-on estat_nfs, estat_iscsi, and estat_backend-io measurements to work correctly (DLPX-96701):

bpf/estat/nfs.c and bpf/stbtrace/nfs.st — Removed struct bpf_wq forward declaration that conflicts with updated kernel headers (the struct is now defined by the kernel itself).
bpf/estat/zvol.c — Removed zv_request_t struct typedef that conflicts with updated kernel headers.
bpf/stbtrace/iscsi.st — Added struct iscsi_conn; forward declaration before #include "iscsi_target_core.h" to resolve an incomplete type error.
bpf/standalone/arc_prefetch.py, bpf/standalone/txg.py, bpf/standalone/zil.py, cmd/estat.py — Added -D__KERNEL__ and -D_KERNEL BPF compiler flags required by newer kernel headers.
bpf/standalone/zil.py — Removed the zil_commit_waiter_skip kprobe (function no longer exists in the current kernel). Added default=60 to --coll so estat zil works without requiring -c. Simplified the collection loop to always run until Ctrl-C, using --coll as the sleep interval between output cycles.
cmd/estat.py — Updated estat zil help text to document the -c INTERVAL and -p POOL options.

Complete list of measurements in InfluxDB

Measurement	Source	Bucket	Availability
`cpu`	`[[inputs.cpu]]` (cpu-total only; per-core excluded)	`default`	Always
`disk`	`[[inputs.disk]]` (fstype/mode tags excluded)	`default`	Always
`diskio`	`[[inputs.diskio]]` (zd, p[0-9], sd[0-9]*, wwid excluded)	`default`	Always
`net`	`[[inputs.net]]`	`default`	Always
`zfs`	`[[inputs.zfs]]`	`default`	Always
`tcp_stats`	`connstat` — `connections`, `inbytes`, `outbytes`, `retranssegs`, `rtt`, `cwnd`, `swnd`, `rwnd`, `suna`, `unsent`	`default`	Always
`mem`	`[[inputs.mem]]`	`support_metrics`	Always
`processes`	`[[inputs.processes]]`	`support_metrics`	Always
`system`	`[[inputs.system]]`	`support_metrics`	Always
`procstat`	`[[inputs.procstat]]` — mgmt + zfs-object-agent (cgroup_full excluded)	`support_metrics`	Always
`agg_cpu/disk/diskio/mem/net/processes/system`	Hourly min/max/mean/stdev aggregates	`support_metrics`	Always
`estat_nfs`	NFS server I/O via `estat nfs`	`default`	When InfluxDB enabled
`estat_iscsi`	iSCSI target I/O via `estat iscsi`	`default`	When InfluxDB enabled
`estat_backend-io`	Backend disk I/O scalars via `estat backend-io`	`support_metrics`	When InfluxDB enabled
`hist_estat_nfs`, `hist_estat_iscsi`, `hist_estat_backend-io`	Histogram-only clones of always-on storage I/O	`default`	When InfluxDB enabled
`estat_zpl/zio/zvol/zio-queue/metaslab-alloc`	ZFS operation stats	`support_metrics`	Playbook only
`hist_estat_zpl/zvol/zio/…`	Histogram clones of playbook estat_*	`support_metrics`	Playbook only
`nfs_threads`	NFS thread utilization	`support_metrics`	Playbook only
`docker_container_*`	Docker/DCT container metrics	`support_metrics`	DCT engines only

Notes to Reviewers

Runtime dependency decisions (`debian/control`)

When someone runs apt install performance-diagnostics, APT checks each package listed in Depends:

If already installed → skip (no reinstall, no harm).
If not installed → automatically download and install it.

The init script (delphix-influxdb-init) relies on curl, openssl, and python3 at runtime. Here is why only curl is explicitly added to Depends:

Dependency	Decision	Reason
`openssl`	Added	Used by `delphix-influxdb-init` (`openssl rand -hex 16`) to generate the admin password. Although `openssl` ships with `delphix-platform`, it is only in `Build-Depends` there, not `Depends`, so it is declared explicitly here to be safe.
`python3`	Not added	Already present via `python3-minimal` in our existing `Depends`.
`curl`	Added	Only in `delphix-platform`'s `Build-Depends` (build-time only) — so explicitly declared here to be safe.

Why `influxdb.toml` instead of `influxdb.conf`

InfluxDB 2.x uses Viper for config parsing, which determines the file format from the extension. Only .json, .toml, .yaml, and .yml are recognized — .conf is silently ignored and influxd falls back to defaults (~/.influxdbv2/ for root). Verified on InfluxDB v2.8.0: INFLUXD_CONFIG_PATH=influxdb.conf → paths/settings ignored; INFLUXD_CONFIG_PATH=influxdb.toml → config fully respected.

All metrics go to InfluxDB — no file outputs

Previously Telegraf wrote metrics to local JSON files (metrics_cpu.json, metrics_docker.json, etc.). Those [[outputs.file]] stanzas have been removed entirely. Routing between the two buckets is controlled by the three [[outputs.influxdb_v2]] stanzas in telegraf.outputs.influxdb (written by delphix-influxdb-init). When InfluxDB output is disabled, delphix-telegraf-service inserts [[outputs.discard]] so Telegraf always starts with a valid config.

`estat_backend-io` vs `stbtrace io`

estat is a Delphix wrapper around stbtrace (BPF kernel tracing). estat backend-io is the stbtrace io equivalent — it instruments I/O at the backend storage device layer (after ZFS cache/compression/RAID transforms). Combined with estat_nfs and estat_iscsi, this lets you trace the full I/O path: client request → ZFS → physical disk.

Disk partition and tag exclusions (`[[inputs.diskio]]`)

ZFS zvol block devices (zd0, zd1, …), NVMe partitions (nvme0n1p1, etc.), and SCSI/SATA partitions (sda1, sdb2, etc.) appear in /proc/diskstats but add no diagnostic value — partition-level I/O duplicates what is already visible at the whole-disk level. These accounted for ~29.5% of diskio/agg_diskio line volume. The wwid tag is a redundant 100+ character identifier; the short-form name tag is sufficient. Both reductions lower storage and query cost in InfluxDB.

`tcp_stats` — per-endpoint TCP statistics

connstat -PLe -i 10 -T u outputs per-connection TCP stats every 10 seconds. The wrapper script (connstat-stats.sh) aggregates by (laddr, raddr, service) to mirror LocalTCPStatsCollector. rport is excluded to prevent cardinality explosion on Oracle dNFS engines. The service tag is resolved from /etc/services (lport first, then rport), with dlpx-sp (port 50001) hard-coded as a special case. Fields: inbytes, outbytes, retranssegs, suna (unacknowledged bytes), unsent, swnd/cwnd/rwnd, rtt, connections.

`hist_estat_*` histogram measurements

Histogram data (microseconds field — e.g. {20000,5},{30000,15}) is stored exclusively in hist_estat_* measurements. The original estat_* measurements have microseconds removed after cloning (via processors.strings fieldexclude). This eliminates duplication and keeps time-series rows lean. The {val,count} format is preserved as-is — a previous regex+parser pipeline that attempted JSON conversion was removed because numeric field names (e.g. "20000") are invalid in InfluxDB line protocol.

`metaslab-alloc-stats.sh` — DLPX-88427 garbage name filter

Moved to a dedicated PR (#120 / DLPX-88427). Not part of this PR.

Testing Done

ab-pre-push

Ran ab-pre-push: https://selfservice-jenkins.eng-tools-prd.aws.delphixcloud.com/job/appliance-build-orchestrator-pre-push/13811/ - ✅
On the ab-pre-push engine, below files are created:

/etc/influxdb# ls -l
total 4
-rw-r--r-- 1 root root  86 Mar 31 09:56 config.toml
-rw-r--r-- 1 root root 357 Mar 31 09:19 influxdb-init.conf
-rw-r--r-- 1 root root 274 Mar 31 09:19 influxdb.toml
-rw------- 1 root root 347 Mar 31 12:24 influxdb_meta

/etc/influxdb# ls -l /var/lib/influxdb
total 23
drwxr-x--- 5 influxdb influxdb      5 Mar 31 12:46 engine
-rw------- 1 influxdb influxdb  65536 Mar 31 12:46 influxd.bolt
-rw-r----- 1 influxdb influxdb      4 Mar 31 12:22 influxd.pid
-rw-r----- 1 influxdb influxdb 122880 Mar 31 12:23 influxd.sqlite

InfluxDB setup is also completed, and I can see data there in the UI:

Measurements verified in InfluxDB

All expected measurements verified across both buckets on live engines:

default bucket (Grafana-facing — dashboard measurements only):

cpu              disk             diskio           estat_iscsi
estat_nfs        hist_estat_backend-io             hist_estat_iscsi
hist_estat_nfs   net              tcp_stats        zfs

support_metrics bucket (everything else):

agg_cpu          agg_disk         agg_diskio       agg_mem
agg_net          agg_processes    agg_system       estat_backend-io
mem              nfs_threads      processes        procstat
system

Change-specific verifications

Change	Verification
`diskio` NVMe/SCSI partition exclusion	Confirmed `nvme0n1p` and `sda[0-9]` absent; only whole-disk entries present
`diskio` `wwid` tag removal	Confirmed `wwid` not present in `diskio` data
`disk` `fstype`/`mode` tag removal	Confirmed tags absent from `disk` measurement
`procstat` `cgroup_full` tag removal	Confirmed tag absent from `procstat` measurement
`hist_estat_*` in `default` bucket	`hist_estat_nfs`, `hist_estat_iscsi`, `hist_estat_backend-io` present with `microseconds` field
No `microseconds` duplication	`microseconds` absent from `estat_nfs`/`estat_iscsi`/`estat_backend-io` originals
`tcp_stats` slim in `default` — 4 fields only	`connections`, `inbytes`, `outbytes`, `retranssegs` present; `rtt`/`cwnd`/`swnd`/`rwnd`/`suna`/`unsent` absent
`tcp_stats` full in `support_metrics`	`rtt`, `cwnd`, `swnd`, `rwnd`, `suna`, `unsent` all present alongside core fields
`agg_*` in `support_metrics` only	All 7 aggregate measurements confirmed in `support_metrics`; absent from `default`
`mem`/`processes`/`system`/`procstat` in `support_metrics`	Confirmed in `support_metrics`; absent from `default`
`estat_backend-io` scalars in `support_metrics`	Confirmed; histogram clone (`hist_estat_backend-io`) present in `default`
`tcp_stats` `service` tag	`service` tag present (e.g. `nfs`, `https`, `dlpx-sp`)
`tcp_stats` `rport` tag removed	Confirmed absent; aggregation key is `(laddr, raddr, service)`
`connstat` Python — deterministic flush	10-second batches flushed on schedule; no multi-hour mawk buffering
`estat_metaslab-alloc` via wrapper	Data present; garbage-name entries absent
`estat nfs`/`iscsi`/`backend-io` BPF compilation	Compile and collect without redefinition or forward-declaration errors
`estat zil` default collection	Runs without `-c` flag (defaults to 60 s); `zil_commit_waiter_skip` probe removed without errors

perf_influxdb enable/disable testing

Test	Result
`INFLUXDB_ENABLED` flag exists on fresh boot	✅
`telegraf.outputs.influxdb` exists with correct perms (`-rw-r-----`)	✅
Telegraf loaded `influxdb_v2` output (2x) on boot	✅
`perf_influxdb disable` removes flag; Telegraf assembles config with `[[outputs.discard]]`	✅
`perf_influxdb enable` recreates flag; Telegraf reloads with both `influxdb_v2` stanzas	✅
After enable, data flows to both `default` and `support_metrics`	✅
Non-root user blocked with clear error (`must be run as root`)	✅
No errors in journalctl	✅

dbshah12 · 2026-06-11T15:27:35Z

All Copilot review comments addressed — stale threads resolved, valid issues fixed or explained inline.

…e Analytics PR URL: https://www.github.com/delphix/performance-diagnostics/pull/119

The Starlark processor was returning [metric] (pass-through) for hist_estat_* rows whose microseconds field is absent — specifically name=total summary rows, which carry only iops and throughput. Those clones are pure duplicates of the corresponding estat_* row and should not exist in hist_estat_*. Fix: return [] to drop the metric when microseconds is None. Only rows with actual histogram data produce le=<bucket> output. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

When microseconds is present but parses to no valid bucket pairs (e.g. "{ }" from a zero-activity interval), the Starlark function was falling through to `return result if result else [metric]` and passing the metric through unchanged — a duplicate of the estat_* summary row for that series. Fix: return [] instead of [metric] when parsing yields no results. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

mawk 1.3.4 does not flush its stdout buffer via fflush() when writing to a Telegraf execd pipe, causing tcp_stats data to never reach InfluxDB on engines where mawk is the default awk. Wrapping connstat in a while loop with -c 2 forces awk to exit naturally after each 10-second interval, triggering the C runtime exit flush (fclose) that reliably delivers data to Telegraf. The END block captures the partial second interval on each awk exit so no data is lost. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

Extend the existing name=total starlark drop filter (previously estat_backend-io only) to also cover estat_nfs and estat_iscsi. The total row is read+write summed at collection time and can be derived in Grafana, so storing it wastes ~33% of each measurement's space. The hist_estat_* clones are unaffected since name=total rows carry no microseconds field and are already dropped by the histogram processor. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

dbshah12 force-pushed the dlpx/pr/dbshah12/5d79e679-49b6-4c0a-8241-1c919bfcaedb branch 2 times, most recently from bb1bd01 to 985a3ac Compare March 31, 2026 08:49

dbshah12 requested a review from Copilot March 31, 2026 08:53

Copilot started reviewing on behalf of dbshah12 March 31, 2026 08:54 View session

This comment was marked as resolved.

Sign in to view

dbshah12 force-pushed the dlpx/pr/dbshah12/5d79e679-49b6-4c0a-8241-1c919bfcaedb branch from 6286549 to 2a39e0c Compare March 31, 2026 09:16

dbshah12 marked this pull request as ready for review March 31, 2026 10:57

dbshah12 marked this pull request as draft March 31, 2026 11:01

dbshah12 force-pushed the dlpx/pr/dbshah12/5d79e679-49b6-4c0a-8241-1c919bfcaedb branch 2 times, most recently from bad0342 to df102c9 Compare March 31, 2026 13:03

dbshah12 marked this pull request as ready for review March 31, 2026 13:07

dbshah12 requested a review from sebroy March 31, 2026 13:07

dbshah12 self-assigned this Mar 31, 2026

dbshah12 requested a review from Copilot April 1, 2026 05:37

Copilot started reviewing on behalf of dbshah12 April 1, 2026 05:38 View session

This comment was marked as resolved.

Sign in to view

dbshah12 requested a review from Copilot April 1, 2026 05:59

Copilot started reviewing on behalf of dbshah12 April 1, 2026 06:00 View session

This comment was marked as spam.

Sign in to view

dbshah12 force-pushed the dlpx/pr/dbshah12/5d79e679-49b6-4c0a-8241-1c919bfcaedb branch 3 times, most recently from 34acba1 to 02cc5df Compare April 1, 2026 06:28

dbshah12 requested a review from Copilot April 1, 2026 06:29

Copilot started reviewing on behalf of dbshah12 April 1, 2026 06:30 View session

This comment was marked as spam.

Sign in to view

delphix deleted a comment from Copilot AI Apr 1, 2026

dbshah12 force-pushed the dlpx/pr/dbshah12/5d79e679-49b6-4c0a-8241-1c919bfcaedb branch from 02cc5df to 7095d33 Compare April 1, 2026 06:43

dbshah12 requested review from ShibasishDelphix, SumoSourabh and VenkatanadhanG April 1, 2026 08:39

This comment was marked as resolved.

Sign in to view

dbshah12 force-pushed the dlpx/pr/dbshah12/5d79e679-49b6-4c0a-8241-1c919bfcaedb branch from 2a7d1b7 to 444ca18 Compare April 20, 2026 13:55

dbshah12 requested a review from Copilot April 20, 2026 14:10

Copilot started reviewing on behalf of dbshah12 April 20, 2026 14:10 View session

This comment was marked as resolved.

Sign in to view

dbshah12 force-pushed the dlpx/pr/dbshah12/5d79e679-49b6-4c0a-8241-1c919bfcaedb branch 9 times, most recently from 304c9a7 to 60660da Compare April 27, 2026 11:53

dbshah12 requested a review from Copilot April 28, 2026 07:27

Copilot started reviewing on behalf of dbshah12 April 28, 2026 07:27 View session

This comment was marked as resolved.

Sign in to view

dbshah12 force-pushed the dlpx/pr/dbshah12/5d79e679-49b6-4c0a-8241-1c919bfcaedb branch from daae5c1 to 9a5175a Compare May 19, 2026 17:15

dbshah12 closed this Jun 11, 2026

dbshah12 reopened this Jun 11, 2026

dbshah12 force-pushed the dlpx/pr/dbshah12/5d79e679-49b6-4c0a-8241-1c919bfcaedb branch from 2e8baa6 to 9bf19d7 Compare June 12, 2026 06:06

dbshah12 requested review from nealquigley and removed request for sisodiyam8 June 12, 2026 14:35

DLPX-96312 Add InfluxDB/Telegraf infrastructure for Engine Performanc…

774cdc1

…e Analytics PR URL: https://www.github.com/delphix/performance-diagnostics/pull/119

dbshah12 force-pushed the dlpx/pr/dbshah12/5d79e679-49b6-4c0a-8241-1c919bfcaedb branch from bde1608 to 774cdc1 Compare June 15, 2026 05:14

dbshah12 and others added 4 commits June 17, 2026 15:45

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DLPX-96312 Add InfluxDB/Telegraf infrastructure for Engine Performance Analytics#119

DLPX-96312 Add InfluxDB/Telegraf infrastructure for Engine Performance Analytics#119
dbshah12 wants to merge 5 commits into
developfrom
dlpx/pr/dbshah12/5d79e679-49b6-4c0a-8241-1c919bfcaedb

dbshah12 commented Mar 31, 2026 •

edited

Loading

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as spam.

Uh oh!

This comment was marked as spam.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

dbshah12 commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

Conversation

dbshah12 commented Mar 31, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Design Doc

Problem

Solution

InfluxDB 2.x infrastructure

Dual-bucket routing

Telegraf metric collection changes

BPF/estat kernel compatibility fixes

Complete list of measurements in InfluxDB

Notes to Reviewers

Runtime dependency decisions (debian/control)

Why influxdb.toml instead of influxdb.conf

All metrics go to InfluxDB — no file outputs

estat_backend-io vs stbtrace io

Disk partition and tag exclusions ([[inputs.diskio]])

tcp_stats — per-endpoint TCP statistics

hist_estat_* histogram measurements

metaslab-alloc-stats.sh — DLPX-88427 garbage name filter

Testing Done

ab-pre-push

Measurements verified in InfluxDB

Change-specific verifications

perf_influxdb enable/disable testing

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as spam.

Uh oh!

This comment was marked as spam.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

This comment was marked as resolved.

Uh oh!

dbshah12 commented Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Milestone

Development

Uh oh!

2 participants

dbshah12 commented Mar 31, 2026 •

edited

Loading

Runtime dependency decisions (`debian/control`)

Why `influxdb.toml` instead of `influxdb.conf`

`estat_backend-io` vs `stbtrace io`

Disk partition and tag exclusions (`[[inputs.diskio]]`)

`tcp_stats` — per-endpoint TCP statistics

`hist_estat_*` histogram measurements

`metaslab-alloc-stats.sh` — DLPX-88427 garbage name filter