DLPX-88427 Fix estat metaslab-alloc: no data on 2026.3 engines + garbled names#121
Open
dbshah12 wants to merge 2 commits into
Open
DLPX-88427 Fix estat metaslab-alloc: no data on 2026.3 engines + garbled names#121dbshah12 wants to merge 2 commits into
dbshah12 wants to merge 2 commits into
Conversation
abd59af to
ed3d575
Compare
dbshah12
added a commit
that referenced
this pull request
Jun 12, 2026
…byte The previous check only tested vd_name[0]. A string like "/dev\xa3..." starts with '/' (printable) so the check passed, but the \xa3 byte later in the string still produces a \xNN escape sequence when Python decodes with backslashreplace, breaking JSON output. Scan all bytes up to the first NUL using a #pragma unroll loop over VD_NAME_SIZE (32) bytes. Replace the whole name with "unknown" if the string is empty or any byte is outside printable ASCII (0x20-0x7e). Addresses review comment on PR #121. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
dbshah12
added a commit
that referenced
this pull request
Jun 12, 2026
If bpf_probe_read fails for mg->mg_class, mc is left as NULL (now explicitly initialized) and the subsequent read of mc->mc_spa would use an undefined/NULL pointer. Initialise both pointers to NULL and return early if either read yields NULL before using them. Addresses review comment on PR #121. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
dbshah12
added a commit
that referenced
this pull request
Jun 12, 2026
spa, mc, and psize are present only to match the kretprobe function signature; add (void) casts so BPF compilation does not emit -Wunused-parameter warnings. Addresses review comment on PR #121. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
Copilot stopped reviewing on behalf of
dbshah12 due to an error
June 12, 2026 12:52
dbshah12
added a commit
that referenced
this pull request
Jun 12, 2026
The previous restructuring dropped the per-DVA "allocation failures" aggregation that metaslab_alloc_dva_exit used to emit when the overall DVA allocation failed. On older ZFS versions (where metaslab_alloc_dva is still in the call chain), consumers rely on this metric to distinguish individual per-vdev group failures from a complete DVA failure. Restore it with the following changes: - Add dva_owned (u8) to data_t so entry/exit probes can distinguish which code path created the map entry. - Add size (u64) to data_t to store psize at metaslab_alloc_dva_entry time; kretprobe argument registers may be clobbered by function return so psize cannot be read reliably from the kretprobe handler directly. - metaslab_alloc_dva_entry: set dva_owned=1 and store psize in data->size. - metaslab_group_alloc_exit: if dva_owned, only reset per-group fields (asize, vd_name) and leave the entry alive so metaslab_alloc_dva_exit can emit the failure metric; if !dva_owned (new path), delete as before. - metaslab_alloc_dva_exit: emit "allocation failures" aggregation when PT_REGS_RC != 0, using data->size (stored psize) and data->ts for the latency; always delete the entry. Addresses review comment on PR #121. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
dbshah12
added a commit
that referenced
this pull request
Jun 12, 2026
The previous comment incorrectly implied that a live map entry means
failure and contained an unfinished correction ("resets ts=0... no,
it doesn't"). Reword to accurately describe the control flow: the
entry is present on both success and failure paths; only PT_REGS_RC
determines whether the failure metric is emitted.
Addresses review comment on PR #121.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
2 tasks
dbshah12
added a commit
that referenced
this pull request
Jun 12, 2026
…ntry The previous check (data != NULL && data->ts != 0) does not correctly identify old-path entries. A stale new-path entry (dva_owned=0) left in the map due to a missed kretprobe event also has a non-zero ts, causing it to be mistaken for an old-path entry: the pool filter is skipped and an unrelated timestamp/vdev name is reused. Since dva_owned was added specifically to distinguish which code path created the entry, key the old-path branch off data->dva_owned instead. A stale new-path entry (dva_owned=0) now correctly falls into the else branch and gets a fresh entry via data_map.update, overwriting the stale one and re-applying the pool filter. Addresses review comment on PR #121. Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
…range
metaslab_alloc_dva() is no longer in the normal write path on ZFS 2.4.99+
(Delphix 2026.3); it now only appears in vdev_removal.c. The write path is:
metaslab_alloc() -> metaslab_alloc_range() -> metaslab_alloc_dva_range()
-> metaslab_group_alloc()
Replace the outer kprobe/kretprobe pair from metaslab_alloc_dva to
metaslab_alloc_dva_range, which occupies the same structural role.
metaslab_alloc_dva_range() receives spa_t * as its first argument, making
pool filtering a direct equal_to_pool(spa->spa_name) call with no struct
chain traversal needed.
This also removes the dual-path dva_owned complexity introduced to work
around the missing outer probe — with metaslab_alloc_dva_range as the
anchor, metaslab_group_alloc_entry always finds an existing data_map entry
and only needs to fill in vdev name and asize.
On older ZFS without metaslab_alloc_dva_range, estat(8) prints a WARNING
and skips those probes gracefully.
Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
6449e97 to
f7b99c7
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Background
estat metaslab-allocis a BPF-based tracer that measures how long ZFS takesto allocate a block of storage on each device in a pool (e.g.
/dev/xvdb1).It feeds per-device IOPS and latency data into Telegraf → InfluxDB.
After upgrading Ubuntu — which brought ZFS 2.4.99 / kernel 6.17 on 2026.3
engines — this metric stopped producing any data at all.
Problem
Main symptom:
estat metaslab-alloccompiles and runs without errors butemits zero JSON lines on 2026.3 engines. InfluxDB has no
estat_metaslab-allocmeasurements since the Ubuntu upgrade.Secondary symptom (some pre-2026.3 engines): the metric was occasionally
emitting garbled device names like
\x2f\xa3\x00...(raw memory bytesrendered as escape sequences) instead of readable paths like
/dev/xvdb1.These break Telegraf/InfluxDB JSON parsing.
Root Cause
Why no data on 2026.3+ (main issue)
Think of the BPF tracer as a stopwatch split across two checkpoints inside
ZFS's storage-allocation code:
metaslab_alloc_dva) — starts the stopwatch and tags itwith the current thread ID.
metaslab_group_alloc) — finds that thread's stopwatch,measures elapsed time, and records it against the specific disk device.
In older ZFS the call chain was linear:
After the Ubuntu/ZFS upgrade (ZFS 2.4.99, 2026.3) ZFS was internally
refactored. The allocation path no longer goes through
metaslab_alloc_dvaat all.metaslab_group_allocis now invoked directlyby a different caller. Checkpoint A never fires. When Checkpoint B arrives it
looks for a stopwatch entry, finds nothing, and silently returns. Zero output.
Confirmed with raw kernel probes:
metaslab_group_allocfires ~900×/secduring normal pool I/O while
metaslab_alloc_dvafires 0 times.Why garbled device names (secondary issue, some older engines)
On some intermediate ZFS versions the internal layout of the
metaslab_group_tstruct changed. The BPF code that reads the device namewas reading from the wrong memory offset, copying random bytes into the metric
name field. These appeared as
"name":"\x2f\xa3..."in the JSON output.Solution
No-data fix (2026.3+)
Updated
metaslab_group_alloc_entryto handle both ZFS code paths via adva_ownedflag indata_t:metaslab_alloc_dva_entrysetsdva_owned=1and storespsize.metaslab_group_alloc_entrydetects this viadata->dva_ownedand fills in vdev info without re-applying the pool filter.
dva_ownedentry exists,metaslab_group_alloc_entrycreates a fresh entry, applies the pool filter via explicit
bpf_probe_readcalls on
mg->mg_class->mc_spa(with NULL guards), and setsdva_owned=0.Using
dva_ownedrather thandata->ts != 0prevents a stale new-pathentry (from a missed kretprobe event) from bypassing the pool filter.
Restored allocation failures metric
metaslab_alloc_dva_exitnow correctly emits an"allocation failures"aggregation when the return value indicates failure.
psizeis stored indata_t.sizeat entry time (kretprobe argument registers may be clobbered).metaslab_group_alloc_exitleaves old-path entries alive so the dva exitcan always find them; new-path entries are deleted immediately.
Garbled-name fix
A
#pragma unrollloop scans all bytes ofvd_nameup to the first NUL.If the string is empty or any byte is outside printable ASCII (0x20–0x7e),
the whole name is replaced with
"unknown".Wrapper script removed
PR #120's grep-filter wrapper is deleted and
telegraf.inputs.playbookrestored to call
estatdirectly. The fix is in the BPF code itself.Testing Done
Each fix was deployed to de-dev1.dlpxdc.co (ZFS 2.4.99-1delphix.2026.03.15,
kernel 6.17.0-1007) and verified — 0 compile errors, clean JSON output — before
committing. Final run:
Also verified on dhruv2.dlpxdc.co (ZFS 2.4.99-1delphix.2026.05.19,
kernel 6.17.0-1015) — same clean output.
Backward compatibility:
metaslab_alloc_dvaprobes are retained; olderengines continue using the original code path unchanged.
Known gap — garbled-name fix: engines old enough to show this bug are not
currently available. Fix verified correct by code analysis.
Known gap — kernel 6.14 compile error:
metaslab-alloc.cfails to compileon kernel 6.14 due to an incomplete-type error in
linux/bpf.hpulled intransitively. Deferred to a follow-up (same class as DLPX-96701).