Skip to content

DLPX-88427 Fix estat metaslab-alloc: no data on 2026.3 engines + garbled names#121

Open
dbshah12 wants to merge 2 commits into
developfrom
dlpx/pr/dbshah12/a15623f2-812d-4171-bcc8-18066302e074
Open

DLPX-88427 Fix estat metaslab-alloc: no data on 2026.3 engines + garbled names#121
dbshah12 wants to merge 2 commits into
developfrom
dlpx/pr/dbshah12/a15623f2-812d-4171-bcc8-18066302e074

Conversation

@dbshah12

@dbshah12 dbshah12 commented Jun 12, 2026

Copy link
Copy Markdown

Background

estat metaslab-alloc is a BPF-based tracer that measures how long ZFS takes
to allocate a block of storage on each device in a pool (e.g. /dev/xvdb1).
It feeds per-device IOPS and latency data into Telegraf → InfluxDB.

After upgrading Ubuntu — which brought ZFS 2.4.99 / kernel 6.17 on 2026.3
engines — this metric stopped producing any data at all.

Problem

Main symptom: estat metaslab-alloc compiles and runs without errors but
emits zero JSON lines on 2026.3 engines. InfluxDB has no
estat_metaslab-alloc measurements since the Ubuntu upgrade.

Secondary symptom (some pre-2026.3 engines): the metric was occasionally
emitting garbled device names like \x2f\xa3\x00... (raw memory bytes
rendered as escape sequences) instead of readable paths like /dev/xvdb1.
These break Telegraf/InfluxDB JSON parsing.

Root Cause

Why no data on 2026.3+ (main issue)

Think of the BPF tracer as a stopwatch split across two checkpoints inside
ZFS's storage-allocation code:

  1. Checkpoint A (metaslab_alloc_dva) — starts the stopwatch and tags it
    with the current thread ID.
  2. Checkpoint B (metaslab_group_alloc) — finds that thread's stopwatch,
    measures elapsed time, and records it against the specific disk device.

In older ZFS the call chain was linear:

metaslab_alloc_dva  →  (internally calls)  →  metaslab_group_alloc

After the Ubuntu/ZFS upgrade (ZFS 2.4.99, 2026.3) ZFS was internally
refactored. The allocation path no longer goes through
metaslab_alloc_dva at all.
metaslab_group_alloc is now invoked directly
by a different caller. Checkpoint A never fires. When Checkpoint B arrives it
looks for a stopwatch entry, finds nothing, and silently returns. Zero output.

Confirmed with raw kernel probes: metaslab_group_alloc fires ~900×/sec
during normal pool I/O while metaslab_alloc_dva fires 0 times.

Why garbled device names (secondary issue, some older engines)

On some intermediate ZFS versions the internal layout of the
metaslab_group_t struct changed. The BPF code that reads the device name
was reading from the wrong memory offset, copying random bytes into the metric
name field. These appeared as "name":"\x2f\xa3..." in the JSON output.

Solution

No-data fix (2026.3+)

Updated metaslab_group_alloc_entry to handle both ZFS code paths via a
dva_owned flag in data_t:

  • Old path: metaslab_alloc_dva_entry sets dva_owned=1 and stores
    psize. metaslab_group_alloc_entry detects this via data->dva_owned
    and fills in vdev info without re-applying the pool filter.
  • New path: if no dva_owned entry exists, metaslab_group_alloc_entry
    creates a fresh entry, applies the pool filter via explicit bpf_probe_read
    calls on mg->mg_class->mc_spa (with NULL guards), and sets dva_owned=0.

Using dva_owned rather than data->ts != 0 prevents a stale new-path
entry (from a missed kretprobe event) from bypassing the pool filter.

Restored allocation failures metric

metaslab_alloc_dva_exit now correctly emits an "allocation failures"
aggregation when the return value indicates failure. psize is stored in
data_t.size at entry time (kretprobe argument registers may be clobbered).
metaslab_group_alloc_exit leaves old-path entries alive so the dva exit
can always find them; new-path entries are deleted immediately.

Garbled-name fix

A #pragma unroll loop scans all bytes of vd_name up to the first NUL.
If the string is empty or any byte is outside printable ASCII (0x20–0x7e),
the whole name is replaced with "unknown".

Wrapper script removed

PR #120's grep-filter wrapper is deleted and telegraf.inputs.playbook
restored to call estat directly. The fix is in the BPF code itself.

Testing Done

Each fix was deployed to de-dev1.dlpxdc.co (ZFS 2.4.99-1delphix.2026.03.15,
kernel 6.17.0-1007) and verified — 0 compile errors, clean JSON output — before
committing. Final run:

{"name":"/dev/xvdb1","axis":"success","iops(/s)":"21","avg latency(us)":"4","throughput(k/s)":"329",...}
{"name":"/dev/xvdc1","axis":"success","iops(/s)":"18","avg latency(us)":"2","throughput(k/s)":"267",...}
{"name":"/dev/xvdd1","axis":"success","iops(/s)":"16","avg latency(us)":"3","throughput(k/s)":"234",...}
{"name":"total","iops(/s)":"55","throughput(k/s)":"832"}

Also verified on dhruv2.dlpxdc.co (ZFS 2.4.99-1delphix.2026.05.19,
kernel 6.17.0-1015) — same clean output.

Backward compatibility: metaslab_alloc_dva probes are retained; older
engines continue using the original code path unchanged.

Known gap — garbled-name fix: engines old enough to show this bug are not
currently available. Fix verified correct by code analysis.

Known gap — kernel 6.14 compile error: metaslab-alloc.c fails to compile
on kernel 6.14 due to an incomplete-type error in linux/bpf.h pulled in
transitively. Deferred to a follow-up (same class as DLPX-96701).

@dbshah12 dbshah12 force-pushed the dlpx/pr/dbshah12/a15623f2-812d-4171-bcc8-18066302e074 branch from abd59af to ed3d575 Compare June 12, 2026 10:55
@dbshah12 dbshah12 requested a review from Copilot June 12, 2026 11:01
@dbshah12 dbshah12 marked this pull request as ready for review June 12, 2026 11:01

This comment was marked as resolved.

@dbshah12 dbshah12 changed the title DLPX-88427 Fix garbage stat names in estat metaslab-alloc BPF program DLPX-88427 Fix estat metaslab-alloc: no data on 2026.3 engines + garbled names Jun 12, 2026
dbshah12 added a commit that referenced this pull request Jun 12, 2026
…byte

The previous check only tested vd_name[0]. A string like "/dev\xa3..."
starts with '/' (printable) so the check passed, but the \xa3 byte later
in the string still produces a \xNN escape sequence when Python decodes
with backslashreplace, breaking JSON output.

Scan all bytes up to the first NUL using a #pragma unroll loop over
VD_NAME_SIZE (32) bytes. Replace the whole name with "unknown" if the
string is empty or any byte is outside printable ASCII (0x20-0x7e).

Addresses review comment on PR #121.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>

This comment was marked as resolved.

dbshah12 added a commit that referenced this pull request Jun 12, 2026
If bpf_probe_read fails for mg->mg_class, mc is left as NULL (now
explicitly initialized) and the subsequent read of mc->mc_spa would use
an undefined/NULL pointer.  Initialise both pointers to NULL and return
early if either read yields NULL before using them.

Addresses review comment on PR #121.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@dbshah12 dbshah12 requested a review from Copilot June 12, 2026 12:32

This comment was marked as resolved.

dbshah12 added a commit that referenced this pull request Jun 12, 2026
spa, mc, and psize are present only to match the kretprobe function
signature; add (void) casts so BPF compilation does not emit
-Wunused-parameter warnings.

Addresses review comment on PR #121.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@dbshah12 dbshah12 requested a review from Copilot June 12, 2026 12:43

This comment was marked as resolved.

This comment was marked as spam.

dbshah12 added a commit that referenced this pull request Jun 12, 2026
The previous restructuring dropped the per-DVA "allocation failures"
aggregation that metaslab_alloc_dva_exit used to emit when the overall
DVA allocation failed.  On older ZFS versions (where metaslab_alloc_dva
is still in the call chain), consumers rely on this metric to distinguish
individual per-vdev group failures from a complete DVA failure.

Restore it with the following changes:

- Add dva_owned (u8) to data_t so entry/exit probes can distinguish which
  code path created the map entry.
- Add size (u64) to data_t to store psize at metaslab_alloc_dva_entry time;
  kretprobe argument registers may be clobbered by function return so psize
  cannot be read reliably from the kretprobe handler directly.
- metaslab_alloc_dva_entry: set dva_owned=1 and store psize in data->size.
- metaslab_group_alloc_exit: if dva_owned, only reset per-group fields
  (asize, vd_name) and leave the entry alive so metaslab_alloc_dva_exit
  can emit the failure metric; if !dva_owned (new path), delete as before.
- metaslab_alloc_dva_exit: emit "allocation failures" aggregation when
  PT_REGS_RC != 0, using data->size (stored psize) and data->ts for the
  latency; always delete the entry.

Addresses review comment on PR #121.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@dbshah12 dbshah12 requested a review from Copilot June 12, 2026 14:08

This comment was marked as resolved.

dbshah12 added a commit that referenced this pull request Jun 12, 2026
The previous comment incorrectly implied that a live map entry means
failure and contained an unfinished correction ("resets ts=0... no,
it doesn't").  Reword to accurately describe the control flow: the
entry is present on both success and failure paths; only PT_REGS_RC
determines whether the failure metric is emitted.

Addresses review comment on PR #121.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@dbshah12 dbshah12 requested a review from Copilot June 12, 2026 14:14

This comment was marked as resolved.

dbshah12 added a commit that referenced this pull request Jun 12, 2026
…ntry

The previous check (data != NULL && data->ts != 0) does not correctly
identify old-path entries.  A stale new-path entry (dva_owned=0) left
in the map due to a missed kretprobe event also has a non-zero ts,
causing it to be mistaken for an old-path entry: the pool filter is
skipped and an unrelated timestamp/vdev name is reused.

Since dva_owned was added specifically to distinguish which code path
created the entry, key the old-path branch off data->dva_owned instead.
A stale new-path entry (dva_owned=0) now correctly falls into the else
branch and gets a fresh entry via data_map.update, overwriting the stale
one and re-applying the pool filter.

Addresses review comment on PR #121.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
dbshah12 and others added 2 commits June 15, 2026 10:44
…range

metaslab_alloc_dva() is no longer in the normal write path on ZFS 2.4.99+
(Delphix 2026.3); it now only appears in vdev_removal.c. The write path is:

  metaslab_alloc() -> metaslab_alloc_range() -> metaslab_alloc_dva_range()
    -> metaslab_group_alloc()

Replace the outer kprobe/kretprobe pair from metaslab_alloc_dva to
metaslab_alloc_dva_range, which occupies the same structural role.
metaslab_alloc_dva_range() receives spa_t * as its first argument, making
pool filtering a direct equal_to_pool(spa->spa_name) call with no struct
chain traversal needed.

This also removes the dual-path dva_owned complexity introduced to work
around the missing outer probe — with metaslab_alloc_dva_range as the
anchor, metaslab_group_alloc_entry always finds an existing data_map entry
and only needs to fill in vdev name and asize.

On older ZFS without metaslab_alloc_dva_range, estat(8) prints a WARNING
and skips those probes gracefully.

Co-Authored-By: Claude Sonnet 4.6 (1M context) <noreply@anthropic.com>
@dbshah12 dbshah12 force-pushed the dlpx/pr/dbshah12/a15623f2-812d-4171-bcc8-18066302e074 branch from 6449e97 to f7b99c7 Compare June 15, 2026 16:04
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Development

Successfully merging this pull request may close these issues.

2 participants