Changes from 4.5.1 to 4.6.0
CTable.sort_by(view=True): zero-copy sorted views
-
CTable.sort_by()now acceptsview=True, returning a lightweight
sorted view that shares the parent's column data and gathers rows on
demand in sorted order — no whole-table copy. This is ideal for reading a
sorted slice of a large (possibly on-disk) table::t.sort_by("col", view=True)[:10] # top-10 without materialisingSorting on a fully indexed column streams directly from the index, so the
table is never materialised. Multi-column sorts and dotted (nested) leaf
names are supported (e.g.t.sort_by(["trip.begin.lon", "payment.fare"], ascending=[True, False])).
where on dictionary (string) columns
whereexpressions now work over dictionary-encoded (string) columns,
including membership tests such as'"Acme" in company', so categorical
text columns can be filtered without decoding the whole column.
b2view is now an opt-in extra
- The
b2viewterminal browser and its TUI stack (textual,
textual-plotext) are no longer core dependencies: a plain
pip install blosc2no longer pulls them, keeping the compression library
lean (and dropping deps that are unusable under wasm32, which has no TTY).
Install the viewer withpip install "blosc2[tui]", or
pip install "blosc2[hires]"to also get the high-reshview. The
b2viewcommand prints this hint if the dependencies are missing.
group_by: flexible aggregation naming
-
CTable.group_by(...).agg()now accepts a list of(column, ops)pairs
and explicit output names (pandas-style keyword arguments), alongside the
existing auto-suffixed mapping; the forms can be combined::g.agg({"sales": ["sum", "mean"]}) # auto: sales_sum, sales_mean g.agg([(t.sales, ["sum", "mean"])]) # auto, but accepts Column objects g.agg(revenue=("sales", "sum")) # explicit: revenue g.agg({"sales": "sum"}, n=("*", "size")) # combined, with a named row countThe list-of-pairs and named forms accept
Columnobjects (t.sales), which
the mapping form cannot becauseColumnis unhashable and so cannot be a dict
key. -
Aggregation ops may also be given as the matching blosc2 reduction functions
(blosc2.sum,mean,min,max,argmin,argmax), matched by
identity -- e.g.g.agg([(t.sales, [blosc2.sum, "mean"])]). This is a
naming shorthand only; arbitrary/UDF callables (and look-alikes such as
np.sumor a user function namedsum) are rejected rather than silently
misinterpreted.
group_by / group_reduce: tri-state sort=
- Vectorized dictionary group ordering:
group_by()result building now
batch-decodes dictionary (string) keys in one pass (decode_batch) instead of
onedecode()per group, making high-cardinality string group-bys dramatically
faster (end-to-endgroup_by().size()dropped from seconds to milliseconds on
~100k-group workloads). sort=is now a tri-state (None/True/False) on both
CTable.group_by()andblosc2.group_reduce():True— always return groups sorted by key.False— never sort; deterministic but unspecified order.None(the new default) — auto: sort only when cheap. Integer and
dictionary keys are sorted (free / vectorized); float and multi-key results,
whose only ordering is an O(G log G) Python sort over every distinct group,
are left unsorted to avoid a cost that can rival the grouping itself on
high-cardinality data.
- Behavior changes (the two APIs had different prior defaults, so they move
in opposite directions):CTable.group_by()previously returned results always sorted. Under the
newNonedefault, float-key and multi-key group-bys are no longer
key-sorted by default — passsort=Trueto restore sorted output. This is
a deliberate divergence from pandas (which defaults tosort=True), suited
to blosc2's large / on-disk datasets.blosc2.group_reduce()previously defaulted tosort=False(unsorted).
Under the newNonedefault its cheap kernels now sort by default —
most visibly float keys, which previously came out in hash order. Integer
keys were already ascending; the generic Python fallback stays unsorted.
Passsort=Falseto opt out.
Accelerated reductions from index summaries
min/maxon indexedColumns, andargmin/argmaxinsidegroup_by, are
now accelerated using the index's per-block min/max summaries: when an
index is available these reductions run from the precomputed summaries instead
of decompressing the underlying data, which is dramatically faster on large
columns. A fast path also builds min/max envelope plots from any index.- The last
group_byoperation is memoized and reused when the same
grouping is requested again, avoiding recomputation in interactive / repeated
workflows (e.g.b2view).
b2view: group-by, sort, and richer plots
- Interactive group-by (
G): group aCTableby a column (integer, string,
or now float keys) directly in the viewer, with a three-list / two-column
menu; while grouped,S/Roperate on the grouped result and the data
panel's subtitle shows aG(roup)chip. The last grouping is memoized for
instant reuse. - Sort by column (
S): sort aCTableby a fully indexed column via a
dropdown (Rtoggles reverse) as a zero-copysort_by(view=True)that streams
from the index — the table is never materialised,Escrestores the original
order, and aSORTEDchip shows in the status bar. Non-indexed columns can
now be sorted too. Sort and filter are mutually exclusive; a row window
composes over a sort, and anfilter is preserved acrossSort /Group. - Better plots of grouped/sorted views: a grouped view plots bars for a
categorical key and lines for a numeric key; numeric-key group plots
render as stem/impulse charts rather than misleading connected lines. Bar
plots gain anhi-res counterpart mirroring the line/scatter plots, and+/-
zoom about the view's left edge. --maxmaximizes the current panel, andescapeis now the single,
consistent way to back out of every modal.
Other / bug fixes
- C-Blosc2 upgraded to 3.1.5.
- Open-file cache correctness: cached open handles are now validated against
the file's fingerprint (st_mtime_ns,st_size) and cached index handles are
released when a table closes, so a file changed underneath an open handle is no
longer served stale. - NumPy 2.5 compatibility: adjusted for deprecations in NumPy 2.5.
- Substantially reduced test-suite runtime, and emscripten builds no longer
attempt to spawn subprocesses (unsupported there).