Skip to content

Release 4.6.0

Latest

Choose a tag to compare

@FrancescAlted FrancescAlted released this 26 Jun 09:56
· 2 commits to main since this release

Changes from 4.5.1 to 4.6.0

CTable.sort_by(view=True): zero-copy sorted views

  • CTable.sort_by() now accepts view=True, returning a lightweight
    sorted view that shares the parent's column data and gathers rows on
    demand in sorted order — no whole-table copy. This is ideal for reading a
    sorted slice of a large (possibly on-disk) table::

    t.sort_by("col", view=True)[:10]      # top-10 without materialising
    

    Sorting on a fully indexed column streams directly from the index, so the
    table is never materialised. Multi-column sorts and dotted (nested) leaf
    names are supported (e.g. t.sort_by(["trip.begin.lon", "payment.fare"], ascending=[True, False])).

where on dictionary (string) columns

  • where expressions now work over dictionary-encoded (string) columns,
    including membership tests such as '"Acme" in company', so categorical
    text columns can be filtered without decoding the whole column.

b2view is now an opt-in extra

  • The b2view terminal browser and its TUI stack (textual,
    textual-plotext) are no longer core dependencies: a plain
    pip install blosc2 no longer pulls them, keeping the compression library
    lean (and dropping deps that are unusable under wasm32, which has no TTY).
    Install the viewer with pip install "blosc2[tui]", or
    pip install "blosc2[hires]" to also get the high-res h view. The
    b2view command prints this hint if the dependencies are missing.

group_by: flexible aggregation naming

  • CTable.group_by(...).agg() now accepts a list of (column, ops) pairs
    and explicit output names (pandas-style keyword arguments), alongside the
    existing auto-suffixed mapping; the forms can be combined::

    g.agg({"sales": ["sum", "mean"]})              # auto: sales_sum, sales_mean
    g.agg([(t.sales, ["sum", "mean"])])            # auto, but accepts Column objects
    g.agg(revenue=("sales", "sum"))                # explicit: revenue
    g.agg({"sales": "sum"}, n=("*", "size"))       # combined, with a named row count
    

    The list-of-pairs and named forms accept Column objects (t.sales), which
    the mapping form cannot because Column is unhashable and so cannot be a dict
    key.

  • Aggregation ops may also be given as the matching blosc2 reduction functions
    (blosc2.sum, mean, min, max, argmin, argmax), matched by
    identity
    -- e.g. g.agg([(t.sales, [blosc2.sum, "mean"])]). This is a
    naming shorthand only; arbitrary/UDF callables (and look-alikes such as
    np.sum or a user function named sum) are rejected rather than silently
    misinterpreted.

group_by / group_reduce: tri-state sort=

  • Vectorized dictionary group ordering: group_by() result building now
    batch-decodes dictionary (string) keys in one pass (decode_batch) instead of
    one decode() per group, making high-cardinality string group-bys dramatically
    faster (end-to-end group_by().size() dropped from seconds to milliseconds on
    ~100k-group workloads).
  • sort= is now a tri-state (None / True / False) on both
    CTable.group_by() and blosc2.group_reduce():
    • True — always return groups sorted by key.
    • False — never sort; deterministic but unspecified order.
    • None (the new default) — auto: sort only when cheap. Integer and
      dictionary keys are sorted (free / vectorized); float and multi-key results,
      whose only ordering is an O(G log G) Python sort over every distinct group,
      are left unsorted to avoid a cost that can rival the grouping itself on
      high-cardinality data.
  • Behavior changes (the two APIs had different prior defaults, so they move
    in opposite directions):
    • CTable.group_by() previously returned results always sorted. Under the
      new None default, float-key and multi-key group-bys are no longer
      key-sorted by default
      — pass sort=True to restore sorted output. This is
      a deliberate divergence from pandas (which defaults to sort=True), suited
      to blosc2's large / on-disk datasets.
    • blosc2.group_reduce() previously defaulted to sort=False (unsorted).
      Under the new None default its cheap kernels now sort by default
      most visibly float keys, which previously came out in hash order. Integer
      keys were already ascending; the generic Python fallback stays unsorted.
      Pass sort=False to opt out.

Accelerated reductions from index summaries

  • min/max on indexed Columns, and argmin/argmax inside group_by, are
    now accelerated using the index's per-block min/max summaries: when an
    index is available these reductions run from the precomputed summaries instead
    of decompressing the underlying data, which is dramatically faster on large
    columns. A fast path also builds min/max envelope plots from any index.
  • The last group_by operation is memoized and reused when the same
    grouping is requested again, avoiding recomputation in interactive / repeated
    workflows (e.g. b2view).

b2view: group-by, sort, and richer plots

  • Interactive group-by (G): group a CTable by a column (integer, string,
    or now float keys) directly in the viewer, with a three-list / two-column
    menu; while grouped, S/R operate on the grouped result and the data
    panel's subtitle shows a G(roup) chip. The last grouping is memoized for
    instant reuse.
  • Sort by column (S): sort a CTable by a fully indexed column via a
    dropdown (R toggles reverse) as a zero-copy sort_by(view=True) that streams
    from the index — the table is never materialised, Esc restores the original
    order, and a SORTED chip shows in the status bar. Non-indexed columns can
    now be sorted too. Sort and filter are mutually exclusive; a row window
    composes over a sort, and an filter is preserved across Sort / Group.
  • Better plots of grouped/sorted views: a grouped view plots bars for a
    categorical key
    and lines for a numeric key; numeric-key group plots
    render as stem/impulse charts rather than misleading connected lines. Bar
    plots gain an hi-res counterpart mirroring the line/scatter plots, and +/-
    zoom about the view's left edge.
  • --max maximizes the current panel, and escape is now the single,
    consistent way to back out of every modal.

Other / bug fixes

  • C-Blosc2 upgraded to 3.1.5.
  • Open-file cache correctness: cached open handles are now validated against
    the file's fingerprint (st_mtime_ns, st_size) and cached index handles are
    released when a table closes, so a file changed underneath an open handle is no
    longer served stale.
  • NumPy 2.5 compatibility: adjusted for deprecations in NumPy 2.5.
  • Substantially reduced test-suite runtime, and emscripten builds no longer
    attempt to spawn subprocesses (unsupported there).