Blosc · FrancescAlted · Jun 29, 2026 · Jun 28, 2026 · Jun 28, 2026 · Jun 28, 2026
diff --git a/RELEASE_NOTES.md b/RELEASE_NOTES.md
@@ -2,7 +2,17 @@
 
 ## Changes from 4.6.0 to 4.6.1
 
-XXX version-specific blurb XXX
+### DSL → JavaScript backend for WebAssembly (`jit_backend="js"`)
+
+- Under WebAssembly/Pyodide, `@blosc2.dsl_kernel` kernels can now be transpiled
+  to JavaScript and run via the browser's JIT.  It is the **default** there for
+  transpilable floating-point kernels (silently falling back to miniexpr for
+  anything it can't handle), and beats the WASM TinyCC JIT on compute-heavy
+  kernels (e.g. ~2.8x on a Newton-fractal kernel).  Request it explicitly with
+  `compute(jit_backend="js")`; outside WebAssembly that raises.
+- Supports index/shape symbols (`_i0`/`_n0`/`_ndim`/`_flat_idx`) and integer inputs
+  with a floating-point output.  Integer/complex *output*, reductions, and
+  unsupported constructs stay on miniexpr.  Native builds are unaffected.
 
 ## Changes from 4.5.1 to 4.6.0
 

diff --git a/bench/js-transpiler/README.md b/bench/js-transpiler/README.md
@@ -0,0 +1,113 @@
+# DSL → JavaScript transpiler benches
+
+Benches/demos for `blosc2.dsl_js`, which transpiles a `@blosc2.dsl_kernel` to JavaScript so
+kernels run at V8-optimized native speed in the browser/Pyodide (the `jit_backend="js"`
+path). Design and findings: [`plans/dsl-js.md`](../../plans/dsl-js.md).
+
+Both run a Newton-fractal kernel (high arithmetic intensity + per-pixel early exit) and
+compare backends. Run everything from the **repo root**.
+
+## Headless (Node + Pyodide) — `dsl-js-node.mjs`
+
+Integration test **and** perf bench, no browser. Installs the blosc2 wasm wheel from PyPI,
+overlays this working tree's pure-Python (`src/blosc2/dsl_js.py` + `lazyexpr.py`) on top of
+it before importing blosc2 — so the wired `jit_backend="js"` path is exercised **without
+rebuilding a wheel**. Asserts `js` and miniexpr-JIT both match a numpy reference exactly,
+then benches a 24-frame `relax` sweep.
+
+```sh
+npm i                                         # pulls pyodide@314 (see package.json)
+node bench/js-transpiler/dsl-js-node.mjs       # correctness + kernel sweep, 12 reps
+node bench/js-transpiler/dsl-js-node.mjs 24    # N reps
+```
+
+Needs network on first run (PyPI wheel via micropip). Exits non-zero on a correctness
+mismatch *or* a broken default fallback, so it works as a smoke test. It benches five kernel
+shapes so the js-vs-tcc ratio can be read against the kernel, not generalized from one. The
+`default` column (no `jit`/`jit_backend` set) shows the prefer-js-with-fallback default, and
+it also checks that an int kernel and an index-symbol kernel fall back cleanly to miniexpr.
+Representative (Apple M2, 4.6.0):
+
+```
+default fallback (no jit_backend): int=ok index-symbol=ok  -> falls back cleanly
+  kernel   default      js     tcc   nojit  js/tcc
+  newton      12.0    11.4    23.9   104.5   2.10x
+  poly         3.0     2.9     2.7     3.2   0.91x
+  trans        4.3     4.3     4.4     5.7   1.01x
+  deep       116.0   116.8   143.4   103.3   1.23x
+  deepar      22.3    22.0    50.4    52.7   2.29x
+```
+
+Columns: `default` = prefer-js-with-fallback, `js` = forced `jit_backend="js"`, `tcc` =
+miniexpr JIT (`jit_backend="tcc"`), `nojit` = miniexpr interpreter (`jit=False`). `default ≈
+js` here (all float + transpilable) → prefer-js engaged. Note `jit=True` *also* prefers js
+(it's a JIT); to force miniexpr use `jit_backend="tcc"`/`"cc"`, and `jit=False` selects the
+interpreter — that's what the `tcc`/`nojit` columns pin.
+
+**The takeaway: there is no single "js is N× the JIT" number — it depends on what the kernel
+is bottlenecked on.**
+
+- **Arithmetic / control-flow bound** (newton, deepar) → V8's optimizing JIT beats blosc2's
+  miniexpr WASM codegen by **~2×**. This is the sweet spot.
+- **Transcendental bound** (trans, deep) → **~1×**: time is spent in `sin`/`exp`/`log` (libm),
+  which costs about the same whoever runs the loop — `nojit` even edges `js` on `deep`.
+- **Light / trivial** (poly) → **<1×**: the kernel does almost no compute, so the blosc2
+  pipeline + per-call JS marshaling dominate, and `js` can be *slightly slower* than the JIT.
+
+So the honest generalization is qualitative: transpiling to JS wins (~2×, single-threaded)
+for **compute-bound float kernels dominated by arithmetic and control flow**, and is roughly
+a wash for transcendental-bound or trivial kernels.
+
+> The overlay pins `blosc2==4.6.0` to keep the compiled `blosc2_ext` ABI in step with the
+> pure-Python we drop on top. Once these changes ship in a Pyodide-installable wheel, the
+> overlay can go away. If the overlay import ever breaks on version skew, overlay all of
+> `src/blosc2/*.py` (or bump the pin).
+
+## Browser — `newton-dsl-js.html`
+
+Visual proof in a real browser: transpiles a real `@blosc2.dsl_kernel` under Pyodide, checks
+the emitted JS against a numpy reference on the **same** inputs, and times it against a
+hand-written JS kernel over the 24-frame sweep (ratio should sit near 1.00 — the transpiler
+reaches hand-written-JS speed), then renders the fractal.
+
+```sh
+python3 -m http.server          # from the repo root
+# open http://localhost:8000/bench/js-transpiler/newton-dsl-js.html  and click Run
+```
+
+Serve from the repo root (not `file://`): the page fetches `/src/blosc2/dsl_js.py` (the
+local transpiler, newer than the PyPI wheel) at a server-root-absolute path.
+
+## Multithreading ceiling — `worker-pool-bench.mjs`
+
+Throwaway exploration of *how fast the transpiled kernel could go* with real JS
+multithreading: pure Node (`worker_threads` + `SharedArrayBuffer`), **no Pyodide, no
+blosc2**. Same Newton kernel, partitioned across a persistent worker pool with an Atomics
+barrier; reports speedup vs single-thread for 1/2/4/N workers.
+
+```sh
+node bench/js-transpiler/worker-pool-bench.mjs
+```
+
+Findings (Apple M2, 4 performance + 4 efficiency cores; laptop numbers vary ±10–15% with
+thermal/P-vs-E scheduling, so treat these as representative, not exact):
+
+| workers | ms/frame | speedup |
+|---|---|---|
+| single-thread | ~11.3 | 1.0× |
+| ×2 | ~5.9 | ~1.9× (~94% eff) |
+| ×4 | ~3.1 | ~3.5× (~88% eff) |
+| ×8 | ~2.3 | ~4.8× (~60% eff — E-cores) |
+
+- The worker mechanism is ~free (×1 ≈ 1.0×); scaling is near-linear up to the performance
+  core count. The ×8 drop-off is the M2's efficiency cores, not overhead.
+- **Load balancing is essential.** Contiguous row-bands regress badly (×4 fell to 1.48×)
+  because the per-pixel early-`break` makes some bands all-max-iter and others trivial.
+  **Striped** rows (worker `i` → rows `i, i+nw, …`) fix it — that's what the bench uses.
+
+Why this stays a *headroom* result, not a shipped feature: it measures **pure compute**.
+The real `jit_backend="js"` path also pays ~8 ms/frame of blosc2 decompress/compress that
+does not parallelize this way, plus Pyodide orchestration is single-threaded — so realistic
+end-to-end gain is a fraction of 5×. And a browser integration needs pure-JS workers (not the
+Pyodide bridge), a SharedArrayBuffer, COOP/COEP cross-origin isolation, and a path to get
+decompressed chunks into shared memory. See "Deferred" in [`plans/dsl-js.md`](../../plans/dsl-js.md).