Vendor Snowball stemmers for byte-exact 14-language Pagefind parity by jeremyandrews · Pull Request #18 · tag1consulting/scolta-python

jeremyandrews · 2026-06-15T09:06:11Z

The bug

scolta's Stemmer maps 14 languages but the parity test only verified 5 (en/fr/de/es/ru). The other 9 — including the four where the dependency genuinely diverges from Pagefind — were mapped, shipped, and never checked.

The build-time stems must match what Pagefind stems queries with at runtime (the crate pagefind_stem 1.0.0, bundled in Pagefind 1.5.0's WASM). If they disagree, that language's index silently misses queries. The binding depended on snowballstemmer>=3, which resolves to 3.1.1, and 3.1.x added apostrophe/elision handling that pagefind_stem 1.0.0 does not have (Danish/Norwegian/Finnish apostrophe handling, Italian elision stripping).

Root cause, measured (not assumed)

I measured every published snowballstemmer 3.x release against the full 14-language corpus (the same corpus, byte-identical word lists, scolta-php's StemmerConcordanceTest uses):

version	divergences vs `pagefind_stem` 1.0.0	where
3.0.0.1	18	en only (predates the `english.sbl` fixes the crate has)
3.0.1	18	en only (same)
3.1.0	2,103	it 1,867 / fi 183 / da 35 / no 18 (apostrophe/elision the crate lacks)
3.1.1 (was resolved)	2,103	it 1,867 / fi 183 / da 35 / no 18

So no published snowballstemmer release is byte-exact for all 14 languages. The 3.0.x line is missing the English fix; the 3.1.x line adds changes Pagefind doesn't have. The parity-correct revision, commit 019c1bd (mainline, between v3.0.0 and v3.1.0), has the English fix and lacks the apostrophe/elision changes. This is exactly the revision pagefind_stem 1.0.0 was generated from (recovered in the scolta-php stemmer work, php#214).

Example divergences (3.1.1 vs Pagefind's actual query stem):

it: all'abbandono → abband (3.1.1) vs all'abband (Pagefind)
da: bbc's → bbc (3.1.1) vs bbc's (Pagefind)
fi: ajatollahin → ajatollah (3.1.1) vs ajatollahin (Pagefind)

The fix — vendor, don't depend

Following scolta-php's proven path (src/Index/Snowball), the stemmers are now vendored from the Snowball compiler's -python backend at commit 019c1bd, in src/scolta/index/snowball/:

Byte-exact with pagefind_stem 1.0.0 over the full corpus: 589,069 words, 0 divergences, all 14 languages.
The snowballstemmer dependency is removed (pyproject.toml, uv.lock). Stemmer maps language codes onto the vendored <Algo>Stemmer classes and calls stemWord.
scripts/generate-stemmers.sh reproduces the vendored output (clone snowball @ 019c1bd, build the compiler, generate + vendor the runtime). No source transforms are needed — unlike the PHP backend, the Python backend already emits floor-compatible code, so the files are byte-identical to compiler output. The vendored dir is excluded from ruff to stay byte-stable.
The vendored LICENSE (Snowball BSD-3-Clause, sha-identical to scolta-php's) and a PROVENANCE.md document the commit, the version-header trap, and why a mainline commit rather than a tag.

Regression gate (fail-before / pass-after)

Fail-before: the table above is the fail-before evidence — on 3.1.1 the it/da/fi/no corpus assertions report exactly 1,867 / 35 / 183 / 18 mismatches.
Pass-after: the byte-exact corpus parity gate now covers all 14 shipped languages (was 5), 0 mismatches.
test_snowball_provenance.py (new) pins every vendored file to a sha256 manifest, mirroring scolta-php's SnowballProvenanceTest — a silent regeneration (wrong revision, hand edit) fails CI.
test_stemmer_provenance.py (corpus drift guard) extended to all 14, and the corpus PROVENANCE.md "Reproduced by" line corrected (it falsely claimed snowballstemmer>=3 was byte-identical).
The exact-commit vendoring + 14-language corpus + sha256 manifests mean a future stemmer move cannot pass silently.

Tested

pytest tests/index/test_stemmer*.py tests/index/test_snowball_provenance.py — 65 passed (incl. 14-language byte-exact corpus parity, 14-language manifest, vendored-file drift guard, Pagefind modern-Porter2 tells).
Full suite: 777 passed. ruff check + ruff format --check clean.
Clean resolve: rm -rf .venv && uv sync --extra dev → snowballstemmer gone, parity still byte-exact.
uv build + scripts/validate-dist.py pass (wheel 770 KB ships the vendored stemmers; sdist cap bumped for the larger 14-language corpus).

Rebuild caveat

Existing Danish/Finnish/Italian/Norwegian indexes built with the old stemmer must be rebuilt — their on-disk stems changed (same caveat scolta-php documented). en/fr/de/es/ru/ca/nl/pt/ro/ru/sv stems are unchanged.

`snowballstemmer` API surprises

None. The compiler's -python backend emits clean classes; BaseStemmer.stemWord(word) is the public entry point and the only call site.

The build-time stemmer must reproduce the stems Pagefind produces from queries at runtime (the crate pagefind_stem 1.0.0 in Pagefind 1.5.0's WASM), or those queries silently miss. The binding depended on snowballstemmer>=3 (resolving to 3.1.1), but 3.1.x added apostrophe/elision handling pagefind_stem 1.0.0 does not have: measured against the 14-language corpus, 3.1.1 diverges on 2,103 words (it 1,867 / fi 183 / da 35 / no 18), so Danish/Finnish/Italian/ Norwegian indexes missed every affected query. The parity test only covered 5 languages (en/fr/de/es/ru), so the divergent four were shipped but unverified. No published snowballstemmer release is byte-exact for all 14 languages (the 3.0.x line predates the english.sbl fixes the crate has: 18 English divergences). The stemmers are now vendored from the Snowball compiler at the exact mainline commit pagefind_stem 1.0.0 was generated from (019c1bd, between v3.0.0 and v3.1.0), in src/scolta/index/snowball/ — byte-exact with the crate over the full corpus (589,069 words, 0 divergences), mirroring scolta-php/src/Index/Snowball. - Remove the snowballstemmer dependency; Stemmer maps languages onto the vendored <Algo>Stemmer classes. - scripts/generate-stemmers.sh regenerates the vendored output; the vendored dir is excluded from ruff to stay byte-stable. - Extend the byte-exact corpus parity gate from 5 to all 14 shipped languages and add test_snowball_provenance.py (sha256 drift guard on the vendored source), so a future stemmer move fails CI loudly. - Carry over the 9 missing-language corpora from scolta-php verbatim; bump the sdist size cap for the larger corpus. Existing Danish/Finnish/Italian/Norwegian indexes built with the old stemmer must be rebuilt — their on-disk stems changed.

jeremyandrews merged commit e9acf14 into main Jun 15, 2026
6 checks passed

jeremyandrews deleted the fix/stemmer-parity-14-lang branch June 15, 2026 09:13

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Vendor Snowball stemmers for byte-exact 14-language Pagefind parity#18

Vendor Snowball stemmers for byte-exact 14-language Pagefind parity#18
jeremyandrews merged 1 commit into
mainfrom
fix/stemmer-parity-14-lang

jeremyandrews commented Jun 15, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

jeremyandrews commented Jun 15, 2026

The bug

Root cause, measured (not assumed)

The fix — vendor, don't depend

Regression gate (fail-before / pass-after)

Tested

Rebuild caveat

snowballstemmer API surprises

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

`snowballstemmer` API surprises