Vendor Snowball stemmers for byte-exact 14-language Pagefind parity#18
Merged
Conversation
The build-time stemmer must reproduce the stems Pagefind produces from queries at runtime (the crate pagefind_stem 1.0.0 in Pagefind 1.5.0's WASM), or those queries silently miss. The binding depended on snowballstemmer>=3 (resolving to 3.1.1), but 3.1.x added apostrophe/elision handling pagefind_stem 1.0.0 does not have: measured against the 14-language corpus, 3.1.1 diverges on 2,103 words (it 1,867 / fi 183 / da 35 / no 18), so Danish/Finnish/Italian/ Norwegian indexes missed every affected query. The parity test only covered 5 languages (en/fr/de/es/ru), so the divergent four were shipped but unverified. No published snowballstemmer release is byte-exact for all 14 languages (the 3.0.x line predates the english.sbl fixes the crate has: 18 English divergences). The stemmers are now vendored from the Snowball compiler at the exact mainline commit pagefind_stem 1.0.0 was generated from (019c1bd, between v3.0.0 and v3.1.0), in src/scolta/index/snowball/ — byte-exact with the crate over the full corpus (589,069 words, 0 divergences), mirroring scolta-php/src/Index/Snowball. - Remove the snowballstemmer dependency; Stemmer maps languages onto the vendored <Algo>Stemmer classes. - scripts/generate-stemmers.sh regenerates the vendored output; the vendored dir is excluded from ruff to stay byte-stable. - Extend the byte-exact corpus parity gate from 5 to all 14 shipped languages and add test_snowball_provenance.py (sha256 drift guard on the vendored source), so a future stemmer move fails CI loudly. - Carry over the 9 missing-language corpora from scolta-php verbatim; bump the sdist size cap for the larger corpus. Existing Danish/Finnish/Italian/Norwegian indexes built with the old stemmer must be rebuilt — their on-disk stems changed.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
The bug
scolta's
Stemmermaps 14 languages but the parity test only verified 5 (en/fr/de/es/ru). The other 9 — including the four where the dependency genuinely diverges from Pagefind — were mapped, shipped, and never checked.The build-time stems must match what Pagefind stems queries with at runtime (the crate
pagefind_stem1.0.0, bundled in Pagefind 1.5.0's WASM). If they disagree, that language's index silently misses queries. The binding depended onsnowballstemmer>=3, which resolves to 3.1.1, and 3.1.x added apostrophe/elision handling thatpagefind_stem1.0.0 does not have (Danish/Norwegian/Finnish apostrophe handling, Italian elision stripping).Root cause, measured (not assumed)
I measured every published
snowballstemmer3.x release against the full 14-language corpus (the same corpus, byte-identical word lists, scolta-php'sStemmerConcordanceTestuses):pagefind_stem1.0.0english.sblfixes the crate has)So no published
snowballstemmerrelease is byte-exact for all 14 languages. The 3.0.x line is missing the English fix; the 3.1.x line adds changes Pagefind doesn't have. The parity-correct revision, commit019c1bd(mainline, between v3.0.0 and v3.1.0), has the English fix and lacks the apostrophe/elision changes. This is exactly the revisionpagefind_stem1.0.0 was generated from (recovered in the scolta-php stemmer work, php#214).Example divergences (3.1.1 vs Pagefind's actual query stem):
all'abbandono→abband(3.1.1) vsall'abband(Pagefind)bbc's→bbc(3.1.1) vsbbc's(Pagefind)ajatollahin→ajatollah(3.1.1) vsajatollahin(Pagefind)The fix — vendor, don't depend
Following scolta-php's proven path (
src/Index/Snowball), the stemmers are now vendored from the Snowball compiler's-pythonbackend at commit019c1bd, insrc/scolta/index/snowball/:pagefind_stem1.0.0 over the full corpus: 589,069 words, 0 divergences, all 14 languages.snowballstemmerdependency is removed (pyproject.toml,uv.lock).Stemmermaps language codes onto the vendored<Algo>Stemmerclasses and callsstemWord.scripts/generate-stemmers.shreproduces the vendored output (clone snowball @019c1bd, build the compiler, generate + vendor the runtime). No source transforms are needed — unlike the PHP backend, the Python backend already emits floor-compatible code, so the files are byte-identical to compiler output. The vendored dir is excluded from ruff to stay byte-stable.LICENSE(Snowball BSD-3-Clause, sha-identical to scolta-php's) and aPROVENANCE.mddocument the commit, the version-header trap, and why a mainline commit rather than a tag.Regression gate (fail-before / pass-after)
test_snowball_provenance.py(new) pins every vendored file to a sha256 manifest, mirroring scolta-php'sSnowballProvenanceTest— a silent regeneration (wrong revision, hand edit) fails CI.test_stemmer_provenance.py(corpus drift guard) extended to all 14, and the corpusPROVENANCE.md"Reproduced by" line corrected (it falsely claimedsnowballstemmer>=3was byte-identical).Tested
pytest tests/index/test_stemmer*.py tests/index/test_snowball_provenance.py— 65 passed (incl. 14-language byte-exact corpus parity, 14-language manifest, vendored-file drift guard, Pagefind modern-Porter2 tells).ruff check+ruff format --checkclean.rm -rf .venv && uv sync --extra dev→snowballstemmergone, parity still byte-exact.uv build+scripts/validate-dist.pypass (wheel 770 KB ships the vendored stemmers; sdist cap bumped for the larger 14-language corpus).Rebuild caveat
Existing Danish/Finnish/Italian/Norwegian indexes built with the old stemmer must be rebuilt — their on-disk stems changed (same caveat scolta-php documented). en/fr/de/es/ru/ca/nl/pt/ro/ru/sv stems are unchanged.
snowballstemmerAPI surprisesNone. The compiler's
-pythonbackend emits clean classes;BaseStemmer.stemWord(word)is the public entry point and the only call site.