Skip to content

Vendor Snowball stemmers for byte-exact 14-language Pagefind parity#18

Merged
jeremyandrews merged 1 commit into
mainfrom
fix/stemmer-parity-14-lang
Jun 15, 2026
Merged

Vendor Snowball stemmers for byte-exact 14-language Pagefind parity#18
jeremyandrews merged 1 commit into
mainfrom
fix/stemmer-parity-14-lang

Conversation

@jeremyandrews

Copy link
Copy Markdown
Member

The bug

scolta's Stemmer maps 14 languages but the parity test only verified 5 (en/fr/de/es/ru). The other 9 — including the four where the dependency genuinely diverges from Pagefind — were mapped, shipped, and never checked.

The build-time stems must match what Pagefind stems queries with at runtime (the crate pagefind_stem 1.0.0, bundled in Pagefind 1.5.0's WASM). If they disagree, that language's index silently misses queries. The binding depended on snowballstemmer>=3, which resolves to 3.1.1, and 3.1.x added apostrophe/elision handling that pagefind_stem 1.0.0 does not have (Danish/Norwegian/Finnish apostrophe handling, Italian elision stripping).

Root cause, measured (not assumed)

I measured every published snowballstemmer 3.x release against the full 14-language corpus (the same corpus, byte-identical word lists, scolta-php's StemmerConcordanceTest uses):

version divergences vs pagefind_stem 1.0.0 where
3.0.0.1 18 en only (predates the english.sbl fixes the crate has)
3.0.1 18 en only (same)
3.1.0 2,103 it 1,867 / fi 183 / da 35 / no 18 (apostrophe/elision the crate lacks)
3.1.1 (was resolved) 2,103 it 1,867 / fi 183 / da 35 / no 18

So no published snowballstemmer release is byte-exact for all 14 languages. The 3.0.x line is missing the English fix; the 3.1.x line adds changes Pagefind doesn't have. The parity-correct revision, commit 019c1bd (mainline, between v3.0.0 and v3.1.0), has the English fix and lacks the apostrophe/elision changes. This is exactly the revision pagefind_stem 1.0.0 was generated from (recovered in the scolta-php stemmer work, php#214).

Example divergences (3.1.1 vs Pagefind's actual query stem):

  • it: all'abbandonoabband (3.1.1) vs all'abband (Pagefind)
  • da: bbc'sbbc (3.1.1) vs bbc's (Pagefind)
  • fi: ajatollahinajatollah (3.1.1) vs ajatollahin (Pagefind)

The fix — vendor, don't depend

Following scolta-php's proven path (src/Index/Snowball), the stemmers are now vendored from the Snowball compiler's -python backend at commit 019c1bd, in src/scolta/index/snowball/:

  • Byte-exact with pagefind_stem 1.0.0 over the full corpus: 589,069 words, 0 divergences, all 14 languages.
  • The snowballstemmer dependency is removed (pyproject.toml, uv.lock). Stemmer maps language codes onto the vendored <Algo>Stemmer classes and calls stemWord.
  • scripts/generate-stemmers.sh reproduces the vendored output (clone snowball @ 019c1bd, build the compiler, generate + vendor the runtime). No source transforms are needed — unlike the PHP backend, the Python backend already emits floor-compatible code, so the files are byte-identical to compiler output. The vendored dir is excluded from ruff to stay byte-stable.
  • The vendored LICENSE (Snowball BSD-3-Clause, sha-identical to scolta-php's) and a PROVENANCE.md document the commit, the version-header trap, and why a mainline commit rather than a tag.

Regression gate (fail-before / pass-after)

  • Fail-before: the table above is the fail-before evidence — on 3.1.1 the it/da/fi/no corpus assertions report exactly 1,867 / 35 / 183 / 18 mismatches.
  • Pass-after: the byte-exact corpus parity gate now covers all 14 shipped languages (was 5), 0 mismatches.
  • test_snowball_provenance.py (new) pins every vendored file to a sha256 manifest, mirroring scolta-php's SnowballProvenanceTest — a silent regeneration (wrong revision, hand edit) fails CI.
  • test_stemmer_provenance.py (corpus drift guard) extended to all 14, and the corpus PROVENANCE.md "Reproduced by" line corrected (it falsely claimed snowballstemmer>=3 was byte-identical).
  • The exact-commit vendoring + 14-language corpus + sha256 manifests mean a future stemmer move cannot pass silently.

Tested

  • pytest tests/index/test_stemmer*.py tests/index/test_snowball_provenance.py — 65 passed (incl. 14-language byte-exact corpus parity, 14-language manifest, vendored-file drift guard, Pagefind modern-Porter2 tells).
  • Full suite: 777 passed. ruff check + ruff format --check clean.
  • Clean resolve: rm -rf .venv && uv sync --extra devsnowballstemmer gone, parity still byte-exact.
  • uv build + scripts/validate-dist.py pass (wheel 770 KB ships the vendored stemmers; sdist cap bumped for the larger 14-language corpus).

Rebuild caveat

Existing Danish/Finnish/Italian/Norwegian indexes built with the old stemmer must be rebuilt — their on-disk stems changed (same caveat scolta-php documented). en/fr/de/es/ru/ca/nl/pt/ro/ru/sv stems are unchanged.

snowballstemmer API surprises

None. The compiler's -python backend emits clean classes; BaseStemmer.stemWord(word) is the public entry point and the only call site.

The build-time stemmer must reproduce the stems Pagefind produces from
queries at runtime (the crate pagefind_stem 1.0.0 in Pagefind 1.5.0's WASM),
or those queries silently miss. The binding depended on snowballstemmer>=3
(resolving to 3.1.1), but 3.1.x added apostrophe/elision handling pagefind_stem
1.0.0 does not have: measured against the 14-language corpus, 3.1.1 diverges on
2,103 words (it 1,867 / fi 183 / da 35 / no 18), so Danish/Finnish/Italian/
Norwegian indexes missed every affected query. The parity test only covered 5
languages (en/fr/de/es/ru), so the divergent four were shipped but unverified.

No published snowballstemmer release is byte-exact for all 14 languages (the
3.0.x line predates the english.sbl fixes the crate has: 18 English
divergences). The stemmers are now vendored from the Snowball compiler at the
exact mainline commit pagefind_stem 1.0.0 was generated from (019c1bd, between
v3.0.0 and v3.1.0), in src/scolta/index/snowball/ — byte-exact with the crate
over the full corpus (589,069 words, 0 divergences), mirroring
scolta-php/src/Index/Snowball.

- Remove the snowballstemmer dependency; Stemmer maps languages onto the
  vendored <Algo>Stemmer classes.
- scripts/generate-stemmers.sh regenerates the vendored output; the vendored
  dir is excluded from ruff to stay byte-stable.
- Extend the byte-exact corpus parity gate from 5 to all 14 shipped languages
  and add test_snowball_provenance.py (sha256 drift guard on the vendored
  source), so a future stemmer move fails CI loudly.
- Carry over the 9 missing-language corpora from scolta-php verbatim;
  bump the sdist size cap for the larger corpus.

Existing Danish/Finnish/Italian/Norwegian indexes built with the old stemmer
must be rebuilt — their on-disk stems changed.
@jeremyandrews jeremyandrews merged commit e9acf14 into main Jun 15, 2026
6 checks passed
@jeremyandrews jeremyandrews deleted the fix/stemmer-parity-14-lang branch June 15, 2026 09:13
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant