tag1consulting · jeremyandrews · Jun 15, 2026 · Jun 15, 2026
diff --git a/CHANGELOG.md b/CHANGELOG.md
@@ -5,6 +5,7 @@ All notable changes to scolta-python are documented here.
 ## [Unreleased]
 
 ### Fixed
+- **The stemmer produced query-mismatched stems for Danish, Finnish, Italian, and Norwegian indexes (`src/scolta/index/stemmer.py`).** The build-time stems must match what Pagefind 1.5.0 stems *queries* with at runtime (the crate `pagefind_stem` 1.0.0); a stem Pagefind cannot reproduce at query time is a word nobody can find. The binding depended on `snowballstemmer>=3`, which resolved to 3.1.1, but 3.1.x added apostrophe/elision handling that `pagefind_stem` 1.0.0 does **not** have (Danish/Norwegian/Finnish apostrophe handling, Italian elision stripping). Measured against the 14-language corpus, 3.1.1 diverges from Pagefind on **2,103 words** (it 1,867 / fi 183 / da 35 / no 18), so those indexes silently missed every affected query. The bug went unseen because the parity test only covered 5 languages (en/fr/de/es/ru) — the four divergent ones were mapped and shipped but never verified. No published `snowballstemmer` release reproduces `pagefind_stem` 1.0.0 across all 14 languages either: the 3.0.x line predates the `english.sbl` fixes the crate has (18 English divergences). The stemmers are now **vendored** from the Snowball compiler at the exact mainline commit `pagefind_stem` 1.0.0 was generated from (`019c1bd`, between v3.0.0 and v3.1.0), in `src/scolta/index/snowball/` — matching the crate byte-for-byte over the full corpus (589,069 words, 0 divergences) and mirroring `scolta-php/src/Index/Snowball`. The `snowballstemmer` dependency is removed; a sha256 drift guard (`test_snowball_provenance.py`) pins the vendored source, and the byte-exact corpus parity gate now covers all 14 shipped languages (was 5), so a future stemmer move fails CI loudly. **Existing Danish/Finnish/Italian/Norwegian indexes built with the old stemmer must be rebuilt** — their on-disk stems changed (same caveat scolta-php documented).
 - **Auto-provisioned Amazee credentials stored without resolved model names no longer leave AI permanently broken (`src/scolta/ai/amazee/auto_provisioner.py`).** Provisioning persists credentials and resolves model names as two non-atomic steps (`AmazeeTrialProvisioner.provision()` stores the token+url, then calls `/model/info`). When the model-info call fails, `get_available_models()` swallows the error and returns `[]`, so the `on_models_resolved` gate never fires and no model name is persisted — but `ConfigStorage.load()` requires only token+url, so it reports the half-provisioned credentials as valid. `ensure_ai_available()` then short-circuited on stored credentials on every later request and never re-resolved, so the caller fell back to the dated config default (`claude-sonnet-4-5-20250929`) which the Amazee LiteLLM gateway rejects with HTTP 400 "Invalid model name" — failing AI silently with no self-recovery (outside `KeyExpiryRecovery`'s auth-only remit). `ensure_ai_available()` now accepts an optional `has_resolved_models` predicate: when stored credentials exist but the caller reports models are still unresolved, model resolution is re-attempted against the **already-stored key** (never a fresh trial, which would waste a server-limited allocation) and `on_models_resolved` fires with the result, so the incomplete-provision state self-heals on the next lazy-init pass. Without the predicate the historical no-op is unchanged. A regression test drives the full provision → failed-resolution → store → re-resolve sequence. (The dated-default fallback itself lives in the consuming adapter/demo client construction, which adopts the predicate when it re-vendors.)
 
 ### Added

diff --git a/parity/README.md b/parity/README.md
@@ -34,9 +34,10 @@ Notes:
   so its chunk partitioning is algorithm-dependent; per-word postings are
   identical either way, which is why the recipe gate is structural.
 - `wamania/php-stemmer` diverges from canonical Snowball on a few words
-  (`adding`→`ad`, `paste`→`past`); Python (snowballstemmer) follows the
-  canonical reference + rust-stemmers, so it is the correct side. This is
-  asserted explicitly in `test_format_parity.py`.
+  (`adding`→`ad`, `paste`→`past`); Python (the vendored Snowball stemmers in
+  `src/scolta/index/snowball/`, generated from the same commit `pagefind_stem`
+  1.0.0 was) follows the canonical reference + rust-stemmers, so it is the
+  correct side. This is asserted explicitly in `test_format_parity.py`.
 
 Run (deprecation notices from a vendor lib on PHP 8.5 are harmless):
 

diff --git a/pyproject.toml b/pyproject.toml
@@ -18,15 +18,15 @@ keywords = ["search", "ai", "pagefind", "wasm", "scoring"]
 dependencies = [
     "httpx>=0.27",
     "selectolax>=0.3",
-    # Must stay on the modern (Snowball >=3.0, 2024-revised) English algorithm:
-    # Pagefind 1.5.0 stems queries at runtime with pagefind_stem 1.0.0 (published
-    # 2026-03-23, post-Snowball-3.0), which emits the revised stems (added->add,
-    # organic->organic, geologist->geolog, organize->organiz). snowballstemmer 3.x
-    # reproduces that crate's output byte-for-byte over the full corpus; the
-    # pre-3.0 2.x line diverges (added->ad, ...) and would silently miss those
-    # queries against a 1.5.0 index. The >=3 floor enforces it; the stemmer
-    # parity tests guard it. See tests/fixtures/stemmer-corpus/PROVENANCE.md.
-    "snowballstemmer>=3",
+    # NOTE: the Snowball stemmers are VENDORED (src/scolta/index/snowball/), not
+    # a dependency. Pagefind 1.5.0 stems queries at runtime with pagefind_stem
+    # 1.0.0, and no published snowballstemmer release reproduces that crate
+    # byte-for-byte across all 14 languages (3.0.x predates the english.sbl fix:
+    # 18 EN divergences; 3.1.x postdates apostrophe/elision changes the crate
+    # lacks: 2,103 da/fi/it/no divergences). The vendored stemmers are generated
+    # from the exact compiler commit the crate was built from and match it
+    # byte-for-byte over the full 14-language corpus; the stemmer parity tests
+    # guard it. See src/scolta/index/snowball/PROVENANCE.md.
     "regex>=2024.0",
     "msgpack>=1.0",
     "psutil>=5.9",
@@ -83,6 +83,10 @@ markers = [
 [tool.ruff]
 line-length = 100
 target-version = "py310"
+# The vendored Snowball stemmers are generated code that must stay byte-identical
+# to the Snowball compiler output (the sha256 manifest in their PROVENANCE.md and
+# the stemmer provenance test pin them). Never let lint/format rewrite them.
+extend-exclude = ["src/scolta/index/snowball"]
 
 [tool.ruff.lint]
 select = ["E", "F", "I", "UP", "B", "C4", "SIM", "RET", "RUF"]

diff --git a/scripts/generate-stemmers.sh b/scripts/generate-stemmers.sh
@@ -0,0 +1,120 @@
+#!/bin/bash
+set -euo pipefail
+
+# Regenerate the vendored pure-Python Snowball stemmers in
+# src/scolta/index/snowball/.
+#
+# The stemmers are generated by the Snowball compiler's `-python` backend at
+# the pinned commit below and vendored verbatim — no source transforms are
+# needed (unlike the PHP port, the Python backend already emits PSR-clean,
+# floor-compatible code). Output must stay byte-identical to the sha256
+# manifest in src/scolta/index/snowball/PROVENANCE.md; the stemmer provenance
+# test fails until the manifest is re-baselined, so regeneration is always
+# explicit.
+#
+# WARNING — the pin is a mainline COMMIT, not a release tag, and that is
+# deliberate: it is the exact snowball revision pagefind_stem 1.0.0 (the crate
+# inside Pagefind 1.5.0's query WASM) was generated from, verified by
+# byte-comparing the compiler output against the crate's vendored sources.
+# Neither adjacent tag is parity-correct, and — critically — NO published
+# `snowballstemmer` PyPI release reproduces it either:
+#   - v3.0.0 / 3.0.1 diverge on 18 English words (internal, interval,
+#     interfered, skis, ...) — they predate the english.sbl fixes the crate
+#     has.
+#   - v3.1.0 / 3.1.1 diverge on da/fi/it/no (and de apostrophe words): snowball
+#     3.1.x added apostrophe/elision handling Pagefind 1.5.0 does not have
+#     (2,103 divergent words across the committed corpora).
+# Both were measured against this package's 14-language corpus. That is why we
+# vendor the compiler output at this commit instead of depending on the
+# `snowballstemmer` package.
+#
+# Also: the "Generated from X.sbl by Snowball <version>" header in each file is
+# the compiler's own version string, not proof of the algorithm state — this
+# commit self-reports "3.0.0". Trust the commit hash, never the header.
+#
+# Usage: ./scripts/generate-stemmers.sh
+# Requires a C compiler + make (to build the Snowball compiler). Run manually,
+# commit the output. Not run during CI.
+
+SNOWBALL_REPO="https://github.com/snowballstem/snowball.git"
+# Mainline commit between v3.0.0 and v3.1.0: "Support -eprefix for all target
+# languages". See WARNING above for why this exact revision.
+SNOWBALL_COMMIT="019c1bdd5fa3925257f30b225a2c4eba49968fe8"
+
+SCRIPT_DIR="$(cd "$(dirname "$0")" && pwd)"
+PACKAGE_DIR="$(dirname "$SCRIPT_DIR")"
+OUT_DIR="${PACKAGE_DIR}/src/scolta/index/snowball"
+
+# Language code => Snowball algorithm name. This is exactly the set Pagefind
+# 1.5.0 enables for these codes (pagefind_web/Cargo.toml feature map): note
+# nl => dutch is the MODERN dutch algorithm, not dutch_porter.
+ALGORITHMS=(
+    catalan danish german english spanish finnish french
+    italian dutch norwegian portuguese romanian russian swedish
+)
+
+WORK_DIR="$(mktemp -d)"
+trap 'rm -rf "${WORK_DIR}"' EXIT
+
+echo "Cloning snowball at ${SNOWBALL_COMMIT}..."
+git clone --quiet "${SNOWBALL_REPO}" "${WORK_DIR}/snowball"
+git -C "${WORK_DIR}/snowball" checkout --quiet "${SNOWBALL_COMMIT}"
+
+ACTUAL_COMMIT="$(git -C "${WORK_DIR}/snowball" rev-parse HEAD)"
+if [ "${ACTUAL_COMMIT}" != "${SNOWBALL_COMMIT}" ]; then
+    echo "ERROR: checkout resolved to ${ACTUAL_COMMIT}, expected ${SNOWBALL_COMMIT}." >&2
+    exit 1
+fi
+
+echo "Building the Snowball compiler..."
+make -s -C "${WORK_DIR}/snowball" snowball >/dev/null
+
+mkdir -p "${OUT_DIR}"
+rm -f "${OUT_DIR}"/*.py "${OUT_DIR}/LICENSE"
+
+echo "Vendoring the runtime (base stemmer + Among)..."
+cp "${WORK_DIR}/snowball/python/snowballstemmer/basestemmer.py" "${OUT_DIR}/basestemmer.py"
+cp "${WORK_DIR}/snowball/python/snowballstemmer/among.py" "${OUT_DIR}/among.py"
+
+for algorithm in "${ALGORITHMS[@]}"; do
+    echo "Generating ${algorithm}_stemmer.py..."
+    # The same flags the snowball GNUmakefile uses for its own Python package
+    # build: -python backend, -eprefix _ (the stem entry point becomes _stem()).
+    "${WORK_DIR}/snowball/snowball" \
+        "${WORK_DIR}/snowball/algorithms/${algorithm}.sbl" \
+        -python -eprefix _ \
+        -o "${OUT_DIR}/${algorithm}_stemmer.py"
+done
+
+echo "Writing the package marker..."
+cat > "${OUT_DIR}/__init__.py" <<'PYEOF'
+"""Vendored pure-Python Snowball stemmers — generated code, do not edit.
+
+Produced by the Snowball compiler's ``-python`` backend at the commit pinned in
+PROVENANCE.md (the revision Pagefind 1.5.0's ``pagefind_stem`` 1.0.0 query WASM
+was generated from). Regenerate with ``scripts/generate-stemmers.sh``; the
+stemmer provenance test pins every file to the sha256 manifest in PROVENANCE.md.
+
+``scolta.index.stemmer.Stemmer`` maps language codes onto the ``<Algo>Stemmer``
+classes here and calls their ``stemWord`` method.
+"""
+PYEOF
+
+echo "Copying the Snowball license..."
+cp "${WORK_DIR}/snowball/COPYING" "${OUT_DIR}/LICENSE"
+
+# Guard: the floor is Python 3.10. The -python backend targets old Pythons, so
+# this should never fire, but fail closed if a future revision emits newer
+# syntax (match statements, the walrus operator is fine on 3.10).
+echo "Checking for post-3.10 syntax in the output..."
+if grep -nE '^\s*match .*:\s*$|^\s*case .*:\s*$' "${OUT_DIR}"/*.py; then
+    echo "ERROR: match/case (3.10+ in a way the floor may not want) found; review." >&2
+    exit 1
+fi
+
+echo
+echo "sha256 manifest (update PROVENANCE.md if these changed):"
+(cd "${OUT_DIR}" && shasum -a 256 ./*.py LICENSE | sed 's|\./||')
+echo
+echo "Done. Generated from snowball commit ${ACTUAL_COMMIT}."
+echo "Now run the parity tests: uv run pytest tests/index/test_stemmer.py tests/index/test_stemmer_provenance.py -q"
diff --git a/scripts/validate-dist.py b/scripts/validate-dist.py
@@ -30,16 +30,19 @@
 from pathlib import Path
 
 # --- size caps (shared pattern: ~2x the measured good artifact) --------------
-# Measured 2026-06-14 against a clean `uv build` on main + this PR:
-#   wheel  = 729_525 bytes (~712 KiB; dominated by scolta_core_bg.wasm ~1.2 MB
-#            uncompressed, the vendored js, and pagefind .pagefind blobs)
-#   sdist  = 2_349_924 bytes (~2.24 MiB; src + the full ported test corpus and
-#            stemmer fixtures, with node_modules/target excluded)
+# Measured 2026-06-15 against a clean `uv build` on main + this PR:
+#   wheel  = 770_175 bytes (~752 KiB; dominated by scolta_core_bg.wasm ~1.2 MB
+#            uncompressed, the vendored js, the pagefind .pagefind blobs, and now
+#            the ~250 KiB of vendored Snowball stemmers)
+#   sdist  = 4_320_309 bytes (~4.12 MiB; src + the full ported test corpus and
+#            the 14-language stemmer fixtures, with node_modules/target excluded.
+#            Grew from ~2.24 MiB when the stemmer corpus went from 5 to all 14
+#            shipped languages — ~2 MiB of additional words/stems text.)
 # Caps are ~2x those measured values, leaving headroom for asset growth while
 # still catching a node_modules/target/cruft regression an order of magnitude
 # bigger.
-WHEEL_MAX_BYTES = 1_500_000  # ~2x of 729_525
-SDIST_MAX_BYTES = 4_700_000  # ~2x of 2_349_924
+WHEEL_MAX_BYTES = 1_500_000  # ~2x of 770_175
+SDIST_MAX_BYTES = 8_700_000  # ~2x of 4_320_309
 
 # --- vendored runtime assets that MUST be in the wheel -----------------------
 # Enumerated from `scripts/vendor_assets.py` (_SUBDIRS x allowed extensions) and

diff --git a/src/scolta/index/snowball/LICENSE b/src/scolta/index/snowball/LICENSE
@@ -0,0 +1,29 @@
+Copyright (c) 2001, Dr Martin Porter
+Copyright (c) 2004,2005, Richard Boulton
+Copyright (c) 2013, Yoshiki Shibukawa
+Copyright (c) 2006-2025, Olly Betts
+All rights reserved.
+
+Redistribution and use in source and binary forms, with or without
+modification, are permitted provided that the following conditions
+are met:
+
+  1. Redistributions of source code must retain the above copyright notice,
+     this list of conditions and the following disclaimer.
+  2. Redistributions in binary form must reproduce the above copyright notice,
+     this list of conditions and the following disclaimer in the documentation
+     and/or other materials provided with the distribution.
+  3. Neither the name of the Snowball project nor the names of its contributors
+     may be used to endorse or promote products derived from this software
+     without specific prior written permission.
+
+THIS SOFTWARE IS PROVIDED BY THE COPYRIGHT HOLDERS AND CONTRIBUTORS "AS IS" AND
+ANY EXPRESS OR IMPLIED WARRANTIES, INCLUDING, BUT NOT LIMITED TO, THE IMPLIED
+WARRANTIES OF MERCHANTABILITY AND FITNESS FOR A PARTICULAR PURPOSE ARE
+DISCLAIMED. IN NO EVENT SHALL THE COPYRIGHT OWNER OR CONTRIBUTORS BE LIABLE FOR
+ANY DIRECT, INDIRECT, INCIDENTAL, SPECIAL, EXEMPLARY, OR CONSEQUENTIAL DAMAGES
+(INCLUDING, BUT NOT LIMITED TO, PROCUREMENT OF SUBSTITUTE GOODS OR SERVICES;
+LOSS OF USE, DATA, OR PROFITS; OR BUSINESS INTERRUPTION) HOWEVER CAUSED AND ON
+ANY THEORY OF LIABILITY, WHETHER IN CONTRACT, STRICT LIABILITY, OR TORT
+(INCLUDING NEGLIGENCE OR OTHERWISE) ARISING IN ANY WAY OUT OF THE USE OF THIS
+SOFTWARE, EVEN IF ADVISED OF THE POSSIBILITY OF SUCH DAMAGE.