feat: refresh DAC codelists from live OECD source#37
Merged
Conversation
The old codelist-crosswalk source (stats.oecd.org FileView GUID URLs, parsed
by positional XML index) is dead (HTTP 404). Replace it with a maintained
refresh tool that sources from the live OECD codelist app.
- scripts/data_maintenance/refresh_dac_codelists.py: fetches Providers (5) and
Recipients (13) via the two-step ASPX/VIEWSTATE handshake, projects each area
codelist to {dac_code: dotstatcode or iso3}, additively merges into the
committed crosswalks (preserves historical codes the live app no longer
serves; live wins on value conflicts), writes a provenance sidecar, prints a
unified diff, and never auto-writes (guarded --write; --check for CI).
- Static maps (prices, flow-types incl. the (.*) regex passthrough) are
committed constants; corrections stay file-loaded.
- Retire the dead FileView URLs + positional-index XML helpers; read_mapping is
keyword-only and raises on a missing file.
- Refreshed dac{1,2}_codes_area.json (+25/+13 active codes, 918 -> 4EU001).
- Offline fixture-backed tests + a characterization test pinning the consumers.
- .github/workflows/codelist-drift.yml: weekly drift check that opens a PR when
the committed codelists fall out of date with the live source.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Why
The old codelist-crosswalk source —
stats.oecd.org/FileView2.aspx?IDFile=<GUID>SDMX maps, parsed by positional XML index — is dead (HTTP 404). OECD retiredstats.oecd.orgfor the Data Explorer. The mappings only still worked because they were fetched once and frozen on disk; they could not be regenerated.What
Ports a maintainable refresh pattern sourcing from the live OECD codelist app (
development-finance-codelists.oecd.org).scripts/data_maintenance/refresh_dac_codelists.py— fetches Providers (5) + Recipients (13) via the two-step ASPX/VIEWSTATE handshake, projects each area codelist to{dac_code: dotstatcode or iso3}, additively merges into the committed crosswalks (preserves the ~91/93 historical codes the live app no longer serves; live wins on value conflicts), writes a provenance sidecar, prints a unified diff, and never auto-writes (guarded--write;--checkfor CI).(.*)regex passthrough) are committed constants; corrections stay file-loaded.read_mappingis keyword-only and raises on a missing file.dac{1,2}_codes_area.json: +25/+13 new active codes,918 -> 4EU001(aligns base with the corrections overlay), 0 removed. Kosovo correctly staysXKV(dotstatcode preferred over iso3XKX).convert_dac1/crs_to_dotstat_codesoutput.Drift protection
.github/workflows/codelist-drift.ymlruns weekly: refreshes against the live source and opens a PR only when the committed codelists fall out of date — keeping the live network check off the PR path.Verification
189 tests pass, ruff + format + ty clean. Diff-only runs leave tracked files untouched.