Skip to content

feat: refresh DAC codelists from live OECD source#37

Merged
jm-rivera merged 1 commit into
mainfrom
feat/refresh-dac-codelists
Jun 15, 2026
Merged

feat: refresh DAC codelists from live OECD source#37
jm-rivera merged 1 commit into
mainfrom
feat/refresh-dac-codelists

Conversation

@jm-rivera

Copy link
Copy Markdown
Collaborator

Why

The old codelist-crosswalk source — stats.oecd.org/FileView2.aspx?IDFile=<GUID> SDMX maps, parsed by positional XML index — is dead (HTTP 404). OECD retired stats.oecd.org for the Data Explorer. The mappings only still worked because they were fetched once and frozen on disk; they could not be regenerated.

What

Ports a maintainable refresh pattern sourcing from the live OECD codelist app (development-finance-codelists.oecd.org).

  • scripts/data_maintenance/refresh_dac_codelists.py — fetches Providers (5) + Recipients (13) via the two-step ASPX/VIEWSTATE handshake, projects each area codelist to {dac_code: dotstatcode or iso3}, additively merges into the committed crosswalks (preserves the ~91/93 historical codes the live app no longer serves; live wins on value conflicts), writes a provenance sidecar, prints a unified diff, and never auto-writes (guarded --write; --check for CI).
  • Static maps (prices, flow-types incl. the (.*) regex passthrough) are committed constants; corrections stay file-loaded.
  • Retires the dead FileView URLs + positional-index XML helpers. read_mapping is keyword-only and raises on a missing file.
  • Refreshed dac{1,2}_codes_area.json: +25/+13 new active codes, 918 -> 4EU001 (aligns base with the corrections overlay), 0 removed. Kosovo correctly stays XKV (dotstatcode preferred over iso3 XKX).
  • Offline fixture-backed tests + a characterization test pinning convert_dac1/crs_to_dotstat_codes output.

Drift protection

.github/workflows/codelist-drift.yml runs weekly: refreshes against the live source and opens a PR only when the committed codelists fall out of date — keeping the live network check off the PR path.

Verification

189 tests pass, ruff + format + ty clean. Diff-only runs leave tracked files untouched.

The old codelist-crosswalk source (stats.oecd.org FileView GUID URLs, parsed
by positional XML index) is dead (HTTP 404). Replace it with a maintained
refresh tool that sources from the live OECD codelist app.

- scripts/data_maintenance/refresh_dac_codelists.py: fetches Providers (5) and
  Recipients (13) via the two-step ASPX/VIEWSTATE handshake, projects each area
  codelist to {dac_code: dotstatcode or iso3}, additively merges into the
  committed crosswalks (preserves historical codes the live app no longer
  serves; live wins on value conflicts), writes a provenance sidecar, prints a
  unified diff, and never auto-writes (guarded --write; --check for CI).
- Static maps (prices, flow-types incl. the (.*) regex passthrough) are
  committed constants; corrections stay file-loaded.
- Retire the dead FileView URLs + positional-index XML helpers; read_mapping is
  keyword-only and raises on a missing file.
- Refreshed dac{1,2}_codes_area.json (+25/+13 active codes, 918 -> 4EU001).
- Offline fixture-backed tests + a characterization test pinning the consumers.
- .github/workflows/codelist-drift.yml: weekly drift check that opens a PR when
  the committed codelists fall out of date with the live source.
@jm-rivera jm-rivera merged commit 88aef50 into main Jun 15, 2026
10 checks passed
@jm-rivera jm-rivera deleted the feat/refresh-dac-codelists branch June 15, 2026 11:28
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant