Skip to content

Allow data version CSV / YAML file(s) to be specified in the config#2189

Open
brynpickering wants to merge 6 commits into
PyPSA:masterfrom
open-energy-transition:feat/data-version-files
Open

Allow data version CSV / YAML file(s) to be specified in the config#2189
brynpickering wants to merge 6 commits into
PyPSA:masterfrom
open-energy-transition:feat/data-version-files

Conversation

@brynpickering

@brynpickering brynpickering commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

Closes #2016

Allows users to completely or partially replace the contents of data/versions.csv in the data.version_files list. If multiple CSVs are given, they overwrite each other in order they are given.

I've tested the unit tests locally with a partial data file that overrides the base data file. I have successfully run the base_network rule with the OSM data retrieval steps.

Checklist

Required:

  • Changes are tested locally and behave as expected.
  • Code and workflow changes are documented.
  • A release note entry is added to doc/release_notes.md.

If applicable:

  • Changes in configuration options are reflected in scripts/lib/validation.
  • For new data sources or versions, these instructions have been followed.
  • New rules are documented in the appropriate doc/*.md files.

@euronion

euronion commented Jun 8, 2026

Copy link
Copy Markdown
Contributor

The way it is implemented would allow for easy support of .yaml files as well, by simply creating a DataFrame from the yaml file as well.

Supporting .yaml instead of .csv was discussed in #2016 - would you be interested in implementing support for it here as well? This would allow forks to decide which file format to use.

@brynpickering

Copy link
Copy Markdown
Contributor Author

@euronion your wish is my command

@brynpickering brynpickering changed the title Allow data version CSV file(s) to be specified in the config Allow data version CSV / YAML file(s) to be specified in the config Jun 9, 2026
@euronion

Copy link
Copy Markdown
Contributor

RTR?

@brynpickering brynpickering requested a review from euronion June 10, 2026 12:43

@euronion euronion left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

MLGTM! Some minor comments below.

And: config object is still hard coded in dataset_version - is the implementation compatible with config_provider / config overwrites using the scenario functionality?

Comment thread pixi.toml

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How's that change related to the PR?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made updates to the tests and consolidated the use of --fix. In the process, it seemed worthwhile to use that --fix arg directly in generating the config. It's very much a side note to the PR which I agree isn't actually responding to the main issue. I can spin it out into another PR if you prefer.

Comment thread scripts/lib/validation/config/data.py Outdated

@field_validator("version_files")
@classmethod
def check_version_files_are_csv(cls, v: list[FilePath]) -> list[FilePath]:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

naming: csv and yaml are being tested for, not only csv

Comment thread scripts/_helpers.py Outdated
return pd.read_csv(cost_file, index_col=0)


def load_data_versions(file: str, create_cols_from_tags: bool = True) -> pd.DataFrame:

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
def load_data_versions(file: str, create_cols_from_tags: bool = True) -> pd.DataFrame:
@lru_cache
def load_data_versions(file: str, create_cols_from_tags: bool = True) -> pd.DataFrame:

This is called quite a bit, avoid unnecessary disk access. I'd prefer the caching to happen after pd.concat(...) in the calling function, but that would require you to restructure the code a bit. Here should be good enough.

Comment thread rules/common.smk Outdated
Comment on lines +110 to +123
data_versions_list = []
for file in config["data"]["version_files"]:
if not (path := Path(file)).is_absolute():
path = Path(workflow.snakefile).parent.parent / path
data_versions_entry = load_data_versions(path).set_index(
["dataset", "version", "source"]
)
data_versions_list.append(data_versions_entry)
data_versions = pd.concat(data_versions_list)
data_versions = (
data_versions.loc[~data_versions.index.duplicated(keep="last")]
.sort_index()
.reset_index()
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These steps were previously cached (see comment also below). Maybe wrap into a dedicated internal function that is cached to avoid dozens of identical file reads + constructions of data frames

Comment thread rules/common.smk
Comment on lines +97 to +98
**dataset_config_overrides : str
entries to override the dataset config for the given `name`.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What do we need the overrides for?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

OSM dataset generation (for upload to zenodo). It was previously using a copy of the method with hardcoded entries for source and version. I've consolidated it into this one method by adding the kwargs option

Comment thread rules/common.smk

if dataset.empty:
raise ValueError(
f"Dataset '{name}' with source '{dataset_config['source']}' for '{dataset_config['version']}' not found in data/versions.csv."

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hard-coded file name here no longer correct

Comment thread scripts/lib/validation/config/data.py Outdated
Comment on lines +41 to +44
elif not path.exists():
raise ValueError(
f"Version file '{path}' must exist and be specified relative to the project root or as an absolute path."
)

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Isn't this redundant? the list[FilePath] in the signature should already cause pydantic to check for the files existing?

@brynpickering brynpickering requested a review from euronion June 16, 2026 13:28
@brynpickering

Copy link
Copy Markdown
Contributor Author

@euronion slight refactor to enable caching (and validating on load, not just in tests - it will cover cases where a user provides their own data file, which we otherwise wouldn't validate prior to merging with the base file).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Make data/versions.csv easily extendable with a configurable option

2 participants