Allow data version CSV / YAML file(s) to be specified in the config by brynpickering · Pull Request #2189 · PyPSA/pypsa-eur

brynpickering · 2026-06-08T12:08:45Z

Closes #2016

Allows users to completely or partially replace the contents of data/versions.csv in the data.version_files list. If multiple CSVs are given, they overwrite each other in order they are given.

I've tested the unit tests locally with a partial data file that overrides the base data file. I have successfully run the base_network rule with the OSM data retrieval steps.

Checklist

Required:

Changes are tested locally and behave as expected.
Code and workflow changes are documented.
A release note entry is added to doc/release_notes.md.

If applicable:

Changes in configuration options are reflected in scripts/lib/validation.
For new data sources or versions, these instructions have been followed.
New rules are documented in the appropriate doc/*.md files.

euronion · 2026-06-08T13:18:46Z

The way it is implemented would allow for easy support of .yaml files as well, by simply creating a DataFrame from the yaml file as well.

Supporting .yaml instead of .csv was discussed in #2016 - would you be interested in implementing support for it here as well? This would allow forks to decide which file format to use.

brynpickering · 2026-06-08T18:16:03Z

@euronion your wish is my command

euronion · 2026-06-10T11:18:17Z

RTR?

euronion

MLGTM! Some minor comments below.

And: config object is still hard coded in dataset_version - is the implementation compatible with config_provider / config overwrites using the scenario functionality?

euronion · 2026-06-11T20:28:30Z

How's that change related to the PR?

I made updates to the tests and consolidated the use of --fix. In the process, it seemed worthwhile to use that --fix arg directly in generating the config. It's very much a side note to the PR which I agree isn't actually responding to the main issue. I can spin it out into another PR if you prefer.

euronion · 2026-06-11T20:32:56Z


+    @field_validator("version_files")
+    @classmethod
+    def check_version_files_are_csv(cls, v: list[FilePath]) -> list[FilePath]:


naming: csv and yaml are being tested for, not only csv

euronion · 2026-06-11T20:36:45Z

    return pd.read_csv(cost_file, index_col=0)
+
+
+def load_data_versions(file: str, create_cols_from_tags: bool = True) -> pd.DataFrame:


Suggested change

def load_data_versions(file: str, create_cols_from_tags: bool = True) -> pd.DataFrame:

@lru_cache

def load_data_versions(file: str, create_cols_from_tags: bool = True) -> pd.DataFrame:

This is called quite a bit, avoid unnecessary disk access. I'd prefer the caching to happen after pd.concat(...) in the calling function, but that would require you to restructure the code a bit. Here should be good enough.

euronion · 2026-06-11T20:46:49Z

+    data_versions_list = []
+    for file in config["data"]["version_files"]:
+        if not (path := Path(file)).is_absolute():
+            path = Path(workflow.snakefile).parent.parent / path
+        data_versions_entry = load_data_versions(path).set_index(
+            ["dataset", "version", "source"]
+        )
+        data_versions_list.append(data_versions_entry)
+    data_versions = pd.concat(data_versions_list)
+    data_versions = (
+        data_versions.loc[~data_versions.index.duplicated(keep="last")]
+        .sort_index()
+        .reset_index()
+    )


These steps were previously cached (see comment also below). Maybe wrap into a dedicated internal function that is cached to avoid dozens of identical file reads + constructions of data frames

euronion · 2026-06-11T20:48:32Z

+    **dataset_config_overrides : str
+        entries to override the dataset config for the given `name`.


What do we need the overrides for?

OSM dataset generation (for upload to zenodo). It was previously using a copy of the method with hardcoded entries for source and version. I've consolidated it into this one method by adding the kwargs option

euronion · 2026-06-11T20:50:34Z


    if dataset.empty:
        raise ValueError(
            f"Dataset '{name}' with source '{dataset_config['source']}' for '{dataset_config['version']}' not found in data/versions.csv."


Hard-coded file name here no longer correct

euronion · 2026-06-11T20:51:53Z

+            elif not path.exists():
+                raise ValueError(
+                    f"Version file '{path}' must exist and be specified relative to the project root or as an absolute path."
+                )


Isn't this redundant? the list[FilePath] in the signature should already cause pydantic to check for the files existing?

…-files

brynpickering · 2026-06-16T13:30:26Z

@euronion slight refactor to enable caching (and validating on load, not just in tests - it will cover cases where a user provides their own data file, which we otherwise wouldn't validate prior to merging with the base file).

brynpickering added 2 commits June 8, 2026 13:03

Allow data version CSV file(s) to be specified in the config

f3d1612

Minor fixes

d032093

Allow YAML data versions

41632d0

brynpickering changed the title ~~Allow data version CSV file(s) to be specified in the config~~ Allow data version CSV / YAML file(s) to be specified in the config Jun 9, 2026

brynpickering requested a review from euronion June 10, 2026 12:43

Merge branch 'master' into feat/data-version-files

5375a18

euronion requested changes Jun 11, 2026

View reviewed changes

brynpickering added 2 commits June 15, 2026 14:44

Merge remote-tracking branch 'upstream/master' into feat/data-version…

56193af

…-files

Cache loaded versions; refactor data version schema test

9984785

brynpickering requested a review from euronion June 16, 2026 13:28

		return pd.read_csv(cost_file, index_col=0)


		def load_data_versions(file: str, create_cols_from_tags: bool = True) -> pd.DataFrame:

	def load_data_versions(file: str, create_cols_from_tags: bool = True) -> pd.DataFrame:
	@lru_cache
	def load_data_versions(file: str, create_cols_from_tags: bool = True) -> pd.DataFrame:

		**dataset_config_overrides : str
		entries to override the dataset config for the given `name`.

Conversation

brynpickering commented Jun 8, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Checklist

Uh oh!

euronion commented Jun 8, 2026

Uh oh!

brynpickering commented Jun 8, 2026

Uh oh!

euronion commented Jun 10, 2026

Uh oh!

euronion left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

brynpickering commented Jun 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

brynpickering commented Jun 8, 2026 •

edited

Loading