Allow data version CSV / YAML file(s) to be specified in the config#2189
Allow data version CSV / YAML file(s) to be specified in the config#2189brynpickering wants to merge 6 commits into
Conversation
|
The way it is implemented would allow for easy support of Supporting |
|
@euronion your wish is my command |
|
RTR? |
euronion
left a comment
There was a problem hiding this comment.
MLGTM! Some minor comments below.
And: config object is still hard coded in dataset_version - is the implementation compatible with config_provider / config overwrites using the scenario functionality?
There was a problem hiding this comment.
How's that change related to the PR?
There was a problem hiding this comment.
I made updates to the tests and consolidated the use of --fix. In the process, it seemed worthwhile to use that --fix arg directly in generating the config. It's very much a side note to the PR which I agree isn't actually responding to the main issue. I can spin it out into another PR if you prefer.
|
|
||
| @field_validator("version_files") | ||
| @classmethod | ||
| def check_version_files_are_csv(cls, v: list[FilePath]) -> list[FilePath]: |
There was a problem hiding this comment.
naming: csv and yaml are being tested for, not only csv
| return pd.read_csv(cost_file, index_col=0) | ||
|
|
||
|
|
||
| def load_data_versions(file: str, create_cols_from_tags: bool = True) -> pd.DataFrame: |
There was a problem hiding this comment.
| def load_data_versions(file: str, create_cols_from_tags: bool = True) -> pd.DataFrame: | |
| @lru_cache | |
| def load_data_versions(file: str, create_cols_from_tags: bool = True) -> pd.DataFrame: |
This is called quite a bit, avoid unnecessary disk access. I'd prefer the caching to happen after pd.concat(...) in the calling function, but that would require you to restructure the code a bit. Here should be good enough.
| data_versions_list = [] | ||
| for file in config["data"]["version_files"]: | ||
| if not (path := Path(file)).is_absolute(): | ||
| path = Path(workflow.snakefile).parent.parent / path | ||
| data_versions_entry = load_data_versions(path).set_index( | ||
| ["dataset", "version", "source"] | ||
| ) | ||
| data_versions_list.append(data_versions_entry) | ||
| data_versions = pd.concat(data_versions_list) | ||
| data_versions = ( | ||
| data_versions.loc[~data_versions.index.duplicated(keep="last")] | ||
| .sort_index() | ||
| .reset_index() | ||
| ) |
There was a problem hiding this comment.
These steps were previously cached (see comment also below). Maybe wrap into a dedicated internal function that is cached to avoid dozens of identical file reads + constructions of data frames
| **dataset_config_overrides : str | ||
| entries to override the dataset config for the given `name`. |
There was a problem hiding this comment.
What do we need the overrides for?
There was a problem hiding this comment.
OSM dataset generation (for upload to zenodo). It was previously using a copy of the method with hardcoded entries for source and version. I've consolidated it into this one method by adding the kwargs option
|
|
||
| if dataset.empty: | ||
| raise ValueError( | ||
| f"Dataset '{name}' with source '{dataset_config['source']}' for '{dataset_config['version']}' not found in data/versions.csv." |
There was a problem hiding this comment.
Hard-coded file name here no longer correct
| elif not path.exists(): | ||
| raise ValueError( | ||
| f"Version file '{path}' must exist and be specified relative to the project root or as an absolute path." | ||
| ) |
There was a problem hiding this comment.
Isn't this redundant? the list[FilePath] in the signature should already cause pydantic to check for the files existing?
|
@euronion slight refactor to enable caching (and validating on load, not just in tests - it will cover cases where a user provides their own data file, which we otherwise wouldn't validate prior to merging with the base file). |
Closes #2016
Allows users to completely or partially replace the contents of
data/versions.csvin thedata.version_fileslist. If multiple CSVs are given, they overwrite each other in order they are given.I've tested the unit tests locally with a partial data file that overrides the base data file. I have successfully run the
base_networkrule with the OSM data retrieval steps.Checklist
Required:
doc/release_notes.md.If applicable:
scripts/lib/validation.doc/*.mdfiles.