Validation_goldens modfication by niveditasing · Pull Request #2043 · datacommonsorg/data

niveditasing · 2026-05-28T05:51:46Z

Summary
This PR fixes issues in our validation pipeline, including data filtering mismatches, malformed CSV output formatting, file path resolution errors, and CSV/MCF parsing crashes.

runner.py — Dynamic Path Resolution

Problem:
Running validations from the repository root (e.g. during automated CI runs) broke relative paths like "golden_data/un_wpp.csv", resulting in reading 0 Data
Fix:
- Introduced helper functions _is_relative_local and _find_base_dir to traverse the directory tree upward (up to 8 levels) to locate the golden_data/ directory relative to the config file or working directory.
- Dynamically resolves golden_files and input_files to absolute paths before dispatching GOLDENS_CHECK.
- Added corresponding unit tests in runner_test.py.

validator_goldens.py — Safe CSV Serialization and Namespace Stripping

Problem:
Input files containing raw strings prefixed with dcid: (e.g., dcid:Earth) failed to generate goldens because of exact match requirements against normalized schema property lists.
Fix:
Updated load_nodes_from_file to automatically sanitize and strip any leading dcid: namespaces from loaded CSV string cell values during parsing, keeping headers intact.

file_util.py

Problem:
when writing a dictionary of dictionaries (such as validator golden nodes) with key_column_name=None, file_write_csv_dict would scan the inner dictionaries and identify the single property (e.g., ['GeoId']). Because len(columns) == 1, it blindly assumed the values were primitives and appended a default 'value' column. This resulted in an extra empty column in the generated CSV (e.g., "GeoId","value")
Fix:
Added a check using any(isinstance(value, dict) for value in py_dict.values()) to determine if the dictionary values are nested dictionaries: If the values are dictionaries, the extracted single key is preserved as-is, and no extra 'value' column is added. If the values are primitives/simple types, the function still falls back to appending 'value' to keep existing functionality working perfectly.

Before:
"GeoId","value"
"Earth",
"country/AGO",
After (Correct):
"GeoId"
"Earth"
"country/AGO"

Verification and Testing Performed:

tested two imports to determine if the code is working. Below are the validation_output results, one showing a failure and the other showing success.

gemini-code-assist

Code Review

This pull request updates validator_goldens.py to support namespace-stripped matching when generating goldens and replaces the helper function for writing CSVs with custom csv.DictWriter logic. A review comment points out that extracting CSV column headers from only the first node in golden_nodes assumes all nodes share the same keys, which can lead to errors or missing columns if keys vary. It suggests collecting the union of all keys across all nodes instead.

balit-raibot · 2026-05-28T06:35:26Z

@gemini-code-assist review

gemini-code-assist

Code Review

This pull request updates tools/import_validation/validator_goldens.py to import the csv module, strip namespaces when matching nodes against must_include_values, and replace the helper file_write_csv_dict with a custom csv.DictWriter implementation. The feedback suggests adding newline='' when opening the CSV file to prevent platform-specific carriage return issues and simplifying the logic for extracting and sorting unique keys.

niveditasing · 2026-06-05T12:41:36Z

@gemini-code-assist review

gemini-code-assist

Code Review

This pull request introduces path resolution logic for summary report rules in the validation runner, cleans up CSV node keys during loading, and updates golden file generation to use Python's standard csv.DictWriter. Feedback focuses on adhering to PEP 8 line length limits, replacing debug print statements with proper logging.debug calls, and opening CSV files with newline='' to ensure cross-platform compatibility on Windows.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

niveditasing · 2026-06-08T08:00:48Z

@gemini-code-assist review

gemini-code-assist

Code Review

This pull request introduces relative path resolution for the GOLDENS_CHECK validator and improves CSV parsing robustness when handling values containing colons (such as DCIDs). Feedback on these changes suggests using repository marker files (like .git or WORKSPACE) for more robust root path resolution, replacing debug print statements with proper logging.debug calls, and streaming CSV files directly to csv.DictReader to avoid reading entire files into memory.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

niveditasing · 2026-06-14T18:06:00Z

@gemini-code-assist review

gemini-code-assist

Code Review

This pull request introduces path resolution improvements for the GOLDENS_CHECK validator, cleans up "dcid:" prefixes when loading golden nodes, and refactors CSV writing logic to handle dictionary values correctly. The code reviewer provided several valuable suggestions, including fixing a bug where remote paths (like GCS URIs) are incorrectly treated as relative local paths, correcting a documentation mismatch in Validations.md, removing a redundant logging import, and handling file paths robustly in the directory search helper. Additionally, the reviewer recommended replacing debug print statements with proper logging, using the modern removeprefix method for string cleanup, and optimizing a dictionary check in file_util.py to run in O(1) time.

Important

The consumer version of Gemini Code Assist on GitHub is being sunset. Starting June 18, 2026, new organization installations will be blocked, and all code review activity will officially cease on July 17, 2026.
For more details on the timeline and next steps, please review the Help Documentation.

gemini-code-assist Bot reviewed May 28, 2026

View reviewed changes

Comment thread tools/import_validation/validator_goldens.py Outdated

gemini-code-assist Bot reviewed May 28, 2026

View reviewed changes

Comment thread tools/import_validation/validator_goldens.py Outdated

gemini-code-assist Bot reviewed Jun 5, 2026

View reviewed changes

Comment thread tools/import_validation/runner.py Outdated

Comment thread tools/import_validation/validator_goldens.py Outdated

Comment thread tools/import_validation/validator_goldens.py Outdated

gemini-code-assist Bot reviewed Jun 8, 2026

View reviewed changes

Comment thread tools/import_validation/runner.py Outdated

Comment thread tools/import_validation/runner.py Outdated

Comment thread tools/import_validation/validator_goldens.py Outdated

niveditasing force-pushed the validation_golden_fix branch from c3a1f36 to ef96648 Compare June 8, 2026 10:48

improve validation path resolution and sanitize CSV parsing

81a632b

niveditasing force-pushed the validation_golden_fix branch from ef96648 to 81a632b Compare June 8, 2026 10:54

niveditasing added 15 commits June 8, 2026 11:05

fixed tests

903e031

docs: add detailed docstrings for path resolution helpers in runner

20e1292

modified validation

14409c4

testing

194a3bc

fix: strip dcid: namespace prefix from CSV values in golden loading

aabebba

docs: add descriptive comments to CSV writer configuration

6e36dcb

testing

1161f17

testing

ddf793b

revert: undo changes to validator_goldens_test.py

5315183

style: clean up whitespace in validator_goldens.py

8d80c46

refactor: reduce base directory search depth limit to 8 in runner.py

c3eff23

Merge branch 'master' into validation_golden_fix

ed220f7

revert: restore tab character in delim_chars in file_util.py

cab980d

style: restore literal tab character in delim_chars in file_util.py

2701c54

testing

b3d2053

gemini-code-assist Bot reviewed Jun 14, 2026

View reviewed changes

testing

8f7e9e7

Conversation

niveditasing commented May 28, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

balit-raibot commented May 28, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

niveditasing commented Jun 5, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

niveditasing commented Jun 8, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

niveditasing commented Jun 14, 2026

Uh oh!

gemini-code-assist Bot left a comment

Choose a reason for hiding this comment

Code Review

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

niveditasing commented May 28, 2026 •

edited

Loading