refactor: standardize packagehallucination dataset tools#1786
Conversation
There was a problem hiding this comment.
While consistency is the goal this approach creates coupling of independent tools that are currently standalone scripts with files that are not in the same python module space. While reduced code is preferred in libraries component, tools are preferred to be independent and self encapsulated as they are not shipped with the python package and are often expected to be executed with access only to the required dependencies installed in a minimal environment utilizing the single tool script file.
Address NVIDIA#1786 review feedback: tools/packagehallucination/* are intentionally standalone scripts that ship outside the python package and run in minimal environments. Replace the shared _common.py with inlined copies of the helpers (normalise_first_seen, emit_record, configure_argparse, write_jsonl) in each tool's main.py, so each tool keeps its self-contained shape while still emitting the same standard {text, package_first_seen} JSONL schema and exposing the same --output / --format / --help CLI surface called out by NVIDIA#1745. Delete _common.py and the shared test scaffolding (the tools are not imported as a package; the inlined helpers are simple enough that the per-tool standalone shape is its own contract). Verified that all three tools produce identical record schema and identical CLI flag set with the inlined helpers. Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
|
Good call -- I missed that these are intentionally self-contained scripts. Pushed
Each tool is now standalone again: no The #1745 contract is still satisfied: I verified that all three tools produce identical record schema ( DCO is signed off now too. |
|
Appreciate the quick redirection, review and testing will proceed. Please also note that the DCO sign-off requirement applies to all commits, please either squash the branch or amend the commit messages that are not in compliance to get the check to pass. |
Fixes NVIDIA#1745 Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
Address NVIDIA#1786 review feedback: tools/packagehallucination/* are intentionally standalone scripts that ship outside the python package and run in minimal environments. Replace the shared _common.py with inlined copies of the helpers (normalise_first_seen, emit_record, configure_argparse, write_jsonl) in each tool's main.py, so each tool keeps its self-contained shape while still emitting the same standard {text, package_first_seen} JSONL schema and exposing the same --output / --format / --help CLI surface called out by NVIDIA#1745. Delete _common.py and the shared test scaffolding (the tools are not imported as a package; the inlined helpers are simple enough that the per-tool standalone shape is its own contract). Verified that all three tools produce identical record schema and identical CLI flag set with the inlined helpers. Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
f45cc9f to
d0556b8
Compare
Summary
Add a small shared helper module
tools/packagehallucination/_common.pythat exposesemit_record(name, first_seen)(normalizingfirst_seento ISO-8601 orNone),STANDARD_FIELDS = ("text", "package_first_seen"), and a sharedargparseconfigurator giving every tool the same--output,--format, and--helpsurface. Update each per-languagemain.pyto import from_common, write JSONL records with the standardized fields, and validate dates before emitting (drop or null on parse failure). Leave the existing dart/perl/raku tools alone in this PR unless the unification is mechanical - keep the diff scoped to python/ruby/javascript per the file list, and note follow-up coverage in the PR description.Why this matters
The dataset-building tools under
tools/packagehallucination/(dart, javascript, perl, python, raku, ruby) produce the package-name corpora hosted athuggingface.co/garak-llm/datasetsthat drive thepackagehallucinationdetectors. Maintainer jmartin-tech filed #1745 calling out three inconsistencies: per-tool output schemas drift on field names (no canonicaltext+package_first_seenpair),package_first_seenis unevenly emitted and lacks a date-format contract, and CLI--help/ option names diverge so the tools cannot share an invocation recipe. The bug is annotatedquality-accuracy+housekeeping, with no claimant yet.Testing
_common.emit_record("requests", "2011-02-14")writes{"text": "requests", "package_first_seen": "2011-02-14"}to the JSONL sink.Noneand the record still validates; CLI--helpfor each refactored tool lists identical flag names.Fixes #1745
AI was used for assistance.