Skip to content

refactor: standardize packagehallucination dataset tools#1786

Open
mvanhorn wants to merge 2 commits into
NVIDIA:mainfrom
mvanhorn:fix/1745-packagehallucination-dataset-tools
Open

refactor: standardize packagehallucination dataset tools#1786
mvanhorn wants to merge 2 commits into
NVIDIA:mainfrom
mvanhorn:fix/1745-packagehallucination-dataset-tools

Conversation

@mvanhorn
Copy link
Copy Markdown

Summary

Add a small shared helper module tools/packagehallucination/_common.py that exposes emit_record(name, first_seen) (normalizing first_seen to ISO-8601 or None), STANDARD_FIELDS = ("text", "package_first_seen"), and a shared argparse configurator giving every tool the same --output, --format, and --help surface. Update each per-language main.py to import from _common, write JSONL records with the standardized fields, and validate dates before emitting (drop or null on parse failure). Leave the existing dart/perl/raku tools alone in this PR unless the unification is mechanical - keep the diff scoped to python/ruby/javascript per the file list, and note follow-up coverage in the PR description.

Why this matters

The dataset-building tools under tools/packagehallucination/ (dart, javascript, perl, python, raku, ruby) produce the package-name corpora hosted at huggingface.co/garak-llm/datasets that drive the packagehallucination detectors. Maintainer jmartin-tech filed #1745 calling out three inconsistencies: per-tool output schemas drift on field names (no canonical text + package_first_seen pair), package_first_seen is unevenly emitted and lacks a date-format contract, and CLI --help / option names diverge so the tools cannot share an invocation recipe. The bug is annotated quality-accuracy + housekeeping, with no claimant yet.

Testing

  • Happy path: feeding _common.emit_record("requests", "2011-02-14") writes {"text": "requests", "package_first_seen": "2011-02-14"} to the JSONL sink.
  • Edge case: malformed or absent first-seen date is coerced to None and the record still validates; CLI --help for each refactored tool lists identical flag names.
  • Error path: invoking a tool against an empty upstream feed exits non-zero with a clear stderr message rather than writing a zero-byte dataset.

Fixes #1745

AI was used for assistance.

Comment thread tools/packagehallucination/_common.py Outdated
Copy link
Copy Markdown
Collaborator

@jmartin-tech jmartin-tech May 20, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While consistency is the goal this approach creates coupling of independent tools that are currently standalone scripts with files that are not in the same python module space. While reduced code is preferred in libraries component, tools are preferred to be independent and self encapsulated as they are not shipped with the python package and are often expected to be executed with access only to the required dependencies installed in a minimal environment utilizing the single tool script file.

mvanhorn added a commit to mvanhorn/garak that referenced this pull request May 20, 2026
Address NVIDIA#1786 review feedback: tools/packagehallucination/* are
intentionally standalone scripts that ship outside the python package
and run in minimal environments. Replace the shared _common.py with
inlined copies of the helpers (normalise_first_seen, emit_record,
configure_argparse, write_jsonl) in each tool's main.py, so each tool
keeps its self-contained shape while still emitting the same standard
{text, package_first_seen} JSONL schema and exposing the same
--output / --format / --help CLI surface called out by NVIDIA#1745.

Delete _common.py and the shared test scaffolding (the tools are not
imported as a package; the inlined helpers are simple enough that the
per-tool standalone shape is its own contract).

Verified that all three tools produce identical record schema and
identical CLI flag set with the inlined helpers.

Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
@mvanhorn
Copy link
Copy Markdown
Author

Good call -- I missed that these are intentionally self-contained scripts. Pushed f45cc9f which:

  • Deletes tools/packagehallucination/_common.py
  • Inlines the helpers (normalise_first_seen, emit_record, configure_argparse, write_jsonl) and the STANDARD_FIELDS constant into each per-language main.py
  • Deletes tests/test_packagehallucination_tools.py (same reasoning -- the tools aren't an importable package, so the shared test scaffolding shouldn't be either)

Each tool is now standalone again: no sys.path.insert(...), no from _common import ..., just python3 tools/packagehallucination/{lang}/main.py --output X in a minimal env with backoff + requests installed.

The #1745 contract is still satisfied: I verified that all three tools produce identical record schema ({text, package_first_seen}) and identical CLI flag set (--input, --output, --format). They just satisfy it through vendored copies rather than a shared module.

DCO is signed off now too.

@jmartin-tech
Copy link
Copy Markdown
Collaborator

Appreciate the quick redirection, review and testing will proceed.

Please also note that the DCO sign-off requirement applies to all commits, please either squash the branch or amend the commit messages that are not in compliance to get the check to pass.

mvanhorn added 2 commits May 21, 2026 06:26
Fixes NVIDIA#1745

Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
Address NVIDIA#1786 review feedback: tools/packagehallucination/* are
intentionally standalone scripts that ship outside the python package
and run in minimal environments. Replace the shared _common.py with
inlined copies of the helpers (normalise_first_seen, emit_record,
configure_argparse, write_jsonl) in each tool's main.py, so each tool
keeps its self-contained shape while still emitting the same standard
{text, package_first_seen} JSONL schema and exposing the same
--output / --format / --help CLI surface called out by NVIDIA#1745.

Delete _common.py and the shared test scaffolding (the tools are not
imported as a package; the inlined helpers are simple enough that the
per-tool standalone shape is its own contract).

Verified that all three tools produce identical record schema and
identical CLI flag set with the inlined helpers.

Signed-off-by: Matt Van Horn <455140+mvanhorn@users.noreply.github.com>
@mvanhorn mvanhorn force-pushed the fix/1745-packagehallucination-dataset-tools branch from f45cc9f to d0556b8 Compare May 21, 2026 13:26
@mvanhorn
Copy link
Copy Markdown
Author

Signed off both commits (12e1abc, d0556b8) and force-pushed. Rebased with --signoff onto current main; DCO should be green now. Thanks for the catch.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Review and standardize packagehallucination dataset tools

2 participants