Skip to content

fix(debian): only parse machine-readable copyright files with Format header#4754

Open
Bahtya wants to merge 1 commit intoanchore:mainfrom
Bahtya:fix/debian-copyright-format-check
Open

fix(debian): only parse machine-readable copyright files with Format header#4754
Bahtya wants to merge 1 commit intoanchore:mainfrom
Bahtya:fix/debian-copyright-format-check

Conversation

@Bahtya
Copy link
Copy Markdown

@Bahtya Bahtya commented Apr 9, 2026

Summary

  • Only parse debian/copyright files as machine-readable DEP-5 format when they contain the mandatory Format: header field pointing to the copyright-format specification URI (e.g., Format: https://www.debian.org/doc/packaging-manuals/copyright-format/1.0/)
  • Files without this header now return nil from parseLicensesFromCopyright, allowing the existing fallback license classifier to handle them via full-text license identification
  • Fixes nonsensical license extraction from free-form copyright files (e.g., "#", "Permission", "This", "see" from the python fixture)

Details

Per the DEP-5 spec, machine-readable copyright files must have a Format field whose value is a URI for the specification. The existing parser was applying License: regex patterns to all files regardless of format, producing garbage results for non-machine-readable files.

The fix adds a hasFormatHeader() check that validates the first non-blank line of the file is a valid Format: header before proceeding with regex-based parsing. Non-machine-readable files fall through to the existing license classifier at package.go:132-138.

Test plan

  • Updated test expectations: 6 non-machine-readable fixtures (libc6, trilicense, python, cuda, dev-kit, microsoft) now correctly return nil
  • Machine-readable fixtures (liblzma5, libaudit-common) continue to extract licenses correctly
  • Added TestHasFormatHeader with 7 cases covering: http/https URLs, blank lines before header, missing header, header not as first non-blank line, empty content, blank-only content
  • Added TestParseLicensesFromCopyrightInline verifying License: fields in a file without Format header are ignored

Closes #4708

…header

Only parse debian/copyright files as machine-readable DEP-5 format when
they contain the mandatory Format header field pointing to the copyright
specification URI. Files without this header are free-form text and
should not have License: regex patterns applied to them, which previously
produced nonsensical results like "#", "Permission", "This", "see" for
non-machine-readable files.

The fallback license classifier in the debian cataloger will handle
non-machine-readable files by doing full-text license identification.

Closes anchore#4708

Signed-off-by: Bahtya <bahtya@users.noreply.github.com>
Signed-off-by: Bahtya <bahtayr@gmail.com>
@Bahtya
Copy link
Copy Markdown
Author

Bahtya commented Apr 11, 2026

Hi team, just wanted to follow up on this PR. Would appreciate any feedback!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

fix parsing of debian/copyright files

1 participant