fix(debian): only parse machine-readable copyright files with Format header#4754
Open
Bahtya wants to merge 1 commit intoanchore:mainfrom
Open
fix(debian): only parse machine-readable copyright files with Format header#4754Bahtya wants to merge 1 commit intoanchore:mainfrom
Bahtya wants to merge 1 commit intoanchore:mainfrom
Conversation
…header Only parse debian/copyright files as machine-readable DEP-5 format when they contain the mandatory Format header field pointing to the copyright specification URI. Files without this header are free-form text and should not have License: regex patterns applied to them, which previously produced nonsensical results like "#", "Permission", "This", "see" for non-machine-readable files. The fallback license classifier in the debian cataloger will handle non-machine-readable files by doing full-text license identification. Closes anchore#4708 Signed-off-by: Bahtya <bahtya@users.noreply.github.com> Signed-off-by: Bahtya <bahtayr@gmail.com>
Author
|
Hi team, just wanted to follow up on this PR. Would appreciate any feedback! |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
debian/copyrightfiles as machine-readable DEP-5 format when they contain the mandatoryFormat:header field pointing to the copyright-format specification URI (e.g.,Format: https://www.debian.org/doc/packaging-manuals/copyright-format/1.0/)nilfromparseLicensesFromCopyright, allowing the existing fallback license classifier to handle them via full-text license identification"#","Permission","This","see"from the python fixture)Details
Per the DEP-5 spec, machine-readable copyright files must have a
Formatfield whose value is a URI for the specification. The existing parser was applyingLicense:regex patterns to all files regardless of format, producing garbage results for non-machine-readable files.The fix adds a
hasFormatHeader()check that validates the first non-blank line of the file is a validFormat:header before proceeding with regex-based parsing. Non-machine-readable files fall through to the existing license classifier atpackage.go:132-138.Test plan
libc6,trilicense,python,cuda,dev-kit,microsoft) now correctly returnnilliblzma5,libaudit-common) continue to extract licenses correctlyTestHasFormatHeaderwith 7 cases covering: http/https URLs, blank lines before header, missing header, header not as first non-blank line, empty content, blank-only contentTestParseLicensesFromCopyrightInlineverifying License: fields in a file without Format header are ignoredCloses #4708