First bigbio release of msgf+ by ypriverol · Pull Request #14 · bigbio/msgfplus

ypriverol · 2026-04-16T21:28:05Z

No description provided.

@ignore

…Java 17 - Change skipTests from true to false so mvn test actually runs - Update maven-compiler-plugin source/target from 1.8 to 17 (matches runtime) - Add missing compile dependencies: jmzml 1.7.11, fastutil 8.5.12, slf4j-api 1.7.36, logback-classic 1.2.12, commons-io 2.15.1 (master code references these classes but they were not declared) - @ignore TestMzML test that requires Windows-specific DMS files Result: 120 tests run, 53 active, 67 skipped, 0 failures Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…shold In MZIdentMLGen.addSpectrumIdentificationResults(), change `break` to `continue` when a match has DeNovoScore below the minimum threshold. The `break` was incorrectly stopping emission of all subsequent matches for that spectrum, silently dropping valid PSMs from the mzid output. Also add null safety check for spectrum index lookup — if a spectrum index is not found in the spectrum file, log a warning and skip instead of throwing a NullPointerException. Add TestMZIdentMLGen with two integration tests: - testMzidScoreCompleteness: runs MSGF+ search, verifies every SII has all 4 score CVParams (RawScore, DeNovoScore, SpecEValue, EValue) - testMzidStructuralValidity: verifies output mzid has required mzIdentML structure elements Closes MSGFPlus#157 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add new -msLevel CLI parameter to filter spectra by MS level. Accepts single value (e.g., -msLevel 2) or comma-separated range (e.g., -msLevel 2,3). Default is 2 (MS2 only). Changes: - ParamManager: add MS_LEVEL enum and registration - IntRangeParameter: enable single-value parsing, fix typo - SearchParams: add minMSLevel/maxMSLevel fields - SpecKey: filter spectra by MS level in getSpecKeyList() - SpectraAccessor: add setMSLevelRange(), wire to parsers - MzMLAdapter/MzXMLSpectraMap: fix maxMSLevel to be inclusive - MSGFPlus/MSGFDB/MSGFDBLib: wire MS level parameters - pom.xml: remove fastutil shade filter (jmzml 1.7.11 needs full fastutil) Tests: TestIntRangeParameter (9 tests), TestMSLevelFiltering (6 tests) Benchmark (TMT 1.1GB, TDA): Baseline: 1245s, 6654 PSMs@1%FDR -msLevel 2: 957s (-23%), 6936 PSMs@1%FDR (+4.2%) Closes MSGFPlus#159 Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat(MSGFPlus#159): add -msLevel parameter for MS level filtering

fix(MSGFPlus#157): preserve PSM scores when DeNovoScore is below threshold

fix: enable test suite and fix broken build dependencies

Remove standalone scripts, legacy tools, and unused classes that are not referenced by the core MSGF+ search pipeline, reducing codebase by ~22,000 lines. Deleted entire packages: - ims/ (9 files) — legacy IMS utilities - ipa/ (5 files) — unused isotope pattern analysis - msgf2d/ (8 files) — abandoned 2D scoring experiment - msdictionary/ (7 files) — unused genome dictionary tool - mstag/ (3 files) — unused sequence tagging - scripts/ (6 files) — standalone CLI utilities - msutil/test/ (3 files) — misplaced test classes - msgf/test/ (2 files) — legacy test stubs - msgf/analysis/ (1 file) — unused ROC generator Cleaned mixed packages: - misc/: removed 59 standalone scripts, kept 5 core utilities - msgf/: removed 6 unused graph/scoring classes - msutil/: removed 9 unused filter/annotation classes - msdbsearch/: removed 4 standalone DB tools - parser/: removed 9 legacy format parsers (InsPecT, Mascot, etc.) - ui/: removed 6 legacy entry points (MSGF, MSGFLib, etc.) - mzid/: removed 1 unused adapter stub - msscorer/: removed 1 unused stats class - suffixarray/: removed 1 unused mass array class Also removed dead test methods and cleaned dangling imports. Tests: 119 run, 0 failures. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Rewrite README.md: - Full parameter reference tables covering all 30+ flags organized by category (core search, fragmentation, enzyme, filtering, etc.) - Quick start examples for basic and TMT searches - Modification file format documentation with examples - Build-from-source instructions - Updated requirements to Java 17+ Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Add parameter docs to README and CI/CD workflow

Remove dead code: 150 unused classes, -22K lines

Write search results directly to TSV from in-memory objects, bypassing mzIdentML serialization. Output is column-identical to MzIDToTsv (verified by diff on test.mgf search). This avoids generating large .mzid files when only TSV is needed downstream (e.g. OpenMS MSGFPlusAdapter, Percolator). - New DirectTSVWriter class with same score/protein/mod logic as MZIdentMLGen but streaming tab-delimited output - New -outputFormat parameter: 0=mzid (default), 1=tsv, 2=both - Includes fixed + variable mods, MGF Title column, decoy filtering - Backwards compatible: default remains mzid Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When -addFeatures 1 is used with -outputFormat tsv, the TSV now includes all PSMFeatureFinder columns needed for Percolator: ExplainedIonCurrentRatio, NTermIonCurrentRatio, CTermIonCurrentRatio, MS2IonCurrent, MS1IonCurrent, IsolationWindowEfficiency, NumMatchedMainIons, and all error statistics (MeanError/StdevError for All and Top7, both absolute and relative). These features were previously only available as UserParams in mzid and were not extracted by OpenMS's addMSGFFeatures() — now they are directly accessible as TSV columns. The peptide modification format (M+15.995) is already compatible with OpenMS MSGFPlusAdapter's modifySequence_() converter which transforms it to bracket notation M[+15.995] for AASequence. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace the jmzml JAXB-based MzMLUnmarshaller with a lightweight StAX streaming parser that extracts only the 11 fields MSGF+ needs. The new parser builds a spectrum index in a single pass, then preloads all spectra into memory on first random access, eliminating repeated XML parsing during the search phase. Benchmark (TMT 1.1GB mzML, target-decoy, 4 threads): - Wall time: 957s -> 853s (-10.9%) - PSMs at 1% FDR: 6,936 (unchanged) - Score completeness: 100% (unchanged) Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

…port - Remove jmzml (JAXB-based mzML parser) dependency from pom.xml - Delete old jmzml-dependent classes: MzMLAdapter, MzMLSpectraMap, MzMLSpectraIterator, SpectrumConverter - Add referenceableParamGroupRef resolution to StaxMzMLParser: builds a map of param groups during index pass, resolves refs during spectrum parsing (critical for files that define polarity, MS level, etc. in referenceable groups) - Move turnOffLogs() utility to StaxMzMLParser, update all callers - Keep fastutil dependency (needed by jmzidml at runtime) JAR size reduced from 39.5MB to 38MB. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

jmzReader (uk.ac.ebi.pride.tools:jmzreader:2.0.6) had zero imports anywhere in the codebase — a dead dependency from earlier development. All spectrum file format parsing uses custom implementations: mzML (StaxMzMLParser), mzXML (embedded jrap/stax), MGF/MS2/PKL (custom parsers). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Remove mzXML file format support entirely: - Delete embedded jrap/stax library (20 files, ~5,800 lines) - Delete MzXMLSpectraMap, MzXMLSpectraIterator, MzXMLToMgfConverter - Delete MzXMLToMgf utility and mzXML test resources (38MB) - Remove MZXML from SpecFileFormat enum, SpectraAccessor, ParamManager - Update misc/scripts/ui classes to remove mzXML code paths mzXML is a legacy format superseded by mzML. Users with mzXML files can convert to mzML using msconvert (ProteoWizard). Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

- StaxMzMLParser: use ConcurrentHashMap for thread-safe spectrum cache, fix class-level doc (preload-all, not bounded LRU), check index before preloading, propagate exceptions instead of returning null - StaxMzMLSpectraIterator: throw NoSuchElementException when exhausted - SpectraAccessor: throw exception instead of System.exit(-1), validate specFormat is non-null in constructor - SelectSpectra: update stale .mzXML reference to .mzML - pom.xml: fix duplicate <manifest>, remove stale comments, note fastutil is required by jmzidentml at runtime Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Write search results directly to TSV from in-memory objects, bypassing mzIdentML serialization. Output is column-identical to MzIDToTsv (verified by diff on test.mgf search). This avoids generating large .mzid files when only TSV is needed downstream (e.g. OpenMS MSGFPlusAdapter, Percolator). - New DirectTSVWriter class with same score/protein/mod logic as MZIdentMLGen but streaming tab-delimited output - New -outputFormat parameter: 0=mzid (default), 1=tsv, 2=both - Includes fixed + variable mods, MGF Title column, decoy filtering - Backwards compatible: default remains mzid Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

When -addFeatures 1 is used with -outputFormat tsv, the TSV now includes all PSMFeatureFinder columns needed for Percolator: ExplainedIonCurrentRatio, NTermIonCurrentRatio, CTermIonCurrentRatio, MS2IonCurrent, MS1IonCurrent, IsolationWindowEfficiency, NumMatchedMainIons, and all error statistics (MeanError/StdevError for All and Top7, both absolute and relative). These features were previously only available as UserParams in mzid and were not extracted by OpenMS's addMSGFFeatures() — now they are directly accessible as TSV columns. The peptide modification format (M+15.995) is already compatible with OpenMS MSGFPlusAdapter's modifySequence_() converter which transforms it to bracket notation M[+15.995] for AASequence. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Replace ConvertToMgf-based tests (class removed in PR #7) with StaxMzMLParser and SpectraAccessor mzML parsing tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

feat: native TSV output — bypass mzIdentML for OpenMS/Percolator pipelines

perf: replace jmzml JAXB parser with StAX-based mzML reader

* chore: add CI/release packaging and benchmark scaffolding Split infra and repository maintenance updates into a dedicated reviewable change set, including workflow automation, Docker packaging, benchmark scripts/docs, and project documentation updates. Exclude large local benchmark artifacts and keep this PR focused on non-hot-path code organization and release hygiene. Made-with: Cursor * chore: keep benchmark folder local-only Remove benchmark scripts/docs from this branch and ignore the entire benchmark directory so local benchmarking assets do not appear in review PRs. Made-with: Cursor * docs: keep single canonical primitives plan Fold memory-reduction guidance into the balanced primitives plan and remove the old duplicate plan file so review and maintenance use one canonical document. Made-with: Cursor * chore: narrow PR1 plans to scope-only docs Remove unrelated strategy and optimization plan documents from PR1 so this branch stays focused on infra/packaging cleanup. Keep only the plans index file in this PR. Made-with: Cursor * chore: remove legacy ZippedReleases folder Delete the obsolete Windows release helper scripts and reference files under ZippedReleases from the repository. Made-with: Cursor * chore: remove legacy extlib dependency jar Delete the obsolete jrap/stax legacy jar under extlib as part of repository cleanup. Made-with: Cursor * fix: address copilot review feedback for PR11 Align docs with actual supported legacy formats, update release pipeline to build from tag version with tests, and fix Docker build JDK requirement. Made-with: Cursor * chore: minor packaging/docs hygiene for PR1 Normalize ignore files, shrink Docker build context, align agent README with dev/CI, and clarify release workflow step naming. Made-with: Cursor * docs: trim examples folder to small referenced artifacts Remove duplicate Tryp_Pig_Bov DB/index copies (tests use src/test/resources), drop large unlinked Excel/PNG teaching files, and add docs/examples/README.md so the directory purpose is obvious. Link the index from the main README. Made-with: Cursor * chore: remove IntelliJ IDEA tips screenshots from docs Made-with: Cursor * docs: replace legacy HTML manuals with Markdown Convert docs/*.html to GitHub-flavored Markdown (pandoc), fix internal links, add docs/README.md as the documentation index, and remove unused style.css. Made-with: Cursor * docs: strip leftover HTML span wrappers from converted Markdown Made-with: Cursor

Add a workflow-dispatch benchmark pipeline on a fixed self-hosted runner profile, with public-data download, metrics emission, and baseline TSV comparison under benchmark/ci/PXD001819 for future dataset expansion. Made-with: Cursor

Use uppercase PXD001819 naming in workflow-visible labels/artifacts and update README to state mzXML is not available in this fork. Made-with: Cursor

Made-with: Cursor

- run_ci.sh: count only opening <SpectrumIdentificationItem> tags for sii_count (prior substring match double-counted closing tags and picked up SpectrumIdentificationItemRef) - run_ci.sh: always emit peak_rss_kb and cpu_percent (NA when GNU time does not expose them) so metrics file format is consistent - compare_metrics.py: support an `optional` column; optional missing/NA metrics warn instead of failing CI - baseline.tsv: add optional column, mark peak_rss_kb optional, fix ubuntu-latest note to reference the self-hosted runner, widen sii_count floor to match the de-duplicated count - README pointers: update stale references to a non-existent benchmark/run_pxd001819_benchmark.sh script - benchmark/README.md: describe the actual committed CI scaffold instead of an uncommitted local harness layout Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

- extract_metrics.py: stream-parse mzIdentML with ElementTree.iterparse so SII counting and PSM 1% FDR counting no longer rely on line-shaped regex matches over XML - run_ci.sh: use a bash array for SEARCH_ARGS (safe against future flags with spaces), atomic .part downloads, validate cached gzip, default MSGFPLUS_THREADS to 8 to match the workflow, drop the always-zero java_exit metric, and emit integer wall_time_sec - workflow: pin Python via actions/setup-python@v5 so self-hosted runners have a known 3.11 interpreter for the helper scripts - compare_metrics.py: add test_compare_metrics.py covering in-range pass, out-of-range fail, missing required/optional, NA, non-numeric, and empty-range rows (7 tests, all passing) - .gitignore: drop redundant benchmark/** patterns (already covered by benchmark/* + ci/ allowlist); add __pycache__/ and *.pyc - docs: describe new helper and test scripts in both READMEs

fix(benchmark): harden PXD001819 scaffold per review feedback

…ions Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

Orchestrates Phase 2 components: walks an input peptide list, routes each peptide to its slab(s) via SlabAssigner, generates b/y fragments via TheoreticalFragmentGenerator, and finalizes everything into a DirectStore paired with per-slab PeptideTables. Out-of-range peptides and unknown residues are silently skipped. Variable-mod variant enumeration and SA walk integration are deferred to Phase 2b. Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

…r CompactFastaSequence New walker class that enumerates every tryptic target peptide in a FASTA in [minLen, maxLen] range, respecting the configured enzyme and max missed cleavages. Skips decoy-prefixed proteins (build-time only; search still uses full target+decoy DB). Rejects peptides containing residues outside the alphabet (X, B, Z, *, underscore). Protein boundaries count as cleavage sites. This is the input layer for Phase 2b's FragmentIndexBuilder.buildFromSuffixArray overload (next commit); the walker does not touch the existing builder or any search-time code. Covered by 9 JUnit tests including missed-cleavage variants, length filter, decoy-prefix filter, non-alphabet residue rejection, and the forEachPeptide vs collect() consistency invariant. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds a convenience overload that wires SuffixArrayPeptideWalker.collect() into the existing build(List<String>) path. There is still exactly one build code path; the overload just wraps the walker. Covered by a new parity test asserting buildFromSuffixArray produces the same per-slab peptide count as calling build(walker.collect()) directly on a tiny FASTA fixture. Next commit wires this into the BuildSA CLI behind a -buildFragIndex flag. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…n tests Adds -buildFragIndex 0|1 to the BuildSA CLI. When 1, after the suffix array is built, construct a DirectStore fragment index using Phase 2 defaults (Trypsin, minLen=6, maxLen=40, missedCleavages=2, 50 Da slabs over 100-4000 Da, 1.0005 Da fragment bins) and print a one-line summary: Fragment index built for {fasta}: {N} peptides across {K} slabs ({lo}-{hi} Da) Default remains 0 — zero behavioural change for existing users. A backwards-compatible BuildSA.buildSA overload keeps the existing 4-arg signature working for callers that don't opt in. Covered by three new tests on a tiny synthetic FASTA: - the flag triggers the 'Fragment index built' stdout line with a positive peptide count - the default no-flag path does not trigger it - -buildFragIndex 0 is equivalent to the default The bundled human-uniprot-contaminants.fasta fixture (82 MB, full human proteome) triggers OOM in the in-memory DirectStore builder; for Phase 2b CLI validation a small hand-written fixture is sufficient. MmapStore support for large fastas is Phase 4. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Adds the CLI flag + SearchParams.FragmentIndexMode enum + ParamManager registration + ParamNameEnum constant, mirroring the -precursorCal pattern from 9b3967a exactly. Default is OFF -- zero behavioural change. Subsequent commits will gate Tier-1 search integration on params.getFragmentIndexMode() != OFF. Covered by a new scaffolding test with 6 assertions: default mode, each of the three value parses, case-insensitive round-trip, and the invalid-value error path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Builds a FragmentIndex once per input FASTA between SpecKey list construction and worker task fan-out, when -useFragmentIndex is not off. Passes the built index (may be null) into ConcurrentMSGFPlus.RunMSGFPlus via a new constructor parameter; the index is held in a field but not yet consumed by the search loop — that wiring lands in commit 3.4. Off path is strictly unchanged: when params.getFragmentIndexMode() is OFF, no fragment-index code runs, the constructor gets null, and downstream behaviour is bit-identical to HEAD. Build defaults match Phase 2b (Trypsin-params-driven enzyme, minLen / maxLen / missedCleavages from user params, 50 Da slabs over 100-4000 Da, 1.0005 Da fragment bins). Covered by a new off-mode regression test asserting the 'Building fragment index' log line does NOT appear when -useFragmentIndex off is set. The on-mode build is validated by the full-search end-to-end path in subsequent commits. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…candidate generation Introduces the v2 architecture: fragment index AS candidate generator, not post-hoc filter. For each spectrum, the generator computes a b/y-decomposed fingerprint over the top-20 ranked peaks, pre-filters peptides by Hamming overlap >= FP_THRESHOLD, then walks per-peak bucket lookups accumulating NewRankSum log-scores via NewRankScorer.getNodeScore. Top-K extracted via min-heap. This commit is SKELETON ONLY — single slab per spectrum (no isotope offset loop yet), unmod peptides only, no DBScanner integration. Blast radius is the two new classes plus the unit test; if the algorithm turns out wrong, rolling back is trivial. Also introduces a tiny package-friend accessor (msscorer.ScorerPartitionAccess) that exposes NewRankScorer.getPartition and getNumSegments to the fragindex package without broadening the public surface of NewRankScorer or forcing the generator to construct a NewScoredSpectrum (which would mutate the caller's Spectrum via filterPrecursorPeaks). DBScanner wiring is the next commit, gated on unit-test validation that the scoring direction and top-K ordering work. Per ~/.claude/plans/msgfplus-fragment-index/candidate-generator-design.md (section 10, "First commit scope"). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…tor when ON Wires the v2 candidate generator into DBScanner at the dbSearch entry point. When fragmentIndex != null AND -useFragmentIndex != off, the new dbSearchFragmentIndex path runs: 1. FragmentIndexCandidateGenerator.topKForSpectrum per SpecKey 2. Tier-2 SpecEValue via existing scorer.getScore on the top-K 3. PriorityQueue insert mirroring the SA-walk code path OFF-mode (default) is a single-branch early-return at DBScanner.dbSearch line 250 (first line of the method body). The existing SA-walk body is untouched, so float op ordering is identical to baseline — the bit- identity correctness gate is preserved by construction. TIER1_TOP_K = 10 for this commit; tuning is Phase V3. Phase V5 (variable-mod enumeration into the index) is deferred; unmod peptides only here. Covered by TestDBScannerFragmentIndexDispatch pinning the dispatch predicate (6 tests covering all 4 branches of the null-AND-mode predicate plus the legacy 11-arg ctor default). Full end-to-end validation is the next commit (3-arm benchmark on PXD001819). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…eighted hit scoring The NewRankScorer.getPartition(charge, mass, segment) call only works for partitions that have been populated via the full NewScoredSpectrum construction path. Picking a fixed segment ("last segment") in the FragmentIndexCandidateGenerator returned a Partition key that was NOT present in rankDistTable for the scorer's actually-loaded segments, so every ON-mode worker thread crashed with NPE at NewRankScorer.getScoreFromTable:172 — "frequencies is null". End-to-end PXD001819 run (commit f408a49 JAR): ph3A baseline : 28037 T / 11022 D (OFF-mode bit-identical ✓) ph3B -useFragmentIndex off : 28037 T / 11022 D (bit-match A, correctness gate holds ✓) ph3D -useFragmentIndex on : NPE in worker threads → pin not produced Swap the Tier-1 score from rank-based log-probability to a peak-rank- weighted hit count: s = 1 / (1 + 0.02 * (rank - 1)) rank 1 (highest intensity) → 1.0; rank 50 → 0.5. No partition lookup, no IonType lookup, no dependency on NewRankScorer state. Tier-2 (DBScanner's SimpleDBSearchScorer) still runs the full rank-aware scoring, so ranking quality is preserved at the top-K survivors that actually matter. Removes the now-unused ScorerPartitionAccess helper (reverts the package-private bridge added in 7923576). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…ance for index=-1 Three correctness fixes to get ON-mode producing PSMs on PXD001819: 1. Precursor mass tolerance filter in FragmentIndexCandidateGenerator. Previously the top-K candidates were drawn from the whole 50 Da slab, but computeSpecEValue only accepts peptides within the search's ppm tolerance (~0.005 Da at 5 ppm). Virtually none of the top-K would survive and all matches were dropped downstream. Added a tolMinus/ tolPlus filter that threads the tolerance through from SearchParams (including isotope-error shifts with the correct sign: candidates can be up to maxIsotopeError * 1.00335 BELOW observed parent mass, and up to tolLeft ABOVE it). 2. Removed NewRankScorer.getPartition-based scoring (was NPE on the last-segment partition not being in rankDistTable). Replaced with peak-rank-weighted hit count inside the Tier-1 bucket walk. Tier-2 re-scores via the existing SimpleDBSearchScorer, so the actual ranking quality is preserved. 3. DirectPinWriter now tolerates DatabaseMatch.index == -1 (emitted by the fragment-index path since fragindex peptides don't come from an SA walk). Emits "unknown_protein" as the accession. This is a temporary workaround — the next commit will resolve real protein annotations by indexing peptide → SA-position at build time. End-to-end PXD001819 result: ph3A baseline : wall=97s T=28037 D=11022 ph3B -useFragmentIndex off : wall=94s T=28037 D=11022 (bit-match) ph3D -useFragmentIndex on : wall=104s T=39117 D=0 (!) Correctness gate on OFF-mode still holds. ON-mode now runs end-to-end but decoy handling is broken because SuffixArrayPeptideWalker filters decoy proteins at build time — the 0-decoy result cannot be rescored by Percolator. Next commits: decoy inclusion in the index, then profiling for the ≥10× speed target. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…rmat BREAKING CHANGE. MS-GF+ no longer emits mzIdentML (.mzid) from the search pipeline. All results now feed into Percolator via the .pin format, which is the expected downstream rescoring pipeline. Rationale: every production workflow we ship already routes MS-GF+ output into Percolator (via pin-bench, OpenMS PercolatorAdapter, or quantms). Keeping the mzid writer alive meant maintaining a second serializer that every PSM had to pass through, along with its indirection through MZIdentMLGen / jmzidentml. Removing it eliminates ~500 LOC of writer overhead per search and makes the CLI surface clearer. CLI changes: -outputFormat options are now just `pin` (default) and `tsv`. Integer aliases: 0=pin, 1=tsv. The previous 0=mzid, 2=both are rejected. Passing an .mzid path to -o is still accepted for backward compatibility with scripts — MS-GF+ rewrites the extension to .pin (or .tsv) at write time. SearchParams.writeMzid() retained as a no-op returning false, so external callers that still reference the method compile cleanly. MZIdentMLGen.java itself stays in the tree for the MzIDToTsv legacy- conversion tool (users processing archived .mzid files still need it). Only the search-time CALL to MZIdentMLGen has been removed. Also restores FragmentIndexCandidateGenerator.FP_THRESHOLD to 4 (was accidentally left at 0 during a diagnostic pass; the 0 disabled the primary Tier-1 pre-filter entirely). Migration for downstream users: - Percolator pipelines: use the .pin directly; no action needed. - OpenMS: switch MSGFPlusAdapter to -out_type pin. - Legacy .mzid workflows: use MS-GF+ v2026.03.25 or convert .pin output through Percolator's XML path. Test suite: 82/82 scoped tests green. TestDirectPinWriter updated to the new outputFormat enum layout; integration tests (TestPrecursorCalIntegration, TestFragmentIndexRunMSGFPlusWiring) now assert .pin files where they previously asserted .mzid. Docs updated: docs/Changelog.md (vNEXT entry), docs/MSGFPlus.md (CLI reference + output section), docs/README.md (summary), .claude/CLAUDE.md (project invariants). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…mpat shim Full removal of every mzIdentML touchpoint from MS-GF+. No extension rewrite, no legacy -o .mzid acceptance, no archived converter utilities. Passing `-o foo.mzid` is now a hard CLI error. Files deleted: * src/main/java/edu/ucsd/msjava/mzid/MZIdentMLGen.java (writer) * src/main/java/edu/ucsd/msjava/mzid/AnalysisProtocolCollectionGen.java * src/main/java/edu/ucsd/msjava/mzid/MzIDTest.java (debug harness) * src/main/java/edu/ucsd/msjava/mzid/Unimod.java (writer-only) * src/main/java/edu/ucsd/msjava/mzid/UnimodComposition.java (writer-only) * src/main/java/edu/ucsd/msjava/mzid/MzIDParser.java (reader) * src/main/java/edu/ucsd/msjava/ui/MzIDToTsv.java (converter CLI) * src/main/java/edu/ucsd/msjava/ui/ScoringParamGen.java (reader-only tool) * src/main/java/edu/ucsd/msjava/msutil/AnnotatedSpectra.java (reader-only) * src/test/java/edu/ucsd/msjava/mzid/AnalysisProtocolCollectionGenTest.java * src/test/java/msgfplus/TestMZIdentMLGen.java * src/test/java/msgfplus/TestMzIDToTsv.java * src/test/java/msgfplus/TestScoring.java * src/test/java/msgfplus/TestMSGFPlus.java (abandoned Windows-path tests) * docs/MzidToTsv.md, docs/ScoringParamGen.md (tool documentation) ParamManager.java: * -o file format now rejects .mzid; accepts only .pin and .tsv * addScoringParamGenParams() deleted * MZID_OUTPUT_FILE enum → SEARCH_OUTPUT_FILE * Examples updated to .pin SearchParams.java: * writeMzid() method deleted (was a no-op stub in prior commit) * outputFormat default filename uses .pin (or .tsv) — no .mzid fallback MSGFPlus.java: * No more outputFile.getPath().replaceAll("\\.mzid$", ".pin") — writer consumes the user's -o path as-is Users with legacy .mzid files who need to convert or train scoring models must use MS-GF+ v2026.03.25 or earlier. The Percolator-fed pipeline is the only supported workflow going forward. Docs: docs/Changelog.md vNEXT entry expanded; docs/MSGFPlus.md Note section rewritten; docs/README.md index purged of removed tool links; .claude/CLAUDE.md mzid/ package description updated. Test suite: 118/118 scoped tests green. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…output The previous commit deleted these along with the rest of the mzid infrastructure, but they are usable independently: Unimod holds the modification-accession + monoisotopic-mass tables that a richer DirectPinWriter could use to emit proper Unimod:NNNN references in the Peptide column instead of the current raw +mass strings. No consumers in the current tree (the only prior consumer, MZIdentMLGen, was deleted). Keeping the files available for PTM work without re- introducing mzid writing. Changelog updated to document the retention rationale. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

…h features Adds three new per-PSM features to the Percolator .pin output: - longest_b: length of the longest consecutive run of matched b-ions along the peptide backbone. - longest_y: same for y-ions. - longest_y_pct: longest_y normalized by the number of inter-residue bonds (peptide.size() - 1). These mirror features Sage exposes in its pin output and give Percolator a signal that survives target/decoy shuffling far better than the scalar peak count NumMatchedMainIons alone. Measured +14 / +140 / +18 PSMs at 1% FDR on PXD001819 / Astral / TMT (vs fork pre-PR#22 baseline; vs upstream v2024.03.26 the gain is a component of the combined +515 / +10985 / +1735 PSM lift). Also consolidates 3 redundant aaSet.getPeptide() parses per PSM in DirectPinWriter.writeRow into a single parse (semantically equivalent; code cleanup). Updated TestDirectPinWriter header-shape assertions to cover the new columns and their ordering (between NumMatchedMainIons and ExplainedIonCurrentRatio). Native target/decoy counts bit-identical to pre-change on all 3 datasets. All 43 scoped tests green.

Removes the fragment-index Tier-1 candidate-generation architecture that was explored on this branch but failed to meet the speed/recall/memory gates in the speed-rewrite-v2 plan. Measured outcomes: - Speed: ON-mode was 1.8× slower than baseline on PXD001819 (the dataset where it should at least break even). JFR showed Step 3's O(n_peptides) per-spectrum linear fingerprint-overlap scan dominated (18.4% of CPU). - Memory: fragment-index build OOM'd with 8 GB -Xmx on the 32 MB Astral FASTA, making Astral searches impossible in ON-mode. - Recall: 95.3% at 1% FDR (plan target was 99.5%). Attempted micro-optimizations (flat long[] fingerprint layout, Step 3 branch hoist, K-tuning) each measured negative impact, confirming the architecture is fundamentally unsuited to Java — matching Sage's Rust-level data-structure quality requires a rewrite beyond what's achievable in this session. Deletes: - src/main/java/edu/ucsd/msjava/fragindex/ (14 classes, ~2500 LOC) - src/test/java/edu/ucsd/msjava/fragindex/ (11 test classes, ~1100 LOC) - 3 related test classes (TestBuildSAFragIndex, TestDBScannerFragmentIndexDispatch, TestFragmentIndexModeScaffolding) Surviving-file cleanups: - MSGFPlus.runMSGFPlus: remove fragment-index build block + imports. - ParamManager: remove USE_FRAGMENT_INDEX enum entry, addUseFragmentIndexParam, getUseFragmentIndexRawValue, the help text. - SearchParams: remove FragmentIndexMode enum, fragmentIndexMode field, getFragmentIndexMode, the parser branch. - DBScanner: remove fragmentIndex/fragmentIndexMode/candidateGenerator fields, the TIER1_TOP_K constant, the dispatch at dbSearch line 250, the entire dbSearchFragmentIndex method (~150 LOC), the 2-arg constructor overload. - BuildSA: remove -buildFragIndex CLI flag + buildFragmentIndex helper. - ConcurrentMSGFPlus.RunMSGFPlus: drop the FragmentIndex constructor arg. MSGFPlus scoring path is now bit-identical to the state before the fragment-index work began — no scoring changes, no format changes. Full post-mortem + alternative speed ideas in ~/.claude/plans/msgfplus-fragment-index/ABANDONED-2026-04-20.md. Total: -3317 LOC (25 files deleted; 6 files cleaned). All 43 scoped tests green (TestDirectPinWriter, TestMassCalibrator, TestPrecursorCalScaffolding).

Two independent micro-optimizations that reduce per-PSM allocation pressure: 1. Partition.hashCode() — cache the hash in a field, recompute only when fields mutate. Partition is used as a HashMap key in NewRankScorer; previously each getPartition lookup recomputed the hash from 3 primitives, which showed up in JFR profiling on large searches. 2. CandidatePeptideGrid — when cloning a parent peptide StringBuffer for a variant, pre-size the new buffer to the final length and append the substring in one call, instead of substring().toString() + new StringBuffer(String). Saves 1 intermediate String per variant-expansion. Both changes are semantically equivalent; no scoring output changes.

- Changelog.md: reflect that the -useFragmentIndex flag is removed. - .gitignore: pick up local-only benchmark artifacts that were leaking into git status.

Three benchmark figures from the 2026-04-20 run comparing upstream MSGFPlus v2024.03.26 (real baseline), current-dev (this branch), and Sage 0.14.7 native on PXD001819, Astral ProteoBench, and TMT PXD007683: - fig1_baseline_vs_currentdev.png — wall, CPU, memory, 1% FDR PSMs across all 3 datasets; current-dev delta annotations per bar pair. - fig2_currentdev_vs_sage.png — current-dev vs Sage on PXD001819 + Astral (TMT excluded due to Sage Percolator-calibration artefact on isobaric data). - fig3_psms_and_peptides.png — 3-engine × 3-dataset PSMs at 1% FDR (top) and unique peptides at 1% FDR (bottom), showing current-dev wins on sensitivity on all three datasets at both metrics. Referenced from PR #23 comments.

…state - Add TestPartition.java — unit test for the Partition.hashCode cache change in commit 880a9d5 (equals/hashCode contract + mutable-field tracking). Test was written alongside the change but not committed. - Update .claude/plans/README.md to reflect fragment-index abandonment; point readers to the ABANDONED doc for the post-mortem. - .gitignore: exclude session-local state files (.claude/SESSION_STATUS.md, .claude/scheduled_tasks.lock).

Parse scan/scans/spectrum key-value metadata in MGF TITLE lines so PRIDE-style spectra keep native scan IDs instead of falling back to index-based identifiers. Made-with: Cursor

… XML The PR switches this test's output from .mzid to .pin (post-mzid-removal), but extractPsmItems still regex-matched <SpectrumIdentificationItem> XML elements. On a .pin file (tab-separated values) no match is found, offPsms is always empty, and `assertFalse("Expected at least one PSM in the off run", offPsms.isEmpty())` fails. This is the CI failure. Fix: read all lines of the .pin file, skip the header row, return the remainder as PSM rows. Indexed comparisons remain meaningful because DirectPinWriter emits PSMs in scoring order. Also drops now-unused Pattern/Matcher imports + VOLATILE_ATTRS constant and updates the javadoc to describe .pin semantics. Verifies the -precursorCal off bit-identity gate is still being enforced against the pin output instead of being silently skipped on an empty regex match.

Four fixes addressing Copilot's review comments (PR #23 discussion threads r3112850861, r3112850935, r3112850976, r3112851058): 1. SearchParams.parse() — read outputFormat from paramManager BEFORE using it to select the default output-file extension. Previously the field was still at its zero initializer when the extension was chosen, so `-outputFormat tsv` would default the output path to `.pin` and then write TSV content into the `.pin`-named file. Move the assignment from line 494 up near the spec-path parse block at line 333, leaving a breadcrumb comment where the duplicate assignment used to live. 2. docs/Changelog.md — remove the stale `-useFragmentIndex` flag entry. This PR removes both the flag and the underlying fragment-index code; the changelog was still advertising it as an added feature. 3. docs/MSGFPlus.md — the usage synopsis for `-o OutputFile` now lists both `.pin` and `.tsv` (was: `.pin` only), matching the detailed description below. 4. DBScanner.java — drop unused imports `NewScorerFactory` and `NewScorerFactory.SpecDataType`; the references were only used by the removed fragment-index dispatch path.

…v schema - Rename docs/*.md (BuildSA.md, Changelog.md, IsobaricLabeling.md, MS-GFDB.md, MSGFDB_ModFile.md, MSGFPlus.md, README.md, Troubleshooting.md, examples/README.md) to lowercase. Also lowercase docs/ParameterFiles → docs/parameterfiles. Two-step git-mv preserves rename history on case-insensitive filesystems. - Update all internal cross-references in docs/*.md to the new lowercase paths, including the root README.md link and the StaxMzMLParser.java pointer to docs/troubleshooting.md. - New docs/output.md — full column reference for the Percolator .pin format (identity, mass/charge, enzymatic-boundary, ion-structure, ion-current, fragment-mass-error, generating-function features, peptide + protein layout) plus a short section on the .tsv format and migration guidance from the removed .mzid. - Link to output.md from docs/readme.md's "Usage and help" section.

Adds docs/examples/pxd001819_example.pin: a trimmed, 31-row Percolator .pin from a PXD001819 search (LTQ Orbitrap Velos, yeast + UPS1 FASTA). Contains header + 20 target PSMs + 10 decoy PSMs, chosen for peptide- sequence diversity so every feature column (charge flags, longest_b/y, error stats, mod annotations) is represented with real data. - docs/examples/readme.md — describe the new file in the inventory. - docs/output.md — link from the "Example file" subsection so readers can jump straight from the column reference to a live sample. File size 12 KB. Users can inspect the .pin schema without running a full search.

feat(perf): pin-direct output + longest_b/y features; abandon fragment-index

ypriverol and others added 30 commits April 13, 2026 07:51

Merge pull request #5 from bigbio/feature/159-ms-level-filtering

65c2592

feat(MSGFPlus#159): add -msLevel parameter for MS level filtering

Merge pull request #4 from bigbio/fix/157-mzid-missing-psm-scores

5f0ced1

fix(MSGFPlus#157): preserve PSM scores when DeNovoScore is below threshold

Merge pull request #3 from bigbio/fix/test-infrastructure

4f74816

fix: enable test suite and fix broken build dependencies

Merge pull request #8 from bigbio/feature/ci-readme-cleanup

3909fe8

Add parameter docs to README and CI/CD workflow

Merge pull request #7 from bigbio/refactor/dead-code-removal

e955869

Remove dead code: 150 unused classes, -22K lines

Update src/main/java/edu/ucsd/msjava/mzml/StaxMzMLParser.java

55daec4

Co-authored-by: Copilot <175728472+Copilot@users.noreply.github.com>

fix: update TestParsers after dead-code removal rebase

3eb2966

Replace ConvertToMgf-based tests (class removed in PR #7) with StaxMzMLParser and SpectraAccessor mzML parsing tests. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>

Merge pull request #9 from bigbio/feature/native-tsv-output

1f00a2b

feat: native TSV output — bypass mzIdentML for OpenMS/Percolator pipelines

Merge pull request #6 from bigbio/feature/stax-mzml-reader

d4c1f9c

perf: replace jmzml JAXB parser with StAX-based mzML reader

chore: align benchmark naming and mzXML messaging

b3f2e98

Use uppercase PXD001819 naming in workflow-visible labels/artifacts and update README to state mzXML is not available in this fork. Made-with: Cursor

docs: drop PXD001819 plan file; point READMEs at CI docs

ea0de94

Made-with: Cursor

Merge pull request #13 from bigbio/claude/review-msgfplus-pr-12-YfoTI

3c47109

fix(benchmark): harden PXD001819 scaffold per review feedback

ypriverol and others added 30 commits April 19, 2026 13:42

feat(fragindex): FragmentIndexStore interface + in-memory DirectStore

1f4cf21

feat(fragindex): TheoreticalFragmentGenerator for b/y singly-charged …

ef6ddbf

…ions Co-Authored-By: Claude Sonnet 4.6 <noreply@anthropic.com>

feat(fragindex): SlabAssigner with boundary-overlap replication

5e66119

feat(fragindex): PeptideTable for per-slab peptide metadata

6eaaf9a

docs: changelog entry for fragment-index removal + gitignore cleanups

2851cbb

- Changelog.md: reflect that the -useFragmentIndex flag is removed. - .gitignore: pick up local-only benchmark artifacts that were leaking into git status.

fix(parser): support PRIDE-style scan extraction in MGF titles

2402802

Parse scan/scans/spectrum key-value metadata in MGF TITLE lines so PRIDE-style spectra keep native scan IDs instead of falling back to index-based identifiers. Made-with: Cursor

docs(examples): describe pxd001819_example.pin in the inventory

d5aea81

Merge pull request #23 from bigbio/feat/msgfplus-speed-v2

52b99d8

feat(perf): pin-direct output + longest_b/y features; abandon fragment-index

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

First bigbio release of msgf+#14

First bigbio release of msgf+#14
ypriverol wants to merge 95 commits intomasterfrom
dev

ypriverol commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ypriverol commented Apr 16, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants