perf(validator): free per-PR file content after scoring#1456
Merged
Conversation
Drop MirrorFile head_content/base_content from each ScoredPR once the scalar scores are extracted. The full source text is only needed transiently for tree-diff scoring; retaining it on every ScoredPR across all miners in miner_evaluations for the whole round needlessly inflates peak memory. The persistent cache already does this in _scored_mirror_pr_for_cache; apply the same to the live round. File metadata (filename/additions/deletions) is preserved.
LandynDev
approved these changes
Jun 5, 2026
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
score_prattaches each PR's full source text to itsScoredPRviascored.files—MirrorFile.head_content/base_content, up to ~1 MB per file. That content is only needed transiently to compute the tree-diff scalar scores (token/structural/leaf counts + base score), but it currently stays attached to everyScoredPRand is held across all miners inminer_evaluationsfor the entire scoring round.This frees the heavy text as soon as the scalar scores are extracted, keeping only the lightweight file metadata (filename / additions / deletions). The persistent cache already does exactly this —
_scored_mirror_pr_for_cachesetsfiles = None(classes.py); this just applies the same treatment to the live round dict.Effect: peak scoring-round memory drops from roughly
miners × PRs × files × contentdown to metadata + a single PR's content at a time, with no change to scoring output.Why it's safe
The file content is consumed entirely within
score_pr(thefile_contentsdict built bymirror_files_to_legacyand passed tocalculate_base_score_for_pr_files). Nothing downstream readshead_content/base_contentafter that point:_calculate_pr_multipliersreads only PR metadata + scalar scoresfinalize_miner_scoresoperates on scalar scoresbulk_store_evaluation→get_all_file_changesreads only file metadata (discards content)Type of Change
Testing