Skip to content

perf(validator): free per-PR file content after scoring#1456

Merged
LandynDev merged 2 commits into
testfrom
perf/free-pr-content-after-scoring
Jun 5, 2026
Merged

perf(validator): free per-PR file content after scoring#1456
LandynDev merged 2 commits into
testfrom
perf/free-pr-content-after-scoring

Conversation

@anderdc
Copy link
Copy Markdown
Collaborator

@anderdc anderdc commented Jun 5, 2026

Summary

score_pr attaches each PR's full source text to its ScoredPR via scored.filesMirrorFile.head_content / base_content, up to ~1 MB per file. That content is only needed transiently to compute the tree-diff scalar scores (token/structural/leaf counts + base score), but it currently stays attached to every ScoredPR and is held across all miners in miner_evaluations for the entire scoring round.

This frees the heavy text as soon as the scalar scores are extracted, keeping only the lightweight file metadata (filename / additions / deletions). The persistent cache already does exactly this — _scored_mirror_pr_for_cache sets files = None (classes.py); this just applies the same treatment to the live round dict.

Effect: peak scoring-round memory drops from roughly miners × PRs × files × content down to metadata + a single PR's content at a time, with no change to scoring output.

Why it's safe

The file content is consumed entirely within score_pr (the file_contents dict built by mirror_files_to_legacy and passed to calculate_base_score_for_pr_files). Nothing downstream reads head_content / base_content after that point:

  • _calculate_pr_multipliers reads only PR metadata + scalar scores
  • finalize_miner_scores operates on scalar scores
  • bulk_store_evaluationget_all_file_changes reads only file metadata (discards content)

Type of Change

  • Performance / optimization

Testing

  • Existing mirror scoring tests pass (72 passed)

anderdc and others added 2 commits June 5, 2026 15:05
Drop MirrorFile head_content/base_content from each ScoredPR once the
scalar scores are extracted. The full source text is only needed
transiently for tree-diff scoring; retaining it on every ScoredPR across
all miners in miner_evaluations for the whole round needlessly inflates
peak memory. The persistent cache already does this in
_scored_mirror_pr_for_cache; apply the same to the live round. File
metadata (filename/additions/deletions) is preserved.
@LandynDev LandynDev merged commit a1fc8e1 into test Jun 5, 2026
3 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants