Skip to content

fix(perf): drop old assets query performance#35591

Merged
wezell merged 6 commits intomainfrom
issue-35590-drop-old-assets-performance
May 6, 2026
Merged

fix(perf): drop old assets query performance#35591
wezell merged 6 commits intomainfrom
issue-35590-drop-old-assets-performance

Conversation

@wezell
Copy link
Copy Markdown
Member

@wezell wezell commented May 6, 2026

ref: #35590

  • Improves performance of both queries.
  • Commits after every batch

This PR fixes: #35590

@claude
Copy link
Copy Markdown
Contributor

claude Bot commented May 6, 2026

Claude finished @wezell's task in 3m 46s —— View job


PR Review

The NULL-handling fix is correct and the new regression test exercises it well. A few things worth a second look before merging.

Findings

1. Semantic change to content.drop-versions.greater-than is not documented.
Both queries now exclude working/live inodes from the count. With greater-than = 100, a tenant ends up with 101 total versions kept (100 old + working/live), where previously the contract was 100 total. The test expectation was updated 100 → 101, which exposes this. This is a behavior change for every existing customer — please call it out in the PR body / release notes (and maybe update the Javadoc on GREATER_THAN_PROP at DropOldContentVersionsJob.java:42-44, which still describes the old semantics).

2. Outer query lost ORDER BY COUNT(inode) DESC.
DropOldContentVersionsJobHelper.java:46 — the original query ordered worst offenders first so the limited batch (PULL_BATCH_SIZE = 500) hit the highest-impact contentlets per run. The new query has no ORDER BY, so batch contents are now in undefined order. On a tenant where a few contentlets have 10k+ versions, you may not reach them for many job runs. Was the removal intentional? If the goal was just to drop the sort cost on a COUNT(*) aggregate, keeping it is cheap relative to the GROUP BY scan that already happened.

3. Outer query performance claim is unverified.
The new outer SELECT does a correlated NOT EXISTS against contentlet_version_info for every row in contentlet filtered by language_id, then a GROUP BY. The previous query was a plain GROUP BY with no subquery. Whether this is faster or slower depends entirely on indexes on contentlet_version_info(identifier, lang, working_inode, live_inode). Has anyone run EXPLAIN ANALYZE on a representative tenant? The PR title says "performance" but the only measured improvement so far is the inner per-contentlet query (which is clearly better).

4. PR description doesn't match the final implementation.
PR body says "Commits after every batch" but commit 9f238b7f reverted to per-contentlet HibernateUtil.closeSessionSilently(). Plus deleteContentVersion is now @CloseDBIfOpened, so the session is already closed inside the call — the explicit closeSessionSilently() in the finally at DropOldContentVersionsJob.java:115 is redundant defense-in-depth (harmless, but worth knowing).

5. Safety-break can mask real failures.
DropOldContentVersionsJob.java:121-127deletedThisIteration == 0 will also be true if every deleteContentVersion call in a batch throws (the catch at line 184 swallows the exception and we never increment deleted). In that case we log "stopping to avoid infinite loop. Investigate manually" but the underlying error already happened earlier in Logger.warnAndDebug. Consider also tracking exception count and logging it in the safety-break message so an operator finds the cause without grepping back through the log.

6. Minor — test docstring references SQL that isn't in the final diff.
DropOldContentVersionsJobTest.java:74-77 mentions NOT IN ... UNION ALL. That form existed in commit 067d2fc3 but the final diff vs main shows the original <> query being replaced with NOT EXISTS. The history is accurate to the dev sequence but confusing for someone reading this in 6 months. Consider rewording to "the previous query (intermediate NOT IN and original <>) silently failed when live_inode was NULL".

7. Pre-existing, but visible while you're in here.
DropOldContentVersionsJob.java:65-67 — Javadoc says "Fire every Monday at 1:15AM" but cron is * * WED *. Either the doc or the schedule is wrong; cheap to fix in this PR.

Style nits (non-blocking)

  • DropOldContentVersionsJobHelper.java:82-83 — double blank line after method signature.
  • The Logger.warn at line 122-125 builds the message with language (Language object) inside a '%s' — relies on Language.toString() being readable. Pre-existing pattern in the file, but if toString ever returns the object hash, this log gets ugly.

What's clean

  • NULL-safety fix using NOT EXISTS is the right call and correctly defended by the new regression test.
  • Restructuring deleteOldVersionsFromContentlet to return a count and propagate it up is a clean way to detect no-progress without extra state.
  • @CloseDBIfOpened on deleteContentVersion is a good correctness fix on its own.

@wezell wezell linked an issue May 6, 2026 that may be closed by this pull request
Address PR #35591 review feedback:

- DropOldContentVersionsJobHelper: replace NOT IN with NOT EXISTS so the
  old-versions query handles nullable contentlet_version_info.live_inode
  correctly. NOT IN against a NULL returns UNKNOWN for every row, which
  silently skipped cleanup for never-published drafts (the dominant case
  for EFS hardlink bloat). Working/live inodes are still strictly
  excluded from deletion.
- DropOldContentVersionsJob: add no-progress safety break to the outer
  while(true) loop so a batch that deletes zero versions stops the run
  for that language instead of spinning the Quartz worker indefinitely.
  Particularly important under StatefulJob, where a stuck run blocks
  future scheduled executions.
- DropOldContentVersionsJobTest: add regression test covering the NULL
  live_inode case end-to-end via the helper, asserting the working
  inode is never returned and the expected old versions still are.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
deleteContentVersion is already @CloseDBIfOpened, so each deletion gets
its own transaction boundary. Closing per contentlet on top of that
churns connection state for tenants with many short-lived sessions.
Move the closeSessionSilently() to a finally block around the batch loop
so it just sweeps anything left after the batch — same memory benefit,
fewer pool round-trips. Address PR #35591 review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
A single batch (default 500 identifiers) can produce many thousands of
version deletions on tenants with heavy file-asset histories — the exact
EFS bloat case this job targets. Closing once per batch let too much
state accumulate in one Hibernate session. Move closeSessionSilently()
back inside the inner for-loop, wrapped in a finally so a delete failure
mid-loop still cleans up. Per-contentlet (~500 closes per batch) is the
right middle ground vs per-row.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Two fixes to DropOldContentVersionsJobTest:

1. deleteOldContentlets — update expected count from 100 → 101.

   The previous expectation (100 versions remaining out of 105) only
   passed because the original NOT IN ... UNION ALL query was failing
   to exclude the working/live inode from deletion candidates: the
   subquery's NULL handling made NOT IN return UNKNOWN for every row
   (the bug PR #35591 fixes), so all 105 versions passed the filter
   and OFFSET 100 returned 5 deletions = 100 remaining.

   With the corrected NOT EXISTS query, the working/live inode is now
   properly excluded (1 row), so 104 candidates pass the filter,
   OFFSET 100 returns 4, and 105 - 4 = 101 versions remain. The
   "delete 5 leave 100" result was a side-effect of the bug, not the
   intended behavior.

2. findContentVersionsGreaterThan_protectsWorkingInodeWhenLiveInodeIsNull
   — replace title "Draft v1 (never published)" with
   "Draft Never Published 1". The webPageContent type's title field
   has a regex pattern that rejects parentheses, causing the previous
   title to fail validation immediately.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
At steady state — a contentlet with working/live + GREATER_THAN old
versions (GREATER_THAN + 1 total rows) — the outer
FIND_CONTENTS_WITH_VERSIONS_GREATER_THAN matched on > GREATER_THAN
total rows, but the inner FIND_CONTENT_VERSIONS_GREATER_THAN query
correctly excludes the working/live inode and then OFFSETs by
GREATER_THAN, returning 0 candidates. The no-progress safety break in
DropOldContentVersionsJob.execute() then warned on every healthy run.

Mirror the inner NOT EXISTS exclusion in the outer query so the count
reflects actual deletion candidates. The outer now only returns
identifiers for which the inner will produce real work.

Address PR #35591 review.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@wezell wezell enabled auto-merge May 6, 2026 18:13
@wezell wezell added this pull request to the merge queue May 6, 2026
Merged via the queue into main with commit d52ec37 May 6, 2026
51 checks passed
@wezell wezell deleted the issue-35590-drop-old-assets-performance branch May 6, 2026 21:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

AI: Safe To Rollback Area : Backend PR changes Java/Maven backend code

Projects

Status: No status

Development

Successfully merging this pull request may close these issues.

Improve performance of DropOldAssets Query

2 participants