fix(perf): drop old assets query performance by wezell · Pull Request #35591 · dotCMS/core

wezell · 2026-05-06T12:19:49Z

Improves performance of both queries.
Commits after every batch

This PR fixes: #35590

claude · 2026-05-06T12:20:57Z

Claude finished @wezell's task in 3m 46s —— View job

PR Review

The NULL-handling fix is correct and the new regression test exercises it well. A few things worth a second look before merging.

Findings

1. Semantic change to content.drop-versions.greater-than is not documented.
Both queries now exclude working/live inodes from the count. With greater-than = 100, a tenant ends up with 101 total versions kept (100 old + working/live), where previously the contract was 100 total. The test expectation was updated 100 → 101, which exposes this. This is a behavior change for every existing customer — please call it out in the PR body / release notes (and maybe update the Javadoc on GREATER_THAN_PROP at DropOldContentVersionsJob.java:42-44, which still describes the old semantics).

2. Outer query lost ORDER BY COUNT(inode) DESC.
DropOldContentVersionsJobHelper.java:46 — the original query ordered worst offenders first so the limited batch (PULL_BATCH_SIZE = 500) hit the highest-impact contentlets per run. The new query has no ORDER BY, so batch contents are now in undefined order. On a tenant where a few contentlets have 10k+ versions, you may not reach them for many job runs. Was the removal intentional? If the goal was just to drop the sort cost on a COUNT(*) aggregate, keeping it is cheap relative to the GROUP BY scan that already happened.

3. Outer query performance claim is unverified.
The new outer SELECT does a correlated NOT EXISTS against contentlet_version_info for every row in contentlet filtered by language_id, then a GROUP BY. The previous query was a plain GROUP BY with no subquery. Whether this is faster or slower depends entirely on indexes on contentlet_version_info(identifier, lang, working_inode, live_inode). Has anyone run EXPLAIN ANALYZE on a representative tenant? The PR title says "performance" but the only measured improvement so far is the inner per-contentlet query (which is clearly better).

4. PR description doesn't match the final implementation.
PR body says "Commits after every batch" but commit 9f238b7f reverted to per-contentlet HibernateUtil.closeSessionSilently(). Plus deleteContentVersion is now @CloseDBIfOpened, so the session is already closed inside the call — the explicit closeSessionSilently() in the finally at DropOldContentVersionsJob.java:115 is redundant defense-in-depth (harmless, but worth knowing).

5. Safety-break can mask real failures.
DropOldContentVersionsJob.java:121-127 — deletedThisIteration == 0 will also be true if every deleteContentVersion call in a batch throws (the catch at line 184 swallows the exception and we never increment deleted). In that case we log "stopping to avoid infinite loop. Investigate manually" but the underlying error already happened earlier in Logger.warnAndDebug. Consider also tracking exception count and logging it in the safety-break message so an operator finds the cause without grepping back through the log.

6. Minor — test docstring references SQL that isn't in the final diff.
DropOldContentVersionsJobTest.java:74-77 mentions NOT IN ... UNION ALL. That form existed in commit 067d2fc3 but the final diff vs main shows the original <> query being replaced with NOT EXISTS. The history is accurate to the dev sequence but confusing for someone reading this in 6 months. Consider rewording to "the previous query (intermediate NOT IN and original <>) silently failed when live_inode was NULL".

7. Pre-existing, but visible while you're in here.
DropOldContentVersionsJob.java:65-67 — Javadoc says "Fire every Monday at 1:15AM" but cron is * * WED *. Either the doc or the schedule is wrong; cheap to fix in this PR.

Style nits (non-blocking)

DropOldContentVersionsJobHelper.java:82-83 — double blank line after method signature.
The Logger.warn at line 122-125 builds the message with language (Language object) inside a '%s' — relies on Language.toString() being readable. Pre-existing pattern in the file, but if toString ever returns the object hash, this log gets ugly.

What's clean

NULL-safety fix using NOT EXISTS is the right call and correctly defended by the new regression test.
Restructuring deleteOldVersionsFromContentlet to return a count and propagate it up is a clean way to detect no-progress without extra state.
@CloseDBIfOpened on deleteContentVersion is a good correctness fix on its own.

Address PR #35591 review feedback: - DropOldContentVersionsJobHelper: replace NOT IN with NOT EXISTS so the old-versions query handles nullable contentlet_version_info.live_inode correctly. NOT IN against a NULL returns UNKNOWN for every row, which silently skipped cleanup for never-published drafts (the dominant case for EFS hardlink bloat). Working/live inodes are still strictly excluded from deletion. - DropOldContentVersionsJob: add no-progress safety break to the outer while(true) loop so a batch that deletes zero versions stops the run for that language instead of spinning the Quartz worker indefinitely. Particularly important under StatefulJob, where a stuck run blocks future scheduled executions. - DropOldContentVersionsJobTest: add regression test covering the NULL live_inode case end-to-end via the helper, asserting the working inode is never returned and the expected old versions still are. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

deleteContentVersion is already @CloseDBIfOpened, so each deletion gets its own transaction boundary. Closing per contentlet on top of that churns connection state for tenants with many short-lived sessions. Move the closeSessionSilently() to a finally block around the batch loop so it just sweeps anything left after the batch — same memory benefit, fewer pool round-trips. Address PR #35591 review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

A single batch (default 500 identifiers) can produce many thousands of version deletions on tenants with heavy file-asset histories — the exact EFS bloat case this job targets. Closing once per batch let too much state accumulate in one Hibernate session. Move closeSessionSilently() back inside the inner for-loop, wrapped in a finally so a delete failure mid-loop still cleans up. Per-contentlet (~500 closes per batch) is the right middle ground vs per-row. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Two fixes to DropOldContentVersionsJobTest: 1. deleteOldContentlets — update expected count from 100 → 101. The previous expectation (100 versions remaining out of 105) only passed because the original NOT IN ... UNION ALL query was failing to exclude the working/live inode from deletion candidates: the subquery's NULL handling made NOT IN return UNKNOWN for every row (the bug PR #35591 fixes), so all 105 versions passed the filter and OFFSET 100 returned 5 deletions = 100 remaining. With the corrected NOT EXISTS query, the working/live inode is now properly excluded (1 row), so 104 candidates pass the filter, OFFSET 100 returns 4, and 105 - 4 = 101 versions remain. The "delete 5 leave 100" result was a side-effect of the bug, not the intended behavior. 2. findContentVersionsGreaterThan_protectsWorkingInodeWhenLiveInodeIsNull — replace title "Draft v1 (never published)" with "Draft Never Published 1". The webPageContent type's title field has a regex pattern that rejects parentheses, causing the previous title to fail validation immediately. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

At steady state — a contentlet with working/live + GREATER_THAN old versions (GREATER_THAN + 1 total rows) — the outer FIND_CONTENTS_WITH_VERSIONS_GREATER_THAN matched on > GREATER_THAN total rows, but the inner FIND_CONTENT_VERSIONS_GREATER_THAN query correctly excludes the working/live inode and then OFFSETs by GREATER_THAN, returning 0 candidates. The no-progress safety break in DropOldContentVersionsJob.execute() then warned on every healthy run. Mirror the inner NOT EXISTS exclusion in the outer query so the count reflects actual deletion candidates. The outer now only returns identifiers for which the inner will produce real work. Address PR #35591 review. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

fix(perf): drop old assets query performance

067d2fc

ref: #35590

github-project-automation Bot added this to dotCMS - Product Planning May 6, 2026

github-actions Bot mentioned this pull request May 6, 2026

Improve performance of DropOldAssets Query #35590

Open

github-actions Bot added the Area : Backend PR changes Java/Maven backend code label May 6, 2026