branch-4.0: [fix](cloud) add get_version and get_tablet_stats case (#61915)#62576
branch-4.0: [fix](cloud) add get_version and get_tablet_stats case (#61915)#62576mymeiyi wants to merge 1 commit intoapache:branch-4.0from
Conversation
|
Thank you for your contribution to Apache Doris. Please clearly describe your PR:
|
|
run buildall |
There was a problem hiding this comment.
Pull request overview
Backports upstream fix/tests for cloud mode “force sync” behaviors: ensuring cloud_force_sync_version and cloud_force_sync_tablet_stats work correctly when the client connects to a follower FE, and adding regression coverage around version/tablet-stat synchronization.
Changes:
- Forward
cloud_force_sync_versionandcloud_force_sync_tablet_statssession variables to the master FE (needForward = true). - Add debug-point hooks to control version-cache update and “active tablets” collection for deterministic testing.
- Add two new cloud docker regression suites covering version syncer and tablet stat syncer behaviors.
Reviewed changes
Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.
Show a summary per file
| File | Description |
|---|---|
| regression-test/suites/cloud_p0/version/test_version_syncer.groovy | New docker regression test for version syncer + cloud_force_sync_version. |
| regression-test/suites/cloud_p0/tablets/test_tablet_stat_syncer.groovy | New docker regression test for tablet stat sync + cloud_force_sync_tablet_stats. |
| fe/fe-core/src/main/java/org/apache/doris/qe/SessionVariable.java | Marks both cloud force-sync session variables as needForward = true. |
| fe/fe-core/src/main/java/org/apache/doris/cloud/transaction/CloudGlobalTransactionMgr.java | Adds debug point to bypass updateVersion() for test control. |
| fe/fe-core/src/main/java/org/apache/doris/catalog/CloudTabletStatMgr.java | Adds debug-point param to ignore certain tablets in addActiveTablets() for tests. |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
| logger.info("RowCount after insert (before sync): ${rowCountAfterInsert}") | ||
|
|
There was a problem hiding this comment.
The debug point only blocks addActiveTablets, but CloudTabletStatMgr can still refresh stats via the interval-ladder path. As written, the test doesn’t assert that RowCount stays stale while the debug point is enabled, so it can become flaky or pass without validating the intended behavior. Consider asserting RowCount remains unchanged for a short window (or set tablet_stat_update_interval_second high via FE configs) to make this deterministic.
| logger.info("RowCount after insert (before sync): ${rowCountAfterInsert}") | |
| logger.info("RowCount after insert (before sync): ${rowCountAfterInsert}") | |
| assertEquals(initialRowCount.toString(), rowCountAfterInsert.toString(), | |
| "RowCount should remain stale immediately after insert while debug point is enabled") | |
| def staleCheckWindowMs = Math.max(2000, Math.min(5000, statUpdateInterval * 1000)) | |
| def staleCheckIntervalMs = 500 | |
| def staleCheckStart = System.currentTimeMillis() | |
| while (System.currentTimeMillis() - staleCheckStart < staleCheckWindowMs) { | |
| def tabletsDuringBlock = sql_return_maparray """ show tablets from ${tbl} """ | |
| def rowCountDuringBlock = tabletsDuringBlock[0].RowCount | |
| logger.info("RowCount while debug point enabled: ${rowCountDuringBlock}") | |
| assertEquals(initialRowCount.toString(), rowCountDuringBlock.toString(), | |
| "RowCount should remain stale while debug point is enabled") | |
| Thread.sleep(staleCheckIntervalMs) | |
| } |
| sql """ INSERT INTO ${tbl} VALUES (3, 'force_sync', 300) """ | ||
| def tabletsBeforeSync = sql_return_maparray """ show tablets from ${tbl} """ | ||
| def localDataSizeBeforeSync = tabletsBeforeSync[0].LocalDataSize | ||
| def rowCountBeforeSync = tabletsBeforeSync[0].RowCount | ||
| logger.info("LocalDataSize before force sync: ${localDataSizeBeforeSync}, RowCount: ${rowCountBeforeSync}") | ||
|
|
||
| sql """ set cloud_force_sync_tablet_stats = true """ | ||
| def tabletsAfterForceSync = sql_return_maparray """ show tablets from ${tbl} """ | ||
| def localDataSizeAfterForceSync = tabletsAfterForceSync[0].LocalDataSize | ||
| def rowCountAfterForceSync = tabletsAfterForceSync[0].RowCount | ||
| logger.info("LocalDataSize after force sync: ${localDataSizeAfterForceSync}, RowCount: ${rowCountAfterForceSync}") | ||
| sql """ set cloud_force_sync_tablet_stats = false """ | ||
|
|
||
| startTime = System.currentTimeMillis() | ||
| updated = false | ||
| while (true) { | ||
| def tabletsNow = sql_return_maparray """ show tablets from ${tbl} """ | ||
| def rowCountNow = tabletsNow[0].RowCount.toInteger() | ||
| logger.info("RowCount get 3: ${rowCountNow}") | ||
| if (rowCountNow == 3) { | ||
| logger.info("RowCount updated 3: ${rowCountNow}") | ||
| updated = true | ||
| break | ||
| } | ||
| if (System.currentTimeMillis() - startTime > maxWaitMs) { | ||
| throw new IllegalStateException("Timeout waiting for tablet stat update. Waited ${maxWaitMs}ms, RowCount still ${rowCountNow}") | ||
| } | ||
| Thread.sleep(intervalMs) | ||
| } | ||
| assertTrue(updated, "Tablet stat should eventually update after debug point disabled") | ||
| tabletsAfterInsert = sql_return_maparray """ show tablets from ${tbl} """ | ||
| rowCountAfterInsert = tabletsAfterInsert[0].RowCount.toInteger() | ||
| logger.info("RowCount after stat sync: ${rowCountAfterInsert}") | ||
| assertEquals(3, rowCountAfterInsert, "RowCount should reflect the new row after stat sync") |
There was a problem hiding this comment.
The cloud_force_sync_tablet_stats section doesn’t currently prove that the session variable changes behavior: after inserting id=3, tablet stats may update naturally on the next scheduled refresh even if force sync is ineffective. To make this a real regression case, block/disable background refresh (e.g., reuse the ignore debug point, or increase update intervals), then assert that stats only update after enabling cloud_force_sync_tablet_stats and issuing the triggering statement (e.g. show tablets).
| sql """ INSERT INTO ${tbl} VALUES (3, 'force_sync', 300) """ | |
| def tabletsBeforeSync = sql_return_maparray """ show tablets from ${tbl} """ | |
| def localDataSizeBeforeSync = tabletsBeforeSync[0].LocalDataSize | |
| def rowCountBeforeSync = tabletsBeforeSync[0].RowCount | |
| logger.info("LocalDataSize before force sync: ${localDataSizeBeforeSync}, RowCount: ${rowCountBeforeSync}") | |
| sql """ set cloud_force_sync_tablet_stats = true """ | |
| def tabletsAfterForceSync = sql_return_maparray """ show tablets from ${tbl} """ | |
| def localDataSizeAfterForceSync = tabletsAfterForceSync[0].LocalDataSize | |
| def rowCountAfterForceSync = tabletsAfterForceSync[0].RowCount | |
| logger.info("LocalDataSize after force sync: ${localDataSizeAfterForceSync}, RowCount: ${rowCountAfterForceSync}") | |
| sql """ set cloud_force_sync_tablet_stats = false """ | |
| startTime = System.currentTimeMillis() | |
| updated = false | |
| while (true) { | |
| def tabletsNow = sql_return_maparray """ show tablets from ${tbl} """ | |
| def rowCountNow = tabletsNow[0].RowCount.toInteger() | |
| logger.info("RowCount get 3: ${rowCountNow}") | |
| if (rowCountNow == 3) { | |
| logger.info("RowCount updated 3: ${rowCountNow}") | |
| updated = true | |
| break | |
| } | |
| if (System.currentTimeMillis() - startTime > maxWaitMs) { | |
| throw new IllegalStateException("Timeout waiting for tablet stat update. Waited ${maxWaitMs}ms, RowCount still ${rowCountNow}") | |
| } | |
| Thread.sleep(intervalMs) | |
| } | |
| assertTrue(updated, "Tablet stat should eventually update after debug point disabled") | |
| tabletsAfterInsert = sql_return_maparray """ show tablets from ${tbl} """ | |
| rowCountAfterInsert = tabletsAfterInsert[0].RowCount.toInteger() | |
| logger.info("RowCount after stat sync: ${rowCountAfterInsert}") | |
| assertEquals(3, rowCountAfterInsert, "RowCount should reflect the new row after stat sync") | |
| GetDebugPoint().enableDebugPointForAllBEs("CloudTabletStatMgr::sync_tablet_stat.ignore") | |
| try { | |
| sql """ INSERT INTO ${tbl} VALUES (3, 'force_sync', 300) """ | |
| def tabletsBeforeSync = sql_return_maparray """ show tablets from ${tbl} """ | |
| def localDataSizeBeforeSync = tabletsBeforeSync[0].LocalDataSize | |
| def rowCountBeforeSync = tabletsBeforeSync[0].RowCount.toInteger() | |
| logger.info("LocalDataSize before force sync: ${localDataSizeBeforeSync}, RowCount: ${rowCountBeforeSync}") | |
| assertEquals(2, rowCountBeforeSync, | |
| "RowCount should remain stale while background tablet stat sync is blocked") | |
| sql """ set cloud_force_sync_tablet_stats = true """ | |
| def tabletsAfterForceSync = sql_return_maparray """ show tablets from ${tbl} """ | |
| def localDataSizeAfterForceSync = tabletsAfterForceSync[0].LocalDataSize | |
| def rowCountAfterForceSync = tabletsAfterForceSync[0].RowCount.toInteger() | |
| logger.info("LocalDataSize after force sync: ${localDataSizeAfterForceSync}, RowCount: ${rowCountAfterForceSync}") | |
| assertEquals(3, rowCountAfterForceSync, | |
| "RowCount should update immediately only after force sync is enabled and show tablets is issued") | |
| } finally { | |
| sql """ set cloud_force_sync_tablet_stats = false """ | |
| GetDebugPoint().disableDebugPointForAllBEs("CloudTabletStatMgr::sync_tablet_stat.ignore") | |
| } |
| List<String> ignoreTablets = Arrays.asList(DebugPointUtil.getDebugParamOrDefault( | ||
| "FE.CloudTabletStatMgr.addActiveTablets.ignore.tablets", "").split(",")); | ||
| if (!ignoreTablets.isEmpty() && tabletIds.stream().anyMatch(id -> ignoreTablets.contains(String.valueOf(id)))) { | ||
| LOG.info("ignore adding active tablets: {}, debug param: {}", tabletIds, ignoreTablets); | ||
| return; |
There was a problem hiding this comment.
addActiveTablets runs on every commit in cloud mode, so splitting/parsing the debug param on every call adds avoidable allocations. Consider fetching the debug param string first and returning early when it’s empty, and parsing into a trimmed Set<String> (or Set<Long>) for faster contains checks. This matches the common pattern elsewhere (e.g., Backend.isQueryAvailable checks for empty before splitting).
| List<String> ignoreTablets = Arrays.asList(DebugPointUtil.getDebugParamOrDefault( | |
| "FE.CloudTabletStatMgr.addActiveTablets.ignore.tablets", "").split(",")); | |
| if (!ignoreTablets.isEmpty() && tabletIds.stream().anyMatch(id -> ignoreTablets.contains(String.valueOf(id)))) { | |
| LOG.info("ignore adding active tablets: {}, debug param: {}", tabletIds, ignoreTablets); | |
| return; | |
| String ignoreTabletsParam = DebugPointUtil.getDebugParamOrDefault( | |
| "FE.CloudTabletStatMgr.addActiveTablets.ignore.tablets", ""); | |
| if (!ignoreTabletsParam.isEmpty()) { | |
| Set<String> ignoreTablets = Arrays.stream(ignoreTabletsParam.split(",")) | |
| .map(String::trim) | |
| .filter(s -> !s.isEmpty()) | |
| .collect(Collectors.toSet()); | |
| if (!ignoreTablets.isEmpty() | |
| && tabletIds.stream().anyMatch(id -> ignoreTablets.contains(String.valueOf(id)))) { | |
| LOG.info("ignore adding active tablets: {}, debug param: {}", tabletIds, ignoreTablets); | |
| return; | |
| } |
|
|
||
| sql """ INSERT INTO ${tbl} VALUES (2, 'test', 200) """ | ||
| result = sql_return_maparray """ select * from ${tbl} where id = 2 """ | ||
| assertEquals(0, result.size(), "Data should not be visible before version sync") | ||
|
|
||
| def maxWaitMs = (syncerInterval * 2 + 5) * 1000 | ||
| def intervalMs = 1000 | ||
| def startTime = System.currentTimeMillis() | ||
| def found = false | ||
| while (true) { | ||
| result = sql_return_maparray """ select * from ${tbl} where id = 2 """ | ||
| if (result.size() == 1) { | ||
| assertEquals('test', result[0].name) | ||
| assertEquals(200, result[0].value) | ||
| found = true | ||
| break | ||
| } | ||
| if (System.currentTimeMillis() - startTime > maxWaitMs) { | ||
| throw new IllegalStateException("Timeout waiting for data to be visible. Waited ${maxWaitMs}ms") | ||
| } | ||
| Thread.sleep(intervalMs) |
There was a problem hiding this comment.
Debug point is enabled here but never disabled. Even though this is in a docker suite, leaving it enabled can affect later statements in the same test and makes failures harder to diagnose. Wrap the debug-point enable/disable in a try/finally (or explicitly disable it after the assertions) to ensure cleanup.
| sql """ INSERT INTO ${tbl} VALUES (2, 'test', 200) """ | |
| result = sql_return_maparray """ select * from ${tbl} where id = 2 """ | |
| assertEquals(0, result.size(), "Data should not be visible before version sync") | |
| def maxWaitMs = (syncerInterval * 2 + 5) * 1000 | |
| def intervalMs = 1000 | |
| def startTime = System.currentTimeMillis() | |
| def found = false | |
| while (true) { | |
| result = sql_return_maparray """ select * from ${tbl} where id = 2 """ | |
| if (result.size() == 1) { | |
| assertEquals('test', result[0].name) | |
| assertEquals(200, result[0].value) | |
| found = true | |
| break | |
| } | |
| if (System.currentTimeMillis() - startTime > maxWaitMs) { | |
| throw new IllegalStateException("Timeout waiting for data to be visible. Waited ${maxWaitMs}ms") | |
| } | |
| Thread.sleep(intervalMs) | |
| try { | |
| sql """ INSERT INTO ${tbl} VALUES (2, 'test', 200) """ | |
| result = sql_return_maparray """ select * from ${tbl} where id = 2 """ | |
| assertEquals(0, result.size(), "Data should not be visible before version sync") | |
| def maxWaitMs = (syncerInterval * 2 + 5) * 1000 | |
| def intervalMs = 1000 | |
| def startTime = System.currentTimeMillis() | |
| def found = false | |
| while (true) { | |
| result = sql_return_maparray """ select * from ${tbl} where id = 2 """ | |
| if (result.size() == 1) { | |
| assertEquals('test', result[0].name) | |
| assertEquals(200, result[0].value) | |
| found = true | |
| break | |
| } | |
| if (System.currentTimeMillis() - startTime > maxWaitMs) { | |
| throw new IllegalStateException("Timeout waiting for data to be visible. Waited ${maxWaitMs}ms") | |
| } | |
| Thread.sleep(intervalMs) | |
| } | |
| } finally { | |
| DebugPoint.disableDebugPointForAllFEs('FE.CloudGlobalTransactionMgr.updateVersion.disabled') |
| sql """ set cloud_force_sync_version = true """ | ||
| def showPartitionsResult = sql_return_maparray """ show partitions from ${tbl} """ | ||
| logger.info("Show partitions result: ${showPartitionsResult}") | ||
|
|
||
| // Step 3: Query again, data should now be visible | ||
| result = sql_return_maparray """ select * from ${tbl} where id = 3 """ | ||
| assertEquals(1, result.size(), "Data should be visible after force sync") |
There was a problem hiding this comment.
This test sets cloud_force_sync_version = true but never resets it. To avoid leaking session state into subsequent statements (or future additions to this suite), set it back to false in a finally block or immediately after the force-sync verification.
| sql """ set cloud_force_sync_version = true """ | |
| def showPartitionsResult = sql_return_maparray """ show partitions from ${tbl} """ | |
| logger.info("Show partitions result: ${showPartitionsResult}") | |
| // Step 3: Query again, data should now be visible | |
| result = sql_return_maparray """ select * from ${tbl} where id = 3 """ | |
| assertEquals(1, result.size(), "Data should be visible after force sync") | |
| try { | |
| sql """ set cloud_force_sync_version = true """ | |
| def showPartitionsResult = sql_return_maparray """ show partitions from ${tbl} """ | |
| logger.info("Show partitions result: ${showPartitionsResult}") | |
| // Step 3: Query again, data should now be visible | |
| result = sql_return_maparray """ select * from ${tbl} where id = 3 """ | |
| assertEquals(1, result.size(), "Data should be visible after force sync") | |
| } finally { | |
| sql """ set cloud_force_sync_version = false """ | |
| } |
|
run nonConcurrent |
pick #61915