Skip to content

branch-4.0: [fix](cloud) add get_version and get_tablet_stats case (#61915)#62576

Open
mymeiyi wants to merge 1 commit intoapache:branch-4.0from
mymeiyi:branch-4.0-pick-61915
Open

branch-4.0: [fix](cloud) add get_version and get_tablet_stats case (#61915)#62576
mymeiyi wants to merge 1 commit intoapache:branch-4.0from
mymeiyi:branch-4.0-pick-61915

Conversation

@mymeiyi
Copy link
Copy Markdown
Contributor

@mymeiyi mymeiyi commented Apr 17, 2026

pick #61915

Copilot AI review requested due to automatic review settings April 17, 2026 04:15
@hello-stephen
Copy link
Copy Markdown
Contributor

Thank you for your contribution to Apache Doris.
Don't know what should be done next? See How to process your PR.

Please clearly describe your PR:

  1. What problem was fixed (it's best to include specific error reporting information). How it was fixed.
  2. Which behaviors were modified. What was the previous behavior, what is it now, why was it modified, and what possible impacts might there be.
  3. What features were added. Why was this function added?
  4. Which code was refactored and why was this part of the code refactored?
  5. Which functions were optimized and what is the difference before and after the optimization?

@mymeiyi
Copy link
Copy Markdown
Contributor Author

mymeiyi commented Apr 17, 2026

run buildall

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Backports upstream fix/tests for cloud mode “force sync” behaviors: ensuring cloud_force_sync_version and cloud_force_sync_tablet_stats work correctly when the client connects to a follower FE, and adding regression coverage around version/tablet-stat synchronization.

Changes:

  • Forward cloud_force_sync_version and cloud_force_sync_tablet_stats session variables to the master FE (needForward = true).
  • Add debug-point hooks to control version-cache update and “active tablets” collection for deterministic testing.
  • Add two new cloud docker regression suites covering version syncer and tablet stat syncer behaviors.

Reviewed changes

Copilot reviewed 5 out of 5 changed files in this pull request and generated 5 comments.

Show a summary per file
File Description
regression-test/suites/cloud_p0/version/test_version_syncer.groovy New docker regression test for version syncer + cloud_force_sync_version.
regression-test/suites/cloud_p0/tablets/test_tablet_stat_syncer.groovy New docker regression test for tablet stat sync + cloud_force_sync_tablet_stats.
fe/fe-core/src/main/java/org/apache/doris/qe/SessionVariable.java Marks both cloud force-sync session variables as needForward = true.
fe/fe-core/src/main/java/org/apache/doris/cloud/transaction/CloudGlobalTransactionMgr.java Adds debug point to bypass updateVersion() for test control.
fe/fe-core/src/main/java/org/apache/doris/catalog/CloudTabletStatMgr.java Adds debug-point param to ignore certain tablets in addActiveTablets() for tests.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +87 to +88
logger.info("RowCount after insert (before sync): ${rowCountAfterInsert}")

Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The debug point only blocks addActiveTablets, but CloudTabletStatMgr can still refresh stats via the interval-ladder path. As written, the test doesn’t assert that RowCount stays stale while the debug point is enabled, so it can become flaky or pass without validating the intended behavior. Consider asserting RowCount remains unchanged for a short window (or set tablet_stat_update_interval_second high via FE configs) to make this deterministic.

Suggested change
logger.info("RowCount after insert (before sync): ${rowCountAfterInsert}")
logger.info("RowCount after insert (before sync): ${rowCountAfterInsert}")
assertEquals(initialRowCount.toString(), rowCountAfterInsert.toString(),
"RowCount should remain stale immediately after insert while debug point is enabled")
def staleCheckWindowMs = Math.max(2000, Math.min(5000, statUpdateInterval * 1000))
def staleCheckIntervalMs = 500
def staleCheckStart = System.currentTimeMillis()
while (System.currentTimeMillis() - staleCheckStart < staleCheckWindowMs) {
def tabletsDuringBlock = sql_return_maparray """ show tablets from ${tbl} """
def rowCountDuringBlock = tabletsDuringBlock[0].RowCount
logger.info("RowCount while debug point enabled: ${rowCountDuringBlock}")
assertEquals(initialRowCount.toString(), rowCountDuringBlock.toString(),
"RowCount should remain stale while debug point is enabled")
Thread.sleep(staleCheckIntervalMs)
}

Copilot uses AI. Check for mistakes.
Comment on lines +115 to +148
sql """ INSERT INTO ${tbl} VALUES (3, 'force_sync', 300) """
def tabletsBeforeSync = sql_return_maparray """ show tablets from ${tbl} """
def localDataSizeBeforeSync = tabletsBeforeSync[0].LocalDataSize
def rowCountBeforeSync = tabletsBeforeSync[0].RowCount
logger.info("LocalDataSize before force sync: ${localDataSizeBeforeSync}, RowCount: ${rowCountBeforeSync}")

sql """ set cloud_force_sync_tablet_stats = true """
def tabletsAfterForceSync = sql_return_maparray """ show tablets from ${tbl} """
def localDataSizeAfterForceSync = tabletsAfterForceSync[0].LocalDataSize
def rowCountAfterForceSync = tabletsAfterForceSync[0].RowCount
logger.info("LocalDataSize after force sync: ${localDataSizeAfterForceSync}, RowCount: ${rowCountAfterForceSync}")
sql """ set cloud_force_sync_tablet_stats = false """

startTime = System.currentTimeMillis()
updated = false
while (true) {
def tabletsNow = sql_return_maparray """ show tablets from ${tbl} """
def rowCountNow = tabletsNow[0].RowCount.toInteger()
logger.info("RowCount get 3: ${rowCountNow}")
if (rowCountNow == 3) {
logger.info("RowCount updated 3: ${rowCountNow}")
updated = true
break
}
if (System.currentTimeMillis() - startTime > maxWaitMs) {
throw new IllegalStateException("Timeout waiting for tablet stat update. Waited ${maxWaitMs}ms, RowCount still ${rowCountNow}")
}
Thread.sleep(intervalMs)
}
assertTrue(updated, "Tablet stat should eventually update after debug point disabled")
tabletsAfterInsert = sql_return_maparray """ show tablets from ${tbl} """
rowCountAfterInsert = tabletsAfterInsert[0].RowCount.toInteger()
logger.info("RowCount after stat sync: ${rowCountAfterInsert}")
assertEquals(3, rowCountAfterInsert, "RowCount should reflect the new row after stat sync")
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The cloud_force_sync_tablet_stats section doesn’t currently prove that the session variable changes behavior: after inserting id=3, tablet stats may update naturally on the next scheduled refresh even if force sync is ineffective. To make this a real regression case, block/disable background refresh (e.g., reuse the ignore debug point, or increase update intervals), then assert that stats only update after enabling cloud_force_sync_tablet_stats and issuing the triggering statement (e.g. show tablets).

Suggested change
sql """ INSERT INTO ${tbl} VALUES (3, 'force_sync', 300) """
def tabletsBeforeSync = sql_return_maparray """ show tablets from ${tbl} """
def localDataSizeBeforeSync = tabletsBeforeSync[0].LocalDataSize
def rowCountBeforeSync = tabletsBeforeSync[0].RowCount
logger.info("LocalDataSize before force sync: ${localDataSizeBeforeSync}, RowCount: ${rowCountBeforeSync}")
sql """ set cloud_force_sync_tablet_stats = true """
def tabletsAfterForceSync = sql_return_maparray """ show tablets from ${tbl} """
def localDataSizeAfterForceSync = tabletsAfterForceSync[0].LocalDataSize
def rowCountAfterForceSync = tabletsAfterForceSync[0].RowCount
logger.info("LocalDataSize after force sync: ${localDataSizeAfterForceSync}, RowCount: ${rowCountAfterForceSync}")
sql """ set cloud_force_sync_tablet_stats = false """
startTime = System.currentTimeMillis()
updated = false
while (true) {
def tabletsNow = sql_return_maparray """ show tablets from ${tbl} """
def rowCountNow = tabletsNow[0].RowCount.toInteger()
logger.info("RowCount get 3: ${rowCountNow}")
if (rowCountNow == 3) {
logger.info("RowCount updated 3: ${rowCountNow}")
updated = true
break
}
if (System.currentTimeMillis() - startTime > maxWaitMs) {
throw new IllegalStateException("Timeout waiting for tablet stat update. Waited ${maxWaitMs}ms, RowCount still ${rowCountNow}")
}
Thread.sleep(intervalMs)
}
assertTrue(updated, "Tablet stat should eventually update after debug point disabled")
tabletsAfterInsert = sql_return_maparray """ show tablets from ${tbl} """
rowCountAfterInsert = tabletsAfterInsert[0].RowCount.toInteger()
logger.info("RowCount after stat sync: ${rowCountAfterInsert}")
assertEquals(3, rowCountAfterInsert, "RowCount should reflect the new row after stat sync")
GetDebugPoint().enableDebugPointForAllBEs("CloudTabletStatMgr::sync_tablet_stat.ignore")
try {
sql """ INSERT INTO ${tbl} VALUES (3, 'force_sync', 300) """
def tabletsBeforeSync = sql_return_maparray """ show tablets from ${tbl} """
def localDataSizeBeforeSync = tabletsBeforeSync[0].LocalDataSize
def rowCountBeforeSync = tabletsBeforeSync[0].RowCount.toInteger()
logger.info("LocalDataSize before force sync: ${localDataSizeBeforeSync}, RowCount: ${rowCountBeforeSync}")
assertEquals(2, rowCountBeforeSync,
"RowCount should remain stale while background tablet stat sync is blocked")
sql """ set cloud_force_sync_tablet_stats = true """
def tabletsAfterForceSync = sql_return_maparray """ show tablets from ${tbl} """
def localDataSizeAfterForceSync = tabletsAfterForceSync[0].LocalDataSize
def rowCountAfterForceSync = tabletsAfterForceSync[0].RowCount.toInteger()
logger.info("LocalDataSize after force sync: ${localDataSizeAfterForceSync}, RowCount: ${rowCountAfterForceSync}")
assertEquals(3, rowCountAfterForceSync,
"RowCount should update immediately only after force sync is enabled and show tablets is issued")
} finally {
sql """ set cloud_force_sync_tablet_stats = false """
GetDebugPoint().disableDebugPointForAllBEs("CloudTabletStatMgr::sync_tablet_stat.ignore")
}

Copilot uses AI. Check for mistakes.
Comment on lines +580 to +584
List<String> ignoreTablets = Arrays.asList(DebugPointUtil.getDebugParamOrDefault(
"FE.CloudTabletStatMgr.addActiveTablets.ignore.tablets", "").split(","));
if (!ignoreTablets.isEmpty() && tabletIds.stream().anyMatch(id -> ignoreTablets.contains(String.valueOf(id)))) {
LOG.info("ignore adding active tablets: {}, debug param: {}", tabletIds, ignoreTablets);
return;
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

addActiveTablets runs on every commit in cloud mode, so splitting/parsing the debug param on every call adds avoidable allocations. Consider fetching the debug param string first and returning early when it’s empty, and parsing into a trimmed Set<String> (or Set<Long>) for faster contains checks. This matches the common pattern elsewhere (e.g., Backend.isQueryAvailable checks for empty before splitting).

Suggested change
List<String> ignoreTablets = Arrays.asList(DebugPointUtil.getDebugParamOrDefault(
"FE.CloudTabletStatMgr.addActiveTablets.ignore.tablets", "").split(","));
if (!ignoreTablets.isEmpty() && tabletIds.stream().anyMatch(id -> ignoreTablets.contains(String.valueOf(id)))) {
LOG.info("ignore adding active tablets: {}, debug param: {}", tabletIds, ignoreTablets);
return;
String ignoreTabletsParam = DebugPointUtil.getDebugParamOrDefault(
"FE.CloudTabletStatMgr.addActiveTablets.ignore.tablets", "");
if (!ignoreTabletsParam.isEmpty()) {
Set<String> ignoreTablets = Arrays.stream(ignoreTabletsParam.split(","))
.map(String::trim)
.filter(s -> !s.isEmpty())
.collect(Collectors.toSet());
if (!ignoreTablets.isEmpty()
&& tabletIds.stream().anyMatch(id -> ignoreTablets.contains(String.valueOf(id)))) {
LOG.info("ignore adding active tablets: {}, debug param: {}", tabletIds, ignoreTablets);
return;
}

Copilot uses AI. Check for mistakes.
Comment on lines +70 to +90

sql """ INSERT INTO ${tbl} VALUES (2, 'test', 200) """
result = sql_return_maparray """ select * from ${tbl} where id = 2 """
assertEquals(0, result.size(), "Data should not be visible before version sync")

def maxWaitMs = (syncerInterval * 2 + 5) * 1000
def intervalMs = 1000
def startTime = System.currentTimeMillis()
def found = false
while (true) {
result = sql_return_maparray """ select * from ${tbl} where id = 2 """
if (result.size() == 1) {
assertEquals('test', result[0].name)
assertEquals(200, result[0].value)
found = true
break
}
if (System.currentTimeMillis() - startTime > maxWaitMs) {
throw new IllegalStateException("Timeout waiting for data to be visible. Waited ${maxWaitMs}ms")
}
Thread.sleep(intervalMs)
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Debug point is enabled here but never disabled. Even though this is in a docker suite, leaving it enabled can affect later statements in the same test and makes failures harder to diagnose. Wrap the debug-point enable/disable in a try/finally (or explicitly disable it after the assertions) to ensure cleanup.

Suggested change
sql """ INSERT INTO ${tbl} VALUES (2, 'test', 200) """
result = sql_return_maparray """ select * from ${tbl} where id = 2 """
assertEquals(0, result.size(), "Data should not be visible before version sync")
def maxWaitMs = (syncerInterval * 2 + 5) * 1000
def intervalMs = 1000
def startTime = System.currentTimeMillis()
def found = false
while (true) {
result = sql_return_maparray """ select * from ${tbl} where id = 2 """
if (result.size() == 1) {
assertEquals('test', result[0].name)
assertEquals(200, result[0].value)
found = true
break
}
if (System.currentTimeMillis() - startTime > maxWaitMs) {
throw new IllegalStateException("Timeout waiting for data to be visible. Waited ${maxWaitMs}ms")
}
Thread.sleep(intervalMs)
try {
sql """ INSERT INTO ${tbl} VALUES (2, 'test', 200) """
result = sql_return_maparray """ select * from ${tbl} where id = 2 """
assertEquals(0, result.size(), "Data should not be visible before version sync")
def maxWaitMs = (syncerInterval * 2 + 5) * 1000
def intervalMs = 1000
def startTime = System.currentTimeMillis()
def found = false
while (true) {
result = sql_return_maparray """ select * from ${tbl} where id = 2 """
if (result.size() == 1) {
assertEquals('test', result[0].name)
assertEquals(200, result[0].value)
found = true
break
}
if (System.currentTimeMillis() - startTime > maxWaitMs) {
throw new IllegalStateException("Timeout waiting for data to be visible. Waited ${maxWaitMs}ms")
}
Thread.sleep(intervalMs)
}
} finally {
DebugPoint.disableDebugPointForAllFEs('FE.CloudGlobalTransactionMgr.updateVersion.disabled')

Copilot uses AI. Check for mistakes.
Comment on lines +104 to +110
sql """ set cloud_force_sync_version = true """
def showPartitionsResult = sql_return_maparray """ show partitions from ${tbl} """
logger.info("Show partitions result: ${showPartitionsResult}")

// Step 3: Query again, data should now be visible
result = sql_return_maparray """ select * from ${tbl} where id = 3 """
assertEquals(1, result.size(), "Data should be visible after force sync")
Copy link

Copilot AI Apr 17, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This test sets cloud_force_sync_version = true but never resets it. To avoid leaking session state into subsequent statements (or future additions to this suite), set it back to false in a finally block or immediately after the force-sync verification.

Suggested change
sql """ set cloud_force_sync_version = true """
def showPartitionsResult = sql_return_maparray """ show partitions from ${tbl} """
logger.info("Show partitions result: ${showPartitionsResult}")
// Step 3: Query again, data should now be visible
result = sql_return_maparray """ select * from ${tbl} where id = 3 """
assertEquals(1, result.size(), "Data should be visible after force sync")
try {
sql """ set cloud_force_sync_version = true """
def showPartitionsResult = sql_return_maparray """ show partitions from ${tbl} """
logger.info("Show partitions result: ${showPartitionsResult}")
// Step 3: Query again, data should now be visible
result = sql_return_maparray """ select * from ${tbl} where id = 3 """
assertEquals(1, result.size(), "Data should be visible after force sync")
} finally {
sql """ set cloud_force_sync_version = false """
}

Copilot uses AI. Check for mistakes.
@mymeiyi
Copy link
Copy Markdown
Contributor Author

mymeiyi commented Apr 17, 2026

run nonConcurrent

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants