Skip to content

cluster: remove shard aggregation for max_offset and under_replicated_partitions#30593

Merged
WillemKauf merged 1 commit into
redpanda-data:devfrom
WillemKauf:replicated_public_metric
May 25, 2026
Merged

cluster: remove shard aggregation for max_offset and under_replicated_partitions#30593
WillemKauf merged 1 commit into
redpanda-data:devfrom
WillemKauf:replicated_public_metric

Conversation

@WillemKauf

Copy link
Copy Markdown
Contributor

Missed removing the sm::shard_label on these in 263a0e8.

Aggregating these across shards doesn't reduce the number of metric series, but it does cause an oversized allocation in the metric_aggregate_by_labels object, since an std::unordered_map is used there to aggregate.

Remove the shard label to prevent this oversized allocation.

Backports Required

  • none - not a bug fix
  • none - this is a backport
  • none - issue does not exist in previous branches
  • none - papercut/not impactful enough to backport
  • v26.1.x
  • v25.3.x
  • v25.2.x

Release Notes

  • none

…cated_replicas`

Missed removing the `sm::shard_label` on these in 263a0e8.

Aggregating these across shards doesn't reduce the number of
metric series, but it does cause an oversized allocation in the
`metric_aggregate_by_labels` object, since an `std::unordered_map`
is used there to aggregate.

Remove the shard label to prevent this oversized allocation.

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR removes shard-level aggregation for the public kafka partition metrics max_offset and under_replicated_replicas to avoid the extra aggregation machinery (and its oversized unordered_map allocation) when aggregation does not reduce metric series cardinality for per-partition metrics.

Changes:

  • Removed .aggregate({sm::shard_label}) from the public kafka:max_offset gauge.
  • Removed .aggregate({sm::shard_label}) from the public kafka:under_replicated_replicas gauge.

Comment on lines 248 to +253
return model::offset(-1);
}
},
sm::description(
"Latest readable offset of the partition (i.e. high watermark)"),
labels)
.aggregate({sm::shard_label}),
labels),
Comment on lines 254 to 256
sm::make_gauge(
"under_replicated_replicas",
[this] {
@vbotbuildovich

Copy link
Copy Markdown
Collaborator

Retry command for Build#84902

please wait until all jobs are finished before running the slash command

/ci-repeat 1
skip-redpanda-build
skip-units
skip-rebase
tests/rptest/tests/write_caching_fi_e2e_test.py::WriteCachingFailureInjectionE2ETest.test_crash_all@{"use_transactions":false}

@vbotbuildovich

Copy link
Copy Markdown
Collaborator

CI test results

test results on build#84902
test_status test_class test_method test_arguments test_kind job_url passed reason test_history
FLAKY(PASS) DataMigrationsApiTest test_migrated_topic_data_integrity {"params": {"cancellation": {"dir": "out", "stage": "prepared"}, "include_groups": true, "transfer_leadership": true, "use_alias": true}} integration https://buildkite.com/redpanda/redpanda/builds/84902#019e519f-8881-4c4a-8354-c05146405752 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0000, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=DataMigrationsApiTest&test_method=test_migrated_topic_data_integrity
FLAKY(PASS) RedpandaNodeOperationsSmokeTest test_node_ops_smoke_test {"cloud_storage_type": 1, "mixed_versions": false} integration https://buildkite.com/redpanda/redpanda/builds/84902#019e519f-8885-46fa-ad02-7fe02d0e31ba 10/11 Test PASSES after retries.No significant increase in flaky rate(baseline=0.0016, p0=1.0000, reject_threshold=0.0100. adj_baseline=0.1000, p1=0.3487, trust_threshold=0.5000) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=RedpandaNodeOperationsSmokeTest&test_method=test_node_ops_smoke_test
FLAKY(FAIL) WriteCachingFailureInjectionE2ETest test_crash_all {"use_transactions": false} integration https://buildkite.com/redpanda/redpanda/builds/84902#019e51a1-cd16-4236-ba62-c80dc13b925e 31/41 Test FAILS after retries.Significant increase in flaky rate(baseline=0.0901, p0=0.0081, reject_threshold=0.0100) https://redpanda.metabaseapp.com/dashboard/87-tests?tab=142-dt-individual-test-history&test_class=WriteCachingFailureInjectionE2ETest&test_method=test_crash_all

@WillemKauf WillemKauf enabled auto-merge May 25, 2026 13:41
@WillemKauf WillemKauf merged commit 8066806 into redpanda-data:dev May 25, 2026
20 checks passed
@WillemKauf

Copy link
Copy Markdown
Contributor Author

/backport v26.1.x

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants