feat(metrics): refine block_transaction_count buckets, add changelog by warku123 · Pull Request #6730 · tronprotocol/java-tron

warku123 · 2026-04-29T03:34:44Z

What does this PR do?

Two changes related to the metrics added in #6624:

Refines the default histogram buckets for tron:block_transaction_count from [0, 10, 50, 100, 200, 500, 1000, 2000, 5000, 10000] to [0, 20, 50, 80, 100, 120, 140, 160, 180, 200, 230, 260, 300, 500, 2000], densifying the 0–300 range so percentile interpolation in the typical per-SR TPS range is more accurate. Per @Sunny6889's review on feat(metrics): add block transaction count and SR set change monitoring #6624.
Introduces METRICS_CHANGELOG.md to track Prometheus metric additions, changes, and removals going forward, with a Pre-4.8.2 baseline snapshot of currently emitted metrics. Closes one of the post-merge follow-ups noted by @halibobo1205 on feat(metrics): add block transaction count and SR set change monitoring #6624.

Why are these changes required?

The bucket change makes the histogram useful for the workload it was designed to observe — most blocks fall under 300 tx, and the previous bucket layout was too coarse there to interpolate percentiles meaningfully. The changelog gives operators a single in-repo place to see what metrics each release adds or removes; today this information is scattered across release notes and the tron-docker metric_monitor/README.md.

Related to #6590 (closed when #6624 merged).

This PR has been tested by:

Unit Tests

Follow up

Update tron-docker metric_monitor/README.md to mirror the new metrics and the bucket change.

Extra details

The existing PrometheusApiServiceTest only queries the le="0.0" bucket (empty-block detection), which exists in both the old and new bucket arrays, so no test changes are needed.

…_CHANGELOG.md Densify the default histogram buckets around 0-300 (the typical TPS range) so percentile interpolation for per-SR throughput is more accurate. New buckets: [0, 20, 50, 80, 100, 120, 140, 160, 180, 200, 230, 260, 300, 500, 2000]. Also introduce METRICS_CHANGELOG.md to track Prometheus metric additions, changes, and removals going forward, with a Pre-4.8.2 baseline snapshot of currently emitted metrics.

@Sunny6889

Replace the grafana.com/dashboards/16567 link with the maintained JSON in tron-docker, per @Sunny6889's review on tronprotocol#6730. The community dashboard 16567 is no longer being updated; the in-org JSON is the current source of truth.

The block_transaction_count histogram buckets in java-tron are being refined to [0, 20, 50, 80, 100, 120, 140, 160, 180, 200, 230, 260, 300, 500, 2000] (tronprotocol/java-tron#6730), densifying the 0-300 range for percentile interpolation in the typical per-SR TPS range. Update the Block Transaction Count Distribution panel to match: the old le="10.0" and le="1000.0" boundaries no longer exist in the refined array, so those queries would return nothing. Adopt 9 legend buckets aligned to the new boundaries (0, 1-20, 21-50, 51-100, 101-200, 201-300, 301-500, 501-2000, 2000+) so the panel makes use of the added 0-300 resolution.

@Sunny6889

Update the documented bucket array for tron:block_transaction_count to match the refined buckets in tronprotocol/java-tron#6730: [0, 20, 50, 80, 100, 120, 140, 160, 180, 200, 230, 260, 300, 500, 2000]. Per @Sunny6889's review on this PR.

bladehan1 · 2026-05-08T07:07:30Z

+
+Snapshot of metrics emitted prior to this changelog. Per-version provenance is not tracked here; consult `git log` on [`common/src/main/java/org/tron/common/prometheus/`](common/src/main/java/org/tron/common/prometheus/) for exact origin of each metric.
+
+### Existing Metrics


[NIT] Pre-4.8.2 baseline omits tron:error_info

The baseline claims to be a complete snapshot of metrics emitted prior to this changelog, but it does not list tron:error_info (Counter, labels topic, type), which is registered in common/src/main/java/org/tron/common/prometheus/InstrumentedAppender.java:12-17 and emitted on every ERROR-level log line. The metric is already actively asserted in framework/src/test/java/org/tron/core/metrics/prometheus/PrometheusApiServiceTest.java:114.

Leaving it out of the baseline means a future PR that renames or removes it will have no documented baseline to compare against, which is the exact failure mode this changelog is meant to prevent.

Suggestion: Add a one-line entry under one of the existing subsections (Core or a new "Logging" subsection) describing tron:error_info (Counter, labels topic, type) — emitted on every ERROR-level log line by InstrumentedAppender.

Added a dedicated Logging subsection to the Pre-4.8.2 Baseline with tron:error_info (Counter, labels topic, type) — incremented on every ERROR-level log line by InstrumentedAppender.

Also took the opportunity to do a full audit of all metric registration sites (MetricsCounter, MetricsGauge, MetricsHistogram, InstrumentedAppender, OperatingSystemExports, GuavaCacheExports). Found one more gap: process_cpu_load (defined in OperatingSystemExports) was missing from the System section — added that in the same commit.

bladehan1 · 2026-05-08T07:07:40Z

+
+#### DB
+
+- `tron:db_size_bytes` (Gauge, labels `type`, `db`, `level`) — LevelDB storage size.


[NIT] DB metrics are not LevelDB-only — RocksDB also emits these

tron:db_size_bytes and tron:db_sst_level are emitted from chainbase/src/main/java/org/tron/common/storage/metric/DbStat.java:18-19, where the first label value comes from DbStat#getEngine(). Both LevelDbDataSourceImpl#getEngine() (returns LEVELDB) and RocksDbDataSourceImpl#getEngine() (returns ROCKSDB) extend DbStat, so on a node configured with the RocksDB backend these metrics are emitted with engine="ROCKSDB".

Describing them as "LevelDB storage size" / "LevelDB SST files per compaction level" will mislead operators of RocksDB-backend nodes into thinking the metrics are missing, or into writing PromQL that implicitly filters for LevelDB and drops half their data.

Suggestion: Reword as engine-aware, e.g. "Storage size in bytes per (engine, db, level); engine ∈ {LEVELDB, ROCKSDB}." The first label is named type in code (MetricsGauge.java:19-20) but its values are storage engines — calling that out in the description avoids the confusion.

Rewrote both entries to be engine-aware:

tron:db_size_bytes — "storage size in bytes per engine, database, and level; type is the storage engine (LEVELDB or ROCKSDB) depending on node configuration."

tron:db_sst_level — "SST files per compaction level per engine and database; type is the storage engine (LEVELDB or ROCKSDB) depending on node configuration."

Also noticed the Guava cache entries were missing their type label (cache name) — expanded those into individual descriptions with the label noted.

…_CHANGELOG - Add missing tron:error_info (Counter) to Logging subsection - Add missing process_cpu_load to System metrics list - Fix tron:db_size_bytes / tron:db_sst_level descriptions to be engine-aware (LEVELDB or ROCKSDB) instead of LevelDB-only - Expand Guava cache entries with type label and individual descriptions - Add +Inf overflow operational note under tron:block_transaction_count

bladehan1 · 2026-05-08T07:28:46Z

    init(MetricKeys.Histogram.BLOCK_TRANSACTION_COUNT,
        "Distribution of transaction counts per block.",
-        new double[]{0, 10, 50, 100, 200, 500, 1000, 2000, 5000, 10000},
+        new double[]{0, 20, 50, 80, 100, 120, 140, 160, 180, 200, 230, 260, 300, 500, 2000},


[DISCUSS] Consider keeping 5000/10000 as safety-net buckets above 2000

The densified 0–300 range is a real improvement, but capping the array at 2000 trades away an observability property the previous layout had: visibility into outlier blocks. Because buckets are cumulative and the client library always appends +Inf, any block with > 2000 transactions now lands in +Inf only — you can read "how many exceeded 2000" as +Inf - 2000, but you can't tell whether those samples are 2001 or 50000. The previous array [..., 2000, 5000, 10000] did give that resolution, even if its lower-range buckets were too coarse for the typical workload.

A hybrid layout keeps both properties:

[0, 20, 50, 80, 100, 120, 140, 160, 180, 200, 230, 260, 300, 500, 2000, 5000, 10000]

The 0–300 densification stays intact for typical-TPS percentile interpolation.

5000 and 10000 act as safety-net buckets for rare bursts (pathological pushes, repush storms, fork-switched blocks observed before early-return at Manager.java:1273-1277).

True outliers above 10000 still fall into +Inf, which is appropriate — at that point it's an incident signal, not something quantiles should smooth over.

Adding two buckets has no cardinality impact (buckets are series-internal, not label values) and negligible per-observe() cost. It strictly subsumes the alerting hook proposed in the other comment: the +Inf ratio alert still works as a tripwire, and operators get richer detail when it does fire.

Happy to defer this if there's a deliberate reason for the 2000 cap (e.g. aligning with an external dashboard's expected max), but didn't see one in the PR description.

Added an Operational note under tron:block_transaction_count recommending a +Inf overflow ratio alert as the signal to re-tune the upper bound.

On the bucket cap itself: real-world TRON blocks average around 300–350 tx, and even at the observed peak TPS (~272) a 3-second block tops out around ~800 tx — well under 2000. The 2000 cap covers both normal and peak operation; the +Inf ratio alert in the operational note serves as the tripwire for pathological cases (repush storms, fork-switched blocks). Keeping the array short with the alert as the escalation path seems sufficient for now, but happy to add 5000/10000 back if you feel the extra resolution on those outlier events is worth it.

In stress testing scenarios, the number will exceed 2000.

Good point on the stress-test / repush scenario. Added 5000 and 10000 back as safety-net buckets — the array is now [0, 20, 50, 80, 100, 120, 140, 160, 180, 200, 230, 260, 300, 500, 2000, 5000, 10000]. The 0–300 densification for typical-TPS percentiles is unchanged, and the +Inf operational note is updated to reference 10000 as the new effective upper bound.

…_count Restore 5000 and 10000 as upper safety-net buckets so outlier events (stress tests, repush storms, fork-switched blocks) retain resolution above 2000. The 0-300 densification for typical-TPS percentiles is unchanged. Update METRICS_CHANGELOG.md and its operational note accordingly.

bladehan1

LGTM

github-actions Bot requested review from Sunny6889 and halibobo1205 April 29, 2026 03:34

Sunny6889 added this to java-tron May 5, 2026

Sunny6889 reviewed May 5, 2026

View reviewed changes

Comment thread METRICS_CHANGELOG.md Outdated

Sunny6889 reviewed May 5, 2026

View reviewed changes

Comment thread METRICS_CHANGELOG.md Outdated

Sunny6889 reviewed May 5, 2026

View reviewed changes

Comment thread METRICS_CHANGELOG.md Outdated

warku123 force-pushed the feat/metrics-changelog branch from afbeb57 to 7416f46 Compare May 6, 2026 03:07

warku123 force-pushed the feat/metrics-changelog branch from 7416f46 to 4252888 Compare May 6, 2026 03:18

warku123 changed the title ~~docs(metrics): add METRICS_CHANGELOG.md~~ feat(metrics): refine block_transaction_count buckets and add METRICS_CHANGELOG.md May 6, 2026

warku123 changed the title ~~feat(metrics): refine block_transaction_count buckets and add METRICS_CHANGELOG.md~~ feat(metrics): refine block_transaction_count buckets, add changelog May 6, 2026

Sunny6889 reviewed May 6, 2026

View reviewed changes

Comment thread METRICS_CHANGELOG.md Outdated

Sunny6889 approved these changes May 6, 2026

View reviewed changes

bladehan1 reviewed May 8, 2026

View reviewed changes

warku123 requested review from Sunny6889 and bladehan1 May 8, 2026 08:54

bladehan1 approved these changes May 8, 2026

View reviewed changes

Sunny6889 approved these changes May 8, 2026

View reviewed changes

kuny0707 merged commit 8bb976a into tronprotocol:develop May 8, 2026
12 checks passed

github-project-automation Bot moved this to Done in java-tron May 8, 2026

halibobo1205 added this to the GreatVoyage-v4.8.2 milestone May 9, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(metrics): refine block_transaction_count buckets, add changelog#6730

feat(metrics): refine block_transaction_count buckets, add changelog#6730
kuny0707 merged 4 commits intotronprotocol:developfrom
warku123:feat/metrics-changelog

warku123 commented Apr 29, 2026 •

edited

Loading

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

bladehan1 May 8, 2026

Uh oh!

warku123 May 8, 2026

Uh oh!

bladehan1 May 8, 2026

Uh oh!

warku123 May 8, 2026

Uh oh!

bladehan1 May 8, 2026

Uh oh!

warku123 May 8, 2026

Uh oh!

bladehan1 May 8, 2026

Uh oh!

warku123 May 8, 2026

Uh oh!

bladehan1 left a comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants


		Snapshot of metrics emitted prior to this changelog. Per-version provenance is not tracked here; consult `git log` on [`common/src/main/java/org/tron/common/prometheus/`](common/src/main/java/org/tron/common/prometheus/) for exact origin of each metric.

		### Existing Metrics


		#### DB

		- `tron:db_size_bytes` (Gauge, labels `type`, `db`, `level`) — LevelDB storage size.

Conversation

warku123 commented Apr 29, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

bladehan1 left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

warku123 commented Apr 29, 2026 •

edited

Loading