Skip to content

feat(metrics): refine block_transaction_count buckets, add changelog#6730

Merged
kuny0707 merged 4 commits intotronprotocol:developfrom
warku123:feat/metrics-changelog
May 8, 2026
Merged

feat(metrics): refine block_transaction_count buckets, add changelog#6730
kuny0707 merged 4 commits intotronprotocol:developfrom
warku123:feat/metrics-changelog

Conversation

@warku123
Copy link
Copy Markdown
Contributor

@warku123 warku123 commented Apr 29, 2026

What does this PR do?

Two changes related to the metrics added in #6624:

  1. Refines the default histogram buckets for tron:block_transaction_count from [0, 10, 50, 100, 200, 500, 1000, 2000, 5000, 10000] to [0, 20, 50, 80, 100, 120, 140, 160, 180, 200, 230, 260, 300, 500, 2000], densifying the 0–300 range so percentile interpolation in the typical per-SR TPS range is more accurate. Per @Sunny6889's review on feat(metrics): add block transaction count and SR set change monitoring #6624.
  2. Introduces METRICS_CHANGELOG.md to track Prometheus metric additions, changes, and removals going forward, with a Pre-4.8.2 baseline snapshot of currently emitted metrics. Closes one of the post-merge follow-ups noted by @halibobo1205 on feat(metrics): add block transaction count and SR set change monitoring #6624.

Why are these changes required?

The bucket change makes the histogram useful for the workload it was designed to observe — most blocks fall under 300 tx, and the previous bucket layout was too coarse there to interpolate percentiles meaningfully. The changelog gives operators a single in-repo place to see what metrics each release adds or removes; today this information is scattered across release notes and the tron-docker metric_monitor/README.md.

Related to #6590 (closed when #6624 merged).

This PR has been tested by:

  • Unit Tests

Follow up

  • Update tron-docker metric_monitor/README.md to mirror the new metrics and the bucket change.

Extra details

The existing PrometheusApiServiceTest only queries the le="0.0" bucket (empty-block detection), which exists in both the old and new bucket arrays, so no test changes are needed.

Comment thread METRICS_CHANGELOG.md Outdated
Comment thread METRICS_CHANGELOG.md Outdated
Comment thread METRICS_CHANGELOG.md Outdated
@warku123 warku123 force-pushed the feat/metrics-changelog branch from afbeb57 to 7416f46 Compare May 6, 2026 03:07
…_CHANGELOG.md

Densify the default histogram buckets around 0-300 (the typical TPS
range) so percentile interpolation for per-SR throughput is more
accurate. New buckets: [0, 20, 50, 80, 100, 120, 140, 160, 180, 200,
230, 260, 300, 500, 2000].

Also introduce METRICS_CHANGELOG.md to track Prometheus metric
additions, changes, and removals going forward, with a Pre-4.8.2
baseline snapshot of currently emitted metrics.
@warku123 warku123 force-pushed the feat/metrics-changelog branch from 7416f46 to 4252888 Compare May 6, 2026 03:18
@warku123 warku123 changed the title docs(metrics): add METRICS_CHANGELOG.md feat(metrics): refine block_transaction_count buckets and add METRICS_CHANGELOG.md May 6, 2026
@warku123 warku123 changed the title feat(metrics): refine block_transaction_count buckets and add METRICS_CHANGELOG.md feat(metrics): refine block_transaction_count buckets, add changelog May 6, 2026
Comment thread METRICS_CHANGELOG.md Outdated
Replace the grafana.com/dashboards/16567 link with the maintained
JSON in tron-docker, per @Sunny6889's review on tronprotocol#6730. The community
dashboard 16567 is no longer being updated; the in-org JSON is the
current source of truth.
warku123 added a commit to warku123/tron-docker that referenced this pull request May 6, 2026
The block_transaction_count histogram buckets in java-tron are being
refined to [0, 20, 50, 80, 100, 120, 140, 160, 180, 200, 230, 260,
300, 500, 2000] (tronprotocol/java-tron#6730), densifying the 0-300
range for percentile interpolation in the typical per-SR TPS range.

Update the Block Transaction Count Distribution panel to match: the
old le="10.0" and le="1000.0" boundaries no longer exist in the
refined array, so those queries would return nothing. Adopt 9 legend
buckets aligned to the new boundaries (0, 1-20, 21-50, 51-100,
101-200, 201-300, 301-500, 501-2000, 2000+) so the panel makes use
of the added 0-300 resolution.
warku123 added a commit to warku123/tron-docker that referenced this pull request May 6, 2026
Update the documented bucket array for tron:block_transaction_count
to match the refined buckets in tronprotocol/java-tron#6730:
[0, 20, 50, 80, 100, 120, 140, 160, 180, 200, 230, 260, 300, 500,
2000]. Per @Sunny6889's review on this PR.
Comment thread METRICS_CHANGELOG.md

Snapshot of metrics emitted prior to this changelog. Per-version provenance is not tracked here; consult `git log` on [`common/src/main/java/org/tron/common/prometheus/`](common/src/main/java/org/tron/common/prometheus/) for exact origin of each metric.

### Existing Metrics
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[NIT] Pre-4.8.2 baseline omits tron:error_info

The baseline claims to be a complete snapshot of metrics emitted prior to this changelog, but it does not list tron:error_info (Counter, labels topic, type), which is registered in common/src/main/java/org/tron/common/prometheus/InstrumentedAppender.java:12-17 and emitted on every ERROR-level log line. The metric is already actively asserted in framework/src/test/java/org/tron/core/metrics/prometheus/PrometheusApiServiceTest.java:114.

Leaving it out of the baseline means a future PR that renames or removes it will have no documented baseline to compare against, which is the exact failure mode this changelog is meant to prevent.

Suggestion: Add a one-line entry under one of the existing subsections (Core or a new "Logging" subsection) describing tron:error_info (Counter, labels topic, type) — emitted on every ERROR-level log line by InstrumentedAppender.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added a dedicated Logging subsection to the Pre-4.8.2 Baseline with tron:error_info (Counter, labels topic, type) — incremented on every ERROR-level log line by InstrumentedAppender.

Also took the opportunity to do a full audit of all metric registration sites (MetricsCounter, MetricsGauge, MetricsHistogram, InstrumentedAppender, OperatingSystemExports, GuavaCacheExports). Found one more gap: process_cpu_load (defined in OperatingSystemExports) was missing from the System section — added that in the same commit.

Comment thread METRICS_CHANGELOG.md Outdated

#### DB

- `tron:db_size_bytes` (Gauge, labels `type`, `db`, `level`) — LevelDB storage size.
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[NIT] DB metrics are not LevelDB-only — RocksDB also emits these

tron:db_size_bytes and tron:db_sst_level are emitted from chainbase/src/main/java/org/tron/common/storage/metric/DbStat.java:18-19, where the first label value comes from DbStat#getEngine(). Both LevelDbDataSourceImpl#getEngine() (returns LEVELDB) and RocksDbDataSourceImpl#getEngine() (returns ROCKSDB) extend DbStat, so on a node configured with the RocksDB backend these metrics are emitted with engine="ROCKSDB".

Describing them as "LevelDB storage size" / "LevelDB SST files per compaction level" will mislead operators of RocksDB-backend nodes into thinking the metrics are missing, or into writing PromQL that implicitly filters for LevelDB and drops half their data.

Suggestion: Reword as engine-aware, e.g. "Storage size in bytes per (engine, db, level); engine ∈ {LEVELDB, ROCKSDB}." The first label is named type in code (MetricsGauge.java:19-20) but its values are storage engines — calling that out in the description avoids the confusion.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Rewrote both entries to be engine-aware:

  • tron:db_size_bytes — "storage size in bytes per engine, database, and level; type is the storage engine (LEVELDB or ROCKSDB) depending on node configuration."
  • tron:db_sst_level — "SST files per compaction level per engine and database; type is the storage engine (LEVELDB or ROCKSDB) depending on node configuration."

Also noticed the Guava cache entries were missing their type label (cache name) — expanded those into individual descriptions with the label noted.

…_CHANGELOG

- Add missing tron:error_info (Counter) to Logging subsection
- Add missing process_cpu_load to System metrics list
- Fix tron:db_size_bytes / tron:db_sst_level descriptions to be
  engine-aware (LEVELDB or ROCKSDB) instead of LevelDB-only
- Expand Guava cache entries with type label and individual descriptions
- Add +Inf overflow operational note under tron:block_transaction_count
init(MetricKeys.Histogram.BLOCK_TRANSACTION_COUNT,
"Distribution of transaction counts per block.",
new double[]{0, 10, 50, 100, 200, 500, 1000, 2000, 5000, 10000},
new double[]{0, 20, 50, 80, 100, 120, 140, 160, 180, 200, 230, 260, 300, 500, 2000},
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

[DISCUSS] Consider keeping 5000/10000 as safety-net buckets above 2000

The densified 0–300 range is a real improvement, but capping the array at 2000 trades away an observability property the previous layout had: visibility into outlier blocks. Because buckets are cumulative and the client library always appends +Inf, any block with > 2000 transactions now lands in +Inf only — you can read "how many exceeded 2000" as +Inf - 2000, but you can't tell whether those samples are 2001 or 50000. The previous array [..., 2000, 5000, 10000] did give that resolution, even if its lower-range buckets were too coarse for the typical workload.

A hybrid layout keeps both properties:

[0, 20, 50, 80, 100, 120, 140, 160, 180, 200,
 230, 260, 300, 500, 2000, 5000, 10000]
  • The 0–300 densification stays intact for typical-TPS percentile interpolation.
  • 5000 and 10000 act as safety-net buckets for rare bursts (pathological pushes, repush storms, fork-switched blocks observed before early-return at Manager.java:1273-1277).
  • True outliers above 10000 still fall into +Inf, which is appropriate — at that point it's an incident signal, not something quantiles should smooth over.

Adding two buckets has no cardinality impact (buckets are series-internal, not label values) and negligible per-observe() cost. It strictly subsumes the alerting hook proposed in the other comment: the +Inf ratio alert still works as a tripwire, and operators get richer detail when it does fire.

Happy to defer this if there's a deliberate reason for the 2000 cap (e.g. aligning with an external dashboard's expected max), but didn't see one in the PR description.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added an Operational note under tron:block_transaction_count recommending a +Inf overflow ratio alert as the signal to re-tune the upper bound.

On the bucket cap itself: real-world TRON blocks average around 300–350 tx, and even at the observed peak TPS (~272) a 3-second block tops out around ~800 tx — well under 2000. The 2000 cap covers both normal and peak operation; the +Inf ratio alert in the operational note serves as the tripwire for pathological cases (repush storms, fork-switched blocks). Keeping the array short with the alert as the escalation path seems sufficient for now, but happy to add 5000/10000 back if you feel the extra resolution on those outlier events is worth it.

Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In stress testing scenarios, the number will exceed 2000.

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good point on the stress-test / repush scenario. Added 5000 and 10000 back as safety-net buckets — the array is now [0, 20, 50, 80, 100, 120, 140, 160, 180, 200, 230, 260, 300, 500, 2000, 5000, 10000]. The 0–300 densification for typical-TPS percentiles is unchanged, and the +Inf operational note is updated to reference 10000 as the new effective upper bound.

@warku123 warku123 requested review from Sunny6889 and bladehan1 May 8, 2026 08:54
…_count

Restore 5000 and 10000 as upper safety-net buckets so outlier events
(stress tests, repush storms, fork-switched blocks) retain resolution
above 2000. The 0-300 densification for typical-TPS percentiles is
unchanged. Update METRICS_CHANGELOG.md and its operational note
accordingly.
Copy link
Copy Markdown
Collaborator

@bladehan1 bladehan1 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM

@kuny0707 kuny0707 merged commit 8bb976a into tronprotocol:develop May 8, 2026
12 checks passed
@github-project-automation github-project-automation Bot moved this to Done in java-tron May 8, 2026
@halibobo1205 halibobo1205 added this to the GreatVoyage-v4.8.2 milestone May 9, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

Status: Done

Development

Successfully merging this pull request may close these issues.

5 participants