feat(metrics): refine block_transaction_count buckets, add changelog#6730
feat(metrics): refine block_transaction_count buckets, add changelog#6730kuny0707 merged 4 commits intotronprotocol:developfrom
Conversation
afbeb57 to
7416f46
Compare
…_CHANGELOG.md Densify the default histogram buckets around 0-300 (the typical TPS range) so percentile interpolation for per-SR throughput is more accurate. New buckets: [0, 20, 50, 80, 100, 120, 140, 160, 180, 200, 230, 260, 300, 500, 2000]. Also introduce METRICS_CHANGELOG.md to track Prometheus metric additions, changes, and removals going forward, with a Pre-4.8.2 baseline snapshot of currently emitted metrics.
7416f46 to
4252888
Compare
Replace the grafana.com/dashboards/16567 link with the maintained JSON in tron-docker, per @Sunny6889's review on tronprotocol#6730. The community dashboard 16567 is no longer being updated; the in-org JSON is the current source of truth.
The block_transaction_count histogram buckets in java-tron are being refined to [0, 20, 50, 80, 100, 120, 140, 160, 180, 200, 230, 260, 300, 500, 2000] (tronprotocol/java-tron#6730), densifying the 0-300 range for percentile interpolation in the typical per-SR TPS range. Update the Block Transaction Count Distribution panel to match: the old le="10.0" and le="1000.0" boundaries no longer exist in the refined array, so those queries would return nothing. Adopt 9 legend buckets aligned to the new boundaries (0, 1-20, 21-50, 51-100, 101-200, 201-300, 301-500, 501-2000, 2000+) so the panel makes use of the added 0-300 resolution.
Update the documented bucket array for tron:block_transaction_count to match the refined buckets in tronprotocol/java-tron#6730: [0, 20, 50, 80, 100, 120, 140, 160, 180, 200, 230, 260, 300, 500, 2000]. Per @Sunny6889's review on this PR.
|
|
||
| Snapshot of metrics emitted prior to this changelog. Per-version provenance is not tracked here; consult `git log` on [`common/src/main/java/org/tron/common/prometheus/`](common/src/main/java/org/tron/common/prometheus/) for exact origin of each metric. | ||
|
|
||
| ### Existing Metrics |
There was a problem hiding this comment.
[NIT] Pre-4.8.2 baseline omits tron:error_info
The baseline claims to be a complete snapshot of metrics emitted prior to this changelog, but it does not list tron:error_info (Counter, labels topic, type), which is registered in common/src/main/java/org/tron/common/prometheus/InstrumentedAppender.java:12-17 and emitted on every ERROR-level log line. The metric is already actively asserted in framework/src/test/java/org/tron/core/metrics/prometheus/PrometheusApiServiceTest.java:114.
Leaving it out of the baseline means a future PR that renames or removes it will have no documented baseline to compare against, which is the exact failure mode this changelog is meant to prevent.
Suggestion: Add a one-line entry under one of the existing subsections (Core or a new "Logging" subsection) describing tron:error_info (Counter, labels topic, type) — emitted on every ERROR-level log line by InstrumentedAppender.
There was a problem hiding this comment.
Added a dedicated Logging subsection to the Pre-4.8.2 Baseline with tron:error_info (Counter, labels topic, type) — incremented on every ERROR-level log line by InstrumentedAppender.
Also took the opportunity to do a full audit of all metric registration sites (MetricsCounter, MetricsGauge, MetricsHistogram, InstrumentedAppender, OperatingSystemExports, GuavaCacheExports). Found one more gap: process_cpu_load (defined in OperatingSystemExports) was missing from the System section — added that in the same commit.
|
|
||
| #### DB | ||
|
|
||
| - `tron:db_size_bytes` (Gauge, labels `type`, `db`, `level`) — LevelDB storage size. |
There was a problem hiding this comment.
[NIT] DB metrics are not LevelDB-only — RocksDB also emits these
tron:db_size_bytes and tron:db_sst_level are emitted from chainbase/src/main/java/org/tron/common/storage/metric/DbStat.java:18-19, where the first label value comes from DbStat#getEngine(). Both LevelDbDataSourceImpl#getEngine() (returns LEVELDB) and RocksDbDataSourceImpl#getEngine() (returns ROCKSDB) extend DbStat, so on a node configured with the RocksDB backend these metrics are emitted with engine="ROCKSDB".
Describing them as "LevelDB storage size" / "LevelDB SST files per compaction level" will mislead operators of RocksDB-backend nodes into thinking the metrics are missing, or into writing PromQL that implicitly filters for LevelDB and drops half their data.
Suggestion: Reword as engine-aware, e.g. "Storage size in bytes per (engine, db, level); engine ∈ {LEVELDB, ROCKSDB}." The first label is named type in code (MetricsGauge.java:19-20) but its values are storage engines — calling that out in the description avoids the confusion.
There was a problem hiding this comment.
Rewrote both entries to be engine-aware:
tron:db_size_bytes— "storage size in bytes per engine, database, and level;typeis the storage engine (LEVELDBorROCKSDB) depending on node configuration."tron:db_sst_level— "SST files per compaction level per engine and database;typeis the storage engine (LEVELDBorROCKSDB) depending on node configuration."
Also noticed the Guava cache entries were missing their type label (cache name) — expanded those into individual descriptions with the label noted.
…_CHANGELOG - Add missing tron:error_info (Counter) to Logging subsection - Add missing process_cpu_load to System metrics list - Fix tron:db_size_bytes / tron:db_sst_level descriptions to be engine-aware (LEVELDB or ROCKSDB) instead of LevelDB-only - Expand Guava cache entries with type label and individual descriptions - Add +Inf overflow operational note under tron:block_transaction_count
| init(MetricKeys.Histogram.BLOCK_TRANSACTION_COUNT, | ||
| "Distribution of transaction counts per block.", | ||
| new double[]{0, 10, 50, 100, 200, 500, 1000, 2000, 5000, 10000}, | ||
| new double[]{0, 20, 50, 80, 100, 120, 140, 160, 180, 200, 230, 260, 300, 500, 2000}, |
There was a problem hiding this comment.
[DISCUSS] Consider keeping 5000/10000 as safety-net buckets above 2000
The densified 0–300 range is a real improvement, but capping the array at 2000 trades away an observability property the previous layout had: visibility into outlier blocks. Because buckets are cumulative and the client library always appends +Inf, any block with > 2000 transactions now lands in +Inf only — you can read "how many exceeded 2000" as +Inf - 2000, but you can't tell whether those samples are 2001 or 50000. The previous array [..., 2000, 5000, 10000] did give that resolution, even if its lower-range buckets were too coarse for the typical workload.
A hybrid layout keeps both properties:
[0, 20, 50, 80, 100, 120, 140, 160, 180, 200,
230, 260, 300, 500, 2000, 5000, 10000]
- The 0–300 densification stays intact for typical-TPS percentile interpolation.
- 5000 and 10000 act as safety-net buckets for rare bursts (pathological pushes, repush storms, fork-switched blocks observed before early-return at
Manager.java:1273-1277). - True outliers above 10000 still fall into
+Inf, which is appropriate — at that point it's an incident signal, not something quantiles should smooth over.
Adding two buckets has no cardinality impact (buckets are series-internal, not label values) and negligible per-observe() cost. It strictly subsumes the alerting hook proposed in the other comment: the +Inf ratio alert still works as a tripwire, and operators get richer detail when it does fire.
Happy to defer this if there's a deliberate reason for the 2000 cap (e.g. aligning with an external dashboard's expected max), but didn't see one in the PR description.
There was a problem hiding this comment.
Added an Operational note under tron:block_transaction_count recommending a +Inf overflow ratio alert as the signal to re-tune the upper bound.
On the bucket cap itself: real-world TRON blocks average around 300–350 tx, and even at the observed peak TPS (~272) a 3-second block tops out around ~800 tx — well under 2000. The 2000 cap covers both normal and peak operation; the +Inf ratio alert in the operational note serves as the tripwire for pathological cases (repush storms, fork-switched blocks). Keeping the array short with the alert as the escalation path seems sufficient for now, but happy to add 5000/10000 back if you feel the extra resolution on those outlier events is worth it.
There was a problem hiding this comment.
In stress testing scenarios, the number will exceed 2000.
There was a problem hiding this comment.
Good point on the stress-test / repush scenario. Added 5000 and 10000 back as safety-net buckets — the array is now [0, 20, 50, 80, 100, 120, 140, 160, 180, 200, 230, 260, 300, 500, 2000, 5000, 10000]. The 0–300 densification for typical-TPS percentiles is unchanged, and the +Inf operational note is updated to reference 10000 as the new effective upper bound.
…_count Restore 5000 and 10000 as upper safety-net buckets so outlier events (stress tests, repush storms, fork-switched blocks) retain resolution above 2000. The 0-300 densification for typical-TPS percentiles is unchanged. Update METRICS_CHANGELOG.md and its operational note accordingly.
What does this PR do?
Two changes related to the metrics added in #6624:
tron:block_transaction_countfrom[0, 10, 50, 100, 200, 500, 1000, 2000, 5000, 10000]to[0, 20, 50, 80, 100, 120, 140, 160, 180, 200, 230, 260, 300, 500, 2000], densifying the 0–300 range so percentile interpolation in the typical per-SR TPS range is more accurate. Per @Sunny6889's review on feat(metrics): add block transaction count and SR set change monitoring #6624.METRICS_CHANGELOG.mdto track Prometheus metric additions, changes, and removals going forward, with a Pre-4.8.2 baseline snapshot of currently emitted metrics. Closes one of the post-merge follow-ups noted by @halibobo1205 on feat(metrics): add block transaction count and SR set change monitoring #6624.Why are these changes required?
The bucket change makes the histogram useful for the workload it was designed to observe — most blocks fall under 300 tx, and the previous bucket layout was too coarse there to interpolate percentiles meaningfully. The changelog gives operators a single in-repo place to see what metrics each release adds or removes; today this information is scattered across release notes and the tron-docker
metric_monitor/README.md.Related to #6590 (closed when #6624 merged).
This PR has been tested by:
Follow up
metric_monitor/README.mdto mirror the new metrics and the bucket change.Extra details
The existing
PrometheusApiServiceTestonly queries thele="0.0"bucket (empty-block detection), which exists in both the old and new bucket arrays, so no test changes are needed.