feat: add p2p observability metrics#2242
Conversation
|
@claude review once |
|
Claude finished @pratikspatil024's task in 2m 40s —— View job Claude PR Review
SummaryTwo narrowly-scoped observability additions:
The change is small, well-targeted at peer-drop RCA, and follows existing naming conventions. It does not introduce new wire messages, untrusted allocations, or expand the attack surface. Overall, looks good with a few minor concerns to consider — none are blockers. Findings1. RTT measurement can be biased by an adversarial peer (LOW — metric manipulation only)A malicious peer can lower its reported // peerPingLatencyTimer reports devp2p ping RTT. Peers can bias the value
// (low) by sending unsolicited pongs; treat as a coarse health signal, not
// a security/SLA metric.2.
|
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## develop #2242 +/- ##
===========================================
+ Coverage 52.87% 52.92% +0.05%
===========================================
Files 886 886
Lines 156786 156921 +135
===========================================
+ Hits 82898 83055 +157
+ Misses 68640 68607 -33
- Partials 5248 5259 +11
... and 38 files with indirect coverage changes
🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
Claude Code Review
This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.
Tip: disable this comment in your organization's Code Review settings.
|
@claude review |
|
@claude review |
This reverts commit a5e894c.
|
There was a problem hiding this comment.
Latest commit correctly implements the keep-alive-always + measure-only-when-clear pattern; previous halt-on-lost-pong concern is resolved.
Extended reasoning...
Overview
This PR adds two p2p observability metrics: a peer drop counter (p2p/peers/drops) at the server's delpeer handler, and a best-effort devp2p ping RTT timer (p2p/peers/ping/latency) measured via a new pongRecv channel on Peer. The pingLoop is refactored to allow injection of an interval for testing, and a test suite covers the new channel semantics.
Security risks
No meaningful expansion of attack surface:
- No new wire messages (
pongMsgwas previously silently discarded) pongRecvsend uses adefault:clause — a peer cannot blockhandle()by flooding pongspongRecvcapacity 1 + drain-only-when-starting-measurement bounds memory- Metric values are peer-influenceable but feed only a histogram, not consensus/sync state
- An adversarial peer can bias its own reported latency (low) or exclude itself (by never responding); this is a fundamental devp2p limitation since
pongMsgcarries no nonce — it is a coarse health signal, not an authoritative SLA metric
Level of scrutiny
The PR has gone through several iterations addressing subtle correctness concerns flagged in prior bot reviews. The final state at e649411 correctly implements the keep-alive-always-send + measure-only-when-clear pattern, which preserves liveness semantics in the face of lost pongs while still recording RTT for healthy exchanges. The test cleanup ordering is also fixed (pipes close before wg.Wait()).
Other factors
- Codecov: all modified lines covered, project coverage +0.07%
- SonarQube quality gate passed
- The remaining limitation — RTT misattribution under sustained RTT > pingInterval — is a protocol-level constraint and acceptable for a coarse signal
- Author has been responsive throughout iteration; the previous "halt on lost pong" bot comment from 17:18:13Z is stale relative to the current HEAD



Summary
Adds focused Bor P2P observability for peer-drop RCA: a peer disconnect counter (p2p/peers/drops) and devp2p ping RTT timer (p2p/peers/ping/latency). This lets operators correlate peer-drop bursts with peer liveness/latency from without relying only on log scraping.
Executed tests
a. go test ./p2p
Result: pass.
b. Kurtosis Polygon PoS devnet stress test with Amoy configuration:
4 Bor validators, 6 Bor RPC nodes
distributed polycli loadtest across all 6 RPC endpoints
load window: 2026-05-26T14:47:45Z to 2026-05-26T14:52:47Z
Rollout notes
No coordinated upgr
ade required.