Skip to content

feat: add p2p observability metrics#2242

Open
vbhattaccmu wants to merge 11 commits into
developfrom
feat/new-p2p-metrics
Open

feat: add p2p observability metrics#2242
vbhattaccmu wants to merge 11 commits into
developfrom
feat/new-p2p-metrics

Conversation

@vbhattaccmu
Copy link
Copy Markdown

@vbhattaccmu vbhattaccmu commented May 26, 2026

Summary

Adds focused Bor P2P observability for peer-drop RCA: a peer disconnect counter (p2p/peers/drops) and devp2p ping RTT timer (p2p/peers/ping/latency). This lets operators correlate peer-drop bursts with peer liveness/latency from without relying only on log scraping.

Executed tests

a. go test ./p2p
Result: pass.

b. Kurtosis Polygon PoS devnet stress test with Amoy configuration:
4 Bor validators, 6 Bor RPC nodes
distributed polycli loadtest across all 6 RPC endpoints
load window: 2026-05-26T14:47:45Z to 2026-05-26T14:52:47Z

Rollout notes

No coordinated upgr
ade required.

@pratikspatil024
Copy link
Copy Markdown
Member

@claude review once

@claude
Copy link
Copy Markdown

claude Bot commented May 26, 2026

Claude finished @pratikspatil024's task in 2m 40s —— View job


Claude PR Review

  • Gather context (read changed files, related code)
  • Analyze code for bugs, race conditions, performance, observability concerns
  • Cross-check against repo .claude/rules/ (p2p-security, security-common)
  • Post review feedback

Summary

Two narrowly-scoped observability additions:

  1. p2p/peers/drops counter — incremented in the server's delpeer handler.
  2. p2p/peers/ping/latency timer — measured by adding a pongRecv channel and timestamping the most recent outbound ping in pingLoop.

The change is small, well-targeted at peer-drop RCA, and follows existing naming conventions. It does not introduce new wire messages, untrusted allocations, or expand the attack surface. Overall, looks good with a few minor concerns to consider — none are blockers.


Findings

1. RTT measurement can be biased by an adversarial peer (LOW — metric manipulation only)

A malicious peer can lower its reported p2p/peers/ping/latency by spamming pongMsg (any unsolicited pong arriving after our ping will be measured). Conversely, withholding pong inflates latency only until the next ping cycle resets pingSent. Since devp2p ping doesn't carry a nonce, this is unavoidable without a wire-protocol change. Worth a comment in the code so future readers know the timer is best-effort and untrusted-peer-influenced, not an authoritative liveness signal.

// peerPingLatencyTimer reports devp2p ping RTT. Peers can bias the value
// (low) by sending unsolicited pongs; treat as a coarse health signal, not
// a security/SLA metric.

2. pingSent is overwritten if a pong is never received before the next ping (LOW — measurement accuracy)

In p2p/peer.go:341-359, when ping.C fires:

  • the drain loop empties any stale pongRecv,
  • a new ping is sent,
  • pingSent is overwritten with time.Now().

If the previous ping's pong then arrives after this re-send, it's measured against the new pingSent and reports an artificially small latency (~ tiny ms instead of >15s). This is the same edge case as #1, but self-inflicted: any peer with RTT > pingInterval (15s) will produce nonsense numbers. Acceptable for a coarse signal; just be aware that the histogram's bottom is "stale pong" noise.

If precision matters, attach a per-ping nonce in a local counter and only accept a pong matching the latest nonce — but that requires modifying message handling beyond what this PR aims to do.

3. Drain block placement is fine but worth a one-line comment (NIT)

The drain: loop at p2p/peer.go:342-349 exists to discard pongs from prior, unmatched pings before starting a fresh measurement. That intent isn't obvious from the code. A one-line comment would help future maintainers:

// Drop any pong from an earlier, unmeasured ping so the next pong
// always corresponds to the ping we are about to send.
drain:
    for { ... }

4. peerDropMeter counts all removals, including graceful local shutdown (LOW — interpretation)

p2p/server.go:880 increments the counter for every delpeer event, which includes node shutdown, peer-initiated graceful disconnect (DiscQuitting), and useless-peer drops — not just abnormal drops. For RCA, raw count is still useful, but consider whether a future enhancement should label by pd.requested / pd.err (similar to how markDialError / markServeError already partition dial/serve errors). Not required for this PR; just flagging for the metric's interpretability when wired into dashboards.

5. msg.Discard() error handling is now consistent — minor behavior change (INFO)

In p2p/peer.go:398-400, the previous code called msg.Discard() and ignored its error for pingMsg. The new code returns the error, which (correctly) propagates as a protoErr and tears down the peer. This is strictly better — Discard's error means the framed connection is broken — but it is a behavior change worth noting in the PR body (currently undocumented).

6. Missing test coverage for the new pong path (LOW)

TestPeerPing in p2p/peer_test.go:179 only verifies that an incoming ping produces an outgoing pong. There is no test that:

  • a pong sent to the peer updates peerPingLatencyTimer, or
  • an unsolicited flood of pongs is bounded by the default branch and doesn't block handle().

A small test exercising the pongRecv non-blocking send (e.g., spam 100 pongs, assert handle() returns immediately, assert at most one is buffered) would lock in the DoS resistance you've designed in.

7. metrics.Enabled() gate is inconsistent with the rest of p2p/metrics.go (NIT — code style)

At p2p/peer.go:354-358, you check metrics.Enabled() before stamping pingSent. Other peers/dials/serves meters in this file mark unconditionally (see serveSuccessMeter.Mark(1) etc. in server.go). Two options for consistency:

  • Drop the gate (the time.Now() cost is negligible and Update is itself cheap when metrics are off), or
  • Apply the same gate to peerDropMeter.Mark(1).

Either is fine; right now the file mixes both conventions.


Security checklist (per .claude/rules/p2p-security.md)

  • No new unbounded allocations from peer input.
  • No new wire-protocol messages; pongMsg was already accepted (silently discarded) before this PR.
  • pongRecv send uses default branch → non-blocking, peer cannot flood handle().
  • No panics introduced in consensus / sync / block production paths.
  • No new context-less blocking peer reads.

No security regressions identified. The peer-controlled inputs (pong timing) only affect a self-reported metric, not consensus or sync.


Verdict

LGTM modulo the minor issues above. Items #1, #4, #6 are the most worth addressing before merge; the rest are nits.
• branch: feat/new-p2p-metrics

Comment thread p2p/peer.go Outdated
Comment thread p2p/peer.go Outdated
Comment thread p2p/peer.go Outdated
@codecov
Copy link
Copy Markdown

codecov Bot commented May 26, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 52.92%. Comparing base (4fb8666) to head (e649411).
⚠️ Report is 13 commits behind head on develop.

Additional details and impacted files

Impacted file tree graph

@@             Coverage Diff             @@
##           develop    #2242      +/-   ##
===========================================
+ Coverage    52.87%   52.92%   +0.05%     
===========================================
  Files          886      886              
  Lines       156786   156921     +135     
===========================================
+ Hits         82898    83055     +157     
+ Misses       68640    68607      -33     
- Partials      5248     5259      +11     
Files with missing lines Coverage Δ
p2p/metrics.go 57.14% <ø> (+38.09%) ⬆️
p2p/peer.go 75.69% <100.00%> (+5.45%) ⬆️
p2p/server.go 73.39% <100.00%> (+0.03%) ⬆️

... and 38 files with indirect coverage changes

Files with missing lines Coverage Δ
p2p/metrics.go 57.14% <ø> (+38.09%) ⬆️
p2p/peer.go 75.69% <100.00%> (+5.45%) ⬆️
p2p/server.go 73.39% <100.00%> (+0.03%) ⬆️

... and 38 files with indirect coverage changes

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

@vbhattaccmu vbhattaccmu marked this pull request as ready for review May 28, 2026 09:35
Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This repository is configured for manual code reviews. Comment @claude review to trigger a review and subscribe this PR to future pushes, or @claude review once for a one-time review.

Tip: disable this comment in your organization's Code Review settings.

kamuikatsurgi
kamuikatsurgi previously approved these changes May 28, 2026
@vbhattaccmu
Copy link
Copy Markdown
Author

@claude review

Comment thread p2p/peer.go Outdated
@vbhattaccmu
Copy link
Copy Markdown
Author

@claude review

Comment thread p2p/peer.go
Comment thread p2p/peer_test.go
Comment thread p2p/peer_test.go Outdated
Comment thread p2p/peer.go
@sonarqubecloud
Copy link
Copy Markdown

sonarqubecloud Bot commented Jun 1, 2026

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Latest commit correctly implements the keep-alive-always + measure-only-when-clear pattern; previous halt-on-lost-pong concern is resolved.

Extended reasoning...

Overview

This PR adds two p2p observability metrics: a peer drop counter (p2p/peers/drops) at the server's delpeer handler, and a best-effort devp2p ping RTT timer (p2p/peers/ping/latency) measured via a new pongRecv channel on Peer. The pingLoop is refactored to allow injection of an interval for testing, and a test suite covers the new channel semantics.

Security risks

No meaningful expansion of attack surface:

  • No new wire messages (pongMsg was previously silently discarded)
  • pongRecv send uses a default: clause — a peer cannot block handle() by flooding pongs
  • pongRecv capacity 1 + drain-only-when-starting-measurement bounds memory
  • Metric values are peer-influenceable but feed only a histogram, not consensus/sync state
  • An adversarial peer can bias its own reported latency (low) or exclude itself (by never responding); this is a fundamental devp2p limitation since pongMsg carries no nonce — it is a coarse health signal, not an authoritative SLA metric

Level of scrutiny

The PR has gone through several iterations addressing subtle correctness concerns flagged in prior bot reviews. The final state at e649411 correctly implements the keep-alive-always-send + measure-only-when-clear pattern, which preserves liveness semantics in the face of lost pongs while still recording RTT for healthy exchanges. The test cleanup ordering is also fixed (pipes close before wg.Wait()).

Other factors

  • Codecov: all modified lines covered, project coverage +0.07%
  • SonarQube quality gate passed
  • The remaining limitation — RTT misattribution under sustained RTT > pingInterval — is a protocol-level constraint and acceptable for a coarse signal
  • Author has been responsive throughout iteration; the previous "halt on lost pong" bot comment from 17:18:13Z is stale relative to the current HEAD

Copy link
Copy Markdown
Contributor

@lucca30 lucca30 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lgtm

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants