Skip to content

feat: add CPU/memory resource telemetry to daemon heartbeats#1724

Open
Siddhant-K-code wants to merge 9 commits into
mainfrom
devin/1782969739-add-resource-telemetry
Open

feat: add CPU/memory resource telemetry to daemon heartbeats#1724
Siddhant-K-code wants to merge 9 commits into
mainfrom
devin/1782969739-add-resource-telemetry

Conversation

@Siddhant-K-code

@Siddhant-K-code Siddhant-K-code commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator

Summary

Adds periodic CPU/memory sampling and .git/ai storage telemetry to daemon heartbeats for observability in ClickHouse (PD-54).

Commit 1: CPU/memory resource telemetry

New module daemon::resource_sampler — lightweight process-level sampler using OS primitives:

  • CPU: libc::getrusage(RUSAGE_SELF) → user+system time delta / wall time = percent
  • RSS: /proc/self/statm on Linux, getrusage.ru_maxrss on macOS
  • Samples stored in a bounded VecDeque<ResourceSample> (cap 64), drained on heartbeat

Integration into telemetry_flush_loop:

// Sample every ~30s (every 10th flush tick × 3s interval)
if sample_tick_counter >= RESOURCE_SAMPLE_TICK_INTERVAL {
    resource_sampler.sample();
}
// On heartbeat (every 15min), drain aggregates into event fields
let resource_stats = resource_sampler.drain_stats();

Heartbeat fields: cpu_percent_{min,max,mean,median}, rss_bytes_{min,max,mean,median,current}, resource_sample_count

DaemonLogFieldValue::from_f64() for clean float→JSON serialization (returns Null for NaN/Infinity).

Commit 2: .git/ai storage telemetry

New module daemon::storage_sampler — tracks known repo ai_dirs discovered through trace2 ingestion:

  • Global RwLock<HashSet<PathBuf>> of known ai_dir paths, registered in TraceNormalizer when repos are discovered
  • scan_storage() computes aggregate stats at heartbeat time with bounded traversal (max 10k entries per dir, 2s time limit, skip symlinks)
  • Skips archived (old-*) working log directories

New heartbeat fields: git_ai_dir_bytes, working_logs_dir_bytes, working_logs_count, working_log_largest_bytes

Performance: O(1) syscalls per CPU/RSS sample, no process spawns, no git operations, bounded buffer. Storage scan is bounded and best-effort. Both run on the existing flush loop — no new tasks or threads.

Link to Devin session: https://app.devin.ai/sessions/aa6bedb86b294920887daafe8d910cb7
Requested by: @Siddhant-K-code


Open in Devin Review

Add a lightweight ResourceSampler that periodically collects process CPU
usage (via getrusage) and RSS memory (via /proc/self/statm on Linux,
getrusage on macOS) every ~30 seconds. Aggregate statistics (min, max,
mean, median) are computed and included in daemon heartbeat events that
are uploaded every 15 minutes.

New fields on heartbeat events:
- cpu_percent_{min,max,mean,median}
- rss_bytes_{min,max,mean,median,current}
- resource_sample_count

These flow through the existing daemon log pipeline into ClickHouse's
fields map without requiring schema changes.

Performance: sampling uses O(1) syscalls (getrusage, /proc read) with
no process spawns. The ring buffer is bounded at 64 samples. No work
is added to the critical trace2 ingestion path.

Refs: PD-54
Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
@Siddhant-K-code Siddhant-K-code self-assigned this Jul 2, 2026
@devin-ai-integration

Copy link
Copy Markdown
Contributor

🤖 Devin AI Engineer

I'll be helping with this pull request! Here's what you should know:

✅ I will automatically:

  • Address comments on this PR. Add '(aside)' to your comment to have me ignore it.
  • Look at CI failures and help fix them

Note: I can only respond to comments from users who have write access to this repository.

⚙️ Control Options:

  • Disable automatic comment, CI, and merge conflict monitoring

devin-ai-integration[bot]

This comment was marked as resolved.

Siddhant-K-code and others added 2 commits July 2, 2026 05:50
New storage_sampler module tracks known repo ai_dirs discovered through
trace2 ingestion. At heartbeat time, computes aggregate storage stats
with bounded directory traversal (max entries, skip symlinks, time limit).

New heartbeat fields:
- git_ai_dir_bytes: total size of .git/ai directories
- working_logs_dir_bytes: total size of active working logs
- working_logs_count: number of active (non-archived) working logs
- working_log_largest_bytes: size of the largest working log

Refs: PD-54
Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Extract scan_storage_dirs() to test scanning logic without global state,
avoiding parallel test interference from resetting the global KNOWN_AI_DIRS.

Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>

Copy link
Copy Markdown
Collaborator Author

Review notes from local checkout of this PR:

  • P2: Unsupported or failed resource reads are emitted as valid zero samples. process_cpu_seconds() returns 0.0 on non-Unix and process_rss_bytes() returns 0 on unsupported/failure paths. Those zeros are sampled and later emitted as resource stats, so Windows or /proc failures can create plausible-looking telemetry instead of omitting stats. I’d make the read helpers return Option and have sample() skip incomplete samples.

  • P3: Storage observability from PD-54/Sasha’s note is not implemented. The heartbeat currently adds CPU/RSS fields only. The .git/ai / working-log size fields are still missing, so that part remains out of scope unless we decide to add it here.

Validation I ran:

  • cargo test resource_sampler --lib
  • cargo test daemon_heartbeat_event --lib
  • cargo test daemon_log_field_value --lib
  • git diff --check origin/main...HEAD

All passed.

process_cpu_seconds() and process_rss_bytes() now return Option<T>,
returning None on unsupported platforms or syscall failure. sample()
skips entirely when either read fails, so Windows or /proc failures
produce no samples rather than plausible-looking zero telemetry.

Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
@devin-ai-integration

Copy link
Copy Markdown
Contributor

Thanks for the review! Addressing both points:

P2 (resource read helpers): Agreed — just pushed 25cb6dc which makes process_cpu_seconds() and process_rss_bytes() return Option<T>. sample() now skips entirely when either read fails, so Windows or /proc failures produce no samples rather than plausible-looking zero telemetry.

P3 (storage observability): This is already implemented in the latest commits on this branch. See commits 2b5a198 and 175d159 which add a storage_sampler module with git_ai_dir_bytes, working_logs_dir_bytes, working_logs_count, and working_log_largest_bytes fields to heartbeats. The companion monorepo PR (#422) has the matching ClickHouse columns, ingestion, and Cube exposure. Your checkout may have been based on the first commit only.

Copy link
Copy Markdown
Collaborator Author

Re-reviewed latest head 25cb6dc42 after the updates. The previous zero-sample and storage-field points are addressed. One remaining issue:

  • P2: Storage scan blocks the async telemetry loop for up to 2 seconds. telemetry_worker.rs calls storage_sampler::scan_storage() directly before the spawn_blocking flush. The scan does filesystem traversal and allows up to MAX_SCAN_DURATION = 2s. It is heartbeat-only, but it still blocks a Tokio worker thread and delays the telemetry loop. I’d move the storage scan into a blocking task, or include it inside the existing blocking flush path.

Validation rerun:

  • cargo test resource_sampler --lib
  • cargo test storage_sampler --lib
  • cargo test daemon_heartbeat_event --lib
  • git diff --check origin/main...HEAD

All passed.

Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
@devin-ai-integration

Copy link
Copy Markdown
Contributor

Fixed in 1a691ddscan_storage() is now called inside tokio::task::spawn_blocking(...) so it runs on the blocking thread pool and won't hold up the async telemetry loop. The 2s budget still applies but it no longer starves other async work on that worker thread.

@Siddhant-K-code Siddhant-K-code marked this pull request as ready for review July 2, 2026 07:56
Siddhant-K-code and others added 2 commits July 2, 2026 07:57
Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
@devin-ai-integration

Copy link
Copy Markdown
Contributor

Fixed linked worktree under-reporting in 303ad93:

The storage scanner now discovers ai/worktrees/*/working_logs directories alongside the direct ai/working_logs. Previously only ai_dir/working_logs was scanned for working_logs_dir_bytes, working_logs_count, and working_log_largest_bytes — missing logs stored under isolated worktree paths (ai/worktrees/<name>/working_logs).

Also moved scan_storage() into spawn_blocking (from an earlier commit) so the up-to-2s filesystem traversal doesn't block the async telemetry loop.

New test: scan_storage_dirs_includes_linked_worktree_working_logs verifies both direct and worktree logs are counted.

devin-ai-integration[bot]

This comment was marked as resolved.

Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
@Siddhant-K-code

Copy link
Copy Markdown
Collaborator Author

Validation update from actual installed-user flow:

I built and installed this PR through the real dev install path, then used the installed binary rather than target/* directly:

./scripts/dev.sh --release
~/.git-ai/bin/git-ai --version
~/.git-ai/bin/git-ai daemon shutdown || true
~/.git-ai/bin/git-ai daemon start
~/.git-ai/bin/git-ai daemon status

For telemetry capture, I temporarily pointed local config at a local HTTP capture server with a dummy API key:

{
  "api_base_url": "http://127.0.0.1:8765",
  "api_key": "pd54-local-validation",
  "disable_auto_updates": true,
  "disable_version_checks": true,
  "feature_flags": { "daemon_log_upload": true }
}

Then I exercised a normal repo flow using the installed CLI:

FLOW=/tmp/pd54-git-ai-user-flow
rm -rf "$FLOW"
mkdir -p "$FLOW"
cd "$FLOW"
git init -q
git config user.name "PD54 Reviewer"
git config user.email "pd54-reviewer@example.com"
printf 'hello\n' > demo.txt
git add demo.txt
git commit -q -m 'initial human commit'
printf 'hello from ai\n' >> demo.txt
~/.git-ai/bin/git-ai checkpoint mock_ai demo.txt
~/.git-ai/bin/git-ai status --json
~/.git-ai/bin/git-ai log --oneline -1
~/.git-ai/bin/git-ai daemon status --repo "$FLOW"

Observed result:

  • CLI flow succeeded; daemon processed the repo (latest_seq: 6).
  • Local capture server received real POST /worker/logs/upload requests from git-ai/1.6.10.
  • Waited for the production heartbeat interval instead of shortening it.
  • Captured heartbeat at 2026-07-02T08:27:46.073808+00:00 with:
{
  "kind": "heartbeat",
  "message": "alive",
  "resource_sample_count": 30,
  "cpu_percent_min": 0.046381397296507886,
  "cpu_percent_max": 0.4034800118999653,
  "cpu_percent_mean": 0.06926008667198168,
  "cpu_percent_median": 0.05876753455661267,
  "rss_bytes_current": 25116672,
  "rss_bytes_min": 25116672,
  "rss_bytes_max": 25116672,
  "rss_bytes_mean": 25116672,
  "rss_bytes_median": 25116672
}

So the CPU/RSS telemetry is verified end-to-end from installed release build -> daemon heartbeat -> daemon-log upload payload.

One caveat from this actual-flow run: that first captured heartbeat did not include git_ai_dir_bytes, working_logs_dir_bytes, working_logs_count, or working_log_largest_bytes. I then ran ~/.git-ai/bin/git-ai debug, and the repo's Trace2 checks passed, but I did not wait a second full 15-minute heartbeat cycle before posting this update.

Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
@Siddhant-K-code

Siddhant-K-code commented Jul 2, 2026

Copy link
Copy Markdown
Collaborator Author

Windows tests are failing due to #1515, see: #1721 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant