feat: add CPU/memory resource telemetry to daemon heartbeats#1724
feat: add CPU/memory resource telemetry to daemon heartbeats#1724Siddhant-K-code wants to merge 9 commits into
Conversation
Add a lightweight ResourceSampler that periodically collects process CPU
usage (via getrusage) and RSS memory (via /proc/self/statm on Linux,
getrusage on macOS) every ~30 seconds. Aggregate statistics (min, max,
mean, median) are computed and included in daemon heartbeat events that
are uploaded every 15 minutes.
New fields on heartbeat events:
- cpu_percent_{min,max,mean,median}
- rss_bytes_{min,max,mean,median,current}
- resource_sample_count
These flow through the existing daemon log pipeline into ClickHouse's
fields map without requiring schema changes.
Performance: sampling uses O(1) syscalls (getrusage, /proc read) with
no process spawns. The ring buffer is bounded at 64 samples. No work
is added to the critical trace2 ingestion path.
Refs: PD-54
Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
🤖 Devin AI EngineerI'll be helping with this pull request! Here's what you should know: ✅ I will automatically:
Note: I can only respond to comments from users who have write access to this repository. ⚙️ Control Options:
|
New storage_sampler module tracks known repo ai_dirs discovered through trace2 ingestion. At heartbeat time, computes aggregate storage stats with bounded directory traversal (max entries, skip symlinks, time limit). New heartbeat fields: - git_ai_dir_bytes: total size of .git/ai directories - working_logs_dir_bytes: total size of active working logs - working_logs_count: number of active (non-archived) working logs - working_log_largest_bytes: size of the largest working log Refs: PD-54 Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Extract scan_storage_dirs() to test scanning logic without global state, avoiding parallel test interference from resetting the global KNOWN_AI_DIRS. Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
|
Review notes from local checkout of this PR:
Validation I ran:
All passed. |
process_cpu_seconds() and process_rss_bytes() now return Option<T>, returning None on unsupported platforms or syscall failure. sample() skips entirely when either read fails, so Windows or /proc failures produce no samples rather than plausible-looking zero telemetry. Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
|
Thanks for the review! Addressing both points: P2 (resource read helpers): Agreed — just pushed 25cb6dc which makes P3 (storage observability): This is already implemented in the latest commits on this branch. See commits |
|
Re-reviewed latest head
Validation rerun:
All passed. |
Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
|
Fixed in 1a691dd — |
Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
|
Fixed linked worktree under-reporting in The storage scanner now discovers Also moved New test: |
Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
|
Validation update from actual installed-user flow: I built and installed this PR through the real dev install path, then used the installed binary rather than ./scripts/dev.sh --release
~/.git-ai/bin/git-ai --version
~/.git-ai/bin/git-ai daemon shutdown || true
~/.git-ai/bin/git-ai daemon start
~/.git-ai/bin/git-ai daemon statusFor telemetry capture, I temporarily pointed local config at a local HTTP capture server with a dummy API key: {
"api_base_url": "http://127.0.0.1:8765",
"api_key": "pd54-local-validation",
"disable_auto_updates": true,
"disable_version_checks": true,
"feature_flags": { "daemon_log_upload": true }
}Then I exercised a normal repo flow using the installed CLI: FLOW=/tmp/pd54-git-ai-user-flow
rm -rf "$FLOW"
mkdir -p "$FLOW"
cd "$FLOW"
git init -q
git config user.name "PD54 Reviewer"
git config user.email "pd54-reviewer@example.com"
printf 'hello\n' > demo.txt
git add demo.txt
git commit -q -m 'initial human commit'
printf 'hello from ai\n' >> demo.txt
~/.git-ai/bin/git-ai checkpoint mock_ai demo.txt
~/.git-ai/bin/git-ai status --json
~/.git-ai/bin/git-ai log --oneline -1
~/.git-ai/bin/git-ai daemon status --repo "$FLOW"Observed result:
{
"kind": "heartbeat",
"message": "alive",
"resource_sample_count": 30,
"cpu_percent_min": 0.046381397296507886,
"cpu_percent_max": 0.4034800118999653,
"cpu_percent_mean": 0.06926008667198168,
"cpu_percent_median": 0.05876753455661267,
"rss_bytes_current": 25116672,
"rss_bytes_min": 25116672,
"rss_bytes_max": 25116672,
"rss_bytes_mean": 25116672,
"rss_bytes_median": 25116672
}So the CPU/RSS telemetry is verified end-to-end from installed release build -> daemon heartbeat -> daemon-log upload payload. One caveat from this actual-flow run: that first captured heartbeat did not include |
Co-Authored-By: Devin AI <158243242+devin-ai-integration[bot]@users.noreply.github.com>
|
Windows tests are failing due to #1515, see: #1721 (comment) |
Summary
Adds periodic CPU/memory sampling and .git/ai storage telemetry to daemon heartbeats for observability in ClickHouse (PD-54).
Commit 1: CPU/memory resource telemetry
New module
daemon::resource_sampler— lightweight process-level sampler using OS primitives:libc::getrusage(RUSAGE_SELF)→ user+system time delta / wall time = percent/proc/self/statmon Linux,getrusage.ru_maxrsson macOSVecDeque<ResourceSample>(cap 64), drained on heartbeatIntegration into
telemetry_flush_loop:Heartbeat fields:
cpu_percent_{min,max,mean,median},rss_bytes_{min,max,mean,median,current},resource_sample_countDaemonLogFieldValue::from_f64()for clean float→JSON serialization (returnsNullfor NaN/Infinity).Commit 2: .git/ai storage telemetry
New module
daemon::storage_sampler— tracks known repo ai_dirs discovered through trace2 ingestion:RwLock<HashSet<PathBuf>>of known ai_dir paths, registered inTraceNormalizerwhen repos are discoveredscan_storage()computes aggregate stats at heartbeat time with bounded traversal (max 10k entries per dir, 2s time limit, skip symlinks)old-*) working log directoriesNew heartbeat fields:
git_ai_dir_bytes,working_logs_dir_bytes,working_logs_count,working_log_largest_bytesPerformance: O(1) syscalls per CPU/RSS sample, no process spawns, no git operations, bounded buffer. Storage scan is bounded and best-effort. Both run on the existing flush loop — no new tasks or threads.
Link to Devin session: https://app.devin.ai/sessions/aa6bedb86b294920887daafe8d910cb7
Requested by: @Siddhant-K-code