Skip to content

iOS device startup: upload .logarchive and pull crash reports on cleanup failure#5232

Open
matouskozak wants to merge 3 commits into
mainfrom
lm-ios-diagnostic-extraction-pr
Open

iOS device startup: upload .logarchive and pull crash reports on cleanup failure#5232
matouskozak wants to merge 3 commits into
mainfrom
lm-ios-diagnostic-extraction-pr

Conversation

@matouskozak

Copy link
Copy Markdown
Member

Summary

When the post-iteration iOS app cleanup fails (devicectl returns 'No such process'), the app already exited before we could terminate it. This PR adds diagnostic data collection to help investigate why — crash, iOS-initiated termination, or natural exit.

Changes

src/scenarios/shared/runner.py:

  • Added make_archive to the shutil import
  • Wrapped killCmdCommand.run() in a try/except CalledProcessError block that:
    1. Zips and uploads the iOS device's .logarchive to HELIX_WORKITEM_UPLOAD_ROOT
    2. Copies device-side crash reports (.ips) via xcrun devicectl device copy from --domain-type systemCrashLogs
    3. Re-raises the original exception after diagnostics are saved

Why

  • Adds observability for iOS device startup/cleanup failures — a daily triage pain point
  • Investigative-only: triggers only on the cleanup failure that already terminates the iteration
  • No behavior change for successful runs

Origin

Cherry-picked from matouskozak/ios-sdk-jobs-macos-26 (commits 71bfd4a1, 3101d30c) which were left unmerged after PR #5231 was merged.

Verification

  • ✅ No conflict markers
  • python3 -m py_compile src/scenarios/shared/runner.py passes

…nup failure

When post-iteration app cleanup fails (devicectl 'No such process'), the app
already exited before we could terminate it. To diagnose why — crash, iOS-
initiated termination, or natural exit:

1. Zip and upload the iOS device's .logarchive to the Helix results container
2. Pull device-side crash reports (.ips) via xcrun devicectl

Investigative-only: triggers only on the cleanup failure that already
terminates the iteration.

Cherry-picked from matouskozak/ios-sdk-jobs-macos-26 (71bfd4a1, 3101d30c).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings May 28, 2026 15:11

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR adds diagnostic collection for iOS device startup cleanup failures, helping investigate cases where the app process is already gone when the runner tries to kill it.

Changes:

  • Wraps the iOS app kill step in CalledProcessError handling.
  • Uploads the collected .logarchive and attempts to copy device crash logs before re-raising.
  • Adds make_archive import for zipping the log archive.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/scenarios/shared/runner.py Outdated
Comment thread src/scenarios/shared/runner.py Outdated
- Wrap os.makedirs into the same try/except as the crash copy so a
  diagnostic-path failure cannot mask the original CalledProcessError
  from the failed app kill.
- Prune the systemCrashLogs copy after devicectl to entries relevant
  to this iteration: keep files matching the bundle name or with
  mtime >= the iteration's log-collect start. Drops dozens of unrelated
  .ips files (WiFiLQMMetrics, wifip2pd, old crashes, etc.) per upload.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@matouskozak

matouskozak commented Jun 8, 2026

Copy link
Copy Markdown
Member Author

Copilot AI review requested due to automatic review settings June 8, 2026 16:45

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 2 comments.

Comment thread src/scenarios/shared/runner.py Outdated
Comment on lines +771 to +792
# The Mac.iPhone.17.Perf devices are shared across many test runs and iOS retains
# crash reports in /var/mobile/Library/Logs/CrashReporter/ until rotated out, so
# systemCrashLogs accumulates days of stale .ips files from previous runs of the
# same app bundles. Wipe it now so the crash collection on a kill failure below
# only contains crashes generated during this test run. devicectl has no per-file
# delete; copying an empty directory in with --remove-existing-content replaces
# the destination contents.
getLogger().info("Clearing systemCrashLogs on device.")
try:
with tempfile.TemporaryDirectory() as emptyCrashDir:
crashClearCmd = [
'xcrun', 'devicectl', 'device', 'copy', 'to',
'--device', deviceUDID,
'--domain-type', 'systemCrashLogs',
'--source', emptyCrashDir,
'--destination', '/',
'--remove-existing-content', 'true',
]
RunCommand(crashClearCmd, verbose=True).run()
except Exception as ex:
getLogger().warning(f"Failed to clear systemCrashLogs (continuing): {ex}")

Comment on lines +872 to +876
# The kill is cleanup-only; the measurement data is already in the .logarchive above.
# devicectl returns non-zero when the app process is already gone (e.g. iOS terminated
# it, the app crashed, or it self-exited). Upload the .logarchive AND any device-side
# crash reports for this bundle to the Helix results container so we can diagnose
# why the app was already gone before re-raising.
Replace devicectl-based copy-everything-then-filter approach with
XHarness's CrashSnapshotReporter pattern, ported to Python:

* Before the iteration loop, run 'mlaunch --list-crash-reports' to
  capture the device's existing crash report set.
* In the kill-failure handler, re-list and download only the diff
  via 'mlaunch --download-crash-report --download-crash-report-to'.
* Poll the final snapshot for up to 60s so iOS has time to finish
  writing the crash report after the process dies (matches
  CrashSnapshotReporter.EndCaptureAsync).

The previous bundle/mtime filter still uploaded the historical
backlog of unrelated crashes that accumulate on shared
Mac.iPhone.17.Perf devices (8-20 stale .ips per work item seen in
build 2986965). The snapshot diff scopes uploads to only the
crashes generated during this test run.

It's a list+copy, not a move — device state is unchanged, matching
XHarness behaviour.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
@matouskozak matouskozak force-pushed the lm-ios-diagnostic-extraction-pr branch from 0d442be to bad02e6 Compare June 8, 2026 17:25
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants