Skip to content

ohcl_boot: exclude CPUs with restored NVMe interrupts from sidecar#3706

Merged
smalis-msft merged 5 commits into
microsoft:mainfrom
smalis-msft:less-nvme-sidecar
Jun 18, 2026
Merged

ohcl_boot: exclude CPUs with restored NVMe interrupts from sidecar#3706
smalis-msft merged 5 commits into
microsoft:mainfrom
smalis-msft:less-nvme-sidecar

Conversation

@smalis-msft

Copy link
Copy Markdown
Contributor

Hopefully fixes intermittent failures of openvmm_openhcl_linux_x64_servicing_keepalive_with_nvme_fault.

The test arms an NVMe fault that panics on any CREATE_IO_COMPLETION_QUEUE after servicing with keepalive — asserting the restore path never re-creates I/O completion queues.

In a failing run, the persisted boot state was:

  cpus_with_mapped_interrupts_no_io=[0, 1]
  cpus_with_outstanding_io=[]

read_from_dt only used cpus_with_outstanding_io to drive the sidecar override, so both CPUs stayed sidecar-started after restore. NVMe interrupts targeted at those CPUs were not delivered, and the keepalive restore eventually issued a CREATE_IO_COMPLETION_QUEUE, tripping the fault.

Fix: in read_from_dt, combine cpus_with_outstanding_io and cpus_with_mapped_interrupts_no_io (sorted, deduped) into a single "needs kernel start" set and use it for the sidecar exclusion / disable decision.

Copilot AI review requested due to automatic review settings June 10, 2026 16:13
@smalis-msft smalis-msft requested a review from a team as a code owner June 10, 2026 16:13
@github-actions github-actions Bot added the unsafe Related to unsafe code label Jun 10, 2026
@github-actions

Copy link
Copy Markdown

⚠️ Unsafe Code Detected

This PR modifies files containing unsafe Rust code. Extra scrutiny is required during review.

For more on why we check whole files, instead of just diffs, check out the Rustonomicon

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Adjusts OpenHCL boot’s servicing-restore sidecar CPU override logic so that vCPUs involved in restored NVMe state (either outstanding I/O or merely a mapped NVMe interrupt) are kernel-started, avoiding missed NVMe completion interrupts that can trigger keepalive restore to recreate completion queues.

Changes:

  • Combine cpus_with_outstanding_io and cpus_with_mapped_interrupts_no_io from persisted state into a single sidecar-exclusion CPU set (sorted + deduped).
  • Drive the sidecar per-CPU override / sidecar-disable decision off that combined set, and update logging/comments to match the intent.

@github-actions

Copy link
Copy Markdown

Comment thread openhcl/openhcl_boot/src/host_params/dt/mod.rs Outdated
@github-actions

Copy link
Copy Markdown

Copilot AI review requested due to automatic review settings June 10, 2026 22:35

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated no new comments.

Comment thread openhcl/openhcl_boot/src/host_params/dt/mod.rs Outdated
Copilot AI review requested due to automatic review settings June 17, 2026 17:09

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 1 out of 1 changed files in this pull request and generated 1 comment.

Comment thread openhcl/openhcl_boot/src/host_params/dt/mod.rs
@github-actions

Copy link
Copy Markdown

@smalis-msft smalis-msft merged commit 8aca5df into microsoft:main Jun 18, 2026
94 of 98 checks passed
@smalis-msft smalis-msft deleted the less-nvme-sidecar branch June 18, 2026 16:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

unsafe Related to unsafe code

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants