Skip to content

Queued offline updates#1430

Draft
rwalker777 wants to merge 40 commits into
esphome:mainfrom
rwalker777:queued-offline-updates
Draft

Queued offline updates#1430
rwalker777 wants to merge 40 commits into
esphome:mainfrom
rwalker777:queued-offline-updates

Conversation

@rwalker777

@rwalker777 rwalker777 commented Jun 13, 2026

Copy link
Copy Markdown

What does this implement/fix?

Add detection when device is offline to local build and add a queue flag that triggers an OTA when the device wakes.

Related issue or feature (if applicable):

  • fixes

Types of changes

  • Bugfix (non-breaking change which fixes an issue) — bugfix
  • New feature (non-breaking change which adds functionality) — new-feature
  • Enhancement to an existing feature — enhancement
  • Breaking change (fix or feature that would cause existing functionality to not work as expected) — breaking-change
  • Refactor (no behaviour change) — refactor
  • Documentation only — docs
  • Maintenance / chore — maintenance
  • CI / workflow change — ci
  • Dependencies bump — dependencies

Frontend coordination

  • No frontend change needed
  • [X ] Companion frontend PR: esphome/device-builder-frontend#

Checklist

  • The code change is tested and works locally.
  • Pre-commit hooks pass (ruff, codespell, yaml/json/python checks).
  • Tests have been added or updated under tests/ where applicable.
  • components.index.json / definitions/components/*.json have not been hand-edited (regenerate via script/sync_components.py if a sync is needed).
  • Architecture-level changes are reflected in docs/ARCHITECTURE.md and/or docs/API.md.

rwalker777 and others added 5 commits June 12, 2026 14:38
…e devices. When a firmware job completes successfully for a device that is currently offline, we set a new `queued_update` flag on the device. This indicates that the compiled firmware is ready and waiting to be flashed via OTA the next time the device checks in via mDNS.
@rwalker777

Copy link
Copy Markdown
Author

Will submit frontend to show queued status and option to cancel. Feedback appreciated.

@codspeed-hq

codspeed-hq Bot commented Jun 13, 2026

Copy link
Copy Markdown

Merging this PR will not alter performance

✅ 27 untouched benchmarks


Comparing rwalker777:queued-offline-updates (71dd8f8) with main (7739554)

Open in CodSpeed

@esphbot

esphbot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

@rwalker777: Will submit frontend to show queued status and option to cancel. Feedback appreciated.

Core gap: on_queued_update_ready unwired. No caller passes it. Wake never flashes. Also queued_update never reset — flapping device re-installs forever. Persist flag via state_callbacks path, not direct device.queued_update = True in runner (lost on restart). Fire DeviceEventData, not raw {"device": device}. Two tests reimplement monitor logic inline — drive real apply() instead.

@esphbot

esphbot commented Jun 13, 2026

Copy link
Copy Markdown
Contributor

Previous review — superseded by a newer review below.

@esphbot esphbot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking issues found — see the review comment above.

@github-actions github-actions Bot added the new-feature New feature label Jun 15, 2026
@codecov

codecov Bot commented Jun 15, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 99.54%. Comparing base (7739554) to head (71dd8f8).

Additional details and impacted files

Impacted file tree graph

@@           Coverage Diff           @@
##             main    #1430   +/-   ##
=======================================
  Coverage   99.54%   99.54%           
=======================================
  Files         224      224           
  Lines       17553    17614   +61     
=======================================
+ Hits        17473    17534   +61     
  Misses         80       80           
Flag Coverage Δ
py3.12 99.51% <100.00%> (+<0.01%) ⬆️
py3.14 99.54% <100.00%> (+0.40%) ⬆️

Flags with carried forward coverage won't be shown. Click here to find out more.

Files with missing lines Coverage Δ
...home_device_builder/controllers/_device_scanner.py 99.40% <100.00%> (+<0.01%) ⬆️
...er/controllers/_device_state_monitor/controller.py 98.24% <100.00%> (+0.09%) ⬆️
...ice_builder/controllers/devices/_metadata_store.py 100.00% <ø> (ø)
...e_device_builder/controllers/devices/controller.py 99.71% <100.00%> (+<0.01%) ⬆️
...ome_device_builder/controllers/devices/metadata.py 100.00% <100.00%> (ø)
..._device_builder/controllers/firmware/controller.py 100.00% <100.00%> (ø)
esphome_device_builder/models/devices.py 100.00% <100.00%> (ø)
esphome_device_builder/models/firmware.py 100.00% <100.00%> (ø)
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@rwalker777 rwalker777 requested a review from esphbot June 16, 2026 16:15
@esphbot

esphbot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Previous review — superseded by a newer review below.

@esphbot esphbot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking issues found — see the review comment above.

@rwalker777 rwalker777 requested a review from esphbot June 16, 2026 20:29
@rwalker777

Copy link
Copy Markdown
Author

This is dependent on this ESPHome PR - esphome/esphome#16358

@esphbot

esphbot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

@rwalker777: This is dependent on this ESPHome PR - esphome/esphome#16358

Noted — feature can't function until esphome/esphome#16358 lands. Mark this PR draft until then. Two blockers independent of that dependency: queued_update persists to the store but _resolve_device_metadata never reads it back (lost on restart/rescan), and _execute_job arms the flag for any offline COMPILE, so plain firmware/compile triggers an unwanted wake-flash. Both need a fix here regardless of upstream.

@esphbot

esphbot commented Jun 16, 2026

Copy link
Copy Markdown
Contributor

Previous review — superseded by a newer review below.

@esphbot esphbot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking issues found — see the review comment above.

rwalker777 and others added 5 commits June 16, 2026 19:58
…gate offline update queuing logic, ensuring that devices are only queued once per offline state. This prevents redundant compile-only jobs for devices that are already marked as having a queued update, while still allowing new offline devices to be queued appropriately. Additionally, the test suite is updated to reflect the new flag and its behavior in the offline queue processing.
@esphbot

esphbot commented Jun 17, 2026

Copy link
Copy Markdown
Contributor

Previous review — superseded by a newer review below.

@esphbot esphbot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking issues found — see the review comment above.

@rwalker777

Copy link
Copy Markdown
Author

@codspeedbot fix this regression

@esphbot

esphbot commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

@rwalker777: @codspeedbot fix this regression

Likely cause: new DEVICE_STATE_CHANGED listener. _handle_device_wake linear-scans all devices via get_devices() every state event. O(n) on a hot path. Resolve via scanner index, not full walk. Early-return when no device is armed.

@esphbot

esphbot commented Jun 20, 2026

Copy link
Copy Markdown
Contributor

PR Review — Queued offline updates

Solid scaffolding for queued offline OTA, but the wake→flash chain has a race that drops the install in its most common case, plus a hot-path perf regression and over-defensive reflective wiring.

What's done well:

  • The persisted queued_update field is threaded cleanly end-to-end — DeviceFileMetadata, the metadata sidecar (_PINNED_FIELDS + test_store_fields_pinned), _resolve_device_metadata, and the Device model — so the flag survives restart, matching the project's persistence conventions.
  • apply_queued_update correctly reuses _any_matching_device_differs for de-dup, consistent with the other apply_* monitor methods.
  • Gating the OFFLINE branch on DeviceState.OFFLINE (not UNKNOWN) deliberately avoids arming on startup states — a thoughtful edge-case call.

What needs attention:

  • Race: a device that comes online during the deferred compile is never flashed and never armed — the install is silently lost (warning Add section config API and user preferences #1).
  • queued_update is cleared before the OTA runs, so a failed flash loses the queued state with no retry (warning Fix be when i want the catalog for a specific device #2).
  • New DEVICE_STATE_CHANGED listener does an O(n) get_devices() scan per state event — the likely CodSpeed regression (warning Handle hint and errors in yaml #3).
  • Reflective hasattr/getattr guards over declared dataclass/controller attributes mask wiring bugs as silent no-ops (warning Fix yaml changes updates to dashboard device #4).
  • Listener wired in __init__, never unsubscribed (suggestion); monitor tests pin the method body rather than the feature contract (suggestion); deferred install returns a bare COMPILE job and the frontend/docs coordination is still a placeholder (suggestion).

🟡 Important

1. Deferred install is silently dropped if the device wakes during compile (`esphome_device_builder/controllers/firmware/controller.py`, L444-453)

The arming step in _execute_job only sets queued_update=True when the device is still OFFLINE after the compile completes:

if is_comp and is_done and is_deferred:
    device = self._device_for_configuration(job.configuration)
    if device and device.state == DeviceState.OFFLINE and self._db.devices:
        self._db.devices.set_queued_update(device.name, is_queued=True)

If the device comes online during the compile (the common case — the user kicked off the install precisely because the device was about to wake, and a compile takes tens of seconds), the DEVICE_STATE_CHANGED → ONLINE event fires while queued_update is still False, so _handle_device_wake does nothing. Then this post-compile check sees state == ONLINE, skips arming, and no upload ever runs. The user clicked Install, the firmware compiled, and nothing flashed — with no surfaced error.

This is the exact moment the feature should fire (device available + binary ready), and it's the window the code doesn't cover. Consider: if the deferred compile finished and the device is currently ONLINE, fire the OTA upload directly instead of only arming-for-later; arm the flag only when still OFFLINE.

if device and device.state == DeviceState.OFFLINE and self._db.devices:
2. Flag cleared optimistically before the OTA upload runs — a failed flash loses the queued state (`esphome_device_builder/controllers/firmware/controller.py`, L99-110)

_handle_device_wake clears queued_update before the upload is even scheduled:

self._db.devices.set_queued_update(device.name, is_queued=False)
create_eager_task(self.upload(configuration=config, port="OTA"))

If the OTA fails (device flaps back offline mid-flash, OTA auth/handshake failure, transient network), the flag is already gone, so the next wake won't retry. The user is left with an un-flashed device and a cleared queued state — the deferred install is lost on the first failed attempt, exactly the unreliable-network scenario the feature targets.

Prefer clearing the flag only after a confirmed successful upload (e.g. on the upload job's COMPLETED lifecycle, mirroring _sync_deployed_hash_after_flash), so a failed OTA leaves the device armed for the next wake. At minimum, document that the queue is best-effort single-shot.

self._db.devices.set_queued_update(device.name, is_queued=False)
create_eager_task(self.upload(configuration=config, port="OTA"))
3. Per-event O(n) device scan on the DEVICE_STATE_CHANGED hot path (likely the CodSpeed regression) (`esphome_device_builder/controllers/firmware/controller.py`, L84-97)

The new listener fires on every state change for every device, and _device_for_configuration does a full linear get_devices() walk each time:

return next((d for d in self._db.devices.get_devices()
             if getattr(d, "configuration", None) == configuration), None)

For a fleet, the mDNS/ping sweep flips many devices online/offline continuously, so this is O(devices) per O(events) — quadratic-ish churn on a hot path that previously did nothing. This is the most likely source of the CodSpeed regression flagged in the PR comment.

The scanner already maintains indexed lookups (DeviceScanner.get_by_name / path-keyed _devices). Resolve the device via the index, and short-circuit even earlier: most state events are for devices with no queued update, so a cheap guard (if event.data['state'] != ONLINE: return is already there — also gate on a controller-held set of armed configurations) avoids the scan entirely in the overwhelming common case.

for d in self._db.devices.get_devices()

🟢 Suggestions

1. Deferred install returns a COMPILE job with no dependent UPLOAD — diverges from the documented install contract (`esphome_device_builder/controllers/firmware/controller.py`, L258-270)

Normal firmware/install returns a COMPILE job plus a dependent UPLOAD job (depends_on), and docs/API.md documents the UPLOAD rendering as queued. The OFFLINE branch returns a bare COMPILE job with is_deferred_install=True and no upload chain, so the frontend's two-job expectation silently doesn't hold for this path.

Combined with the unfilled esphome/device-builder-frontend#<number> placeholder and the unticked architecture-docs checkbox, the deferred state is invisible to the user: clicking Install on an offline device produces a compile, then nothing, with no "waiting for device to wake" affordance and no in-UI way to invoke firmware/clear_queued_update. Link the companion frontend PR (or confirm no frontend change is needed and explain how the user observes/cancels the queued state), and document the deferred-install return-shape difference in docs/API.md next to the new firmware/clear_queued_update row.

job.is_deferred_install = True
return await self._enqueue(job)

Checklist

  • Feature functional end-to-end (offline→compile→wake→flash) — warning #1
  • Queued state survives restart
  • Failure paths preserve queued state — warning #2
  • No regression on event hot path — warning #3
  • No fragile reflective attribute discovery — warning #4
  • Tests cover new branches behaviorally — suggestion #6
  • Docs + frontend coordination for new field/command — suggestion #7
  • No security/injection issues

To rebase specific severity levels, mention me: @esphbot rebase critical (fixes 🔴 only), @esphbot rebase important (fixes 🔴 + 🟡), or just @esphbot rebase for all.


Automated review by Kōan (Claude) HEAD=71dd8f8 4 min 25s

@esphbot esphbot left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Blocking issues found — see the review comment above.

@rwalker777 rwalker777 marked this pull request as draft June 20, 2026 00:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

new-feature New feature

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants