ViewTouch · No0ne558 · Jun 9, 2026 · Jun 9, 2026 · Jun 9, 2026
diff --git a/docs/changelog.md b/docs/changelog.md
@@ -7,6 +7,14 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
 ## [Unreleased]
 
 ### Fixed
+- **Printing & Ticket Saving: Fix freeze under heavy workload / long shifts** (2026-06-09)
+  - Four bugs combined to freeze printing and order saving on busy days or when end-of-day is not run for extended periods.
+  - **Bug A — thread pool blocking main event loop (`src/core/thread_pool.hh`):** `enqueue_detached()` called `condition_variable::wait` when the job queue was full (64 jobs, 2 workers). Under heavy load with slow CUPS jobs the queue filled, and the next print call blocked the main Xt event loop thread until a worker freed a slot — freezing the UI, printing, and order saving completely. Fix: replaced the blocking wait with a non-blocking check that drops the job and logs to stderr if the queue is full. Thread count increased 2→4 and queue cap increased 64→256 so the drop condition is extremely rare in practice.
+  - **Bug B — CUPS health check running `fork()`+`exec()` on the main thread (`src/core/data_persistence_manager.cc`):** `CheckCUPSHealth()` was called from `Update()` which runs on the main event loop timer every 500 ms. Every 60 seconds it forked a child to run `systemctl is-active cups` and busy-polled with `usleep(10ms)` for up to 5 seconds. On CUPS failure `AttemptCUPSRecovery()` added another 10-second busy-poll — all on the main thread, causing regular multi-second freezes under any CUPS instability. Fix: moved the entire CUPS health check into a dedicated `CUPSMonitorLoop` background thread. `ProcessPeriodicTasks()` on the main thread now only reads an `atomic<bool>`.
+  - **Bug C — `SaveAllChecks()` running synchronously on the main thread (`src/core/data_persistence_manager.cc`):** Every 30 seconds, `SaveCriticalData()` → `SaveAllChecks()` iterated the entire open check list and wrote each check to disk (open + write + close) on the main event loop thread. On a busy shift with 100–200 open checks — or more when end-of-day is never run — this caused hundreds of sequential blocking file I/O operations per cycle. Fix: auto-save is now dispatched to the thread pool via `enqueue_detached`. An `atomic<bool> save_in_progress_` flag prevents double-dispatch.
+  - **Bug D — `RemotePrinter::Send()` flush threshold exceeds buffer size (`main/hardware/remote_printer.cc`):** `Send()` only flushed the output buffer when `buffer_out->size > 4096`, but `buffer_out` is a `CharQueue(1024)`. The threshold was never reached, so intermediate `Send()` calls were no-ops. Large receipts silently overflowed the ring buffer and dropped bytes, producing garbled or truncated prints. Fix: flush threshold changed to `buffer_size / 2` (512 bytes), matching the `CharQueue::send_size` design. Dropped jobs in `CloseAsync()` now call `ReportError()` so the operator sees a message on screen.
+  - Files modified: `src/core/thread_pool.hh`, `src/core/data_persistence_manager.hh`, `src/core/data_persistence_manager.cc`, `main/hardware/remote_printer.cc`, `main/hardware/printer.cc`.
+
 - **Printing & Ticket Saving: Fix 10-12 hour runtime failure** (2026-06-02)
   - After extended runtime (~10-12 hours) the system would silently stop printing and saving new orders. Three bugs combined to cause this.
   - **Root cause — stale Xt input handler spin loop (`main/hardware/remote_printer.cc`):** `PrinterCB()` was closing the socket and marking the printer offline (`failure = 999`) but never calling `RemoveInputFn(p->input_id)`. The X event loop continued polling the closed file descriptor on every tick, creating an infinite error-callback spin loop. Over 10-12 hours of normal restaurant operation this CPU waste progressively starved the main event thread, causing both printing and order-saving to hang. Fix: deregister the input handler (`RemoveInputFn` + `input_id = -1`) before closing the socket in the offline-marking block.

diff --git a/main/hardware/printer.cc b/main/hardware/printer.cc
@@ -343,7 +343,7 @@ void Printer::CloseAsync()
     temp_name.Set("");  // Clear so printer can be reused
 
     // Queue the print job to the thread pool
-    vt::ThreadPool::instance().enqueue_detached(
+    bool queued = vt::ThreadPool::instance().enqueue_detached(
         [temp_file, target_str, port, type]() {
             vt::Logger::debug("Async print starting: {} -> {}:{}", temp_file, target_str, port);
 
@@ -422,6 +422,13 @@ void Printer::CloseAsync()
             unlink(temp_file.c_str());
         }
     );
+
+    if (!queued) {
+        // Thread pool is full. The temp file will be orphaned in the printqueue
+        // directory but the operator needs to know so they can reprint.
+        ReportError("Print job dropped: thread pool queue full. Please reprint.");
+        unlink(temp_file.c_str()); // clean up the temp file we won't be sending
+    }
 }
 
 /****

diff --git a/main/hardware/remote_printer.cc b/main/hardware/remote_printer.cc
@@ -263,7 +263,11 @@ const char* RemotePrinter::RStr(char* s)
 
 int RemotePrinter::Send()
 {
-    if (buffer_out->size > 4096)
+    // Flush when the buffer is at least half full.  buffer_out is created
+    // as CharQueue(1024), so send_size == 512.  The old threshold of 4096
+    // was larger than the entire buffer and so never triggered, causing
+    // large receipts to overflow the ring buffer and lose data silently.
+    if (buffer_out->size > (buffer_out->buffer_size / 2))
         SendNow();
     return 0;
 }

diff --git a/src/core/data_persistence_manager.cc b/src/core/data_persistence_manager.cc
@@ -1,4 +1,5 @@
 #include "data_persistence_manager.hh"
+#include "thread_pool.hh"
 #include "main/data/system.hh"
 #include "main/business/check.hh"
 #include "main/data/settings.hh"
@@ -94,6 +95,8 @@ DataPersistenceManager::DataPersistenceManager()
     , cups_consecutive_failures(0)
     , shutdown_in_progress(false)
     , force_shutdown(false)
+    , cups_monitor_stop_(false)
+    , save_in_progress_(false)
 {
     FnTrace("DataPersistenceManager::DataPersistenceManager()");
 
@@ -103,11 +106,16 @@ DataPersistenceManager::DataPersistenceManager()
     // Reserve space for logs to reduce reallocations
     error_log.reserve(static_cast<std::size_t>(config.max_error_log_size));
     warning_log.reserve(static_cast<std::size_t>(config.max_warning_log_size));
+
+    // Start the background CUPS monitor thread so health checks never
+    // run on the main event loop thread.
+    StartCUPSMonitorThread();
 }
 
 DataPersistenceManager::~DataPersistenceManager()
 {
     FnTrace("DataPersistenceManager::~DataPersistenceManager()");
+    StopCUPSMonitorThread();
     if (!shutdown_in_progress) {
         PrepareForShutdown();
     }
@@ -451,44 +459,95 @@ void DataPersistenceManager::ForceCUPSRecovery()
 void DataPersistenceManager::ProcessPeriodicTasks()
 {
     FnTrace("DataPersistenceManager::ProcessPeriodicTasks()");
-    
+
     // Skip all periodic tasks during shutdown to prevent hanging
     if (shutdown_in_progress.load(std::memory_order_acquire)) {
         LogInfo("Skipping periodic tasks during shutdown to prevent hanging");
         return;
     }
-    
+
     auto now = std::chrono::steady_clock::now();
-    
+
     // Check if auto-save is needed
     if (config.enable_auto_save) {
         auto elapsed = std::chrono::duration_cast<std::chrono::seconds>(now - last_auto_save);
         if (elapsed >= config.auto_save_interval) {
-            // Check if there's actually dirty data to save
-            if (IsDataDirty("checks") || IsDataDirty("settings") || IsDataDirty("archives")) {
-                // Skip auto-save if any terminal is in edit mode to avoid interrupting user workflow
-                if (IsAnyTerminalInEditMode()) {
-                    LogInfo("Skipping auto-save - terminal in edit mode (data is dirty)");
-                } else {
-                    LogInfo("Performing periodic auto-save (dirty data detected)");
-                    SaveResult result = SaveCriticalData();
-                    if (result == SAVE_SUCCESS) {
-                        last_auto_save = now;
-                        LogInfo("Auto-save completed successfully");
-                    } else {
-                        LogError("Auto-save failed with result: " + std::to_string(result), "auto_save");
+            last_auto_save = now; // advance timestamp immediately so we don't re-enter
+
+            if (IsAnyTerminalInEditMode()) {
+                LogInfo("Skipping auto-save - terminal in edit mode");
+            } else if (!save_in_progress_.exchange(true)) {
+                // Dispatch to thread pool so the main event loop thread is
+                // never blocked by disk I/O across potentially hundreds of checks.
+                bool queued = vt::ThreadPool::instance().enqueue_detached(
+                    [this]() {
+                        SaveResult result = SaveCriticalData();
+                        if (result != SAVE_SUCCESS) {
+                            LogError("Background auto-save failed with result: " +
+                                     std::to_string(result), "auto_save");
+                        }
+                        save_in_progress_.store(false, std::memory_order_release);
                     }
+                );
+                if (!queued) {
+                    // Thread pool is full — clear the flag so next tick can retry
+                    save_in_progress_.store(false, std::memory_order_release);
+                    LogWarning("Auto-save deferred: thread pool queue full", "auto_save");
                 }
-            } else {
-                // No dirty data, just update the timestamp to avoid constant checking
-                last_auto_save = now;
-                LogInfo("Auto-save skipped - no dirty data");
             }
+            // else: previous save is still running, skip this cycle
+        }
+    }
+
+    // CUPS status is checked by the background CUPSMonitorLoop thread.
+    // Do NOT call CheckCUPSStatus() here — it forks a process and busy-polls
+    // for up to 5 seconds, which stalls the main event loop.
+}
+
+// ---------------------------------------------------------------------------
+// Background CUPS monitor thread — runs independently of the main event loop
+// ---------------------------------------------------------------------------
+
+void DataPersistenceManager::StartCUPSMonitorThread()
+{
+    cups_monitor_stop_.store(false, std::memory_order_release);
+    cups_monitor_thread_ = std::thread([this]() { CUPSMonitorLoop(); });
+}
+
+void DataPersistenceManager::StopCUPSMonitorThread()
+{
+    {
+        std::lock_guard<std::mutex> lock(cups_monitor_mutex_);
+        cups_monitor_stop_.store(true, std::memory_order_release);
+    }
+    cups_monitor_cv_.notify_all();
+    if (cups_monitor_thread_.joinable())
+        cups_monitor_thread_.join();
+}
+
+void DataPersistenceManager::CUPSMonitorLoop()
+{
+    while (!cups_monitor_stop_.load(std::memory_order_acquire)) {
+        // Sleep for the configured interval (default 60 s), waking early on stop
+        std::unique_lock<std::mutex> lock(cups_monitor_mutex_);
+        cups_monitor_cv_.wait_for(lock, config.cups_check_interval,
+            [this]() { return cups_monitor_stop_.load(std::memory_order_acquire); });
+
+        if (cups_monitor_stop_.load(std::memory_order_acquire))
+            break;
+
+        if (shutdown_in_progress.load(std::memory_order_acquire))
+            break;
+
+        // Run the potentially-slow health check here, safely off the main thread
+        bool healthy = CheckCUPSHealth();
+        cups_communication_healthy.store(healthy, std::memory_order_release);
+
+        if (!healthy) {
+            LogWarning("CUPS communication unhealthy - attempting recovery", "cups");
+            AttemptCUPSRecovery();
         }
     }
-
-    // Check CUPS status
-    CheckCUPSStatus();
 }
 
 void DataPersistenceManager::Update()

diff --git a/src/core/data_persistence_manager.hh b/src/core/data_persistence_manager.hh
@@ -3,7 +3,9 @@
 
 #include "basic.hh"
 #include <chrono>
+#include <condition_variable>
 #include <memory>
+#include <thread>
 #include <vector>
 #include <functional>
 #include <mutex>
@@ -149,6 +151,18 @@ private:
     std::atomic<bool> shutdown_in_progress;
     std::atomic<bool> force_shutdown;
 
+    // Background thread for CUPS health checks (must never run on the main thread)
+    std::thread cups_monitor_thread_;
+    std::mutex cups_monitor_mutex_;
+    std::condition_variable cups_monitor_cv_;
+    std::atomic<bool> cups_monitor_stop_;
+    void CUPSMonitorLoop();
+    void StartCUPSMonitorThread();
+    void StopCUPSMonitorThread();
+
+    // Guard against dispatching SaveAllChecks while a previous dispatch is still running
+    std::atomic<bool> save_in_progress_;
+
     // Performance metrics
     struct PerformanceMetrics {
         int total_validations{0};

diff --git a/src/core/thread_pool.hh b/src/core/thread_pool.hh
@@ -50,8 +50,8 @@ namespace vt {
  */
 class ThreadPool {
 public:
-    // Singleton instance - use 2 threads by default for RPi
-    static ThreadPool& instance(size_t num_threads = 2) {
+    // Singleton instance - 4 threads handles concurrent print jobs to multiple printers
+    static ThreadPool& instance(size_t num_threads = 4) {
         static ThreadPool pool(num_threads);
         return pool;
     }
@@ -114,31 +114,40 @@ public:
     }
 
     /**
-     * @brief Enqueue a task without caring about the result (fire-and-forget)
-     * @param f Function to execute
-     * @param args Arguments to pass
-     * 
-     * More efficient than enqueue() when you don't need the result.
+     * @brief Enqueue a task without caring about the result (fire-and-forget).
+     *
+     * IMPORTANT: This is intentionally non-blocking.  The caller may be the
+     * main Xt event loop thread.  Blocking it even briefly causes the UI,
+     * printing, and order saving to freeze under heavy load.
+     *
+     * If the queue is full the job is dropped and a message is written to
+     * stderr.  A missed print is recoverable (operator can reprint); a frozen
+     * POS is not.
+     *
+     * @return true if the job was queued, false if it was dropped.
      */
     template<typename F, typename... Args>
-    void enqueue_detached(F&& f, Args&&... args) {
+    bool enqueue_detached(F&& f, Args&&... args) {
         auto task = std::bind(std::forward<F>(f), std::forward<Args>(args)...);
-        
+
         {
-            std::unique_lock<std::mutex> lock(queue_mutex_);
-
-            if (stop_) return;
+            std::lock_guard<std::mutex> lock(queue_mutex_);
 
-            queue_not_full_.wait(lock, [this] {
-                return tasks_.size() < max_queue_size_ || stop_;
-            });
+            if (stop_) return false;
 
-            if (stop_) return;
+            if (tasks_.size() >= max_queue_size_) {
+                fprintf(stderr,
+                    "ThreadPool::enqueue_detached: queue full (%zu/%zu), "
+                    "dropping job\n",
+                    tasks_.size(), max_queue_size_);
+                return false;
+            }
 
             tasks_.emplace(std::move(task));
         }
-        
+
         condition_.notify_one();
+        return true;
     }
 
     /**
@@ -191,7 +200,7 @@ private:
     explicit ThreadPool(size_t num_threads) 
         : stop_(false)
         , active_tasks_(0)
-        , max_queue_size_(64)  // Bounded queue for memory safety
+        , max_queue_size_(256) // Large enough for a busy shift; see enqueue_detached for drop policy
     {
         workers_.reserve(num_threads);