diff --git a/docs/changelog.md b/docs/changelog.md index 77a3e78c..a294f756 100644 --- a/docs/changelog.md +++ b/docs/changelog.md @@ -7,6 +7,14 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/) ## [Unreleased] ### Fixed +- **Printing & Ticket Saving: Fix freeze under heavy workload / long shifts** (2026-06-09) + - Four bugs combined to freeze printing and order saving on busy days or when end-of-day is not run for extended periods. + - **Bug A — thread pool blocking main event loop (`src/core/thread_pool.hh`):** `enqueue_detached()` called `condition_variable::wait` when the job queue was full (64 jobs, 2 workers). Under heavy load with slow CUPS jobs the queue filled, and the next print call blocked the main Xt event loop thread until a worker freed a slot — freezing the UI, printing, and order saving completely. Fix: replaced the blocking wait with a non-blocking check that drops the job and logs to stderr if the queue is full. Thread count increased 2→4 and queue cap increased 64→256 so the drop condition is extremely rare in practice. + - **Bug B — CUPS health check running `fork()`+`exec()` on the main thread (`src/core/data_persistence_manager.cc`):** `CheckCUPSHealth()` was called from `Update()` which runs on the main event loop timer every 500 ms. Every 60 seconds it forked a child to run `systemctl is-active cups` and busy-polled with `usleep(10ms)` for up to 5 seconds. On CUPS failure `AttemptCUPSRecovery()` added another 10-second busy-poll — all on the main thread, causing regular multi-second freezes under any CUPS instability. Fix: moved the entire CUPS health check into a dedicated `CUPSMonitorLoop` background thread. `ProcessPeriodicTasks()` on the main thread now only reads an `atomic`. + - **Bug C — `SaveAllChecks()` running synchronously on the main thread (`src/core/data_persistence_manager.cc`):** Every 30 seconds, `SaveCriticalData()` → `SaveAllChecks()` iterated the entire open check list and wrote each check to disk (open + write + close) on the main event loop thread. On a busy shift with 100–200 open checks — or more when end-of-day is never run — this caused hundreds of sequential blocking file I/O operations per cycle. Fix: auto-save is now dispatched to the thread pool via `enqueue_detached`. An `atomic save_in_progress_` flag prevents double-dispatch. + - **Bug D — `RemotePrinter::Send()` flush threshold exceeds buffer size (`main/hardware/remote_printer.cc`):** `Send()` only flushed the output buffer when `buffer_out->size > 4096`, but `buffer_out` is a `CharQueue(1024)`. The threshold was never reached, so intermediate `Send()` calls were no-ops. Large receipts silently overflowed the ring buffer and dropped bytes, producing garbled or truncated prints. Fix: flush threshold changed to `buffer_size / 2` (512 bytes), matching the `CharQueue::send_size` design. Dropped jobs in `CloseAsync()` now call `ReportError()` so the operator sees a message on screen. + - Files modified: `src/core/thread_pool.hh`, `src/core/data_persistence_manager.hh`, `src/core/data_persistence_manager.cc`, `main/hardware/remote_printer.cc`, `main/hardware/printer.cc`. + - **Printing & Ticket Saving: Fix 10-12 hour runtime failure** (2026-06-02) - After extended runtime (~10-12 hours) the system would silently stop printing and saving new orders. Three bugs combined to cause this. - **Root cause — stale Xt input handler spin loop (`main/hardware/remote_printer.cc`):** `PrinterCB()` was closing the socket and marking the printer offline (`failure = 999`) but never calling `RemoveInputFn(p->input_id)`. The X event loop continued polling the closed file descriptor on every tick, creating an infinite error-callback spin loop. Over 10-12 hours of normal restaurant operation this CPU waste progressively starved the main event thread, causing both printing and order-saving to hang. Fix: deregister the input handler (`RemoveInputFn` + `input_id = -1`) before closing the socket in the offline-marking block. diff --git a/main/hardware/printer.cc b/main/hardware/printer.cc index 9cc23105..9f0f5c11 100644 --- a/main/hardware/printer.cc +++ b/main/hardware/printer.cc @@ -343,7 +343,7 @@ void Printer::CloseAsync() temp_name.Set(""); // Clear so printer can be reused // Queue the print job to the thread pool - vt::ThreadPool::instance().enqueue_detached( + bool queued = vt::ThreadPool::instance().enqueue_detached( [temp_file, target_str, port, type]() { vt::Logger::debug("Async print starting: {} -> {}:{}", temp_file, target_str, port); @@ -422,6 +422,13 @@ void Printer::CloseAsync() unlink(temp_file.c_str()); } ); + + if (!queued) { + // Thread pool is full. The temp file will be orphaned in the printqueue + // directory but the operator needs to know so they can reprint. + ReportError("Print job dropped: thread pool queue full. Please reprint."); + unlink(temp_file.c_str()); // clean up the temp file we won't be sending + } } /**** diff --git a/main/hardware/remote_printer.cc b/main/hardware/remote_printer.cc index 747827cc..cdc70d8d 100644 --- a/main/hardware/remote_printer.cc +++ b/main/hardware/remote_printer.cc @@ -263,7 +263,11 @@ const char* RemotePrinter::RStr(char* s) int RemotePrinter::Send() { - if (buffer_out->size > 4096) + // Flush when the buffer is at least half full. buffer_out is created + // as CharQueue(1024), so send_size == 512. The old threshold of 4096 + // was larger than the entire buffer and so never triggered, causing + // large receipts to overflow the ring buffer and lose data silently. + if (buffer_out->size > (buffer_out->buffer_size / 2)) SendNow(); return 0; } diff --git a/src/core/data_persistence_manager.cc b/src/core/data_persistence_manager.cc index 53ed1452..918201d0 100644 --- a/src/core/data_persistence_manager.cc +++ b/src/core/data_persistence_manager.cc @@ -1,4 +1,5 @@ #include "data_persistence_manager.hh" +#include "thread_pool.hh" #include "main/data/system.hh" #include "main/business/check.hh" #include "main/data/settings.hh" @@ -94,6 +95,8 @@ DataPersistenceManager::DataPersistenceManager() , cups_consecutive_failures(0) , shutdown_in_progress(false) , force_shutdown(false) + , cups_monitor_stop_(false) + , save_in_progress_(false) { FnTrace("DataPersistenceManager::DataPersistenceManager()"); @@ -103,11 +106,16 @@ DataPersistenceManager::DataPersistenceManager() // Reserve space for logs to reduce reallocations error_log.reserve(static_cast(config.max_error_log_size)); warning_log.reserve(static_cast(config.max_warning_log_size)); + + // Start the background CUPS monitor thread so health checks never + // run on the main event loop thread. + StartCUPSMonitorThread(); } DataPersistenceManager::~DataPersistenceManager() { FnTrace("DataPersistenceManager::~DataPersistenceManager()"); + StopCUPSMonitorThread(); if (!shutdown_in_progress) { PrepareForShutdown(); } @@ -451,44 +459,95 @@ void DataPersistenceManager::ForceCUPSRecovery() void DataPersistenceManager::ProcessPeriodicTasks() { FnTrace("DataPersistenceManager::ProcessPeriodicTasks()"); - + // Skip all periodic tasks during shutdown to prevent hanging if (shutdown_in_progress.load(std::memory_order_acquire)) { LogInfo("Skipping periodic tasks during shutdown to prevent hanging"); return; } - + auto now = std::chrono::steady_clock::now(); - + // Check if auto-save is needed if (config.enable_auto_save) { auto elapsed = std::chrono::duration_cast(now - last_auto_save); if (elapsed >= config.auto_save_interval) { - // Check if there's actually dirty data to save - if (IsDataDirty("checks") || IsDataDirty("settings") || IsDataDirty("archives")) { - // Skip auto-save if any terminal is in edit mode to avoid interrupting user workflow - if (IsAnyTerminalInEditMode()) { - LogInfo("Skipping auto-save - terminal in edit mode (data is dirty)"); - } else { - LogInfo("Performing periodic auto-save (dirty data detected)"); - SaveResult result = SaveCriticalData(); - if (result == SAVE_SUCCESS) { - last_auto_save = now; - LogInfo("Auto-save completed successfully"); - } else { - LogError("Auto-save failed with result: " + std::to_string(result), "auto_save"); + last_auto_save = now; // advance timestamp immediately so we don't re-enter + + if (IsAnyTerminalInEditMode()) { + LogInfo("Skipping auto-save - terminal in edit mode"); + } else if (!save_in_progress_.exchange(true)) { + // Dispatch to thread pool so the main event loop thread is + // never blocked by disk I/O across potentially hundreds of checks. + bool queued = vt::ThreadPool::instance().enqueue_detached( + [this]() { + SaveResult result = SaveCriticalData(); + if (result != SAVE_SUCCESS) { + LogError("Background auto-save failed with result: " + + std::to_string(result), "auto_save"); + } + save_in_progress_.store(false, std::memory_order_release); } + ); + if (!queued) { + // Thread pool is full — clear the flag so next tick can retry + save_in_progress_.store(false, std::memory_order_release); + LogWarning("Auto-save deferred: thread pool queue full", "auto_save"); } - } else { - // No dirty data, just update the timestamp to avoid constant checking - last_auto_save = now; - LogInfo("Auto-save skipped - no dirty data"); } + // else: previous save is still running, skip this cycle + } + } + + // CUPS status is checked by the background CUPSMonitorLoop thread. + // Do NOT call CheckCUPSStatus() here — it forks a process and busy-polls + // for up to 5 seconds, which stalls the main event loop. +} + +// --------------------------------------------------------------------------- +// Background CUPS monitor thread — runs independently of the main event loop +// --------------------------------------------------------------------------- + +void DataPersistenceManager::StartCUPSMonitorThread() +{ + cups_monitor_stop_.store(false, std::memory_order_release); + cups_monitor_thread_ = std::thread([this]() { CUPSMonitorLoop(); }); +} + +void DataPersistenceManager::StopCUPSMonitorThread() +{ + { + std::lock_guard lock(cups_monitor_mutex_); + cups_monitor_stop_.store(true, std::memory_order_release); + } + cups_monitor_cv_.notify_all(); + if (cups_monitor_thread_.joinable()) + cups_monitor_thread_.join(); +} + +void DataPersistenceManager::CUPSMonitorLoop() +{ + while (!cups_monitor_stop_.load(std::memory_order_acquire)) { + // Sleep for the configured interval (default 60 s), waking early on stop + std::unique_lock lock(cups_monitor_mutex_); + cups_monitor_cv_.wait_for(lock, config.cups_check_interval, + [this]() { return cups_monitor_stop_.load(std::memory_order_acquire); }); + + if (cups_monitor_stop_.load(std::memory_order_acquire)) + break; + + if (shutdown_in_progress.load(std::memory_order_acquire)) + break; + + // Run the potentially-slow health check here, safely off the main thread + bool healthy = CheckCUPSHealth(); + cups_communication_healthy.store(healthy, std::memory_order_release); + + if (!healthy) { + LogWarning("CUPS communication unhealthy - attempting recovery", "cups"); + AttemptCUPSRecovery(); } } - - // Check CUPS status - CheckCUPSStatus(); } void DataPersistenceManager::Update() diff --git a/src/core/data_persistence_manager.hh b/src/core/data_persistence_manager.hh index 689d91c5..dc9ad6e4 100644 --- a/src/core/data_persistence_manager.hh +++ b/src/core/data_persistence_manager.hh @@ -3,7 +3,9 @@ #include "basic.hh" #include +#include #include +#include #include #include #include @@ -149,6 +151,18 @@ private: std::atomic shutdown_in_progress; std::atomic force_shutdown; + // Background thread for CUPS health checks (must never run on the main thread) + std::thread cups_monitor_thread_; + std::mutex cups_monitor_mutex_; + std::condition_variable cups_monitor_cv_; + std::atomic cups_monitor_stop_; + void CUPSMonitorLoop(); + void StartCUPSMonitorThread(); + void StopCUPSMonitorThread(); + + // Guard against dispatching SaveAllChecks while a previous dispatch is still running + std::atomic save_in_progress_; + // Performance metrics struct PerformanceMetrics { int total_validations{0}; diff --git a/src/core/thread_pool.hh b/src/core/thread_pool.hh index 01cd4e35..7b35a346 100644 --- a/src/core/thread_pool.hh +++ b/src/core/thread_pool.hh @@ -50,8 +50,8 @@ namespace vt { */ class ThreadPool { public: - // Singleton instance - use 2 threads by default for RPi - static ThreadPool& instance(size_t num_threads = 2) { + // Singleton instance - 4 threads handles concurrent print jobs to multiple printers + static ThreadPool& instance(size_t num_threads = 4) { static ThreadPool pool(num_threads); return pool; } @@ -114,31 +114,40 @@ public: } /** - * @brief Enqueue a task without caring about the result (fire-and-forget) - * @param f Function to execute - * @param args Arguments to pass - * - * More efficient than enqueue() when you don't need the result. + * @brief Enqueue a task without caring about the result (fire-and-forget). + * + * IMPORTANT: This is intentionally non-blocking. The caller may be the + * main Xt event loop thread. Blocking it even briefly causes the UI, + * printing, and order saving to freeze under heavy load. + * + * If the queue is full the job is dropped and a message is written to + * stderr. A missed print is recoverable (operator can reprint); a frozen + * POS is not. + * + * @return true if the job was queued, false if it was dropped. */ template - void enqueue_detached(F&& f, Args&&... args) { + bool enqueue_detached(F&& f, Args&&... args) { auto task = std::bind(std::forward(f), std::forward(args)...); - + { - std::unique_lock lock(queue_mutex_); - - if (stop_) return; + std::lock_guard lock(queue_mutex_); - queue_not_full_.wait(lock, [this] { - return tasks_.size() < max_queue_size_ || stop_; - }); + if (stop_) return false; - if (stop_) return; + if (tasks_.size() >= max_queue_size_) { + fprintf(stderr, + "ThreadPool::enqueue_detached: queue full (%zu/%zu), " + "dropping job\n", + tasks_.size(), max_queue_size_); + return false; + } tasks_.emplace(std::move(task)); } - + condition_.notify_one(); + return true; } /** @@ -191,7 +200,7 @@ private: explicit ThreadPool(size_t num_threads) : stop_(false) , active_tasks_(0) - , max_queue_size_(64) // Bounded queue for memory safety + , max_queue_size_(256) // Large enough for a busy shift; see enqueue_detached for drop policy { workers_.reserve(num_threads);