Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
8 changes: 8 additions & 0 deletions docs/changelog.md
Original file line number Diff line number Diff line change
Expand Up @@ -7,6 +7,14 @@ The format is based on [Keep a Changelog](https://keepachangelog.com/en/1.0.0/)
## [Unreleased]

### Fixed
- **Printing & Ticket Saving: Fix freeze under heavy workload / long shifts** (2026-06-09)
- Four bugs combined to freeze printing and order saving on busy days or when end-of-day is not run for extended periods.
- **Bug A — thread pool blocking main event loop (`src/core/thread_pool.hh`):** `enqueue_detached()` called `condition_variable::wait` when the job queue was full (64 jobs, 2 workers). Under heavy load with slow CUPS jobs the queue filled, and the next print call blocked the main Xt event loop thread until a worker freed a slot — freezing the UI, printing, and order saving completely. Fix: replaced the blocking wait with a non-blocking check that drops the job and logs to stderr if the queue is full. Thread count increased 2→4 and queue cap increased 64→256 so the drop condition is extremely rare in practice.
- **Bug B — CUPS health check running `fork()`+`exec()` on the main thread (`src/core/data_persistence_manager.cc`):** `CheckCUPSHealth()` was called from `Update()` which runs on the main event loop timer every 500 ms. Every 60 seconds it forked a child to run `systemctl is-active cups` and busy-polled with `usleep(10ms)` for up to 5 seconds. On CUPS failure `AttemptCUPSRecovery()` added another 10-second busy-poll — all on the main thread, causing regular multi-second freezes under any CUPS instability. Fix: moved the entire CUPS health check into a dedicated `CUPSMonitorLoop` background thread. `ProcessPeriodicTasks()` on the main thread now only reads an `atomic<bool>`.
- **Bug C — `SaveAllChecks()` running synchronously on the main thread (`src/core/data_persistence_manager.cc`):** Every 30 seconds, `SaveCriticalData()` → `SaveAllChecks()` iterated the entire open check list and wrote each check to disk (open + write + close) on the main event loop thread. On a busy shift with 100–200 open checks — or more when end-of-day is never run — this caused hundreds of sequential blocking file I/O operations per cycle. Fix: auto-save is now dispatched to the thread pool via `enqueue_detached`. An `atomic<bool> save_in_progress_` flag prevents double-dispatch.
- **Bug D — `RemotePrinter::Send()` flush threshold exceeds buffer size (`main/hardware/remote_printer.cc`):** `Send()` only flushed the output buffer when `buffer_out->size > 4096`, but `buffer_out` is a `CharQueue(1024)`. The threshold was never reached, so intermediate `Send()` calls were no-ops. Large receipts silently overflowed the ring buffer and dropped bytes, producing garbled or truncated prints. Fix: flush threshold changed to `buffer_size / 2` (512 bytes), matching the `CharQueue::send_size` design. Dropped jobs in `CloseAsync()` now call `ReportError()` so the operator sees a message on screen.
- Files modified: `src/core/thread_pool.hh`, `src/core/data_persistence_manager.hh`, `src/core/data_persistence_manager.cc`, `main/hardware/remote_printer.cc`, `main/hardware/printer.cc`.

- **Printing & Ticket Saving: Fix 10-12 hour runtime failure** (2026-06-02)
- After extended runtime (~10-12 hours) the system would silently stop printing and saving new orders. Three bugs combined to cause this.
- **Root cause — stale Xt input handler spin loop (`main/hardware/remote_printer.cc`):** `PrinterCB()` was closing the socket and marking the printer offline (`failure = 999`) but never calling `RemoveInputFn(p->input_id)`. The X event loop continued polling the closed file descriptor on every tick, creating an infinite error-callback spin loop. Over 10-12 hours of normal restaurant operation this CPU waste progressively starved the main event thread, causing both printing and order-saving to hang. Fix: deregister the input handler (`RemoveInputFn` + `input_id = -1`) before closing the socket in the offline-marking block.
Expand Down
9 changes: 8 additions & 1 deletion main/hardware/printer.cc
Original file line number Diff line number Diff line change
Expand Up @@ -343,7 +343,7 @@ void Printer::CloseAsync()
temp_name.Set(""); // Clear so printer can be reused

// Queue the print job to the thread pool
vt::ThreadPool::instance().enqueue_detached(
bool queued = vt::ThreadPool::instance().enqueue_detached(
[temp_file, target_str, port, type]() {
vt::Logger::debug("Async print starting: {} -> {}:{}", temp_file, target_str, port);

Expand Down Expand Up @@ -422,6 +422,13 @@ void Printer::CloseAsync()
unlink(temp_file.c_str());
}
);

if (!queued) {
// Thread pool is full. The temp file will be orphaned in the printqueue
// directory but the operator needs to know so they can reprint.
ReportError("Print job dropped: thread pool queue full. Please reprint.");
unlink(temp_file.c_str()); // clean up the temp file we won't be sending
}
}

/****
Expand Down
6 changes: 5 additions & 1 deletion main/hardware/remote_printer.cc
Original file line number Diff line number Diff line change
Expand Up @@ -263,7 +263,11 @@ const char* RemotePrinter::RStr(char* s)

int RemotePrinter::Send()
{
if (buffer_out->size > 4096)
// Flush when the buffer is at least half full. buffer_out is created
// as CharQueue(1024), so send_size == 512. The old threshold of 4096
// was larger than the entire buffer and so never triggered, causing
// large receipts to overflow the ring buffer and lose data silently.
if (buffer_out->size > (buffer_out->buffer_size / 2))
SendNow();
return 0;
}
Expand Down
105 changes: 82 additions & 23 deletions src/core/data_persistence_manager.cc
Original file line number Diff line number Diff line change
@@ -1,4 +1,5 @@
#include "data_persistence_manager.hh"
#include "thread_pool.hh"
#include "main/data/system.hh"
#include "main/business/check.hh"
#include "main/data/settings.hh"
Expand Down Expand Up @@ -94,6 +95,8 @@ DataPersistenceManager::DataPersistenceManager()
, cups_consecutive_failures(0)
, shutdown_in_progress(false)
, force_shutdown(false)
, cups_monitor_stop_(false)
, save_in_progress_(false)
{
FnTrace("DataPersistenceManager::DataPersistenceManager()");

Expand All @@ -103,11 +106,16 @@ DataPersistenceManager::DataPersistenceManager()
// Reserve space for logs to reduce reallocations
error_log.reserve(static_cast<std::size_t>(config.max_error_log_size));
warning_log.reserve(static_cast<std::size_t>(config.max_warning_log_size));

// Start the background CUPS monitor thread so health checks never
// run on the main event loop thread.
StartCUPSMonitorThread();
}

DataPersistenceManager::~DataPersistenceManager()
{
FnTrace("DataPersistenceManager::~DataPersistenceManager()");
StopCUPSMonitorThread();
if (!shutdown_in_progress) {
PrepareForShutdown();
}
Expand Down Expand Up @@ -451,44 +459,95 @@ void DataPersistenceManager::ForceCUPSRecovery()
void DataPersistenceManager::ProcessPeriodicTasks()
{
FnTrace("DataPersistenceManager::ProcessPeriodicTasks()");

// Skip all periodic tasks during shutdown to prevent hanging
if (shutdown_in_progress.load(std::memory_order_acquire)) {
LogInfo("Skipping periodic tasks during shutdown to prevent hanging");
return;
}

auto now = std::chrono::steady_clock::now();

// Check if auto-save is needed
if (config.enable_auto_save) {
auto elapsed = std::chrono::duration_cast<std::chrono::seconds>(now - last_auto_save);
if (elapsed >= config.auto_save_interval) {
// Check if there's actually dirty data to save
if (IsDataDirty("checks") || IsDataDirty("settings") || IsDataDirty("archives")) {
// Skip auto-save if any terminal is in edit mode to avoid interrupting user workflow
if (IsAnyTerminalInEditMode()) {
LogInfo("Skipping auto-save - terminal in edit mode (data is dirty)");
} else {
LogInfo("Performing periodic auto-save (dirty data detected)");
SaveResult result = SaveCriticalData();
if (result == SAVE_SUCCESS) {
last_auto_save = now;
LogInfo("Auto-save completed successfully");
} else {
LogError("Auto-save failed with result: " + std::to_string(result), "auto_save");
last_auto_save = now; // advance timestamp immediately so we don't re-enter

if (IsAnyTerminalInEditMode()) {
LogInfo("Skipping auto-save - terminal in edit mode");
} else if (!save_in_progress_.exchange(true)) {
// Dispatch to thread pool so the main event loop thread is
// never blocked by disk I/O across potentially hundreds of checks.
bool queued = vt::ThreadPool::instance().enqueue_detached(
[this]() {
SaveResult result = SaveCriticalData();
if (result != SAVE_SUCCESS) {
LogError("Background auto-save failed with result: " +
std::to_string(result), "auto_save");
}
save_in_progress_.store(false, std::memory_order_release);
}
);
if (!queued) {
// Thread pool is full — clear the flag so next tick can retry
save_in_progress_.store(false, std::memory_order_release);
LogWarning("Auto-save deferred: thread pool queue full", "auto_save");
}
} else {
// No dirty data, just update the timestamp to avoid constant checking
last_auto_save = now;
LogInfo("Auto-save skipped - no dirty data");
}
// else: previous save is still running, skip this cycle
}
}

// CUPS status is checked by the background CUPSMonitorLoop thread.
// Do NOT call CheckCUPSStatus() here — it forks a process and busy-polls
// for up to 5 seconds, which stalls the main event loop.
}

// ---------------------------------------------------------------------------
// Background CUPS monitor thread — runs independently of the main event loop
// ---------------------------------------------------------------------------

void DataPersistenceManager::StartCUPSMonitorThread()
{
cups_monitor_stop_.store(false, std::memory_order_release);
cups_monitor_thread_ = std::thread([this]() { CUPSMonitorLoop(); });
}

void DataPersistenceManager::StopCUPSMonitorThread()
{
{
std::lock_guard<std::mutex> lock(cups_monitor_mutex_);
cups_monitor_stop_.store(true, std::memory_order_release);
}
cups_monitor_cv_.notify_all();
if (cups_monitor_thread_.joinable())
cups_monitor_thread_.join();
}

void DataPersistenceManager::CUPSMonitorLoop()
{
while (!cups_monitor_stop_.load(std::memory_order_acquire)) {
// Sleep for the configured interval (default 60 s), waking early on stop
std::unique_lock<std::mutex> lock(cups_monitor_mutex_);
cups_monitor_cv_.wait_for(lock, config.cups_check_interval,
[this]() { return cups_monitor_stop_.load(std::memory_order_acquire); });

if (cups_monitor_stop_.load(std::memory_order_acquire))
break;

if (shutdown_in_progress.load(std::memory_order_acquire))
break;

// Run the potentially-slow health check here, safely off the main thread
bool healthy = CheckCUPSHealth();
cups_communication_healthy.store(healthy, std::memory_order_release);

if (!healthy) {
LogWarning("CUPS communication unhealthy - attempting recovery", "cups");
AttemptCUPSRecovery();
}
}

// Check CUPS status
CheckCUPSStatus();
}

void DataPersistenceManager::Update()
Expand Down
14 changes: 14 additions & 0 deletions src/core/data_persistence_manager.hh
Original file line number Diff line number Diff line change
Expand Up @@ -3,7 +3,9 @@

#include "basic.hh"
#include <chrono>
#include <condition_variable>
#include <memory>
#include <thread>
#include <vector>
#include <functional>
#include <mutex>
Expand Down Expand Up @@ -149,6 +151,18 @@ private:
std::atomic<bool> shutdown_in_progress;
std::atomic<bool> force_shutdown;

// Background thread for CUPS health checks (must never run on the main thread)
std::thread cups_monitor_thread_;
std::mutex cups_monitor_mutex_;
std::condition_variable cups_monitor_cv_;
std::atomic<bool> cups_monitor_stop_;
void CUPSMonitorLoop();
void StartCUPSMonitorThread();
void StopCUPSMonitorThread();

// Guard against dispatching SaveAllChecks while a previous dispatch is still running
std::atomic<bool> save_in_progress_;

// Performance metrics
struct PerformanceMetrics {
int total_validations{0};
Expand Down
45 changes: 27 additions & 18 deletions src/core/thread_pool.hh
Original file line number Diff line number Diff line change
Expand Up @@ -50,8 +50,8 @@ namespace vt {
*/
class ThreadPool {
public:
// Singleton instance - use 2 threads by default for RPi
static ThreadPool& instance(size_t num_threads = 2) {
// Singleton instance - 4 threads handles concurrent print jobs to multiple printers
static ThreadPool& instance(size_t num_threads = 4) {
static ThreadPool pool(num_threads);
return pool;
}
Expand Down Expand Up @@ -114,31 +114,40 @@ public:
}

/**
* @brief Enqueue a task without caring about the result (fire-and-forget)
* @param f Function to execute
* @param args Arguments to pass
*
* More efficient than enqueue() when you don't need the result.
* @brief Enqueue a task without caring about the result (fire-and-forget).
*
* IMPORTANT: This is intentionally non-blocking. The caller may be the
* main Xt event loop thread. Blocking it even briefly causes the UI,
* printing, and order saving to freeze under heavy load.
*
* If the queue is full the job is dropped and a message is written to
* stderr. A missed print is recoverable (operator can reprint); a frozen
* POS is not.
*
* @return true if the job was queued, false if it was dropped.
*/
template<typename F, typename... Args>
void enqueue_detached(F&& f, Args&&... args) {
bool enqueue_detached(F&& f, Args&&... args) {
auto task = std::bind(std::forward<F>(f), std::forward<Args>(args)...);

{
std::unique_lock<std::mutex> lock(queue_mutex_);

if (stop_) return;
std::lock_guard<std::mutex> lock(queue_mutex_);

queue_not_full_.wait(lock, [this] {
return tasks_.size() < max_queue_size_ || stop_;
});
if (stop_) return false;

if (stop_) return;
if (tasks_.size() >= max_queue_size_) {
fprintf(stderr,
"ThreadPool::enqueue_detached: queue full (%zu/%zu), "
"dropping job\n",
tasks_.size(), max_queue_size_);
return false;
}

tasks_.emplace(std::move(task));
}

condition_.notify_one();
return true;
}

/**
Expand Down Expand Up @@ -191,7 +200,7 @@ private:
explicit ThreadPool(size_t num_threads)
: stop_(false)
, active_tasks_(0)
, max_queue_size_(64) // Bounded queue for memory safety
, max_queue_size_(256) // Large enough for a busy shift; see enqueue_detached for drop policy
{
workers_.reserve(num_threads);

Expand Down
Loading