OAK-12282 : defining a fixed bound for the AbstractDiskCache by pat-lego · Pull Request #2978 · apache/jackrabbit-oak

pat-lego · 2026-06-25T15:59:39Z

The AbstractPersistentCache is unbound leading to OOM exceptions.

joerghoh

a lot of minor things:

Comments: AI writes too many comments which are 100% valid and helpful in the context of this PR, but not really helpful in a year in the context of a different problem.
some consistency

joerghoh · 2026-06-29T10:16:01Z

                    configuration.redisMinConnections(), configuration.redisMaxConnections(), configuration.redisMaxTotalConnections(), configuration.redisDBIndex(), redisCacheIOMonitor);
            closer.register(redisCache);

+            // OAK-12282: expose the bounded write-queue kill switch. Requires a restart to take effect.


Suggested change

// OAK-12282: expose the bounded write-queue kill switch. Requires a restart to take effect.

// OAK-12282: expose the bounded write-queue kill switch.

That 2nd setence is a bit misleading

joerghoh · 2026-06-29T10:18:53Z

+     * registered with the Whiteboard to revert to the pre-fix unbounded queue.
+     * <strong>Note:</strong> changing this flag requires a process restart, as
+     * the executor is created at startup.
+     */


I would remove this for consistency reasons (or we would need to annotate every configuration parameter for the cache with the same).

joerghoh · 2026-06-29T10:20:24Z

                () -> Long.valueOf(directory.listFiles().length),
                () -> FileUtils.sizeOfDirectory(directory),
                () -> evictionCount.get());
+        segmentCacheStats.setWriteDiscardCountSupplier(() -> discardCount.get());


Very inconsistent. I would extend the c'tor of SegmentCacheStats by one more parameter.

joerghoh · 2026-06-29T10:27:59Z

+        // Formerly Executors.newFixedThreadPool() was used here, which creates an unbounded
+        // LinkedBlockingQueue — allowing unlimited segment buffers to pile up in memory under
+        // high write load. The bounded queue (gated by FT_OAK_12282) prevents OOM by dropping
+        // write tasks when full; this is safe because the disk cache is an optimisation only.


This comment is typical AI, but it won't help if in a year someone is looking at this code (If I want to understand what was here before, I can checkout the right git commit ...)

Suggested change

// Formerly Executors.newFixedThreadPool() was used here, which creates an unbounded

// LinkedBlockingQueue — allowing unlimited segment buffers to pile up in memory under

// high write load. The bounded queue (gated by FT_OAK_12282) prevents OOM by dropping

// write tasks when full; this is safe because the disk cache is an optimisation only.

// Limit the amount of Runnable in the queue to avoid memory issues. Discarding elements from

// the queue is not a problem, as caching is just an optimization; if it's dropped, it will be downloaded

// again and tried again.

joerghoh · 2026-06-29T10:30:44Z

+        BlockingQueue<Runnable> writeQueue = FT_OAK_12282_BOUNDED_WRITE_QUEUE_ENABLED.get()
+                ? new LinkedBlockingQueue<>(WRITE_QUEUE_SIZE)
+                : new LinkedBlockingQueue<>();
+        executor = new ThreadPoolExecutor(


right now these 10 threads are named quite randomly, and therefor they are hard to detect; can you use a ThreadFactory to name them accordingly?

joerghoh · 2026-06-29T10:34:07Z

+                    discardCount.incrementAndGet();
+                    logger.debug("Segment write task discarded: write queue full (capacity={})", WRITE_QUEUE_SIZE);


I would also log the number of discarded items from the queue

Suggested change

discardCount.incrementAndGet();

logger.debug("Segment write task discarded: write queue full (capacity={})", WRITE_QUEUE_SIZE);

long discarded = discardCount.incrementAndGet();

logger.debug("Segment write task discarded: write queue full (capacity={}, already discarded={})", WRITE_QUEUE_SIZE, discarded);

This log should be throttled, as otherwise we risk flooding the logs if for some reason the write threads cannot keep up with the reads from Azure. See the LogSilencer class.

joerghoh · 2026-06-29T10:36:00Z

        );
    }

+    public void setWriteDiscardCountSupplier(@NotNull Supplier<Long> supplier) {


as mentioned above, I would add this to the constructor for consistency reasons

nfsantos · 2026-06-30T09:10:45Z

    private static final Logger logger = LoggerFactory.getLogger(AbstractPersistentCache.class);

    public static final int THREADS = Integer.getInteger("oak.segment.cache.threads", 10);
+    public static final int WRITE_QUEUE_SIZE = Integer.getInteger("oak.segment.cache.writeQueueSize", THREADS * 100);


Writing to disk will be limited to the performance of the IO system and does not scale much with the number of threads that are writing. In fact, a single thread writing efficiently to disk will likely saturate the disk. The max queue size should be a function of the write IO of the disk and of the memory that we want to devote to buffering pending writes, it should not depend on the number of write threads. And we also don't want to cause an OOME only by increasing the number of threads. The number of writers and the size of the queue should be separate settings.

Since segments are around 256KB, a queue size of 100 would take up 25MB when full. I'd say this is enough, as writes to a local SSD should be faster than reading from the network, so a few write threads should be able to keep up with whatever we can get from Azure.

Writing to disk will be limited to the performance of the IO system and does not scale much with the number of threads that are writing

This is definitely true for the classic hard drives (aka "spinning rust"), but not anymore for NVME-SSDs. You get the full IOPS only when you execute operations at a massive scale (IIRC 256 concurrent threads). And this is applicable for both read and write. Assuming a modern operating system also the filesystem should be able to handle that.
For that I strongly recommend to keep these 10 as default.

nfsantos · 2026-06-30T09:10:59Z

+                    discardCount.incrementAndGet();
+                    logger.debug("Segment write task discarded: write queue full (capacity={})", WRITE_QUEUE_SIZE);


This log should be throttled, as otherwise we risk flooding the logs if for some reason the write threads cannot keep up with the reads from Azure. See the LogSilencer class.

nfsantos · 2026-06-30T09:15:34Z

+     * Name of the feature toggle for the OAK-12282 bounded write-queue fix.
+     * See {@link #FT_OAK_12282_BOUNDED_WRITE_QUEUE_ENABLED}.
+     */
+    public static final String FT_OAK_12282 = "FT_OAK-12282";


Personally, I think this is unnecessary. The change is quite safe. And if for some reason having a bound on the queue causes trouble, we can increase it's size to be effectively unlimited.

patlego added 2 commits June 25, 2026 11:58

OAK-12282 defining a fixed bound for the AbstractDiskCache

71bd7fa

OAK-12282 adding FT and logging

84c1bf5

rishabhdaim requested review from joerghoh and jsedding and removed request for joerghoh June 26, 2026 11:05

Adding package-info change

c3799dc

rishabhdaim changed the title ~~OAK-12282 defining a fixed bound for the AbstractDiskCache~~ OAK-12282 : defining a fixed bound for the AbstractDiskCache Jun 26, 2026

joerghoh requested changes Jun 29, 2026

View reviewed changes

nfsantos requested changes Jun 30, 2026

View reviewed changes

OAK-12282 adding changes requested

75159a6

	// OAK-12282: expose the bounded write-queue kill switch. Requires a restart to take effect.
	// OAK-12282: expose the bounded write-queue kill switch.

		discardCount.incrementAndGet();
		logger.debug("Segment write task discarded: write queue full (capacity={})", WRITE_QUEUE_SIZE);

Uh oh!

Conversation

pat-lego commented Jun 25, 2026

Uh oh!

joerghoh left a comment

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants