KVBench is a lightweight, reproducible benchmark suite for studying the latency, throughput, and memory tradeoffs of KV cache strategies in transformer inference.
This project focuses on a practical question: when does KV cache materially help, how much memory does it cost, and which cache mode is the best default on constrained local hardware.
The current benchmark set covers:
- cache vs no-cache generation
- prompt-length scaling
- cache strategy comparison
- a decode-focused microbenchmark at long context
The experiments were intentionally scoped to remain stable on an RTX 3060 Laptop GPU with 6 GB VRAM. A broader generation-length sweep was avoided after heavier runs caused system instability and shutdowns.
- KV cache becomes much more valuable at long context than at short context.
dynamiccache is the strongest default on the current setup.offloadedcache is a useful fallback when memory pressure matters.no_cachebreaks down for decode-heavy long-context inference.staticcache could not be evaluated locally because of Triton/backend support issues.
From the saved prompt-length results:
- at
512prompt tokens,use_cache=Falseaveraged about8.81 s - at
512prompt tokens,use_cache=Trueaveraged about3.63 s
That makes the long-context advantage of caching large enough to be visible even on a small local benchmark.
KV_cache/
|-- README.md
|-- requirements.txt
|-- check_gpu.py
|-- test_model.py
|-- src/
| |-- run_benchmark.py
| |-- run_prompt_sweep.py
| |-- run_prompt_sweep_multirun.py
| |-- run_cache_strategy_sweep.py
| |-- run_decode_microbenchmark.py
| `-- plot_results.py
|-- results/
| |-- raw/
| `-- figures/
`-- notebooks/
- GPU: NVIDIA RTX 3060 Laptop GPU
- VRAM: 6 GB
- Model:
HuggingFaceTB/SmolLM2-360M - Frameworks: PyTorch + Hugging Face Transformers
Install dependencies:
pip install -r requirements.txtCheck GPU visibility:
python check_gpu.pyRun a simple model sanity check:
python test_model.pyBaseline benchmark:
python src/run_benchmark.pyPrompt-length multirun benchmark:
python src/run_prompt_sweep_multirun.pyCache strategy comparison:
python src/run_cache_strategy_sweep.pyDecode-focused microbenchmark:
python src/run_decode_microbenchmark.pyRegenerate plots:
python src/plot_results.pyRaw and aggregated experiment results are written to results/raw/.
Generated figures are written to results/figures/.
The plotting script now uses trial standard deviations as error bars to make variance visible instead of showing mean curves alone.
For detailed plot interpretation, see Figure guide.
Tests how latency, throughput, and peak GPU memory change as prompt length grows for:
use_cache=Falseuse_cache=True
Tests:
no_cachedynamicoffloadedstaticwhen backend support allows it
Uses a fixed long prompt (512 tokens) and a conservative decode length (64 new tokens) to isolate the decode-side value of KV cache without risking unstable long runs on 6 GB hardware.
- Results come from a single laptop GPU environment.
- The tested model is intentionally small for safety and reproducibility.
staticcache failed locally because Triton/backend support was unavailable.- The decode split uses a practical timing estimate rather than kernel-level instrumentation.
- A broad generation-length sweep was intentionally not run on this machine because previous heavier tests caused full system shutdowns.
The smaller decode benchmark is deliberate. On this hardware, pushing generation length higher is not just a slower run; it can make the machine unstable. The project therefore prioritizes:
- safe repeatability
- clear comparisons
- honest reporting of hardware constraints
This makes the benchmark more useful for real local-development scenarios, where safety and reproducibility matter as much as raw scale.
- add a compact summary table directly in the README
- refactor duplicated runner logic into shared utilities
- test one additional small model for cross-model comparison
- add batch-size sensitivity experiments if hardware allows
- compare local results on a higher-VRAM machine in a future extension