Skip to content

bench: warm up SEAL kernels before timed benchmarks (#625)#741

Merged
kimlaine merged 2 commits intomicrosoft:mainfrom
BAder82t:fix/sealbench-warmup-625
Apr 28, 2026
Merged

bench: warm up SEAL kernels before timed benchmarks (#625)#741
kimlaine merged 2 commits intomicrosoft:mainfrom
BAder82t:fix/sealbench-warmup-625

Conversation

@BAder82t
Copy link
Copy Markdown
Contributor

@BAder82t BAder82t commented Apr 26, 2026

Closes #625.

Summary

The first batch of sealbench results in a fresh process can be 2-6× slower than subsequent batches due to cold instruction cache, page faults, and an uninitialized SEAL memory pool. This produced the symptom @rickwebiii reported in #625 where "poly degree 2048 keygen was faster than 1024, which makes no sense", and the same first-batch / second-batch ratios he documented in the issue body (KeyGen Secret 392 µs → 75 µs, Decrypt 140 µs → 59 µs, etc.).

This PR implements the fix exactly as the original reporter proposed:

"running the first batch of benchmarks one time during precomputation and not printing the results."

What changes

native/bench/bench.cpp only. No library code touched, no impact on users who do not build sealbench.

After the existing precomputation in main(), a new warmup_family() helper iterates the bm_env_map and runs one encrypt + decrypt + add + multiply per BMEnv (with an extra relinearize when key-switching is on), discarding the results. CKKS additionally exercises the encoder. The warmup primes:

  • the i-cache for the BFV / BGV / CKKS hot paths
  • the page tables for the SEAL data working set
  • the SEAL memory pool, so the timed benchmarks allocate from a populated free list rather than fresh OS pages

A Running warmup pass ... line prints between the existing Running precomputations ... banner and the google-benchmark output, so it is visible to users.

Verification

Built Release -O3 on Apple M-series, full sealbench across all default parameter sets:

Check Result
sealbench exit code 0
Total benchmark cases ran 378
All 6 parameter sets (n=1024..32768)
KeyGen Secret monotonic across n 40 → 70 → 187 → 505 → 1484 → 5957 µs
KeyGen Public monotonic 72 → 117 → 395 → 1280 → 3873 → 15031 µs
BFV EncryptSecret monotonic 73 → 138 → 384 → 1200 → 4182 → 18151 µs
BFV Decrypt monotonic 22 → 44 → 139 → 480 → 2014 → 7974 µs
CKKS Decrypt monotonic 3 → 5.6 → 22 → 81 → 321 → 1248 µs
NTT Forward monotonic 11 → 21 → 91 → 382 → 1606 → 6714 µs
Original "smaller-n slower than larger-n" symptom gone

First-vs-second-batch variability tightens on this hardware (e.g. KeyGen Public: 2.7% gap baseline → 0.6% gap with warmup). The dramatic 2-6× ratios reported in #625 are hardware-dependent (the original report was on M1 Air); on M-series the baseline gap is already smaller, but the warmup eliminates it by construction regardless of system.

Test plan

  • Full sealbench completes cleanly across n=1024..32768
  • All measured kernels scale monotonically with n
  • No library code changed
  • CI green on microsoft:main

The first sealbench batch is 2-6x slower than later batches due to cold instruction cache and an uninitialized SEAL memory pool, which produced the "n=2048 keygen faster than n=1024" symptom in microsoft#625.

Add a silent warmup pass after precomputation that runs one encrypt + decrypt + add + multiply (+ relinearize when key-switching is on) per registered BMEnv. Bench-only change; no library code touched.

Implements the fix proposed by @rickwebiii in the issue.

Closes microsoft#625
@kimlaine
Copy link
Copy Markdown
Contributor

Thank you, this is overall a nice improvement.

Would you be able to add a flag that optionally disables this warmup; something like --no-warmup? This could be just read from argv. The rationale is that sometimes it's actually desirable to measure the cold start.

@BAder82t
Copy link
Copy Markdown
Contributor Author

no problem

@kimlaine kimlaine merged commit 63a0c3f into microsoft:main Apr 28, 2026
1 check passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

First set of benchmarks that run do so more slowly than they should.

2 participants