Fast inference pipeline for nnU-Net and
TotalSegmentator, the most
popular frameworks for medical image segmentation. This project provides a
clean, minimal inference module with only the necessary components, plus an
end-to-end GPU fast-path: every stage — resampling (cucim + a GPU
cubic-B-spline that matches scipy order=3 to ~1e-13), normalization, the
sliding-window forward pass, logits→label conversion, cropping, and
connected-component postprocessing — runs on the GPU. Across 24 parity-validated
modes it reproduces official TotalSegmentator output at ≥0.999 DSC on the headline
modes (≥0.995 on 21 of 24) while running 2–9× faster (forward-pass-bound; the rest
is fixed import/model-load overhead amortized in batch).
- Python: 3.10
- CUDA: 12.4
- uv for environment management
- Clone this repo. The vendored
nnunetv2lives insrc/, andTotalSegmentatoris expected as a sibling checkout (see[tool.uv.sources]inpyproject.toml):
git clone https://github.com/JunMa11/FastSegmentator.git
git clone https://github.com/wasserth/TotalSegmentator.git # sibling of FastSegmentator
cd FastSegmentator- Create the environment and install everything (including the
FastSegmentatorcommand) in one step:
uv syncuv sync builds the editable nnunetv2 package from src/, installs the
pinned CUDA 12.1 torch wheels, cupy/cucim, and registers the
FastSegmentator console script into .venv/.
- Activate the environment so the
FastSegmentatorcommand is on your PATH:
source .venv/bin/activateDownload the dataset and model weights from the Google Drive link.
- Place the dataset in
FastSegmentator/nnUNet_data/ - Place the model weights in
FastSegmentator/model_weights/
TotalSegmentator weights default to ~/.totalsegmentator/nnunet/results
(override with --weights_dir).
With the environment activated, the FastSegmentator command dispatches to one
of two backends:
FastSegmentator <command> [options]Without activating, you can equivalently run
uv run FastSegmentator ...or.venv/bin/FastSegmentator ....
FastSegmentator totalseg \
-i <path_to_input_images> \
-o <path_to_output_segmentations> \
--task total| Flag | Default | Description |
|---|---|---|
-i, --input_path |
(required) | Folder of *.nii.gz input images |
-o, --output_path |
(required) | Folder to write multilabel output NIfTIs |
--task |
total |
Mode (e.g. total, total_mr, body_mr, …) |
--weights_dir |
~/.totalsegmentator/nnunet/results |
TotalSegmentator weights path |
--device |
cuda |
Device (cuda or cpu) |
Run FastSegmentator totalseg --help for the full list of --task modes.
FastSegmentator nnunet \
-i <path_to_input_images> \
-o <path_to_output_segmentations> \
--model_path <path_to_model_weights>| Flag | Default | Description |
|---|---|---|
-i, --input_path |
(required) | Path to the input image folder |
-o, --output_path |
(required) | Path to save output segmentations |
--model_path |
(required) | Path to the trained model directory |
--fold |
all |
Fold to use for inference |
--checkpoint |
checkpoint_final.pth |
Checkpoint filename |
--use_softmax |
False |
Apply softmax to output probabilities |
--device |
cuda |
Device (cuda or cpu) |
Trainers. By design, the
nnunetbranch resolves only the standard nnU-Net trainers (nnUNetTrainer,nnUNetTrainerNoMirroring,nnUNetTrainerTopkLoss). To use a model trained with a custom trainer, point thennUNet_extTrainerenvironment variable at the directory containing your trainer class so it can be resolved at checkpoint load:export nnUNet_extTrainer=/path/to/your/trainers
FastSegmentator nnunet \
-i ./nnUNet_data/Dataset701_AbdomenCT/imagesVal \
-o ./seg \
--model_path ./model_weights/701/nnUNetTrainerMICCAI_repvgg__nnUNetPlans__3d_fullresThe fast-path is validated to match official TotalSegmentator on the same
input (parity, not vs. ground truth) across 24 modes.
Overview + interactive figures: report/index.html; full
per-mode report: report/validation_report.html.
Of the 24 validated modes, 21 reach ≥0.995 DSC — every previously-failing pathology mode is now ≥0.999 — and 3 thin/sparse modes carry small, characterized caveats, all at 2–9× speedup:
| Mode | Task | DSC vs official | Speedup |
|---|---|---|---|
total |
291–295 | 1.0000 | 9.6× |
liver_lesions |
591 | 1.0000 | 4.9× |
liver_lesions_mr¹ |
589 | 1.0000 | 6.2× |
liver_segments_mr |
576 | 1.0000 | 6.7× |
trunk_cavities |
343 | 1.0000 | 4.1× |
lung_vessels |
117 | 0.9999 | 2.8× |
teeth |
113 | 0.9999 | 3.9× |
total_mr |
850,851 | 0.9999 | 9.5× |
lung_nodules |
913 | 0.9999 | 7.5× |
vertebrae_mr |
756 | 0.9999 | 4.6× |
lung_vessels_LEGACY |
258 | 0.9999 | 4.2× |
craniofacial_structures |
115 | 0.9998 | 4.0× |
body |
299 | 0.9996 | 2.3× |
pleural_pericard_effusion |
315 | 0.9990 | 9.3× |
head_muscles |
777 | 0.9989 | 5.1× |
head_glands_cavities |
775 | 0.9987 | 5.0× |
liver_segments |
570 | 0.9984 | 5.2× |
abdominal_muscles |
952 | 0.9981 | 4.6× |
headneck_bones_vessels |
776 | 0.9968 | 4.5× |
oculomotor_muscles |
351 | 0.9960 | 4.8× |
body_mr |
597 | 0.9956 | 4.9× |
headneck_muscles |
778,779 | 0.9945 | 4.9× |
kidney_cysts |
789 | 0.9919 | 4.9× |
liver_vessels |
8 | 0.9880 | 5.9× |
¹ liver_lesions_mr — an 86-voxel lesion on the crop boundary, nondeterministic
on both pipelines (official itself flips 86/52 voxels across runs); our
deterministic output matches official's same-draw at DSC 1.0.
Three fixes brought the harder modes to parity (each isolated by bisecting against official's per-function intermediates):
- GPU cubic-B-spline input resample — replaced order-1 trilinear
(
F.interpolate) with a separable order-3 cubic B-spline matching nnU-Net'sskimage.resize(order=3)to ~1e-13. (pleural, lung_nodules) dtype=np.int32on the cucim input resample — matches official's pre-model int truncation. (total_mr, liver_segments_mr, liver crops)- Per-mode softmax→argmax convert for low-confidence lesion modes. (liver_lesions, liver_lesions_mr)
Plus GPU-ported crop + connected-component postprocess (bit-identical to the scipy originals) and cuDNN-deterministic forward for reproducibility.