Speed up FastSurfer processing by ClePol · Pull Request #820 · Deep-MI/FastSurfer

ClePol · 2026-05-22T12:03:46Z

This is an AI driven change of the FastSurfer pipeline, aimed to speedup the segmentation and surface processing components. Current acceleration is around 25-35% depending on the number of threads available. The speedup of single changes - in an uncontrolled test environment with --threads 8 - are included in most commits and as detailed commit messages.

Before merging this PR the maintainers of the specific modules should review and edit the changes if needed.

I also recommand running the full validation prior to merging as this PR touches many core components - initial tests show no regression.

Note that c81f0c7 disables writing the usually unused T1.mgz and allows enabling it instead of disabling it.

Start BA label generation and hyporelabel asynchronously as soon as both hemispheres finish, then wait after ribbon before consuming the relabeled aseg. On 114823_MR1 with 8 threads this reduced surface wall time from 46:01.47 to 45:03.77 with unchanged image, surface, label, annotation, and morphometry outputs.

Run wmparc WM labeling from aseg.mgz asynchronously with half the requested threads while aparc.DKTatlas+aseg.mapped.mgz is created, then merge the WM labels back into the mapped aparc volume. Validation: 114823_MR1 surf_only threads=8 improved from 45:03.77 to 44:32.97 (30.80s faster). Surface comparator reported no voxel, surface, morphometry, label, or annotation data differences; numeric stats rows matched after stripping headers.

Use a smaller mri_surf2volseg hash resolution for mapped aparc/wmparc projection, run aparc projection with extra threads at 8-thread pipeline settings, and cap ribbon distance at 2. Validated on 114823_MR1 with exact data outputs vs surf_speed_async_wmparc_halfthreads_threads8_run1; wall time improved from 44:32.97 to 43:46.98 (45.99s faster).

Start mris_volmask from an async command file once both pial surfaces are ready, while the remaining hemisphere curvature/stat tail continues. Validated on 114823_MR1 with exact data outputs vs surf_speed_hash8_cap2_threads8_run1; wall time improved from 43:46.98 to 43:14.12 (32.86s faster).

Split async ribbon generation by hemisphere and merge the exact left/right masks before downstream consumers. Use lower exact hash resolutions for mapped-volume projections, cap concurrent mapped-volume threads to reduce contention, and only wait on async jobs when their outputs are required. Validated on 114823_MR1: surf_speed_async_ribbon_threads8_run1: 43:14.12 surf_speed_hash5_wmparc_hash4_t16_threads8_run1: 42:42.62 Speedup: 31.50s Output comparison: exact data outputs; stripped numeric stats match.

Skip distance-neighborhood curvature sampling for inflated.H/K generation. On 114823_MR1 this reduces the two inflated mris_curvature steps from 78.52s to 3.83s, saving 74.69s in that component. Validated against surf_speed_fast_fill_threads8_run2: final white/pial/sphere.reg vertices and faces, thickness, curv, aseg, ribbon, aparc+aseg, and wmparc outputs are unchanged. Intermediate inflated.H/K files differ. The full benchmark run was slower overall due heavier system load (47:17.43 vs prior 40:32.44), so the speedup is based on paired command timings.

Defer sphere/surfreg until after white/pial generation for the default FastSurfer DKT path, keeping --fsaparc behavior unchanged because it needs sphere.reg before aparc labeling. On 114823_MR1 with the current curvature optimization, wall time improved from 47:17.43 to 43:59.74 under similar load, a 197.69s speedup. Validation found unchanged white/pial/sphere.reg vertices and faces, thickness, curv, avg_curv, jacobian_white, aseg, ribbon, aparc+aseg, and wmparc outputs.

Default surface reconstruction now uses norm.mgz directly for brainmask.mgz and skips the FreeSurfer-style T1.mgz normalization unless --fs_T1 is requested. On 114823_MR1 this reduced wall time from 43:59.74 to 40:22.01 (-217.73s) versus the committed defer-surfreg reference. Checked final MRI volumes, white/pial/sphere.reg surfaces, and key morph data were unchanged; the legacy auxiliary T1.mgz output is no longer produced by default.

Replace per-row scipy.stats.mode calls in sample_parc smoothing with np.bincount().argmax(), preserving scipy's smallest-label tie behavior for non-negative labels. Validation on 114823_MR1 against surf_speed_no_fs_t1_threads8_run1: mapped annotations byte-identical; final MRI volumes have zero voxel changes; white/pial/sphere.reg surfaces and morphometry have zero changes. sample_parc.py dropped from 61.57s total to 8.59s total; full wall time changed from 40:22.01 to 39:58.11.

Use shrink factor 5 for T1 N4 bias correction in run_fastsurfer.sh and recon-surf.sh. Isolated 114823_MR1 N4 timing: default shrink 4 took 1:44.76, shrink 5 took 0:50.24 (54.52s faster). In full seg validation under high host load, N4 module time was 40.05s vs 62.65s in the reference run. Validation against seg_speed_torchcap_threads8_run1: primary segmentation outputs, CC outputs, aseg.auto, mask, and CerebNet segmentation were voxel-identical. orig_nu changed by max 3 UCHAR (mean abs 0.099; p99.9 2). HypVINN changed 200 label voxels and 999 mask voxels.

Run CerebNet asynchronously when HypVINN is also enabled, using temporary CerebNet-specific segmentation and timing logs that are appended after the background process completes. Validation on 114823_MR1 seg_only threads=8: seg_speed_async_cereb_n4_threads8_run1 completed in 18:57.99. CerebNet took 60.60s and completed fully under HypVINN, which took 585.46s, hiding the CerebNet module runtime. Compared against seg_speed_n4_shrink5_threads8_run1: checked mri/orig.mgz, aparc.DKTatlas+aseg.deep.mgz, mask.mgz, aseg.auto_noCCseg.mgz, orig_nu.mgz, callosum.CC.orig.mgz, aparc.DKTatlas+aseg.deep.withCC.mgz, aseg.auto.mgz, cerebellum.CerebNet.nii.gz, hypothalamus.HypVINN.nii.gz, and hypothalamus_mask.HypVINN.nii.gz; all were voxel-identical. Stats data rows were identical; stats files differed only in metadata headers such as paths and hostname.

Trace the shape-stable batch-1 CPU HypVINN model after the first batch so repeated slice inference runs through TorchScript. The optimization is limited to the default no-output-scale CPU path and can be disabled with FASTSURFER_HYPVINN_TRACE=0. Validation on 114823_MR1 with --seg_only --device cpu --threads 8: wall time improved from 18:57.99 in seg_speed_async_cereb_n4_threads8_run1 to 17:39.52 in seg_speed_hyp_trace_threads8_run1, a 78.47 second speedup. Isolated HypVINN improved from 11:20.59 to 8:22.56. Key image outputs were voxel-identical and non-comment stats rows matched exactly.

Trace the shape-stable batch-1 CPU FastSurferVINN model after the first batch so repeated slice inference uses TorchScript. The optimization is limited to the default no-output-scale CPU path and can be disabled with FASTSURFER_VINN_TRACE=0. Validation on 114823_MR1 with --seg_only --device cpu --threads 8: wall time improved from 17:39.52 in seg_speed_hyp_trace_threads8_run1 to 16:47.67 in seg_speed_vinn_hyp_trace_threads8_run1, a 51.85 second speedup. Key image outputs were voxel-identical and non-comment stats rows matched exactly.

Freeze traced FastSurferVINN and HypVINN batch-1 CPU modules after tracing to reduce per-slice inference overhead. The freeze layer is enabled by default and can be disabled independently with FASTSURFER_VINN_FREEZE=0 or FASTSURFER_HYPVINN_FREEZE=0 while keeping the tracing speedups. Validation on 114823_MR1 with --seg_only --device cpu --threads 8: wall time improved from 16:47.67 in seg_speed_vinn_hyp_trace_threads8_run1 to 16:07.54 in seg_speed_freeze_traces_threads8_run1, a 40.13 second speedup. Compared with the previous reference, observed small output changes: 12 voxels in aparc.DKTatlas+aseg.deep.mgz, 1 in mask.mgz, 7 in aseg.auto_noCCseg.mgz, 70 UCHAR voxels in orig_nu.mgz, 38 in aparc.DKTatlas+aseg.deep.withCC.mgz, 33 in aseg.auto.mgz, and 2 each in HypVINN segmentation and mask. CerebNet output was unchanged.

Move the shared CPU-only trace/freeze decision and torch.jit.trace/freeze wrapper into FastSurferCNN.utils.torchscript. FastSurferVINN and HypVINN keep their model-specific trace adapters and the same FASTSURFER_* environment flags. Validation: python3 -m py_compile FastSurferCNN/utils/torchscript.py FastSurferCNN/inference.py HypVINN/inference.py.

Reference 114823_MR1 elapsed: 31:16.77. Candidate surf_speed_qsphere_skip_large_lh_crop32_threads8_run1 elapsed: 30:43.90, speedup 32.87 seconds; recon-surf runtime changed from 0.496h to 0.487h. Use the direct seeded mris_sphere -q fallback instead of recon-all -qsphere and pass the full requested thread count to the qsphere wrapper. Large left-hemisphere meshes skip the spectral attempt and go straight to the deterministic FreeSurfer fallback; isolated validation showed lh.qsphere.nofix identical to the previous fallback. Run hemi ribbon masks through cropped_mris_volmask.py with margin 32. Output changes vs the reference are limited to 1 voxel in lh.ribbon.mgz, 2 voxels in rh.ribbon.mgz, and 3 voxels in ribbon/aseg/wmparc/aparc-derived volumes; prep outputs and surface binaries stayed exact in compare_surface_outputs.

Start mri_relabel_hypointensities asynchronously after both pial surfaces are ready so it can overlap later independent surface work. Validation on 114823_MR1 with --threads 8: reference surf_speed_qsphere_skip_large_lh_crop32_threads8_run1 elapsed 30:43.90; candidate surf_speed_async_hyporelabel_pial_threads8_run1 elapsed 30:30.99, speedup 12.91 seconds. Checked MRI outputs matched the reference exactly.

Start HypVINN and CerebNet asynchronously as soon as their segmentation and bias-corrected image inputs are ready, then append their logs and check exits at the existing synchronization point. This overlaps auxiliary segmentation with stats and corpus-callosum work instead of waiting until after those stages. Validation on 114823_MR1 with --seg_only --device cuda --threads 8: wall time improved from 3:28.15 in seg_speed_gpu_threads8_run2 to 2:43.31 in seg_speed_gpu_aux_async_threads8_run1, a 44.84 second speedup. Key image outputs were voxel-identical and non-comment stats rows matched exactly against the prior GPU reference.

Overlap FastSurfer-CC generation/inpainting with N4 and early stats, and raise T1 N4 shrink from 5 to 6. Validation on 114823_MR1 seg_only cuda threads=8: seg_speed_gpu_aux_async_threads8_run1 2:43.31 -> seg_speed_gpu_cc_async_n4s6_threads8_run1 2:16.02, speedup 47.29s. Output comparison against the prior GPU reference: main aseg+DKT, mask, aseg auto_noCC, CC, withCC, aseg.auto, and CerebNet segmentations were voxel-identical. orig_nu changed in 3,544,455 voxels with max absolute UCHAR delta 5 and mean absolute delta 0.130. HypVINN changed 195 segmentation voxels and 918 mask voxels.

Use 40 N4 iterations after shrink=6 and use HypVINN batch 4 for the default CUDA batch=1 path. CerebNet remains on the requested global batch size to avoid the batch-size output drift seen with global batch 4. Validation on 114823_MR1 seg_only cuda threads=8: seg_speed_gpu_cc_async_n4s6_threads8_run1 2:16.02 -> seg_speed_gpu_hypbatch4_n4iter40_threads8_run1 2:08.61, speedup 7.41s. Output comparison against the previous reference: main aseg+DKT, mask, aseg.auto_noCCseg, CC, withCC, aseg.auto, and CerebNet segmentations were voxel-identical. orig_nu changed in 2,913,929 voxels with max absolute UCHAR delta 4 and mean absolute delta 0.104. HypVINN changed 153 segmentation voxels and 730 mask voxels.

Start recon-surf as soon as CC inpainting has produced aseg.auto.mgz, while the remaining segmentation stats and auxiliary HypVINN/CerebNet outputs continue in parallel. 114823_MR1 validation: 34:26.62 vs 35:04.86 reference, speedup 38.24s. Output comparison showed no voxel differences, no normal surface geometry/value differences, identical callosum.surf geometry, and identical .stats data bodies; remaining differences are path/timestamp/header metadata.

The surface scheduling work can place FreeSurfer progress text before @#@FSTIME markers on the same physical log line. Anchor timing parsing on the marker token so recon-surf_times.yaml is still generated. Validation: reran extract_recon_surf_time_info.py on surf_speed_surface_tail_overlap_threads8_run1/114823_MR1; output YAML was generated successfully. No processing output changes.

Overlap cortex+hipamyg label creation, inflate/curvature products, mapped anatomical stats, and pctsurfcon work with independent downstream surface steps. This keeps the same FreeSurfer commands and waits at the points where their outputs are needed. Validation on 114823_MR1 (--threads 8, cuda): - reference surf_speed_early_surface_after_cc_clean_threads8_run1: 34:26.62 - optimized surf_speed_surface_tail_overlap_threads8_run1: 33:52.75 - speedup: 33.87 seconds - compare_surface_outputs: no surface geometry or morph value differences; volume outputs unchanged by the comparator. Differences are logs/scripts/path metadata/stats headers plus callosum.surf byte hash. - stats data bodies are identical after ignoring comment/header lines. - callosum.surf geometry is identical: coords_equal=True, faces_equal=True, coord_max_abs=0.0.

Run Talairach registration asynchronously and use orig_nu.mgz as the normalization source while Talairach only updates transform metadata. Also defer the final async command-file wait and aparc cleanup until after the cortex-label surf2volseg pass so independent tail work can overlap. Validated on 114823_MR1 with Docker --user 4758:1001, --threads 8, --device cuda. Wall time improved from 33:52.75 to 33:04.89, a 47.86s speedup versus the previous optimized reference. Output comparison versus surf_speed_surface_tail_overlap_threads8_run1 showed no changed MRI volumes or primary surface geometry. Stats bodies are identical; callosum.surf byte hash differs but coordinates/faces are identical. Remaining differences are logs, command files, stats headers, run paths, and Talairach path/timestamp text.

m-reuter · 2026-05-22T15:28:27Z

There are probably some things in here worth considering, but:

most is parallelisation of a few blocks (each saving very little, like 30s) and adding a lot of additional code and complexity. Probably not worth the added administrative and maintenance overhead.
a lot is just faster because it calls some freesurfer commands directly instead of piggy-backing on recon-all. The reason that the slower recon-all is used, is to inherit future updates in these blocks directly from freesurfer.
some changes modify outputs, but the AI does not care and looks only at the standard outputs (it interprets the files as intermediate files but they are final output, like the different surface curvature files).
it skips things that are not relevant for the FastSurfer pipeline (like the T1.mgz creation) but are necessary for some downstream tools, like fmri-prep. They had been added for that purpose only. (If this has changed in the meantime in fMRI-prep, we can of course drop it).

So most of it is overlap (parallelisation) of components, including bigger modules like neural networks. For that of course the GPU needs to handle that. We should consider moving the pipeline to snakemake instead of hacking parallelism into it like here.

Still worth to browse through these, as some (at least one , the aparc smoothing filter for example) are worth adding, or point to potential places to improve speed in the future. So this PR will never get merged (as is), but can remain open as a guide for place to look for future improvements.

dkuegler · 2026-05-26T12:50:50Z

Generally, a lot of the parallelism circumvents the intended functionality of the --threads* flags.

As for the changes to bash scripts, I have tried to avoid these changes in the past, because it makes the (already complex) scripts more complex and I would want to switch over the code from bash to python no matter what as well.

The torchscript extensions might be interesting to consider.

Other scripts I did not take a look at.

ClePol added 30 commits May 22, 2026 13:48

Speed up ribbon generation

505109c

Speed up spherical morphing

d3741d1

Overlap surface statistics

7b2985c

Speed up surface wrapper steps

8af67fd

Cap CPU inference threads

3882edb

Share cropped volume helpers for volmask

4cfb41d

fix ruff error

572fb9e

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Speed up FastSurfer processing#820

Speed up FastSurfer processing#820
ClePol wants to merge 31 commits into
Deep-MI:devfrom
ClePol:speedup_pr2

ClePol commented May 22, 2026 •

edited

Loading

Uh oh!

m-reuter commented May 22, 2026

Uh oh!

dkuegler commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

Conversation

ClePol commented May 22, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

m-reuter commented May 22, 2026

Uh oh!

dkuegler commented May 26, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

ClePol commented May 22, 2026 •

edited

Loading