BertsWin: Accelerating 3D Medical Image Analysis with Topological Preservation
Published:Dec 25, 2025 19:32
•1 min read
•ArXiv
Analysis
This paper addresses the challenge of applying self-supervised learning (SSL) and Vision Transformers (ViTs) to 3D medical imaging, specifically focusing on the limitations of Masked Autoencoders (MAEs) in capturing 3D spatial relationships. The authors propose BertsWin, a hybrid architecture that combines BERT-style token masking with Swin Transformer windows to improve spatial context learning. The key innovation is maintaining a complete 3D grid of tokens, preserving spatial topology, and using a structural priority loss function. The paper demonstrates significant improvements in convergence speed and training efficiency compared to standard ViT-MAE baselines, without incurring a computational penalty. This is a significant contribution to the field of 3D medical image analysis.
Key Takeaways
- •Proposes BertsWin, a novel architecture for 3D medical image analysis using SSL.
- •Combines BERT-style masking with Swin Transformer windows to improve spatial context learning.
- •Maintains a complete 3D token grid to preserve spatial topology.
- •Achieves significant improvements in convergence speed and training efficiency compared to existing methods.
- •Demonstrates the effectiveness of the approach on TMJ segmentation using 3D CT scans.
Reference
“BertsWin achieves a 5.8x acceleration in semantic convergence and a 15-fold reduction in training epochs compared to standard ViT-MAE baselines.”