This research presents a two-stage cascaded 3D segmentation approach designed for accurate brain tumor delineation in multimodal MRI data. The design incorporates a Swin Transformer backbone alongside Atrous Spatial Pyramid Pooling (ASPP) and Squeeze-and-Excitation (SE) blocks during the coarse segmentation phase to extract comprehensive, multi-scale contextual data. The second stage enhances the preliminary output with a class-wise attention decoder, which highlights tumor subregions—specifically, edema, necrotic core, and enhancing tumor—by masking the original input with the coarse predictions. The model employs a weighted Tversky loss to address the substantial class imbalance present in tumor segmentation tasks. Extensive experiments were executed on the BraTS 2020 dataset, incorporating preprocessing processes such as volume cropping and scaling to 128 {texttimes} 128 {texttimes} 128, with evaluation completed across training, validation, and test sets divided in a 70{%}–15{%}–15{%} ratio. The model exhibited exceptional accuracy in identifying tumor regions, with a Dice score of 0.99 for healthy tissue and above 0.5 for tumor subregions. The investigation of ROC curves and confusion matrices further substantiated the reliability of the forecasts. To improve model transparency, layer-wise Grad-CAM heatmaps were produced, illustrating the transition of attention from general background in first levels to concentrated tumor localization in subsequent layers. The proposed method attains competitive segmentation performance while providing a more interpretable and therapeutically pertinent answer. The results underscore the efficacy of integrating coarse-to-fine cascaded architectures with transformer-based encoders and attention-driven refinement for brain tumor segmentation.
