A new dual-module system combining YOLOv8 and the Swin Transformer shows that AI can detect structural cracks more quickly and accurately than humans. This offers a significant upgrade in how building safety inspections are carried out.

Study: Improved Dual-Module YOLOv8 Algorithm for Building Crack Detection. Image Credit: Bowonpat Sakaew/Shutterstock.com
Bridging the Gaps in Current Inspection Methods
Cracks are among the earliest and most critical signs of structural deterioration. Identifying them accurately is essential for preventive maintenance and structural health monitoring.
Traditional inspection methods, often manual, are time-consuming, risky, and increasingly impractical, especially with the rise of high-density urban construction and towering buildings.
To address these challenges, computer vision and digital image processing technologies are stepping in. These methods offer greater speed, consistency, and scalability, especially as cities grow taller and denser. By analyzing visual data, they can quickly pinpoint cracks and deliver essential metrics to assess the integrity of building structures.
Deep learning, in particular, has become central to efforts aimed at automating crack detection. However, there remains a gap between academic models and real-world engineering applications.
This study is hoping to plug that gap by pushing the field forward with an AI-driven system designed to be both efficient and precise.
Inside the Dual-Module Crack Detection System
In a study published in Buildings, researchers proposed a dual-module crack detection system powered by YOLOv8 and the Swin Transformer. The system was trained on a diverse dataset of crack images sourced from online platforms, on-site photography, and open-access image repositories.
To enhance the detection of fine cracks without significantly increasing computational load, the team introduced a Swin Transformer-based windowed multi-head self-attention mechanism. This helped the system focus more effectively on small and subtle crack features.
For segmentation, an improved U-Net architecture was used. With a rotating-split method, it extracted detailed crack shapes and accurately measured widths - key parameters in structural assessment.
Training was conducted using the Ultralytics YOLOv8 framework on PyTorch 1.12.0, powered by an NVIDIA RTX 3090 GPU and an Intel i5-13500H CPU. The YOLOv8n model served as the base, trained using stochastic gradient descent with a 0.937 momentum and a cosine-decay learning rate schedule.
Input images were standardized to 640×640 pixels, and the dataset was augmented using techniques like random flips, Mosaic augmentation, and scaling to enhance model generalization.
Performance Evaluation and Key Metrics
To assess the model’s performance, the study looked at accuracy, precision, recall, F1 score, IoU (Intersection over Union), and mean Average Precision (mAP), with mAP serving as the primary benchmark.
Detected cracks were also categorized by orientation: horizontal, vertical, diagonal, and complex patterns. This classification provided a clearer picture of the nature and directionality of the structural issues.
The dual-module system achieved a mAP of 97.14 %, with over 99 % mAP for horizontal and vertical cracks. Accuracy reached 98.17 %, recall hit 99.02 %, and the F1 score landed at 98.34 %.
In segmentation, the enhanced U-Net yielded a Dice coefficient of 91.95 %, an average symmetric surface distance of 0.5618, and an IoU of 86.87 %. These results outperformed baseline models like standard U-Net and Attention U-Net, thanks to the use of pixel-level and spatial attention modules.
Why This Approach Stands Out
Traditional CNNs have limitations in capturing long-range dependencies and global context, which are key for detecting fine, elongated cracks in complex images. By integrating the Swin Transformer into the YOLOv8 framework, the researchers expanded the model's receptive field and boosted its resilience to background noise.
Segmentation was treated as a post-detection refinement step rather than a separate task. This approach preserved real-time detection speeds while adding critical geometric information, such as crack width and orientation.
Compared to previous models like YOLOv3, YOLOv5, YOLOv7, Vision Transformer (ViT), and Faster R-CNN, the proposed system demonstrated superior detection accuracy and faster inference, confirming its technical and practical advantages.
Practical Implications and Future Directions
The study confirms that combining YOLOv8 with Swin Transformer modules can significantly boost the performance of automated crack detection systems. The segmentation outputs are visually interpretable, offering engineers clear, actionable insights during structural assessments.
With its high accuracy and efficiency, the model provides a practical solution for real-time fracture detection, which is crucial for maintenance planning and safety evaluations. It also lays the groundwork for future innovations in smart infrastructure monitoring, where automated systems continuously track and assess structural health.
Journal Reference
Zuo, X., Almutairi, A. D., Saeed, M. K., & Dai, Y. (2026). Improved Dual-Module YOLOv8 Algorithm for Building Crack Detection. Buildings. DOI: 10.3390/buildings16020461, https://www.mdpi.com/2075-5309/16/2/461
Disclaimer: The views expressed here are those of the author expressed in their private capacity and do not necessarily represent the views of AZoM.com Limited T/A AZoNetwork the owner and operator of this website. This disclaimer forms part of the Terms and conditions of use of this website.