A Cross-Scale Hierarchical Transformer With Correspondence-Augmented Attention for Inferring Bird’s-Eye-View Semantic Segmentation
As bird’s-eye-view (BEV) semantic segmentation is simple-to-visualize and easy-to-handle, it has been applied in autonomous driving to provide the surrounding information to downstream tasks. Inferring BEV semantic segmentation conditioned on multi-camera-view images is a popular scheme in the community as cheap devices and real-time processing. The recent work implemented this task by learning the content and position relationship via Vision Transformer (ViT). However, its quadratic complexity confines the relationship learning only in the latent layer, leaving the scale gap to impede the representation of fine-grained objects. In view of information absorption, when representing position-related BEV features, their weighted fusion of all view feature imposes inconducive features to disturb the fusion of conducive features. To tackle these issues, the authors propose a novel cross-scale hierarchical Transformer with correspondence-augmented attention for semantic segmentation inference. Specifically, the authors devise a hierarchical framework to refine the BEV feature representation, where the last size is only half of the final segmentation. To save the computation increase caused by this hierarchical framework, the authors exploit the cross-scale Transformer to learn feature relationships in a reversed-aligning way, and leverage the residual connection of BEV features to facilitate information transmission between scales. The authors propose correspondence-augmented attention to distinguish conducive and inconducive correspondences. It is implemented in a simple yet effective way, amplifying attention scores before the Softmax operation, so that the position-view-related and the position-view-disrelated attention scores are highlighted and suppressed. Extensive experiments demonstrate that the authors' method has state-of-the-art performance in inferring BEV semantic segmentation conditioned on multi-camera-view images.
- Record URL:
-
Availability:
- Find a library where document is available. Order URL: http://worldcat.org/oclc/41297384
-
Supplemental Notes:
- Copyright © 2024, IEEE.
-
Authors:
- Fang, Naiyu
- Qiu, Lemiao
- Zhang, Shuyou
- Wang, Zili
- Hu, Kerui
- Wang, Kang
- Publication Date: 2024-7
Language
- English
Media Info
- Media Type: Web
- Features: References;
- Pagination: pp 7726-7737
-
Serial:
- IEEE Transactions on Intelligent Transportation Systems
- Volume: 25
- Issue Number: 7
- Publisher: Institute of Electrical and Electronics Engineers (IEEE)
- ISSN: 1524-9050
- Serial URL: http://ieeexplore.ieee.org/xpl/RecentIssue.jsp?punumber=6979
Subject/Index Terms
- TRT Terms: Autonomous vehicles; Cameras; Data segmentation; Object detection; Three dimensional displays; Transformers
- Subject Areas: Data and Information Technology; Highways; Vehicles and Equipment;
Filing Info
- Accession Number: 01936082
- Record Type: Publication
- Files: TRIS
- Created Date: Nov 7 2024 9:21AM