A Video Is Worth Three Views: Trigeminal Transformers for Video-Based Person Re-Identification

Video-based person Re-Identification (Re-ID) is a hot research topic in intelligent transportation systems, which aims to retrieve video sequences of the same person under non-overlapping surveillance cameras. Compared with static images, video sequences contain more visual information from multiple views, such as spatial and temporal views. However, previous Re-ID methods usually focus on single limited views, lacking diverse observations from different views. To capture richer perceptions and extract more comprehensive representations, the authors propose a novel learning framework named Trigeminal Transformers (TMT) to tackle video-based person Re-ID. More specifically, the authors first design a View-wise Projector (VP) to jointly transform raw videos from spatial, temporal and spatial-temporal views. In addition, inspired by the great success of Vision Transformers (ViT), the authors introduce the Transformer structure for information enhancement and aggregation. In the work, three Self-view Transformers (ST) are proposed to exploit the relationships of local features for information enhancement in spatial, temporal and spatial-temporal. Moreover, a Cross-view Transformer (CT) is proposed to aggregate the multi-view features for comprehensive representations. Experimental results indicate that the approach can obtain better performance than some other state-of-the-art approaches on four public Re-ID benchmarks.

Language

  • English

Media Info

Subject/Index Terms

Filing Info

  • Accession Number: 01938905
  • Record Type: Publication
  • Files: TRIS
  • Created Date: Dec 6 2024 2:15PM