Text-to-Image Vehicle Re-Identification: Multi-Scale Multi-View Cross-Modal Alignment Network and a Unified Benchmark

Vehicle Re-IDentification (Re-ID) aims to retrieve the most similar images with a given query vehicle image from a set of images captured by non-overlapping cameras, and plays a crucial role in intelligent transportation systems and has made impressive advancements in recent years. In real-world scenarios, the authors can often acquire the text descriptions of target vehicle through witness accounts, and then manually search the image queries for vehicle Re-ID, which is time-consuming and labor-intensive. To solve this problem, this paper introduces a new fine-grained cross-modal retrieval task called text-to-image vehicle re-identification, which seeks to retrieve target vehicle images based on the given text descriptions. To bridge the significant gap between language and visual modalities, the authors propose a novel Multi-scale multi-view Cross-modal Alignment Network (MCANet). In particular, the authors incorporate view masks and multi-scale features to align image and text features in a progressive way. In addition, the authors design the Masked Bidirectional InfoNCE (MB-InfoNCE) loss to enhance the training stability and make the best use of negative samples. To provide an evaluation platform for text-to-image vehicle re-identification, the authors create a Text-to-Image Vehicle Re-Identification dataset (T2I VeRi), which contains 2465 image-text pairs from 776 vehicles with an average sentence length of 26.8 words. Extensive experiments conducted on T2I VeRi demonstrate MCANet outperforms the current state-of-art (SOTA) method by 2.2% in rank-1 accuracy.

Language

  • English

Media Info

Subject/Index Terms

Filing Info

  • Accession Number: 01936078
  • Record Type: Publication
  • Files: TRIS
  • Created Date: Nov 7 2024 9:21AM