👋 Welcome to Track #4: Cross-Modal Drone Navigation of the 2025 RoboSense Challenge!
This track focuses on developing robust models for natural language-guided cross-view image retrieval, specifically designed for scenarios where input data is captured from drastically different viewpoints such as aerial (drone or satellite) and ground-level perspectives.
Participants will tackle the challenge of creating systems that can effectively retrieve corresponding images from large-scale cross-view databases based on natural language text descriptions, even under common corruptions such as blurriness, occlusions, or sensory noise.
🏆 Prize Pool: $2,000 USD (1st: $1,000, 2nd: $600, 3rd: $400) + Innovation Awards
This track evaluates the capability of vision-language models to perform cross-modal drone navigation through natural language-guided image retrieval. The key challenges include:
The ultimate goal is to develop models that can enable intuitive drone navigation through natural language commands while being robust to real-world sensing conditions.
Duration: June 15th, 2025 (anytime on earth) - August 15th, 2025 (anytime on earth)
This phase involves evaluation on a publicly available test set with approximately 190 test cases (24GB GPU version). Participants can:
              Duration: August 15th, 2025 (anytime on earth) - September 15th September 22nd, 2025 (anytime on earth)
            
This phase involves final evaluation on a private test set with the same number of test cases (~190) as Phase I. Key requirements:
The track uses the RoboSense Track 4 Cross-Modal Drone Navigation Dataset, based on the GeoText-1652 benchmark. The dataset features:
| Platform | Split | Images | Descriptions | Bbox-Texts | Classes | Universities | 
|---|---|---|---|---|---|---|
| Drone | Train | 37,854 | 113,562 | 113,367 | 701 | 33 | 
| Drone | Test | 51,355 | 154,065 | 140,179 | 951 | 39 | 
| Satellite | Train | 701 | 2,103 | 1,709 | 701 | 33 | 
| Satellite | Test | 951 | 2,853 | 2,006 | 951 | 39 | 
| Ground | Train | 11,663 | 34,989 | 14,761 | 701 | 33 | 
| Ground | Test | 2,921 | 8,763 | 4,023 | 793 | 39 | 
The baseline model is built upon the GeoText-1652 framework, which combines:
The baseline provides a strong foundation for participants to build upon, incorporating state-of-the-art techniques in cross-modal retrieval and spatial understanding.
The baseline model achieves the following performance on the Phase I evaluation set (~190 test cases):
| Query Type | R@1 | R@5 | R@10 | 
|---|---|---|---|
| Text Query | 29.9 | 46.3 | 54.1 | 
| Image Query | 50.1 | 81.2 | 90.3 | 
Note: These results represent the baseline performance on the 24GB GPU version used for Phase I evaluation. Participants are encouraged to develop novel approaches to surpass these benchmarks.
We provide the following resources to support the development of models in this track:
| Resource | Link | Description | 
|---|---|---|
| GitHub Repository | github.com/robosense2025/track4 | Baseline code and setup instructions | 
| Dataset | HuggingFace Dataset | Dataset with training and test splits | 
| Baseline Model | Pre-Trained Checkpoint | GeoText-1652 baseline model weights | 
| Registration | Google Form (Closed on August 15th) | Team registration for the challenge | 
| Evaluation Server (Phase 1) | CodaLab Platform 1 | Online evaluation platform | 
| Evaluation Server (Phase 2) | CodaLab Platform 2 | Online evaluation platform | 
Here, we provide a list of Frequently Asked Questions (FAQs) below for better clarity. If you have additional questions on the details about this competition, please reach out to the official email at robosense2025@gmail.com and/or contact Meng Chu (richardtotrueman@gmail.com), the organizer of Track 4.
Yes, but must be publicly available and cited.
Yes, you can combine multiple models.
Evaluation runs on NVIDIA V100 GPUs with 32GB memory.
Typically 10-20 minutes depending on submission size.
@inproceedings{chu2024towards, 
  title     = {Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching}, 
  author    = {Chu, Meng and Zheng, Zhedong and Ji, Wei and Wang, Tingyu and Chua, Tat-Seng}, 
  booktitle = {European Conference on Computer Vision}, 
  year      = {2024} 
}
@misc{robosense2025track4,
  title        = {RoboSense Challenge 2025: Track 4 - Cross-Modal Drone Navigation},
  author       = {RoboSense Challenge 2025 Organizers},
  year         = {2025},
  howpublished = {https://robosense2025.github.io/track4}
}