👋 Welcome to Track #4: Cross-Modal Drone Navigation of the 2025 RoboSense Challenge!
This track focuses on developing robust models for natural language-guided cross-view image retrieval, specifically designed for scenarios where input data is captured from drastically different viewpoints such as aerial (drone or satellite) and ground-level perspectives. Participants will tackle the challenge of creating systems that can effectively retrieve corresponding images from large-scale cross-view databases based on natural language text descriptions, even under common corruptions such as blurriness, occlusions, or sensory noise.
🏆 Prize Pool: $2,000 USD (1st: $1,000, 2nd: $600, 3rd: $400) + Innovation Awards
This track evaluates the capability of vision-language models to perform cross-modal drone navigation through natural language-guided image retrieval. The key challenges include:
The ultimate goal is to develop models that can enable intuitive drone navigation through natural language commands while being robust to real-world sensing conditions.
This phase involves evaluation on a publicly available test set with approximately 190 test cases (24GB GPU version). Participants can:
This phase involves final evaluation on a private test set with the same number of test cases (~190) as Phase I. Key requirements:
The track uses the RoboSense Track 4 Cross-Modal Drone Navigation Dataset, based on the GeoText-1652 benchmark. The dataset features:
Platform | Split | Images | Descriptions | Bbox-Texts | Classes | Universities |
---|---|---|---|---|---|---|
Drone | Train | 37,854 | 113,562 | 113,367 | 701 | 33 |
Drone | Test | 51,355 | 154,065 | 140,179 | 951 | 39 |
Satellite | Train | 701 | 2,103 | 1,709 | 701 | 33 |
Satellite | Test | 951 | 2,853 | 2,006 | 951 | 39 |
Ground | Train | 11,663 | 34,989 | 14,761 | 701 | 33 |
Ground | Test | 2,921 | 8,763 | 4,023 | 793 | 39 |
The baseline model is built upon the GeoText-1652 framework, which combines:
The baseline provides a strong foundation for participants to build upon, incorporating state-of-the-art techniques in cross-modal retrieval and spatial understanding.
The baseline model achieves the following performance on the Phase I evaluation set (~190 test cases):
Query Type | R@1 | R@5 | R@10 |
---|---|---|---|
Text Query | 29.9 | 46.3 | 54.1 |
Image Query | 50.1 | 81.2 | 90.3 |
Note: These results represent the baseline performance on the 24GB GPU version used for Phase I evaluation. Participants are encouraged to develop novel approaches to surpass these benchmarks.
We provide the following resources to support the development of models in this track:
Resource | Link | Description |
---|---|---|
GitHub Repository | https://github.com/robosense2025/track4 | Official baseline code and setup instructions |
Dataset | HuggingFace Dataset | Complete dataset with training and test splits |
Baseline Model | Pre-trained Checkpoint | GeoText-1652 baseline model weights |
Registration | Google Form | Team registration for the challenge |
Evaluation Server | CodaLab | Online evaluation platform |
For questions, technical support, or clarifications about Track 4, please contact:
@inproceedings{chu2024towards, title={Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching}, author={Chu, Meng and Zheng, Zhedong and Ji, Wei and Wang, Tingyu and Chua, Tat-Seng}, booktitle={ECCV}, year={2024} } @article{robosense2025track4, title = {RoboSense Challenge 2025: Track 4 - Cross-Modal Drone Navigation}, author = {RoboSense Challenge Organizers}, booktitle = {IROS 2025}, year = {2025}, url = {https://robosense2025.github.io/track4} }