The RoboSense Challenge 2025

Track #4: Cross-Modal Drone Navigation

Natural Language-Guided Cross-View Image Retrieval

👋 Welcome to Track #4: Cross-Modal Drone Navigation of the 2025 RoboSense Challenge!

This track focuses on developing robust models for natural language-guided cross-view image retrieval, specifically designed for scenarios where input data is captured from drastically different viewpoints such as aerial (drone or satellite) and ground-level perspectives. Participants will tackle the challenge of creating systems that can effectively retrieve corresponding images from large-scale cross-view databases based on natural language text descriptions, even under common corruptions such as blurriness, occlusions, or sensory noise.

🏆 Prize Pool: $2,000 USD (1st: $1,000, 2nd: $600, 3rd: $400) + Innovation Awards


🎯 Objective

This track evaluates the capability of vision-language models to perform cross-modal drone navigation through natural language-guided image retrieval. The key challenges include:

  • Cross-View Matching: Retrieve corresponding images across drastically different viewpoints (aerial drone/satellite vs. ground-level)
  • Natural Language Understanding: Process and interpret natural language descriptions to guide image retrieval
  • Robustness: Maintain performance under real-world corruptions including blurriness, occlusions, and sensor noise
  • Multi-Platform Support: Handle imagery from multiple platforms including drone, satellite, and ground cameras

The ultimate goal is to develop models that can enable intuitive drone navigation through natural language commands while being robust to real-world sensing conditions.


🗂️ Phases & Requirements

Phase I: Public Evaluation

This phase involves evaluation on a publicly available test set with approximately 190 test cases (24GB GPU version). Participants can:

  • Download the official dataset and baseline model
  • Develop and test their approaches using the public test set
  • Submit results with reproducible code
  • Iterate and improve their models based on public leaderboard feedback

Phase II: Final Evaluation

This phase involves final evaluation on a private test set with the same number of test cases (~190) as Phase I. Key requirements:

  • Submit final model and code for evaluation on the private test set
  • Provide technical report describing the approach and innovations
  • Include model weights and complete reproducible implementation
  • Final rankings and awards will be determined based on Phase II results

🗄️ Dataset

The track uses the RoboSense Track 4 Cross-Modal Drone Navigation Dataset, based on the GeoText-1652 benchmark. The dataset features:

  • Multi-platform imagery: drone, satellite, and ground cameras
  • Large scale: 100K+ images across 72 universities
  • Rich annotations: global descriptions, bounding boxes, and spatial relations
  • No overlap: Training (33 universities) and test (39 universities) are completely separate
Platform Split Images Descriptions Bbox-Texts Classes Universities
Drone Train 37,854 113,562 113,367 701 33
Drone Test 51,355 154,065 140,179 951 39
Satellite Train 701 2,103 1,709 701 33
Satellite Test 951 2,853 2,006 951 39
Ground Train 11,663 34,989 14,761 701 33
Ground Test 2,921 8,763 4,023 793 39

🛠️ Baseline Model

The baseline model is built upon the GeoText-1652 framework, which combines:

  • Vision-Language Architecture: Based on X-VLM for multi-modal understanding
  • Spatial Relation Matching: Advanced spatial reasoning for cross-view alignment
  • Multi-Platform Support: Handles drone, satellite, and ground imagery
  • Robust Training: Trained on diverse university campuses with rich annotations

The baseline provides a strong foundation for participants to build upon, incorporating state-of-the-art techniques in cross-modal retrieval and spatial understanding.


📊 Baseline Results

The baseline model achieves the following performance on the Phase I evaluation set (~190 test cases):

Query Type R@1 R@5 R@10
Text Query 29.9 46.3 54.1
Image Query 50.1 81.2 90.3

Note: These results represent the baseline performance on the 24GB GPU version used for Phase I evaluation. Participants are encouraged to develop novel approaches to surpass these benchmarks.


🔗 Resources

We provide the following resources to support the development of models in this track:

Resource Link Description
GitHub Repository https://github.com/robosense2025/track4 Official baseline code and setup instructions
Dataset HuggingFace Dataset Complete dataset with training and test splits
Baseline Model Pre-trained Checkpoint GeoText-1652 baseline model weights
Registration Google Form Team registration for the challenge
Evaluation Server CodaLab Online evaluation platform

📧 Contact

For questions, technical support, or clarifications about Track 4, please contact:


📖 References

@inproceedings{chu2024towards, 
  title={Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching}, 
  author={Chu, Meng and Zheng, Zhedong and Ji, Wei and Wang, Tingyu and Chua, Tat-Seng}, 
  booktitle={ECCV}, 
  year={2024} 
}

@article{robosense2025track4,
  title     = {RoboSense Challenge 2025: Track 4 - Cross-Modal Drone Navigation},
  author    = {RoboSense Challenge Organizers},
  booktitle = {IROS 2025},
  year      = {2025},
  url       = {https://robosense2025.github.io/track4}
}