The RoboSense Challenge 2025

Track #4: Cross-Modal Drone Navigation

Towards Natural Language-Guided Cross-View Image Retrieval

🚀 Submit on CodaBench

👋 Welcome to Track #4: Cross-Modal Drone Navigation of the 2025 RoboSense Challenge!

This track focuses on developing robust models for natural language-guided cross-view image retrieval, specifically designed for scenarios where input data is captured from drastically different viewpoints such as aerial (drone or satellite) and ground-level perspectives.

Participants will tackle the challenge of creating systems that can effectively retrieve corresponding images from large-scale cross-view databases based on natural language text descriptions, even under common corruptions such as blurriness, occlusions, or sensory noise.

🏆 Prize Pool: $2,000 USD (1st: $1,000, 2nd: $600, 3rd: $400) + Innovation Awards


🎯 Objective

This track evaluates the capability of vision-language models to perform cross-modal drone navigation through natural language-guided image retrieval. The key challenges include:

  • Cross-View Matching: Retrieve corresponding images across drastically different viewpoints (aerial drone/satellite vs. ground-level)
  • Natural Language Understanding: Process and interpret natural language descriptions to guide image retrieval
  • Robustness: Maintain performance under real-world corruptions including blurriness, occlusions, and sensor noise
  • Multi-Platform Support: Handle imagery from multiple platforms including drone, satellite, and ground cameras

The ultimate goal is to develop models that can enable intuitive drone navigation through natural language commands while being robust to real-world sensing conditions.


🗂️ Phases & Requirements

Phase #1: Public Evaluation

Duration: June 15th, 2025 (anytime on earth) - August 15th, 2025 (anytime on earth)

This phase involves evaluation on a publicly available test set with approximately 190 test cases (24GB GPU version). Participants can:

  • Download the official dataset and baseline model
  • Develop and test their approaches using the public test set
  • Submit results with reproducible code
  • Iterate and improve their models based on public leaderboard feedback

Phase #2: Final Evaluation

Duration: August 15th, 2025 (anytime on earth) - September 15th September 22nd, 2025 (anytime on earth)

This phase involves final evaluation on a private test set with the same number of test cases (~190) as Phase I. Key requirements:

  • Submit final model and code for evaluation on the private test set
  • Provide technical report describing the approach and innovations
  • Include model weights and complete reproducible implementation
  • Final rankings and awards will be determined based on Phase II results

🗄️ Dataset

The track uses the RoboSense Track 4 Cross-Modal Drone Navigation Dataset, based on the GeoText-1652 benchmark. The dataset features:

  • Multi-platform imagery: drone, satellite, and ground cameras
  • Large scale: 100K+ images across 72 universities
  • Rich annotations: global descriptions, bounding boxes, and spatial relations
  • No overlap: Training (33 universities) and test (39 universities) are completely separate
Platform Split Images Descriptions Bbox-Texts Classes Universities
Drone Train 37,854 113,562 113,367 701 33
Drone Test 51,355 154,065 140,179 951 39
Satellite Train 701 2,103 1,709 701 33
Satellite Test 951 2,853 2,006 951 39
Ground Train 11,663 34,989 14,761 701 33
Ground Test 2,921 8,763 4,023 793 39

🛠️ Baseline Model

The baseline model is built upon the GeoText-1652 framework, which combines:

  • Vision-Language Architecture: Based on X-VLM for multi-modal understanding
  • Spatial Relation Matching: Advanced spatial reasoning for cross-view alignment
  • Multi-Platform Support: Handles drone, satellite, and ground imagery
  • Robust Training: Trained on diverse university campuses with rich annotations

The baseline provides a strong foundation for participants to build upon, incorporating state-of-the-art techniques in cross-modal retrieval and spatial understanding.


📊 Baseline Results

The baseline model achieves the following performance on the Phase I evaluation set (~190 test cases):

Query Type R@1 R@5 R@10
Text Query 29.9 46.3 54.1
Image Query 50.1 81.2 90.3

Note: These results represent the baseline performance on the 24GB GPU version used for Phase I evaluation. Participants are encouraged to develop novel approaches to surpass these benchmarks.


🔗 Resources

We provide the following resources to support the development of models in this track:

Resource Link Description
GitHub Repository github.com/robosense2025/track4 Baseline code and setup instructions
Dataset HuggingFace Dataset Dataset with training and test splits
Baseline Model Pre-Trained Checkpoint GeoText-1652 baseline model weights
Registration Google Form (Closed on August 15th) Team registration for the challenge
Evaluation Server (Phase 1) CodaLab Platform 1 Online evaluation platform
Evaluation Server (Phase 2) CodaLab Platform 2 Online evaluation platform

❓ Frequently Asked Questions

Here, we provide a list of Frequently Asked Questions (FAQs) below for better clarity. If you have additional questions on the details about this competition, please reach out to the official email at robosense2025@gmail.com and/or contact Meng Chu (richardtotrueman@gmail.com), the organizer of Track 4.


Can I use external data?

Yes, but must be publicly available and cited.

Are ensemble methods allowed?

Yes, you can combine multiple models.

What's the evaluation hardware?

Evaluation runs on NVIDIA V100 GPUs with 32GB memory.

How long does evaluation take?

Typically 10-20 minutes depending on submission size.


📖 References

@inproceedings{chu2024towards, 
  title     = {Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching}, 
  author    = {Chu, Meng and Zheng, Zhedong and Ji, Wei and Wang, Tingyu and Chua, Tat-Seng}, 
  booktitle = {European Conference on Computer Vision}, 
  year      = {2024} 
}

@misc{robosense2025track4,
  title        = {RoboSense Challenge 2025: Track 4 - Cross-Modal Drone Navigation},
  author       = {RoboSense Challenge 2025 Organizers},
  year         = {2025},
  howpublished = {https://robosense2025.github.io/track4}
}