RoboSense: Track 4

👋 Welcome to Track #4: Cross-Modal Drone Navigation of the 2025 RoboSense Challenge!

This track focuses on developing robust models for natural language-guided cross-view image retrieval, specifically designed for scenarios where input data is captured from drastically different viewpoints such as aerial (drone or satellite) and ground-level perspectives. Participants will tackle the challenge of creating systems that can effectively retrieve corresponding images from large-scale cross-view databases based on natural language text descriptions, even under common corruptions such as blurriness, occlusions, or sensory noise.

🏆 Prize Pool: $2,000 USD (1st: $1,000, 2nd: $600, 3rd: $400) + Innovation Awards

This track evaluates the capability of vision-language models to perform cross-modal drone navigation through natural language-guided image retrieval. The key challenges include:

Cross-View Matching: Retrieve corresponding images across drastically different viewpoints (aerial drone/satellite vs. ground-level)
Natural Language Understanding: Process and interpret natural language descriptions to guide image retrieval
Robustness: Maintain performance under real-world corruptions including blurriness, occlusions, and sensor noise
Multi-Platform Support: Handle imagery from multiple platforms including drone, satellite, and ground cameras

The ultimate goal is to develop models that can enable intuitive drone navigation through natural language commands while being robust to real-world sensing conditions.

Phase I: Public Evaluation

This phase involves evaluation on a publicly available test set with approximately 190 test cases (24GB GPU version). Participants can:

Download the official dataset and baseline model
Develop and test their approaches using the public test set
Submit results with reproducible code
Iterate and improve their models based on public leaderboard feedback

Phase II: Final Evaluation

This phase involves final evaluation on a private test set with the same number of test cases (~190) as Phase I. Key requirements:

Submit final model and code for evaluation on the private test set
Provide technical report describing the approach and innovations
Include model weights and complete reproducible implementation
Final rankings and awards will be determined based on Phase II results

The track uses the RoboSense Track 4 Cross-Modal Drone Navigation Dataset, based on the GeoText-1652 benchmark. The dataset features:

Multi-platform imagery: drone, satellite, and ground cameras
Large scale: 100K+ images across 72 universities
Rich annotations: global descriptions, bounding boxes, and spatial relations
No overlap: Training (33 universities) and test (39 universities) are completely separate

Platform	Split	Images	Descriptions	Bbox-Texts	Classes	Universities
Drone	Train	37,854	113,562	113,367	701	33
Drone	Test	51,355	154,065	140,179	951	39
Satellite	Train	701	2,103	1,709	701	33
Satellite	Test	951	2,853	2,006	951	39
Ground	Train	11,663	34,989	14,761	701	33
Ground	Test	2,921	8,763	4,023	793	39

The baseline model is built upon the GeoText-1652 framework, which combines:

Vision-Language Architecture: Based on X-VLM for multi-modal understanding
Spatial Relation Matching: Advanced spatial reasoning for cross-view alignment
Multi-Platform Support: Handles drone, satellite, and ground imagery
Robust Training: Trained on diverse university campuses with rich annotations

The baseline provides a strong foundation for participants to build upon, incorporating state-of-the-art techniques in cross-modal retrieval and spatial understanding.

The baseline model achieves the following performance on the Phase I evaluation set (~190 test cases):

Query Type	R@1	R@5	R@10
Text Query	29.9	46.3	54.1
Image Query	50.1	81.2	90.3

Note: These results represent the baseline performance on the 24GB GPU version used for Phase I evaluation. Participants are encouraged to develop novel approaches to surpass these benchmarks.

We provide the following resources to support the development of models in this track:

Resource	Link	Description
GitHub Repository	https://github.com/robosense2025/track4	Official baseline code and setup instructions
Dataset	HuggingFace Dataset	Complete dataset with training and test splits
Baseline Model	Pre-trained Checkpoint	GeoText-1652 baseline model weights
Registration	Google Form	Team registration for the challenge
Evaluation Server	CodaLab	Online evaluation platform

For questions, technical support, or clarifications about Track 4, please contact:

Email: robosense2025@gmail.com
GitHub Issues: Submit technical questions
Challenge Website: https://robosense2025.github.io/

@inproceedings{chu2024towards, 
  title={Towards Natural Language-Guided Drones: GeoText-1652 Benchmark with Spatial Relation Matching}, 
  author={Chu, Meng and Zheng, Zhedong and Ji, Wei and Wang, Tingyu and Chua, Tat-Seng}, 
  booktitle={ECCV}, 
  year={2024} 
}

@article{robosense2025track4,
  title     = {RoboSense Challenge 2025: Track 4 - Cross-Modal Drone Navigation},
  author    = {RoboSense Challenge Organizers},
  booktitle = {IROS 2025},
  year      = {2025},
  url       = {https://robosense2025.github.io/track4}
}

The Robo Sense Challenge 2025

Track #4: Cross-Modal Drone Navigation

Natural Language-Guided Cross-View Image Retrieval

🎯 Objective

🗂️ Phases & Requirements

Phase I: Public Evaluation

Phase II: Final Evaluation

🗄️ Dataset

🛠️ Baseline Model

📊 Baseline Results

🔗 Resources

📧 Contact

📖 References

The RoboSense Challenge 2025

Track #4: Cross-Modal Drone Navigation

Natural Language-Guided Cross-View Image Retrieval

🎯 Objective

🗂️ Phases & Requirements

Phase I: Public Evaluation

Phase II: Final Evaluation

🗄️ Dataset

🛠️ Baseline Model

📊 Baseline Results

🔗 Resources

📧 Contact

📖 References

The Robo Sense Challenge 2025