The RoboSense Challenge 2025

Track #1: Driving with Language

Track 1 Image

👋 Welcome to Track #1: Driving with Language of the 2025 RoboSense Challenge!

In the era of autonomous driving, it is increasingly critical for intelligent agents to understand and act upon language-based instructions. Human drivers naturally interpret complex commands involving spatial, semantic, and temporal cues (e.g., "turn left after the red truck" or "stop at the next gas station on your right"). To enable such capabilities in autonomous systems, vision-language models (VLMs) must be able to perceive dynamic driving scenes, understand natural language commands, and make informed driving decisions accordingly.

🏆 Prize Pool: $2,000 USD (1st: $1,000, 2nd: $600, 3rd: $400) + Innovation Awards


🎯 Objective

This track evaluates the capability of VLMs to answer high-level driving questions in complex urban environments. Given question including perception, prediction, and planning, and a multi-view camera input, participants are expected to answer the question given the visual corrupted images.


🗂️ Phases & Requirements

Phase 1: Clean Image Evaluation

Duration: 15 June 2025 – 15 August 2025

In this phase, participants are expected to answer the high-level driving questions given the clean images. Participants can: - Fine-tune the VLM on custom datasets (including nuScenes, DriveBench, etc.) - Develop and test their approaches - Submit results as a json file - Iterate and improve their models based on public leaderboard feedback

Ranking Metric: Weighted score combining:

  • Accuracy
  • Language-based score
  • LLM-based score


Phase 2: Corruption Image Evaluation

Duration: 15 August 2025 – 15 October 2025

In this phase, participants are expected to answer the high-level driving questions given the corrupted images. Participants can: - Fine-tune the VLM on custom datasets (including nuScenes, DriveBench, etc.) - Develop and test their approaches - Submit results as a json file - Iterate and improve their models based on public leaderboard feedback

Ranking Metric: Weighted score combining:

  • Accuracy
  • Language-based score
  • LLM-based score

🚗 Dataset Examples

Track 1 Image

Track 1 Image

Track 1 Image

Track 1 Image


🛠️ Baseline Model

In this track, we adopt Qwen2.5-VL7B as our baseline model. Beyond the provided baseline, participants are encouraged to explore alternative strategies to further boost performance:

  • Retrieval-augmented reasoning
  • Multi-frame visual reasoning
  • Chain-of-thought reasoning
  • Graph-based reasoning
  • Vision-based object reference prompting

📊 Baseline Results

In this track, we use ...

Metric Metric 1 Metric 2 Metric 3 Metric 4
Baseline Model x.xx x.xx x.xx x.xx

🔗 Resources

We provide the following resources to support the development of models in this track:

Resource Link
GitHub https://github.com/robosense2025/track1
Dataset TBD
Registration Google Form
Evaluation Server TBD