RS-WorldModel: a Unified Model for Remote Sensing Understanding and Future Sense Forecasting
Abstract
Remote sensing world models aim to both explain observed changes and forecast plausible futures, two tasks that share spatiotemporal priors. Existing methods, however, typically address them separately, limiting cross-task transfer. We present RS-WorldModel, a unified world model for remote sensing that jointly handles spatiotemporal change understanding and text-guided future scene forecasting, and we build RSWBench-1.1M, a 1.1 million sample dataset with rich language annotations covering both tasks. RS-WorldModel is trained in three stages: (1) Geo-Aware Generative Pre-training (GAGP) conditions forecasting on geographic and acquisition metadata; (2) synergistic instruction tuning (SIT) jointly trains understanding and forecasting; (3) verifiable reinforcement optimization (VRO) refines outputs with verifiable, task-specific rewards. With only 2B parameters, RS-WorldModel surpasses open-source models up to 120 times larger on most spatiotemporal change question-answering metrics. It achieves an FID of 43.13 on text-guided future scene forecasting, outperforming all open-source baselines as well as the closed-source Gemini-2.5-Flash Image (Nano Banana).
Community
🌏 We are excited to launch RS-WorldModel, a remote sensing world model specifically designed for understanding and predicting remote sensing imagery.
One model, two core capabilities:
1️⃣ Understanding Changes
2️⃣ Forecasting the Future
This is an automated message from the Librarian Bot. I found the following papers similar to this paper.
The following papers were recommended by the Semantic Scholar API
- VLRS-Bench: A Vision-Language Reasoning Benchmark for Remote Sensing (2026)
- GeoSolver: Scaling Test-Time Reasoning in Remote Sensing with Fine-Grained Process Supervision (2026)
- Uni-RS: A Spatially Faithful Unified Understanding and Generation Model for Remote Sensing (2026)
- OSM-based Domain Adaptation for Remote Sensing VLMs (2026)
- Think and Answer ME: Benchmarking and Exploring Multi-Entity Reasoning Grounding in Remote Sensing (2026)
- RSEdit: Text-Guided Image Editing for Remote Sensing (2026)
- WorldVLM: Combining World Model Forecasting and Vision-Language Reasoning (2026)
Please give a thumbs up to this comment if you found it helpful!
If you want recommendations for any Paper on Hugging Face checkout this Space
You can directly ask Librarian Bot for paper recommendations by tagging it in a comment: @librarian-bot recommend
Models citing this paper 0
No model linking this paper
Datasets citing this paper 0
No dataset linking this paper
Spaces citing this paper 0
No Space linking this paper
Collections including this paper 0
No Collection including this paper