Multimodal foundation models have transformed artificial intelligence by enabling scalable and transferable representations across diverse modalities, facilitating applications in vision-language understanding, text-to-image/video generation, and AI-driven assistants. However, their reliance on predominantly linguistic and 2D visual representations limits their ability to interact effectively with the physical world, where deep 3D spatial reasoning is crucial. Spatial intelligence, which encompasses perception, comprehension, and reasoning about spatial relationships and 3D structures, is essential for advancing AI models beyond static, task-specific functions toward embodied capabilities. Achieving robust spatial intelligence is critical for applications such as autonomous systems, robotics, augmented reality, and digital twins.
This workshop seeks to bring together researchers and practitioners from multimedia and related communities to discuss Multimodal Foundation Models for Spatial Intelligence. There are a lot of open problems to be explored, no matter the aspects of multimedia data and benchmarks, framework designs, training techniques, or trust-worthy algorithms. By uniting insights from researchers from various backgrounds, we aim to step towards reshaping the future of spatially-aware foundation models and paving the way for next-generation AI systems capable of perceiving, reasoning, and acting in complex 3D environments.