The 1st International Workshop on Multimodal Foundation Models for Spatial Intelligence

Overview

Multimodal foundation models have transformed artificial intelligence by enabling scalable and transferable representations across diverse modalities, facilitating applications in vision-language understanding, text-to-image/video generation, and AI-driven assistants. However, their reliance on predominantly linguistic and 2D visual representations limits their ability to interact effectively with the physical world, where deep 3D spatial reasoning is crucial. Spatial intelligence, which encompasses perception, comprehension, and reasoning about spatial relationships and 3D structures, is essential for advancing AI models beyond static, task-specific functions toward embodied capabilities. Achieving robust spatial intelligence is critical for applications such as autonomous systems, robotics, augmented reality, and digital twins.

This workshop seeks to bring together researchers and practitioners from multimedia and related communities to discuss Multimodal Foundation Models for Spatial Intelligence. There are a lot of open problems to be explored, no matter the aspects of multimedia data and benchmarks, framework designs, training techniques, or trust-worthy algorithms. By uniting insights from researchers from various backgrounds, we aim to step towards reshaping the future of spatially-aware foundation models and paving the way for next-generation AI systems capable of perceiving, reasoning, and acting in complex 3D environments.

Call for Papers

We welcome three types of submissions, all of which should align with the topics of interest below:

• Position or Perspective Papers (up to 4 pages, excluding references): Original ideas, perspectives, research visions, and open challenges related to evaluation approaches for explainable recommender systems.

• Featured Papers (title, abstract, and the original paper): Previously published papers or summaries of existing publications from leading conferences and high-impact journals that are relevant to the workshop theme.

• Demonstration Papers (up to 2 pages, excluding references): Original or previously published prototypes and operational evaluation approaches in explainable recommender systems.

Topics of Interest

We invite submissions of original research contributions related to, but not limited to, the following topics:

1. Multimodal Spatial Understanding

• Multimodal Large Language Models with Spatial Awareness

• (3D) Vision-Language-Action Alignment

• 3D Scene Perception (Detection, Segmentation)

• 3D Semantic Occupancy Prediction

• Multimodal Spatial Reasoning (Images, Videos, Point Clouds, Text, Audio, etc.)

• 3D Spatial Grounding

• Multimodal Affordance Learning

2. 3D/3D-Aware Generative Models and World Models

• 3D-Aware Diffusion Models and Variational Autoencoders (VAEs)

• World Models and Its Application in Embodied AI and Autonomous Vehicles

• 3D Generative Adversarial Networks (3D GANs)

• Multi-view Consistent 3D Generation

• Camera View and Motion Controllability in Generation

3. 3D Geometric Reconstruction

• 3D Reconstruction from Multimodal Inputs

• Neural Implicit Representations for 3D Reconstruction

• 3D Gaussian Splatting for High-fidelity Scene Reconstruction

• Integration of Geometric Priors in Deep Learning Models

• SLAM and Semantic SLAM

• Scalable 3D Reconstruction for Large-scale Datasets

4. Data and Benchmarks for Deep Spatial Analysis

• Benchmarks for Spatial Intelligence

• Large-scale Datasets for Multimodal Spatial Reasoning

• Standardized Evaluation Protocols for 3D Generative Models

• Novel Multimodal Annotation Techniques and Crowdsourced Datasets

• Multimodal Learning for World Simulation

5. Trustworthy Spatial Intelligence

• Ethical Considerations in 3D Generative Models

• Trustworthy and Robust 3D Foundation Models

• Fairness and Bias in 3D and Multimodal Datasets and Foundation Models

Schedule

08:00 am - 08:50 am

Poster setup

08:50 am - 09:00 am

Opening remarks

09:00 am - 09:30 am

Invited talk 1

09:30 am - 10:00 am

Invited talk 2

10:00 am - 11:00 am

Coffee break

11:00 am - 11:30 am

Invited talk 3

11:30 am - 12:00 am

Invited talk 4

12:30 pm - 13:30 pm

Lunch break

13:30 pm - 14:00 pm

Invited talk 4

14:00 pm - 14:30 pm

Invited talk 5

14:30 pm - 15:00 pm

Contributed talks

15:00 pm - 16:00 pm

ICoffee break

16:00 pm - 16:30 pm

Invited talk 6

16:30 pm - 17:30 pm

Panel

17:30 pm - 17:40 pm

Closing remarks

The 1st International Workshop on Multimodal Foundation

Models for Spatial Intelligence

Overview

Call for Papers

Topics of Interest

Schedule

Invited Speakers

Organizers

Contact