My long-term vision follows a progressive pathway: first achieving 3D-consistent content generation, then developing comprehensive 3D understanding, and ultimately enabling intelligent embodied agents that can navigate and interact within these 3D environments.
Xiamen University
Sept 2023 - Present
Nanchang University
Sept 2019 - June 2023
Baidu Inc.
Aug 2025 - Present
Video Generation Research
Texas A&M University
May 2025 - Aug 2025
3D Vision & Embodied Intelligence
VITA Group, University of Texas at Austin
Jan 2024 - May 2025
3D Spatial Reconstruction & Understanding
ArXiv 2025 | Jian Zhang*, Zhiwen Fan*, et al.
Unified VLM framework incorporating 3D Reconstructive instruction tuning, processing monocular video to derive implicit 3D tokens for spatial assistance and embodied reasoning.
Preprint | Kairun Wen*, Yuzhi Huang*, ..., Jian Zhang, et al.
Large-scale dataset with 100K+ videos, 800K+ masks, and 10M+ frames for understanding dynamic physical worlds with evolving 3D structure and motion.
NeurIPS 2024 | Jian Zhang*, Zhiwen Fan*, et al.
Directly processes unposed RGB images into semantic radiance fields, achieving real-time semantic 3D reconstruction through simultaneous geometry, appearance, and semantics estimation.
ArXiv 2024 | Zhiwen Fan*, Kairun Wen*, ..., Jian Zhang, et al.
Lightning-fast sparse-view 3D reconstruction using self-supervised framework that optimizes scene representation and camera poses through differentiable neural rendering.
Open to research collaborations and opportunities in 3D Vision & AI
3D-consistent video generation
3D spatial understanding
Open to research collaborations