Jian Zhang

Jian Zhang

Focused on advancing 3D-consistent video generation and spatial understanding

About Me

Research Vision

My long-term vision follows a progressive pathway: first achieving 3D-consistent content generation, then developing comprehensive 3D understanding, and ultimately enabling intelligent embodied agents that can navigate and interact within these 3D environments.

Research Focus

3D-Consistent Content Generation
3D Understanding
3D Embodied Agents

Education

Graduate Student

Xiamen University

Sept 2023 - Present

B.S. in Artificial Intelligence

Nanchang University

Sept 2019 - June 2023

Experience

Research Intern

Baidu Inc.

Aug 2025 - Present

Video Generation Research

Research Assistant

Texas A&M University

May 2025 - Aug 2025

3D Vision & Embodied Intelligence

Research Assistant

VITA Group, University of Texas at Austin

Jan 2024 - May 2025

3D Spatial Reconstruction & Understanding

Publications & Preprints

VLM-3R: Vision-Language Models Augmented with 3D Reconstruction

ArXiv 2025 | Jian Zhang*, Zhiwen Fan*, et al.

Unified VLM framework incorporating 3D Reconstructive instruction tuning, processing monocular video to derive implicit 3D tokens for spatial assistance and embodied reasoning.

DynamicVerse: Physically-Aware Multimodal Modeling for Dynamic 4D Worlds

Preprint | Kairun Wen*, Yuzhi Huang*, ..., Jian Zhang, et al.

Large-scale dataset with 100K+ videos, 800K+ masks, and 10M+ frames for understanding dynamic physical worlds with evolving 3D structure and motion.

Paper (Soon) Code (Soon) Project

Large Spatial Model: End-to-end Unposed Images to Semantic 3D

NeurIPS 2024 | Jian Zhang*, Zhiwen Fan*, et al.

Directly processes unposed RGB images into semantic radiance fields, achieving real-time semantic 3D reconstruction through simultaneous geometry, appearance, and semantics estimation.

InstantSplat: Sparse-view Gaussian Splatting in Seconds

ArXiv 2024 | Zhiwen Fan*, Kairun Wen*, ..., Jian Zhang, et al.

Lightning-fast sparse-view 3D reconstruction using self-supervised framework that optimizes scene representation and camera poses through differentiable neural rendering.

Let's Connect

Open to research collaborations and opportunities in 3D Vision & AI

Open for Opportunities

🎬

3D-consistent video generation

🔬

3D spatial understanding

🤝

Open to research collaborations

×