Recent advancements in multi-view scene reconstruction have been significant, yet existing methods face limitations when processing streams of input images. These methods either rely on time-consuming offline optimization or are restricted to shorter sequences, hindering their applicability in real-time scenarios. In this work, we propose LONG3R (LOng sequence streaming 3D Reconstruction), a novel model designed for streaming multi-view 3D scene reconstruction over longer sequences. Our model achieves real-time processing by operating recurrently, maintaining and updating memory with each new observation. We first employ a memory gating mechanism to filter relevant memory, which, together with a new observation, is fed into a dual-source refined decoder for coarse-to-fine interaction. To effectively capture long-sequence memory, we propose a 3D spatio-temporal memory that dynamically prunes redundant spatial information while adaptively adjusting resolution along the scene. To enhance our model's performance on long sequences while maintaining training efficiency, we employ a two-stage curriculum training strategy, each stage targeting specific capabilities. Experiments demonstrate that LONG3R outperforms state-of-the-art streaming methods, particularly for longer sequences, while maintaining real-time inference speed.
(a) Illustrates the overall architecture, where image features \( F^I_t \) first interact with \( F^I_{t-1} \) in the Coarse Decoder to generate \( F^c_t \), after which a memory-gating module filters irrelevant entries from the spatio-temporal memory \( F_{\text{mem}} \). The Dual-Source Refined Decoder subsequently interacts with both the filtered memory and features from \( t+1 \), ultimately generating the pointmap at time \( t \).
(b) Details the attention-based memory gating module, which selects relevant information from the memory.
(c) Illustrates the dual-source refined decoder, which alternately interacts with the next-frame features and relevant memory features through multiple self- and cross-attention layers to optimize memory information utilization and maintain alignment with the subsequent frame.
We evaluate the 3D reconstruction performance on the 7-scenes and NRGBD datasets.
We evaluate camera pose estimation on 7Scenes, TUM Dynamics, and ScanNet datasets.
We visualize the 3D reconstruction results of LONG3R and other state-of-the-art methods.
@article{long3r,
title={LONG3R: Long Sequence Streaming 3D Reconstruction},
author={Zhuoguang Chen and Minghui Qin and Tianyuan Yuan and Zhe Liu and Hang Zhao},
journal={arXiv preprint arXiv:2507.18255},
year={2025}
}