LONG3R: Long Sequence Streaming 3D Reconstruction

Abstract

Recent advancements in multi-view scene reconstruction have been significant, yet existing methods face limitations when processing streams of input images. These methods either rely on time-consuming offline optimization or are restricted to shorter sequences, hindering their applicability in real-time scenarios. In this work, we propose LONG3R (LOng sequence streaming 3D Reconstruction), a novel model designed for streaming multi-view 3D scene reconstruction over longer sequences. Our model achieves real-time processing by operating recurrently, maintaining and updating memory with each new observation. We first employ a memory gating mechanism to filter relevant memory, which, together with a new observation, is fed into a dual-source refined decoder for coarse-to-fine interaction. To effectively capture long-sequence memory, we propose a 3D spatio-temporal memory that dynamically prunes redundant spatial information while adaptively adjusting resolution along the scene. To enhance our model's performance on long sequences while maintaining training efficiency, we employ a two-stage curriculum training strategy, each stage targeting specific capabilities. Experiments demonstrate that LONG3R outperforms state-of-the-art streaming methods, particularly for longer sequences, while maintaining real-time inference speed.

Method Overview

(a) Illustrates the overall architecture, where image features \( F^I_t \) first interact with \( F^I_{t-1} \) in the Coarse Decoder to generate \( F^c_t \), after which a memory-gating module filters irrelevant entries from the spatio-temporal memory \( F_{\text{mem}} \). The Dual-Source Refined Decoder subsequently interacts with both the filtered memory and features from \( t+1 \), ultimately generating the pointmap at time \( t \).

(b) Details the attention-based memory gating module, which selects relevant information from the memory.

(c) Illustrates the dual-source refined decoder, which alternately interacts with the next-frame features and relevant memory features through multiple self- and cross-attention layers to optimize memory information utilization and maintain alignment with the subsequent frame.

Results

3D Reconstruction

We evaluate the 3D reconstruction performance on the 7-scenes and NRGBD datasets.

Camera Pose Estimation

We evaluate camera pose estimation on 7Scenes, TUM Dynamics, and ScanNet datasets.

Visualization

We visualize the 3D reconstruction results of LONG3R and other state-of-the-art methods.