FlashVGGT Logo
FlashVGGT
Efficient and Scalable Visual Geometry Transformers with Compressed Descriptor Attention

The Hong Kong University of Science and Technology (HKUST)
CVPR 2026

ScanNet-Scene0002 (5192 images)

Abstract

TL;DR: Accelerate VGGT by spatially resampling keys and values for global attention.

3D reconstruction from multi-view images is a core challenge in computer vision. Recently, feed-forward methods have emerged as efficient and robust alternatives to traditional per-scene optimization techniques. Among them, state-of-the-art models like the Visual Geometry Grounding Transformer (VGGT) leverage full self-attention over all image tokens to capture global relationships. However, this approach suffers from poor scalability due to the quadratic complexity of self-attention and the large number of tokens generated in long image sequences. In this work, we introduce FlashVGGT, an efficient alternative that addresses this bottleneck through a descriptor-based attention mechanism. Instead of applying dense global attention across all tokens, FlashVGGT compresses spatial information from each frame into a compact set of descriptor tokens. Global attention is then computed as cross-attention between the full set of image tokens and this smaller descriptor set, significantly reducing computational overhead. Moreover, the compactness of the descriptors enables online inference over long sequences via a chunk-recursive mechanism that reuses cached descriptors from previous chunks. Experimental results show that FlashVGGT achieves reconstruction accuracy competitive with VGGT while reducing inference time to just 9.3% of VGGT for 1,000 images, and scaling to sequences exceeding 3,000 images.

Method

FlashVGGT framework diagram
Descriptor Tokens

Efficiently compress dense image tokens into a compact set of spatial descriptors via resampling.

Cross-Attention

Replace quadratic self-attention with linear cross-attention between image queries and descriptor keys/values.

Chunk Inference

Cache and reuse descriptors across temporal chunks to scale seamlessly to sequences of thousands of frames.

Single-forward Reconstruction

Select a scene from the panel below to compare the results.

FlashVGGT (Ours)
πŸ’‘Tips

● Scroll to zoom in/out

● Drag to rotate

● Press "shift" and drag to pan

Chunk-recursive Reconstruction

Select a scene from the panel below to compare the results.

FlashVGGT (Ours)
πŸ’‘Tips

● Scroll to zoom in/out

● Drag to rotate

● Press "shift" and drag to pan

BibTeX

@inproceedings{wang2026flashvggt,
  title={FlashVGGT: Efficient and Scalable Visual Geometry Transformers with Compressed Descriptor Attention},
  author={Wang, Zipeng and Xu, Dan},
  booktitle={Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition (CVPR)},
  year={2026}
}