FlashVGGT
Efficient and Scalable Visual Geometry Transformers with Compressed Descriptor Attention

The Hong Kong University of Science and Technology (HKUST)
*Corresponding author
TL;DR Accelerate VGGT with more efficient global attention for ~10x faster inference on 1K images and scaling to 3K+ images.

Abstract

3D reconstruction from multi-view images is a core challenge in computer vision. Recently, feed-forward methods have emerged as efficient and robust alternatives to traditional per-scene optimization techniques. Among them, state-of-the-art models like the Visual Geometry Grounding Transformer (VGGT) leverage full self-attention over all image tokens to capture global relationships. However, this approach suffers from poor scalability due to the quadratic complexity of self-attention and the large number of tokens generated in long image sequences. In this work, we introduce FlashVGGT, an efficient alternative that addresses this bottleneck through a descriptor-based attention mechanism. Instead of applying dense global attention across all tokens, FlashVGGT compresses spatial information from each frame into a compact set of descriptor tokens. Global attention is then computed as cross-attention between the full set of image tokens and this smaller descriptor set, significantly reducing computational overhead. Moreover, the compactness of the descriptors enables online inference over long sequences via a chunk-recursive mechanism that reuses cached descriptors from previous chunks. Experimental results show that FlashVGGT achieves reconstruction accuracy competitive with VGGT while reducing inference time to just 9.3% of VGGT for 1,000 images, and scaling efficiently to sequences exceeding 3,000 images.

Motivation

Why compress global attention for long sequence reconstruction?

VGGT global attention latency breakdown
Attention sparsity comparison

Global attention dominates inference time.

The global attention block is the primary computational bottleneck in VGGT, dominating total inference time.

Global self-attention is highly sparse.

Global attention is highly sparse, with most scores concentrated near zero, making full self-attention inefficient, while frame attention is more uniform.

Method Overview

FlashVGGT swaps dense global attention with more efficient descriptor attention.

FlashVGGT framework diagram

Compressed Descriptor Tokens

Use spatial resampling to compress image tokens into a compact set of descriptor tokens.

Descriptor-based Cross-Attention

Use image tokens as queries and descriptor tokens as keys/values to reduce computation.

Chunk-Recursive Inference

Descriptors are cached and reused across chunks, enabling efficient inference over long sequences.

Complexity drops from O(N²) to O(N² / r²) (r ≈ 4), delivering ~10× faster inference empirically for 1K images.

Performance Comparison

FlashVGGT maintains accuracy while delivering large speedups across datasets.

Performance comparison across scenes

Single-forward Reconstruction

VGGT (67.43s) FlashVGGT (10.04s)
VGGT (67.43s) FlashVGGT (10.04s)
VGGT (67.43s) FlashVGGT (10.04s)
FastVGGT (21.67s) FlashVGGT (10.04s)
FastVGGT (21.67s) FlashVGGT (10.04s)
FastVGGT (21.67s) FlashVGGT (10.04s)

Online Reconstruction

CUT3R (104.67s) FlashVGGT (47.58s)
TTT3R (106.45s) FlashVGGT (47.58s)

BibTeX


  @article{wang2025flashvggt,
    title={FlashVGGT: Efficient and Scalable Visual Geometry Transformers with Compressed Descriptor Attention},
    author={Zipeng Wang and Dan Xu},
    journal={arXiv preprint arXiv:xxxx},
    year={2025}
  }