Bridging the Domain Gap: Self-Supervised 3D Scene Understanding with Foundation Models

May 1, 2023·

Zhimin Chen

Longlong Jing

Yingwei Li

Bing Li

· 0 min read

PDF Code

Abstract

Foundation models have made significant strides in 2D and language tasks such as image segmentation, object detection, and visual-language understanding. Nevertheless, their potential to enhance 3D scene representation learning remains largely untapped due to the domain gap. In this paper, we propose an innovative methodology Bridge3D to address this gap, pre-training 3D models using features, semantic masks, and captions sourced from foundation models. Specifically, our approach utilizes semantic masks from these models to guide the masking and reconstruction process in the masked autoencoder. This strategy enables the network to concentrate more on foreground objects, thereby enhancing 3D representation learning. Additionally, we bridge the 3D-text gap at the scene level by harnessing image captioning foundation models. To further facilitate knowledge distillation from well-learned 2D and text representations to the 3D model, we introduce a novel method that employs foundation models to generate highly accurate object-level masks and semantic text information at the object level. Our approach notably outshines state-of-the-art methods in 3D object detection and semantic segmentation tasks. For instance, on the ScanNet dataset, our method surpasses the previous state-of-the-art method, PiMAE, by a significant margin of 5.3%.

Type

Publication

Advances on Neural Information Processing Systems (NeurIPS)

Last updated on Oct 13, 2024

Authors

Zhimin Chen

Ph.D. Student

I am primarily focused on studying computer vision and deep learning, with a particular emphasis on image quality assessment, self-supervised learning, semi-supervised learning, multi-modality learning, and foundational models.

← Point Cloud Self-supervised Learning via 3D to Multi-view Masked Autoencoder Sep 1, 2023

Class-Level Confidence Based 3D Semi-Supervised Learning Oct 18, 2022 →