Self-Supervised Modality-Invariant and Modality-Specific Feature Learning for 3D Objects

Abstract

While most existing self-supervised 3D feature learning methods mainly focus on point cloud data, this paper explores the inherent multimodal attributes of 3D objects. We propose to jointly learn effective features from different modalities including image, point cloud, and mesh with heterogeneous networks from unlabeled 3D data. Our proposed novel self-supervised model learns two types of distinct features. modality-invariant features and modality-specific features. The modality-invariant features capture high-level semantic information across different modalities with minimum modality discrepancy, while the modality-specific features capture specific characteristics preserved in each modality. These two types of features provide a more comprehensive representation of 3D data. The quality of the learned features is evaluated on different downstream tasks including 3D object recognition, 3D within-modal retrieval, and 3D cross-modal retrieval tasks with three data modalities including image, point cloud, and mesh. Our proposed method significantly outperforms the state-of-the-art self-supervised methods for all three tasks and even achieves comparable performance with the state-of-the-art supervised methods on the ModelNet10 and ModelNet40 datasets.

Zhimin Chen
Zhimin Chen
Ph.D. Student

I am primarily focused on studying computer vision and deep learning, with a particular emphasis on image quality assessment, self-supervised learning, semi-supervised learning, multi-modality learning, and foundational models.