PointContrast: Unsupervised Pre-training for 3D Point Cloud Understanding

Abstract — 摘要

We present PointContrast, a framework for unsupervised pre-training of 3D point cloud representations. We show that contrastive learning can be effectively applied to 3D point clouds by leveraging correspondences between different views of the same scene. Points that correspond to the same 3D location across views serve as positive pairs, while non-corresponding points serve as negative pairs. Pre-training with PointContrast significantly improves downstream 3D tasks including detection and segmentation on ScanNet and S3DIS.

我們提出 PointContrast，一個用於三維點雲表徵無監督預訓練的框架。我們展示對比學習可透過利用同一場景不同視角之間的對應關係有效地應用於三維點雲。跨視角對應到同一三維位置的點作為正對，非對應點作為負對。使用 PointContrast 預訓練顯著改善了下游三維任務，包括 ScanNet 和 S3DIS 上的偵測和分割。

段落功能全文總覽——定義三維點雲的對比預訓練框架。

邏輯角色建立「多視角對應 → 對比學習 → 下游任務改善」的完整論述。

論證技巧 / 潛在漏洞將二維對比學習成功遷移到三維是重要的概念驗證。

1. Introduction — 緒論

Self-supervised learning has achieved remarkable success in 2D vision and NLP, but its application to 3D point clouds remains largely unexplored. Unlike images, point clouds are unordered, sparse, and vary in density, making it unclear how to define effective pretext tasks. We propose to exploit the natural multi-view structure of 3D scenes: the same physical point observed from different viewpoints provides free supervision for contrastive learning.

自監督學習在二維視覺和自然語言處理中取得了顯著成功，但其在三維點雲上的應用大致未被探索。與影像不同，點雲是無序、稀疏且密度不均的，如何定義有效的前置任務尚不明確。我們提議利用三維場景的自然多視角結構：同一物理點從不同視角觀察提供了對比學習的免費監督信號。

段落功能建立動機——指出三維自監督學習的空白與多視角的自然監督信號。

邏輯角色從二維成功到三維空白的過渡自然且合理，建立了明確的研究缺口。

論證技巧 / 潛在漏洞多視角對應作為免費監督信號的洞察極為精巧，是本文的核心創新。

The challenge of applying contrastive learning to 3D data lies in defining meaningful positive and negative pairs. In 2D, augmentations of the same image form positive pairs. For 3D point clouds, we leverage geometric correspondences from multi-view registration: given two partial scans of the same scene, points in the overlapping region that map to the same 3D coordinate are natural positive pairs. This formulation is geometrically grounded and does not require synthetic augmentations.

將對比學習應用於三維資料的挑戰在於定義有意義的正負對。在二維中，同一影像的增強構成正對。對於三維點雲，我們利用多視角配準的幾何對應：給定同一場景的兩次部分掃描，重疊區域中映射到相同三維座標的點是自然的正對。此公式化基於幾何且不需要合成增強。

段落功能技術動機——從二維到三維的正負對定義遷移。

邏輯角色幾何對應作為正對的定義比資料增強更自然且更穩健。

論證技巧 / 潛在漏洞需要多視角配準資料，限制了預訓練資料的來源。

2. Method — 方法

Given a 3D scene, we create two partially overlapping point cloud views by random spatial transformations. Points in the overlap region that correspond to the same 3D location are positive pairs. We extract features using a sparse convolutional network (e.g., SparseConvNet) and apply a PointInfoNCE contrastive loss that pulls positive pairs together and pushes negative pairs apart in the feature space. The pre-trained backbone is then fine-tuned on downstream tasks.

給定三維場景，我們透過隨機空間變換建立兩個部分重疊的點雲視角。重疊區域中對應同一三維位置的點為正對。使用稀疏摺積網路（如 SparseConvNet）提取特徵，並施加PointInfoNCE 對比損失，在特徵空間中拉近正對、推遠負對。預訓練的骨幹隨後在下游任務上微調。

段落功能核心方法——描述視角生成、對比學習與下游微調流程。

邏輯角色簡潔的三步流程使方法易於理解和復現。

論證技巧 / 潛在漏洞稀疏摺積網路是三維處理的自然選擇，對比損失的適用性已被二維領域充分驗證。

2.1 Contrastive Loss — 對比損失

The PointInfoNCE loss is formulated as: for each positive pair (i, j), maximize the similarity of their features relative to all negative pairs. We use hardest-contrastive sampling to focus on the most informative negatives. We also study the effect of pre-training data scale and find that performance improves log-linearly with the amount of pre-training data, suggesting that more unlabeled 3D data can further improve results.

PointInfoNCE 損失的公式化如下：對每個正對 (i, j)，相對於所有負對最大化其特徵的相似度。我們使用最難對比採樣聚焦最具資訊量的負例。我們也研究了預訓練資料規模的影響，發現效能隨預訓練資料量呈對數線性改善，暗示更多未標注三維資料可進一步提升結果。

段落功能損失函數與擴展性——描述對比損失的公式化與資料規模效應。

邏輯角色資料規模的對數線性關係為大規模預訓練提供了正面的擴展性信號。

論證技巧 / 潛在漏洞最難負例採樣增加了訓練難度但提高了學到表徵的品質。

3. Experiments — 實驗

On ScanNet 3D detection, PointContrast pre-training improves VoteNet by +3.4 mAP@0.5. On ScanNet semantic segmentation, it improves by +2.7 mIoU. On S3DIS segmentation, improvement is +2.2 mIoU. These gains are consistent across different architectures (SparseConvNet, MinkowskiNet), demonstrating the generality of the pre-trained representations. PointContrast also outperforms other 3D pre-training methods including autoencoder-based approaches.

在 ScanNet 三維偵測上，PointContrast 預訓練使 VoteNet 提升 +3.4 mAP@0.5。在 ScanNet 語意分割上提升 +2.7 mIoU。在 S3DIS 分割上提升 +2.2 mIoU。這些增益在不同架構（SparseConvNet、MinkowskiNet）間一致，證明預訓練表徵的通用性。PointContrast 也超越其他三維預訓練方法，包括基於自編碼器的方法。

段落功能定量評估——跨任務、跨架構的一致性改進。

邏輯角色多任務、多架構的一致增益建立了方法的穩健性。

論證技巧 / 潛在漏洞 +3.4 mAP 的偵測改進尤為顯著，證明預訓練的價值超越了分割任務。

We further analyze the data efficiency of PointContrast pre-training. When fine-tuning with only 20% of labeled data, the pre-trained model achieves performance comparable to training from scratch with 100% labeled data. This demonstrates that PointContrast learns highly transferable representations that significantly reduce the annotation requirement for 3D understanding tasks.

我們進一步分析 PointContrast 預訓練的資料效率。僅以 20% 的標注資料微調時，預訓練模型達到與從頭訓練使用 100% 標注資料相當的效能。這證明 PointContrast 學到了高度可遷移的表徵，顯著降低三維理解任務的標註需求。

段落功能資料效率分析——預訓練大幅降低標注需求。

邏輯角色 20% 標注即可匹配 100% 的結果，有力地證明了預訓練的實用價值。

論證技巧 / 潛在漏洞資料效率是自監督預訓練最具實際意義的優勢之一。

4. Conclusion — 結論

We have demonstrated that contrastive pre-training is highly effective for 3D point cloud understanding. PointContrast leverages the natural multi-view structure of 3D data to learn transferable representations. Our results show that the paradigm of self-supervised pre-training followed by supervised fine-tuning, successful in 2D and NLP, translates effectively to the 3D domain.

我們證明了對比預訓練對三維點雲理解高度有效。PointContrast 利用三維資料的自然多視角結構學習可遷移表徵。我們的結果表明在二維和 NLP 中成功的自監督預訓練後微調範式，有效地遷移到了三維領域。

段落功能總結——確認對比預訓練在三維領域的有效性。

邏輯角色將結論上升到範式遷移的層面，增強了工作的學術意義。

論證技巧 / 潛在漏洞後續的大量三維自監督工作驗證了此方向的正確性。

論證結構總覽

問題
三維自監督學習空白

→

論點
多視角提供免費監督

→

方法
點對比 + InfoNCE

→

證據
跨任務一致改進

→

結論
範式成功遷移到三維

核心主張

透過利用三維場景的自然多視角對應關係進行對比預訓練，可學習到高品質的三維點雲表徵，顯著改善下游偵測和分割任務。

論證最強處

跨三個基準、兩種架構的一致性改進，且超越所有其他三維預訓練方法，證明了對比學習在三維領域的普遍有效性。

論證最弱處

僅在室內場景驗證，戶外大規模點雲（如自駕場景）的適用性未被探討，且預訓練需要具有多視角重建的三維資料。