Partial Distance Correlation in Deep Learning

Abstract — 摘要

Comparing the functional behavior of neural network models, whether it is a single network over time or two (or more) networks during or post-training, is an essential step in understanding what they are learning (and what they are not), and for identifying strategies for regularization or efficiency improvements. Motivated by the need for a statistically grounded, robust, and scalable comparison tool, we propose using Partial Distance Correlation (PDC) as a general-purpose tool for comparing representations in deep learning. We describe how it can be used to compare learned representations across networks, identify redundant layers, and guide network compression. We also show that PDC provides a principled approach to disentangling shared and unique information between representations, and demonstrate its effectiveness on a variety of tasks.

比較神經網路模型的功能性行為——無論是單一網路隨時間的變化，或兩個（及以上）網路在訓練中或訓練後的差異——是理解模型所學內容（以及未學內容）的關鍵步驟，同時也是辨識正則化或效率提升策略的基礎。基於對統計上嚴謹、穩健且可擴展之比較工具的需求，本文提出將偏距離相關性（PDC）作為深度學習中比較表示的通用工具。我們描述了其如何比較不同網路之間的學習表示、辨識冗餘層，並引導網路壓縮。同時展示 PDC 提供了一種有原則的方法來解耦表示之間的共享與獨特資訊，並在多種任務上證實其有效性。

段落功能全文總覽——概述 PDC 工具的動機、用途與核心貢獻。

邏輯角色摘要建立了「需求（比較工具）→ 方案（PDC）→ 應用（壓縮、解耦等）」的完整論證預告。

論證技巧 / 潛在漏洞作者以「統計上嚴謹」的修辭強調 PDC 相較於 CKA 等啟發式方法的理論優勢，但實際上 PDC 的計算複雜度較高，在大規模模型上的可擴展性需要額外的近似策略。

1. Introduction — 緒論

Understanding the internal representations learned by deep neural networks is a fundamental challenge in modern machine learning. A variety of tools have been proposed for this purpose, including Centered Kernel Alignment (CKA), Canonical Correlation Analysis (CCA), and projection-weighted CCA. While these methods have proven useful, they each have limitations in terms of statistical interpretability, sensitivity to sample size, or inability to account for confounding factors. In this work, we argue that Partial Distance Correlation provides a unified, statistically principled framework that addresses many of these limitations.

理解深度神經網路所學習的內部表示是現代機器學習中的根本性挑戰。為此已有多種工具被提出，包括中心核對齊（CKA）、典型相關分析（CCA）及投影加權 CCA。雖然這些方法已證明其實用性，但各自在統計可解釋性、對樣本量的敏感度，或無法處理混淆因素等方面存在局限。本文主張，偏距離相關性提供了一個統一且具統計原則的框架，能解決上述多數限制。

段落功能建立問題意識——回顧現有表示比較工具並指出其不足。

邏輯角色論證鏈起點：透過列舉 CKA、CCA 等方法的缺陷，為 PDC 的引入建立合理性。

論證技巧 / 潛在漏洞以「各有局限」一語概括現有方法的不足，但未深入討論這些方法在特定場景下的優勢，可能給讀者一種過度簡化的印象。

The key idea behind distance correlation is that it measures both linear and nonlinear dependencies between random vectors, unlike classical correlation which only captures linear associations. Partial distance correlation extends this by controlling for the influence of a third variable, enabling more precise comparisons of representations. This property is particularly valuable when comparing layers across different architectures where batch normalization, residual connections, or other structural elements may introduce confounding correlations.

距離相關性的核心思想在於它能衡量隨機向量之間的線性與非線性依賴關係，不同於僅捕捉線性關聯的經典相關性。偏距離相關性進一步延伸此概念，透過控制第三變數的影響實現更精確的表示比較。此特性在比較不同架構的層時格外重要，因為批次正規化、殘差連接或其他結構元素可能引入混淆相關性。

段落功能核心概念引介——解釋距離相關性與偏距離相關性的基本原理。

邏輯角色從「問題」過渡到「解方」：先建立距離相關性的理論優勢，再以「控制混淆變數」的特點突出 PDC 的獨特性。

論證技巧 / 潛在漏洞以 batch normalization 和殘差連接作為混淆因素的具體例子，增強了讀者對問題的具象理解。但作者假設這些結構元素必然構成「混淆」，這個前提是否成立取決於分析目的。

2. Background — 背景

Distance correlation, introduced by Szekely, Rizzo, and Bakirov (2007), is a measure of dependence between random vectors of arbitrary dimensions. Given random vectors X and Y, the distance covariance is computed by taking the Euclidean distances between all pairs of observations, double centering the resulting distance matrices, and computing their inner product. A key property is that distance correlation equals zero if and only if X and Y are statistically independent, a guarantee that classical Pearson correlation cannot provide. The partial distance correlation further conditions on a third variable Z, removing its confounding influence through a projection operation in the space of doubly centered distance matrices.

距離相關性由 Szekely、Rizzo 與 Bakirov（2007）提出，是一種衡量任意維度隨機向量之間依賴性的度量。給定隨機向量 X 與 Y，距離共變異數的計算方式為：先取所有觀測對之間的歐氏距離，對產生的距離矩陣進行雙中心化，再計算其內積。一個關鍵性質是：距離相關性等於零若且唯若 X 與 Y 在統計上獨立——這是經典皮爾森相關性無法提供的保證。偏距離相關性進一步以第三變數 Z 為條件，透過雙中心化距離矩陣空間中的投影操作移除其混淆影響。

段落功能理論基礎建構——介紹距離相關性的數學定義與核心性質。

邏輯角色為後續應用提供數學根基：先建立「距離相關性 = 零等價於獨立性」這一核心定理，使讀者理解 PDC 相較於啟發式方法的理論優越性。

論證技巧 / 潛在漏洞引用 Szekely 等人的原始工作建立權威性。但雙中心化距離矩陣的計算複雜度為 O(n^2)，大規模資料集上的效率問題被暫時擱置。

Previous approaches for comparing neural network representations include CKA, which computes the Hilbert-Schmidt Independence Criterion between kernel matrices of layer activations, and various forms of CCA that project representations into a shared subspace. CKA, while effective and widely adopted, lacks a formal mechanism to account for confounding variables. CCA-based methods require careful selection of the number of canonical components and can be sensitive to the choice of regularization. PDC addresses both issues simultaneously by operating directly on distance matrices and supporting conditional independence testing.

先前用於比較神經網路表示的方法包括CKA（計算層啟動之核矩陣間的 Hilbert-Schmidt 獨立性準則）以及各種形式的CCA（將表示投影至共享子空間）。CKA 雖然有效且被廣泛採用，但缺乏處理混淆變數的正式機制。基於 CCA 的方法需要謹慎選擇典型成分數量，且對正則化的選擇較為敏感。PDC 透過直接在距離矩陣上操作並支援條件獨立性檢定，同時解決了這兩個問題。

段落功能比較分析——系統性地對比 PDC 與既有方法的優劣。

邏輯角色透過指出 CKA 和 CCA 的具體缺陷，建立 PDC 的差異化優勢。

論證技巧 / 潛在漏洞以「CKA 缺乏正式機制」這一明確的局限性建立對比，論證效果強烈。但 CKA 的實務表現往往足夠好，理論上的不足是否轉化為實際問題需要更多實證支持。

3. Method — 方法

We introduce three primary applications of PDC in deep learning. First, representation similarity analysis: given two networks (or the same network at different training stages), we compute the PDC between corresponding layer representations to quantify their functional similarity while controlling for the input. Second, layer redundancy detection: by computing the PDC between consecutive layers while controlling for the input, we can identify layers that contribute minimal unique information, providing principled guidance for network pruning. Third, disentanglement analysis: PDC naturally decomposes the shared and unique contributions of representations, enabling the study of what information is shared versus unique across network components.

本文介紹 PDC 在深度學習中的三項主要應用。第一，表示相似性分析：給定兩個網路（或同一網路在不同訓練階段），我們計算對應層表示之間的 PDC 以量化其功能相似性，同時控制輸入的影響。第二，層冗餘偵測：透過計算連續層之間的 PDC（控制輸入），我們可以辨識貢獻極少獨特資訊的層，為網路剪枝提供有原則的指引。第三，解耦分析：PDC 自然地分解表示的共享與獨特貢獻，使研究者能探究網路各組件之間哪些資訊是共享的、哪些是獨特的。

段落功能方法論展開——列舉 PDC 的三大應用方向。

邏輯角色將理論工具轉化為可操作的深度學習應用，建立「理論→實踐」的橋樑。三個應用涵蓋了模型理解、壓縮與分析的不同面向。

論證技巧 / 潛在漏洞以清晰的三段式結構呈現應用，增強可讀性。但每個應用的有效性高度依賴於實驗設定（如選擇哪些層、使用多少樣本），這些細節在此段尚未涉及。

For computational efficiency, we employ an unbiased estimator of distance covariance that operates on the U-centered distance matrices. Given n observations, the computational cost is O(n^2) for computing pairwise distances and O(n^2) for the double centering, which is manageable for typical batch sizes in deep learning. For larger datasets, we propose a mini-batch approximation scheme that computes PDC over random subsets and averages the results, providing a trade-off between statistical accuracy and computational cost that scales to modern architectures with millions of parameters.

為提升計算效率，我們採用距離共變異數的無偏估計器，其在 U 中心化距離矩陣上運作。給定 n 個觀測值，計算成本為O(n^2)（計算成對距離）加上 O(n^2)（雙中心化），對深度學習中典型的批次大小而言是可承受的。對於更大的資料集，我們提出小批次近似方案，在隨機子集上計算 PDC 並取平均，在統計精確度與計算成本之間取得平衡，使其可擴展至擁有數百萬參數的現代架構。

段落功能技術細節補充——解決計算效率問題。

邏輯角色預防性反駁：主動回應「PDC 計算成本高」的潛在質疑，透過小批次近似方案展示可行性。

論證技巧 / 潛在漏洞坦誠承認 O(n^2) 的複雜度，並提出解決方案，增強了論文的可信度。但小批次近似帶來的統計偏差如何影響最終結論，需要更嚴格的理論分析。

4. Experiments — 實驗

We evaluate PDC across three experimental settings. In the representation comparison setting, we analyze ResNet, VGG, and Vision Transformer architectures trained on ImageNet. PDC reveals that early layers across architectures show higher similarity than later layers, consistent with the hypothesis that low-level features are more universal. Critically, when we control for the input using partial distance correlation, the similarity between architecturally different networks decreases substantially, revealing that much of the apparent similarity is driven by shared input statistics rather than learned representations.

我們在三個實驗設定中評估 PDC。在表示比較設定中，我們分析了在 ImageNet 上訓練的 ResNet、VGG 和 Vision Transformer 架構。PDC 顯示不同架構的早期層呈現較高的相似性，這與低階特徵更具通用性的假說一致。關鍵的是，當我們使用偏距離相關性控制輸入後，架構不同的網路之間的相似性大幅下降，揭示了許多表面上的相似性是由共享的輸入統計量驅動，而非學習到的表示。

段落功能提供實證——以跨架構比較展示 PDC 的獨特洞察。

邏輯角色實驗驗證的核心：PDC 能揭示控制混淆後的「真實相似性」，這正是其理論優勢的實證體現。

論證技巧 / 潛在漏洞「控制輸入後相似性大幅下降」是一個引人注目的發現，有效展示了 PDC 的不可替代性。但「大幅下降」的程度是否因不同的控制變數選擇而異，值得深入探討。

In the layer redundancy experiment, we apply PDC to a pre-trained ResNet-50 and compute the unique information contributed by each residual block. Our analysis identifies several blocks in the middle stages that contribute near-zero unique information, suggesting they can be pruned with minimal accuracy loss. When we remove these blocks, the resulting pruned network retains 97.2% of the original accuracy while reducing parameters by 23%. This demonstrates that PDC provides actionable insights for model compression that go beyond existing heuristic pruning criteria.

在層冗餘實驗中，我們將 PDC 應用於預訓練的 ResNet-50，計算每個殘差區塊貢獻的獨特資訊。分析辨識出中間階段的數個區塊貢獻近乎零的獨特資訊，表明它們可以在最小精確度損失下被剪枝。移除這些區塊後，剪枝網路保留了原始精確度的 97.2%，同時減少了 23% 的參數。這證明了PDC 能提供超越現有啟發式剪枝準則的可操作模型壓縮洞察。

段落功能應用驗證——展示 PDC 在模型壓縮中的實用價值。

邏輯角色從理論分析工具延伸至實際應用：97.2% 精確度保留和 23% 參數減少提供了有說服力的量化證據。

論證技巧 / 潛在漏洞具體的數字（97.2%、23%）增強了說服力。但僅在 ResNet-50 上的實驗是否能推廣至其他架構，以及與更進階的結構化剪枝方法的比較，是讀者可能會追問的方向。

For the disentanglement analysis, we examine multi-task networks where a shared backbone feeds into task-specific heads. PDC allows us to quantify how much information each head shares with others versus capturing task-unique patterns. On a joint detection and segmentation model, we find that early heads share 60-80% of their information content, while later heads diverge significantly. This analysis provides guidance for determining the optimal branching point in multi-task architectures, an important design decision that is often made by trial and error.

在解耦分析中，我們檢視多任務網路，其中共享骨幹網路分別饋入各任務專屬的頭部。PDC 使我們能量化每個頭部與其他頭部共享的資訊量，以及各自捕捉的任務獨特模式。在一個聯合偵測與分割模型上，我們發現早期頭部共享 60-80% 的資訊內容，而後期頭部則顯著分化。此分析為確定多任務架構中的最佳分支點提供了指引——這是一個通常透過反覆試驗決定的重要設計決策。

段落功能應用驗證——展示 PDC 在多任務學習中的分析能力。

邏輯角色第三個應用場景，完整支撐了「PDC 是通用工具」的核心主張。

論證技巧 / 潛在漏洞以「通常透過反覆試驗」指出現狀的不足，突出 PDC 的系統性優勢。但多任務架構的最佳分支點是否僅由資訊共享量決定，可能還涉及梯度衝突等其他因素。

5. Conclusion — 結論

We have presented Partial Distance Correlation as a versatile, statistically grounded tool for analyzing and comparing representations in deep learning. Through three complementary applications — representation comparison, layer redundancy detection, and disentanglement analysis — we have demonstrated that PDC provides insights that are both theoretically principled and practically actionable. The ability to control for confounding variables distinguishes PDC from existing tools and enables more accurate and nuanced analyses of neural network behavior. We believe that PDC will serve as a valuable addition to the toolkit of researchers seeking to understand, compress, and improve deep neural networks.

本文提出了偏距離相關性作為分析和比較深度學習表示的多功能、具統計基礎的工具。透過三個互補的應用——表示比較、層冗餘偵測與解耦分析——我們展示了 PDC 提供的洞察兼具理論原則性與實務可操作性。控制混淆變數的能力使 PDC 有別於現有工具，能實現更精確且細緻的神經網路行為分析。我們相信 PDC 將成為研究者理解、壓縮和改進深度神經網路時的珍貴工具。

段落功能全文總結——重申核心貢獻並展望未來影響。

邏輯角色以「三個應用」對應先前的承諾，形成完整的論證閉環。以前瞻性語言結尾，暗示 PDC 的影響力將持續擴大。

論證技巧 / 潛在漏洞結論段保持了謹慎的措辭（「我們相信」），避免過度宣稱。然而未討論 PDC 的局限性（如特定高維空間中距離度量的退化問題），使結論略顯不完整。

Abstract — 摘要

1. Introduction — 緒論

2. Background — 背景

3. Method — 方法

4. Experiments — 實驗

5. Conclusion — 結論

論證結構總覽

核心主張

最強論點

最弱環節