4.5. 隨機投影

sklearn.random_projection 模塊實現了一個簡單且高效率的計算方式來減少數據維度，通過犧牲一定的精度（作爲附加變量）來加速處理時間及更小的模型尺寸。這個模型實現了兩類無結構化的隨機矩陣: Gaussian random matrix 和 sparse random matrix.

隨機投影矩陣的維度和分佈是受控制的，所以可以保存任意兩個數據集的距離。因此隨機投影適用於基於距離的方法。

參考:

Sanjoy Dasgupta. 2000. Experiments with random projection. In Proceedings of the Sixteenth conference on Uncertainty in artificial intelligence (UAI‘00), Craig Boutilier and Moisés Goldszmidt (Eds.). Morgan Kaufmann Publishers Inc., San Francisco, CA, USA, 143-151.
Ella Bingham and Heikki Mannila. 2001. Random projection in dimensionality reduction: applications to image and text data. In Proceedings of the seventh ACM SIGKDD international conference on Knowledge discovery and data mining (KDD ‘01). ACM, New York, NY, USA, 245-250.

4.5.1. Johnson-Lindenstrauss 輔助定理

支撐隨機投影效率的主要理論成果是`Johnson-Lindenstrauss lemma (quoting Wikipedia) <https://en.wikipedia.org/wiki/Johnson%E2%80%93Lindenstrauss_lemma>`_:

在數學中，Johnson-Lindenstrauss引理是考慮輕度變形(內置點從高維到低維歐式空間)在內的結果. 引理闡釋了高維空間下的可以小的點集可以內嵌到非常低維的空間，這種方式下點之間的距離幾乎全部被保留。內嵌所用到的隱射至少是Lipschitz,而且可以被當做正交投影。

有了樣本數量， sklearn.random_projection.johnson_lindenstrauss_min_dim 會保守估計隨機子空間的最小大小來保證隨機投影導致的變形在一定範圍內：

>>>
>>> from sklearn.random_projection import johnson_lindenstrauss_min_dim
>>> johnson_lindenstrauss_min_dim(n_samples=1e6, eps=0.5)
663
>>> johnson_lindenstrauss_min_dim(n_samples=1e6, eps=[0.5, 0.1, 0.01])
array([    663,   11841, 1112658])
>>> johnson_lindenstrauss_min_dim(n_samples=[1e4, 1e5, 1e6], eps=0.1)
array([ 7894,  9868, 11841])

../_images/sphx_glr_plot_johnson_lindenstrauss_bound_0011.png

../_images/sphx_glr_plot_johnson_lindenstrauss_bound_0021.png

例子:

查看 The Johnson-Lindenstrauss bound for embedding with random projections 裏面有Johnson-Lindenstrauss引理的理論說明和使用稀疏隨機矩陣的經驗驗證。

參考:

Sanjoy Dasgupta and Anupam Gupta, 1999. An elementary proof of the Johnson-Lindenstrauss Lemma.

4.5.2. 高斯隨機投影

The sklearn.random_projection.GaussianRandomProjection 通過將原始輸入空間投影到隨機生成的矩陣（該矩陣的組件由以下分佈中抽取） :math:`N(0, frac{1}{n_{components}})`降低維度。

以下小片段演示了任何使用高斯隨機投影轉換器:

>>>
>>> import numpy as np
>>> from sklearn import random_projection
>>> X = np.random.rand(100, 10000)
>>> transformer = random_projection.GaussianRandomProjection()
>>> X_new = transformer.fit_transform(X)
>>> X_new.shape
(100, 3947)

4.5.3. 稀疏隨機矩陣

sklearn.random_projection.SparseRandomProjection 使用稀疏隨機矩陣，通過投影原始輸入空間來降低維度。

稀疏矩陣可以替換高斯隨機投影矩陣來保證相似的嵌入質量，且內存利用率更高、投影數據的計算更快。

如果我們定義 s = 1 / density, 隨機矩陣的元素由

$\left\{\begin{array}{c c l}-\sqrt{\frac{s}{n_{\text{components}}}} & & 1 / 2s\\0 &\text{with probability} & 1 - 1 / s \\+\sqrt{\frac{s}{n_{\text{components}}}} & & 1 / 2s\\\end{array}\right.$

抽取。

其中 $n_{\text{components}}$ 是投影后的子空間大小。默認非零元素的濃密度設置爲最小濃密度，該值由Ping Li et al.:推薦，根據公式:math:`1 / sqrt{n_{text{features}}}`計算。

以下小片段演示瞭如何使用稀疏隨機投影轉換器:

>>>
>>> import numpy as np
>>> from sklearn import random_projection
>>> X = np.random.rand(100,10000)
>>> transformer = random_projection.SparseRandomProjection()
>>> X_new = transformer.fit_transform(X)
>>> X_new.shape
(100, 3947)

參考:

D. Achlioptas. 2003. Database-friendly random projections: Johnson-Lindenstrauss with binary coins. Journal of Computer and System Sciences 66 (2003) 671–687
Ping Li, Trevor J. Hastie, and Kenneth W. Church. 2006. Very sparse random projections. In Proceedings of the 12th ACM SIGKDD international conference on Knowledge discovery and data mining (KDD ‘06). ACM, New York, NY, USA, 287-296.

中文文檔: http://sklearn.apachecn.org/cn/stable/modules/random_projection.html

英文文檔: http://sklearn.apachecn.org/en/stable/modules/random_projection.html

官方文檔: http://scikit-learn.org/stable/

GitHub: https://github.com/apachecn/scikit-learn-doc-zh（覺得不錯麻煩給個 Star，我們一直在努力）

貢獻者: https://github.com/apachecn/scikit-learn-doc-zh#貢獻者

關於我們: http://www.apachecn.org/organization/209.html

有興趣的們也可以和我們一起來維護，持續更新中。。。

機器學習交流羣: 629470233