scikit-learn 中文文檔-流形學習-監督學習|ApacheCN

中文文檔: http://sklearn.apachecn.org/cn/stable/modules/manifold.html

英文文檔: http://sklearn.apachecn.org/en/stable/modules/manifold.html

官方文檔: http://scikit-learn.org/stable/

GitHub: https://github.com/apachecn/scikit-learn-doc-zh(覺得不錯麻煩給個 Star,我們一直在努力)

貢獻者: https://github.com/apachecn/scikit-learn-doc-zh#貢獻者

關於我們: http://www.apachecn.org/organization/209.html

注意: 正在翻譯中 。。。 




2.2. 流形學習

Look for the bare necessities
The simple bare necessities
Forget about your worries and your strife
I mean the bare necessities
Old Mother Nature’s recipes
That bring the bare necessities of life

– Baloo的歌 [奇幻森林]
../_images/sphx_glr_plot_compare_methods_0011.png

流形學習是一種減少非線性維度的方法。 這個任務的算法基於許多數據集的維度只是人爲導致的高。

2.2.1. 介紹

高維數據集可能非常難以可視化。 雖然可以繪製兩維或三維數據來顯示數據的固有結構,但等效的高維圖不太直觀。 爲了幫助可視化數據集的結構,必須以某種方式減小維度。

通過對數據的隨機投影來實現降維的最簡單方法。 雖然這允許數據結構的一定程度的可視化,但是選擇的隨機性遠遠不夠。 在隨機投影中,數據中更有趣的結構很可能會丟失。

digits_img projected_img

爲了解決這一問題,設計了一些監督和無監督的線性維數降低框架,如主成分分析(PCA),獨立成分分析,線性判別分析等。 這些算法定義了特定的標題來選擇數據的“有趣”線性投影。 這些是強大的,但是經常會錯過重要的非線性結構的數據。

PCA_img LDA_img

流形可以被認爲是將線性框架(如PCA)推廣爲對數據中的非線性結構敏感的嘗試。 雖然存在監督變量,但是典型的流形學習問題是無監督的:它從數據本身學習數據的高維結構,而不使用預定的分類。

例子:

以下概述了scikit學習中可用的流形學習實現

2.2.2. Isomap

流形學習的最早方法之一是 Isomap 算法,等距映射(Isometric Mapping)的縮寫。 Isomap 可以被視爲多維縮放(Multi-dimensional Scaling:MDS)或 Kernel PCA 的擴展。 Isomap 尋求一個維度較低的嵌入,它保持所有點之間的測量距離。 Isomap 可以與 Isomap 對象執行。

../_images/sphx_glr_plot_lle_digits_0051.png

2.2.2.1. 複雜度

Isomap 算法包括三個階段:

  1. 搜索最近的鄰居. Isomap 使用 sklearn.neighbors.BallTree 進行有效的鄰居搜索。 對於 D 維中 N 個點的 k 個最近鄰,成本約爲 O[D \log(k) N \log(N)]
  2. 最短路徑圖搜索. 最有效的已知算法是 Dijkstra 算法,它的複雜度大約是 O[N^2(k + \log(N))] ,或 Floyd-Warshall 算法,它的複雜度是 O[N^3]. 該算法可以由用戶使用 isomap 的 path_method 關鍵字來選擇。 如果未指定,則代碼嘗試爲輸入數據選擇最佳算法。
  3. 部分特徵值分解. 嵌入在與 N \times N isomap內核的 d 個最大特徵值相對應的特徵向量中進行編碼。 對於密集求解器,成本約爲 O[d N^2] 。 通常可以使用ARPACK求解器來提高這個成本。 用戶可以使用isomap的path_method關鍵字來指定特徵。 如果未指定,則代碼嘗試爲輸入數據選擇最佳算法。

Isomap 的整體複雜度是 O[D \log(k) N \log(N)] + O[N^2(k + \log(N))] + O[d N^2].

  • N : 訓練的數據節點數
  • D : 輸入維度
  • k : 最近的鄰居數
  • d : 輸出維度

參考文獻:

2.2.3. 局部線性嵌入

局部線性嵌入(LLE)尋求保留局部鄰域內距離的數據的低維投影。 它可以被認爲是一系列局部主成分分析,與整體相比,找到最好的非線性嵌入。

局部線性嵌入可以使用 locally_linear_embedding 函數或其面向對象的副本方法 LocallyLinearEmbedding 執行。

../_images/sphx_glr_plot_lle_digits_0061.png

2.2.3.1. 複雜度

標準的 LLE 算法包括三個階段:

  1. 搜索最近的鄰居. 參見上述 Isomap 討論。
  2. 權重矩陣構造O[D N k^3]. LLE 權重矩陣的構造涉及每 N 個局部鄰域的 k \times k 線性方程的解
  3. 部分特徵值分解. 參見上述 Isomap 討論。

標準 LLE 的整體複雜度是 O[D \log(k) N \log(N)] + O[D N k^3] + O[d N^2].

  • N : 訓練的數據節點數
  • D : 輸入維度
  • k : 最近的鄰居數
  • d : 輸出維度

參考文獻:

2.2.4. Modified Locally Linear Embedding

One well-known issue with LLE is the regularization problem. When the number of neighbors is greater than the number of input dimensions, the matrix defining each local neighborhood is rank-deficient. To address this, standard LLE applies an arbitrary regularization parameter r, which is chosen relative to the trace of the local weight matrix. Though it can be shown formally that as r \to 0, the solution converges to the desired embedding, there is no guarantee that the optimal solution will be found for r > 0. This problem manifests itself in embeddings which distort the underlying geometry of the manifold.

One method to address the regularization problem is to use multiple weight vectors in each neighborhood. This is the essence of modified locally linear embedding (MLLE). MLLE can be performed with function locally_linear_embeddingor its object-oriented counterpart LocallyLinearEmbedding, with the keyword method = 'modified'. It requires n_neighbors > n_components.

../_images/sphx_glr_plot_lle_digits_0071.png

2.2.4.1. Complexity

The MLLE algorithm comprises three stages:

  1. Nearest Neighbors Search. Same as standard LLE
  2. Weight Matrix Construction. Approximately O[D N k^3] + O[N (k-D) k^2]. The first term is exactly equivalent to that of standard LLE. The second term has to do with constructing the weight matrix from multiple weights. In practice, the added cost of constructing the MLLE weight matrix is relatively small compared to the cost of steps 1 and 3.
  3. Partial Eigenvalue Decomposition. Same as standard LLE

The overall complexity of MLLE is O[D \log(k) N \log(N)] + O[D N k^3] + O[N (k-D) k^2] + O[d N^2].

  • N : number of training data points
  • D : input dimension
  • k : number of nearest neighbors
  • d : output dimension

2.2.5. Hessian Eigenmapping

Hessian Eigenmapping (also known as Hessian-based LLE: HLLE) is another method of solving the regularization problem of LLE. It revolves around a hessian-based quadratic form at each neighborhood which is used to recover the locally linear structure. Though other implementations note its poor scaling with data size, sklearn implements some algorithmic improvements which make its cost comparable to that of other LLE variants for small output dimension. HLLE can be performed with function locally_linear_embedding or its object-oriented counterpart LocallyLinearEmbedding, with the keyword method = 'hessian'. It requires n_neighbors > n_components * (n_components + 3) / 2.

../_images/sphx_glr_plot_lle_digits_0081.png

2.2.5.1. Complexity

The HLLE algorithm comprises three stages:

  1. Nearest Neighbors Search. Same as standard LLE
  2. Weight Matrix Construction. Approximately O[D N k^3] + O[N d^6]. The first term reflects a similar cost to that of standard LLE. The second term comes from a QR decomposition of the local hessian estimator.
  3. Partial Eigenvalue Decomposition. Same as standard LLE

The overall complexity of standard HLLE is O[D \log(k) N \log(N)] + O[D N k^3] + O[N d^6] + O[d N^2].

  • N : number of training data points
  • D : input dimension
  • k : number of nearest neighbors
  • d : output dimension

References:

2.2.6. Spectral Embedding

Spectral Embedding is an approach to calculating a non-linear embedding. Scikit-learn implements Laplacian Eigenmaps, which finds a low dimensional representation of the data using a spectral decomposition of the graph Laplacian. The graph generated can be considered as a discrete approximation of the low dimensional manifold in the high dimensional space. Minimization of a cost function based on the graph ensures that points close to each other on the manifold are mapped close to each other in the low dimensional space, preserving local distances. Spectral embedding can be performed with the function spectral_embedding or its object-oriented counterpart SpectralEmbedding.

2.2.6.1. Complexity

The Spectral Embedding (Laplacian Eigenmaps) algorithm comprises three stages:

  1. Weighted Graph Construction. Transform the raw input data into graph representation using affinity (adjacency) matrix representation.
  2. Graph Laplacian Construction. unnormalized Graph Laplacian is constructed as L = D - A for and normalized one as L = D^{-\frac{1}{2}} (D - A) D^{-\frac{1}{2}}.
  3. Partial Eigenvalue Decomposition. Eigenvalue decomposition is done on graph Laplacian

The overall complexity of spectral embedding is O[D \log(k) N \log(N)] + O[D N k^3] + O[d N^2].

  • N : number of training data points
  • D : input dimension
  • k : number of nearest neighbors
  • d : output dimension

References:

2.2.7. Local Tangent Space Alignment

Though not technically a variant of LLE, Local tangent space alignment (LTSA) is algorithmically similar enough to LLE that it can be put in this category. Rather than focusing on preserving neighborhood distances as in LLE, LTSA seeks to characterize the local geometry at each neighborhood via its tangent space, and performs a global optimization to align these local tangent spaces to learn the embedding. LTSA can be performed with function locally_linear_embedding or its object-oriented counterpart LocallyLinearEmbedding, with the keyword method = 'ltsa'.

../_images/sphx_glr_plot_lle_digits_0091.png

2.2.7.1. Complexity

The LTSA algorithm comprises three stages:

  1. Nearest Neighbors Search. Same as standard LLE
  2. Weight Matrix Construction. Approximately O[D N k^3] + O[k^2 d]. The first term reflects a similar cost to that of standard LLE.
  3. Partial Eigenvalue Decomposition. Same as standard LLE

The overall complexity of standard LTSA is O[D \log(k) N \log(N)] + O[D N k^3] + O[k^2 d] + O[d N^2].

  • N : number of training data points
  • D : input dimension
  • k : number of nearest neighbors
  • d : output dimension

References:

2.2.8. Multi-dimensional Scaling (MDS)

Multidimensional scaling (MDS) seeks a low-dimensional representation of the data in which the distances respect well the distances in the original high-dimensional space.

In general, is a technique used for analyzing similarity or dissimilarity data. MDS attempts to model similarity or dissimilarity data as distances in a geometric spaces. The data can be ratings of similarity between objects, interaction frequencies of molecules, or trade indices between countries.

There exists two types of MDS algorithm: metric and non metric. In the scikit-learn, the class MDS implements both. In Metric MDS, the input similarity matrix arises from a metric (and thus respects the triangular inequality), the distances between output two points are then set to be as close as possible to the similarity or dissimilarity data. In the non-metric version, the algorithms will try to preserve the order of the distances, and hence seek for a monotonic relationship between the distances in the embedded space and the similarities/dissimilarities.

../_images/sphx_glr_plot_lle_digits_0101.png

Let S be the similarity matrix, and X the coordinates of the n input points. Disparities \hat{d}_{ij} are transformation of the similarities chosen in some optimal ways. The objective, called the stress, is then defined by sum_{i < j} d_{ij}(X) - \hat{d}_{ij}(X)

2.2.8.1. Metric MDS

The simplest metric MDS model, called absolute MDS, disparities are defined by \hat{d}_{ij} = S_{ij}. With absolute MDS, the value S_{ij} should then correspond exactly to the distance between point i and j in the embedding point.

Most commonly, disparities are set to \hat{d}_{ij} = b S_{ij}.

2.2.8.2. Nonmetric MDS

Non metric MDS focuses on the ordination of the data. If S_{ij} < S_{kl}, then the embedding should enforce d_{ij} <d_{jk}. A simple algorithm to enforce that is to use a monotonic regression of d_{ij} on S_{ij}, yielding disparities \hat{d}_{ij} in the same order as S_{ij}.

A trivial solution to this problem is to set all the points on the origin. In order to avoid that, the disparities \hat{d}_{ij} are normalized.

../_images/sphx_glr_plot_mds_0011.png

References:

2.2.9. t-distributed Stochastic Neighbor Embedding (t-SNE)

t-SNE (TSNE) converts affinities of data points to probabilities. The affinities in the original space are represented by Gaussian joint probabilities and the affinities in the embedded space are represented by Student’s t-distributions. This allows t-SNE to be particularly sensitive to local structure and has a few other advantages over existing techniques:

  • Revealing the structure at many scales on a single map
  • Revealing data that lie in multiple, different, manifolds or clusters
  • Reducing the tendency to crowd points together at the center

While Isomap, LLE and variants are best suited to unfold a single continuous low dimensional manifold, t-SNE will focus on the local structure of the data and will tend to extract clustered local groups of samples as highlighted on the S-curve example. This ability to group samples based on the local structure might be beneficial to visually disentangle a dataset that comprises several manifolds at once as is the case in the digits dataset.

The Kullback-Leibler (KL) divergence of the joint probabilities in the original space and the embedded space will be minimized by gradient descent. Note that the KL divergence is not convex, i.e. multiple restarts with different initializations will end up in local minima of the KL divergence. Hence, it is sometimes useful to try different seeds and select the embedding with the lowest KL divergence.

The disadvantages to using t-SNE are roughly:

  • t-SNE is computationally expensive, and can take several hours on million-sample datasets where PCA will finish in seconds or minutes
  • The Barnes-Hut t-SNE method is limited to two or three dimensional embeddings.
  • The algorithm is stochastic and multiple restarts with different seeds can yield different embeddings. However, it is perfectly legitimate to pick the embedding with the least error.
  • Global structure is not explicitly preserved. This is problem is mitigated by initializing points with PCA (using init=’pca’).
../_images/sphx_glr_plot_lle_digits_0131.png

2.2.9.1. Optimizing t-SNE

The main purpose of t-SNE is visualization of high-dimensional data. Hence, it works best when the data will be embedded on two or three dimensions.

Optimizing the KL divergence can be a little bit tricky sometimes. There are five parameters that control the optimization of t-SNE and therefore possibly the quality of the resulting embedding:

  • perplexity
  • early exaggeration factor
  • learning rate
  • maximum number of iterations
  • angle (not used in the exact method)

The perplexity is defined as k=2^(S) where S is the Shannon entropy of the conditional probability distribution. The perplexity of a k-sided die is k, so that k is effectively the number of nearest neighbors t-SNE considers when generating the conditional probabilities. Larger perplexities lead to more nearest neighbors and less sensitive to small structure. Conversely a lower perplexity considers a smaller number of neighbors, and thus ignores more global information in favour of the local neighborhood. As dataset sizes get larger more points will be required to get a reasonable sample of the local neighborhood, and hence larger perplexities may be required. Similarly noisier datasets will require larger perplexity values to encompass enough local neighbors to see beyond the background noise.

The maximum number of iterations is usually high enough and does not need any tuning. The optimization consists of two phases: the early exaggeration phase and the final optimization. During early exaggeration the joint probabilities in the original space will be artificially increased by multiplication with a given factor. Larger factors result in larger gaps between natural clusters in the data. If the factor is too high, the KL divergence could increase during this phase. Usually it does not have to be tuned. A critical parameter is the learning rate. If it is too low gradient descent will get stuck in a bad local minimum. If it is too high the KL divergence will increase during optimization. More tips can be found in Laurens van der Maaten’s FAQ (see references). The last parameter, angle, is a tradeoff between performance and accuracy. Larger angles imply that we can approximate larger regions by a single point,leading to better speed but less accurate results.

“How to Use t-SNE Effectively” provides a good discussion of the effects of the various parameters, as well as interactive plots to explore the effects of different parameters.

2.2.9.2. Barnes-Hut t-SNE

The Barnes-Hut t-SNE that has been implemented here is usually much slower than other manifold learning algorithms. The optimization is quite difficult and the computation of the gradient is O[d N log(N)], where d is the number of output dimensions and N is the number of samples. The Barnes-Hut method improves on the exact method where t-SNE complexity is O[d N^2], but has several other notable differences:

  • The Barnes-Hut implementation only works when the target dimensionality is 3 or less. The 2D case is typical when building visualizations.
  • Barnes-Hut only works with dense input data. Sparse data matrices can only be embedded with the exact method or can be approximated by a dense low rank projection for instance using sklearn.decomposition.TruncatedSVD
  • Barnes-Hut is an approximation of the exact method. The approximation is parameterized with the angle parameter, therefore the angle parameter is unused when method=”exact”
  • Barnes-Hut is significantly more scalable. Barnes-Hut can be used to embed hundred of thousands of data points while the exact method can handle thousands of samples before becoming computationally intractable

For visualization purpose (which is the main use case of t-SNE), using the Barnes-Hut method is strongly recommended. The exact t-SNE method is useful for checking the theoretically properties of the embedding possibly in higher dimensional space but limit to small datasets due to computational constraints.

Also note that the digits labels roughly match the natural grouping found by t-SNE while the linear 2D projection of the PCA model yields a representation where label regions largely overlap. This is a strong clue that this data can be well separated by non linear methods that focus on the local structure (e.g. an SVM with a Gaussian RBF kernel). However, failing to visualize well separated homogeneously labeled groups with t-SNE in 2D does not necessarily implie that the data cannot be correctly classified by a supervised model. It might be the case that 2 dimensions are not enough low to accurately represents the internal structure of the data.

References:

2.2.10. Tips on practical use

  • Make sure the same scale is used over all features. Because manifold learning methods are based on a nearest-neighbor search, the algorithm may perform poorly otherwise. See StandardScaler for convenient ways of scaling heterogeneous data.
  • The reconstruction error computed by each routine can be used to choose the optimal output dimension. For a d-dimensional manifold embedded in a D-dimensional parameter space, the reconstruction error will decrease as n_components is increased until n_components == d.
  • Note that noisy data can “short-circuit” the manifold, in essence acting as a bridge between parts of the manifold that would otherwise be well-separated. Manifold learning on noisy and/or incomplete data is an active area of research.
  • Certain input configurations can lead to singular weight matrices, for example when more than two points in the dataset are identical, or when the data is split into disjointed groups. In this case, solver='arpack' will fail to find the null space. The easiest way to address this is to use solver='dense' which will work on a singular matrix, though it may be very slow depending on the number of input points. Alternatively, one can attempt to understand the source of the singularity: if it is due to disjoint sets, increasing n_neighbors may help. If it is due to identical points in the dataset, removing these points may help.

See also

   

完全隨機樹嵌入 can also be useful to derive non-linear representations of feature space, also it does not perform dimensionality reduction.




中文文檔: http://sklearn.apachecn.org/cn/stable/modules/manifold.html

英文文檔: http://sklearn.apachecn.org/en/stable/modules/manifold.html

官方文檔: http://scikit-learn.org/stable/

GitHub: https://github.com/apachecn/scikit-learn-doc-zh(覺得不錯麻煩給個 Star,我們一直在努力)

貢獻者: https://github.com/apachecn/scikit-learn-doc-zh#貢獻者

關於我們: http://www.apachecn.org/organization/209.html

有興趣的們也可以和我們一起來維護,持續更新中 。。。

機器學習交流羣: 629470233


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章