Julia機器學習---- 聚類分析代碼示例

原創

2020-06-28 23:39

Clustering.jl 是Julia中一個很基礎的用於聚類數據分析的代碼庫，文檔裏缺少一些代碼示例，這裏簡單整理了一下。

K-means

K-均值是一種經典的聚類或矢量量化方法。它產生固定數量的簇，每個簇都與一箇中心（也稱爲原型）關聯，並且每個數據點都被分配給具有最近中心的簇。

從數學角度來看，K-means是一種座標下降算法，它解決了以下優化問題：

這裏，μk是k次聚類的中心，Zi是i次點的聚類指標。

代碼樣例

using RDatasets, Clustering, Plots
using DataFrames
using CSV

iris = dataset("datasets", "iris"); # load the data

features = collect(Matrix(iris[:, 1:4])'); # features to use for clustering
result = kmeans(features, 3); # run K-means for the 3 clusters

#plot with the point color mapped to the assigned cluster index
scatter(iris.PetalLength, iris.PetalWidth, marker_z=result.assignments,
                color=:lightrainbow, legend=false)

Fuzzy C-means

Fuzzy C-means是一種聚類方法，它提供聚類成員權而不是“硬”分類（如K-means）。

從數學角度看，Fuzzy C-means解決了以下優化問題：

這裏，cj是j-簇的中心，wij是j-簇中i-點的隸屬度，m>1是用戶定義的模糊參數。

代碼示例：

using RDatasets, Clustering, Plots
using DataFrames
using CSV

iris = dataset("datasets", "iris"); # load the data

features = collect(Matrix(iris[:, 1:4])'); # features to use for clustering
result = fuzzy_cmeans(features, 3, 4, maxiter=150, display=:iter)

#plot with the point color mapped to the assigned cluster index
scatter(iris.PetalLength, iris.PetalWidth, marker_z=result.centers,
                color=:lightrainbow, legend=false)

K-medoids

K-medods是一種聚類算法，思路與K-means算法類似，但它通過查找中值（而不是均值）數據點（稱爲medods），使得每個數據點與最近的medods之間的總距離最小。它需要提供N*N的正方形矩陣。

using RDatasets, Clustering, Plots
using DataFrames
using CSV

X = DataFrame(rand(100, 100))
features = collect(Matrix(X[:, 1:100])); # features to use for clustering
result = kmedoids(features, 3); # run K-means for the 3 clusters

scatter(sum([X.x1,X.x49]), sum([X.x50,X.x100]), marker_z=result.counts,
                color=:lightrainbow, legend=false)

MLC（Markov Cluster Algorithm)

這是一個圖聚類算法，可以用於人的社交圖分析，以上幾個都是特徵聚類算法。馬爾可夫聚類算法的工作原理是在一個加權圖中模擬一個隨機（馬爾可夫）流，其中每個節點都是一個數據點，邊的權值由鄰接矩陣定義。。。當算法收斂時，它產生新的邊權值來定義圖中新的連通分量（即簇）。它需要提供N*N的正方形矩陣。

using RDatasets, Clustering, Plots
using DataFrames
using CSV
import Clustering:mcl

X = DataFrame(rand(100, 100))
features = collect(Matrix(X[:, 1:100])); # features to use for clustering
result = mcl(features;add_loops=true,expansion=3,inflation=4); # run K-means for the 3 clusters

# println(X)
# println(result.assignments)
# println(result.converged)
# println(result.iterations)
# println(result.rel_Δ)
# println(result.iterations)

#plot with the point color mapped to the assigned cluster index
scatter(X.x1, X.x2, marker_z=result.assignments,
                color=:lightrainbow, legend=false)

AP 圖聚類算法

Affinity propagation （簡稱AP算法）是2007提出的，當時發表在Science上《single-exemplar-based》。特別適合高維、多類數據快速聚類，相比傳統的聚類算法，該算法算是比較新的，從聚類性能和效率方面都有大幅度的提升。

AP算法的基本思想：將全部樣本看作網絡的節點，然後通過網絡中各條邊的消息傳遞計算出各樣本的聚類中心。聚類過程中，共有兩種消息在各節點間傳遞，分別是吸引度( responsibility)和歸屬度(availability) 。AP算法通過迭代過程不斷更新每一個點的吸引度和歸屬度值，直到產生m個高質量的Exemplar（類似於質心），同時將其餘的數據點分配到相應的聚類中。

using RDatasets, Clustering, Plots
using DataFrames
using CSV

iris = dataset("datasets", "iris"); # load the data

features = collect(Matrix(iris[:, 1:4])'); # features to use for clustering
typeof(features)

result = affinityprop(features)

#plot with the point color mapped to the assigned cluster index
scatter(iris.PetalLength, iris.PetalWidth, marker_z=result.assignments,
                color=:lightrainbow, legend=false)

DBSCAN

DBSCAN(Density-Based Spatial Clustering of Applications with Noise)是一個比較有代表性的基於密度的聚類算法。與劃分和層次聚類方法不同，它將簇定義爲密度相連的點的最大集合，能夠把具有足夠高密度的區域劃分爲簇，並可在噪聲的空間數據庫中發現任意形狀的聚類。

待續。。。。。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Julia機器學習---- 聚類分析代碼示例

K-means

Fuzzy C-means

K-medoids

MLC（Markov Cluster Algorithm)

AP 圖聚類算法

DBSCAN

Julia 機器學習 --- k-折交叉驗證

Julia 機器學習 ---- 單變量線性迴歸和多元線性迴歸 (Linear regression)

Julia 機器學習 ---- 訓練集和測試集的拆分函數

Julia機器學習---- 聚類分析代碼示例

Docker 一鍵部署Redis Cluster 集羣

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

Julia機器學習---- 聚類分析 代碼示例

AP 圖聚類算法

DBSCAN

Julia機器學習---- 聚類分析代碼示例