小白都能瞭解的聚類算法之二(DBSCAN)

1.簡介

DBSCAN(Density-Based Spatial Clustering of Application with Noise)是一種基於密度的經典聚類算法，出現的時間大概是1996年前後。

2.DBSCAN的一些基本概念

DBSCAN算法基於一組“鄰域”參數(經常用 $\epsilon$ ，MinPts)來描述樣本分佈的緊湊程度。若給定樣本集 $D= x_1, x_2, \cdots,x_m$ ，我們可以定義一下幾個概念：

$\epsilon$ 鄰域：對 $x_j \in D$ ，其 $\epsilon$ 鄰域包含樣本集 D中與 $x_j$ 的距離不大於 $\epsilon$ 的樣本，
$N_{\epsilon}(\boldsymbol{x_j}) = \{ \boldsymbol{x_i} \in D \mid \text{dist}(\boldsymbol{x_i}, \boldsymbol{x_j}) \le \epsilon \}$
其中dist爲距離的計算方法。
核心對象（core object) 若 $x_j$ 的 $\epsilon$ 鄰域至少包含MinPts個樣本，即 $N_{\epsilon}(\boldsymbol{x_j}) = \{ \boldsymbol{x_i} \in D \mid \text{dist}(\boldsymbol{x_i}, \boldsymbol{x_j}) \le \epsilon \} \geq MinPts$ ，則 $x_j$ 爲一個核心對象。
密度直達（directly density-reachable）如果 $x_j$ 位於 $x_i$ 的某一 $\epsilon$ 鄰域中，且 $x_i$ 是核心對象，則 $x_j$ 可以由 $x_i$ 密度直達。但是反過來不一定城裏。
密度可達（density-reachable）對樣本 $x_i, x_j$ ，如果有樣本序列 $p_1, p_2, \cdots, p_n$ ，其中 $p_1 = x_i， p_n = x_j$ ，且 $p_{i+1}$ 由 $p_i$ 密度直達，則稱 $x_j$ 由 $x_i$ 密度可達。密度可達關係滿足傳遞性，但不滿足對稱性。
密度相連（density-connected）對樣本 $x_i, x_j$ ，若存在 $x_k$ 使得 $x_i, x_j$ 均由 $x_k$ 密度可達，則稱 $x_i, x_j$ 密度相連。顯然密度相連是對稱的。

DBSCAN定義的基本概念(MinPts=3)。其中虛線部分爲 $\epsilon$ 鄰域， $x_1$ 爲核心對象， $x_2$ 由 $x_1$ 密度直達， $x_3$ 由 $x_1$ 密度可達， $x_3, x_4$ 密度相連。

根據上面的定義，DBSCAN將“簇”定義爲：由密度可達關係導出的最大密度相連的樣本集合。

3.算法的過程

在第1-7行中，算法先根據給定的鄰域參數( $\epsilon$ , MinPts)找出所有核心對象；然後在第10-24行中，以任一核心對象爲出發點，找出由其密度可達的樣本生成聚類簇，直到所有核心對象均被訪問過爲止。

3.算法的優缺點

優點：
1.相比 K-平均算法，DBSCAN 不需要預先聲明聚類數量。
2.DBSCAN 可以找出任何形狀的聚類，甚至能找出一個聚類，它包圍但不連接另一個聚類，另外，由於 MinPts 參數，single-link effect （不同聚類以一點或極幼的線相連而被當成一個聚類）能有效地被避免。
3.DBSCAN 能分辨噪音（局外點）。
4.DBSCAN 只需兩個參數，且對數據庫內的點的次序幾乎不敏感（兩個聚類之間邊緣的點有機會受次序的影響被分到不同的聚類，另外聚類的次序會受點的次序的影響）。
5.DBSCAN 被設計成能配合可加速範圍訪問的數據庫結構，例如 R*樹。
6.如果對資料有足夠的瞭解，可以選擇適當的參數以獲得最佳的分類。

缺點：
1.DBSCAN 不是完全決定性的：在兩個聚類交界邊緣的點會視乎它在數據庫的次序決定加入哪個聚類，幸運地，這種情況並不常見，而且對整體的聚類結果影響不大——DBSCAN 對核心點和噪音都是決定性的。DBSCAN* 是一種變化了的算法，把交界點視爲噪音，達到完全決定性的結果。
2.DBSCAN 聚類分析的質素受函數 regionQuery(P,ε) 裏所使用的度量影響，最常用的度量是歐幾里得距離，尤其在高維度資料中，由於受所謂“維數災難”影響，很難找出一個合適的 ε ，但事實上所有使用歐幾里得距離的算法都受維數災難影響。
3.如果數據庫裏的點有不同的密度，而該差異很大，DBSCAN 將不能提供一個好的聚類結果，因爲不能選擇一個適用於所有聚類的 minPts-ε 參數組合。
4.如果沒有對資料和比例的足夠理解，將很難選擇適合的 ε 參數。

4.例子

#!/usr/bin/env python
# -*- coding: utf-8 -*-

#Author: WangLei
#date: 2020/3/28


import numpy as np

from sklearn.cluster import DBSCAN
from sklearn import metrics
from sklearn.datasets import make_blobs
from sklearn.preprocessing import StandardScaler


# #############################################################################
# Generate sample data
centers = [[1, 1], [-1, -1], [1, -1]]
X, labels_true = make_blobs(n_samples=750, centers=centers, cluster_std=0.4,
                            random_state=0)

X = StandardScaler().fit_transform(X)

# #############################################################################
# Compute DBSCAN
db = DBSCAN(eps=0.3, min_samples=10).fit(X)
core_samples_mask = np.zeros_like(db.labels_, dtype=bool)
core_samples_mask[db.core_sample_indices_] = True
labels = db.labels_

# Number of clusters in labels, ignoring noise if present.
n_clusters_ = len(set(labels)) - (1 if -1 in labels else 0)
n_noise_ = list(labels).count(-1)

print('Estimated number of clusters: %d' % n_clusters_)
print('Estimated number of noise points: %d' % n_noise_)
print("Homogeneity: %0.3f" % metrics.homogeneity_score(labels_true, labels))
print("Completeness: %0.3f" % metrics.completeness_score(labels_true, labels))
print("V-measure: %0.3f" % metrics.v_measure_score(labels_true, labels))
print("Adjusted Rand Index: %0.3f"
      % metrics.adjusted_rand_score(labels_true, labels))
print("Adjusted Mutual Information: %0.3f"
      % metrics.adjusted_mutual_info_score(labels_true, labels))
print("Silhouette Coefficient: %0.3f"
      % metrics.silhouette_score(X, labels))

# #############################################################################
# Plot result
import matplotlib.pyplot as plt

# Black removed and is used for noise instead.
unique_labels = set(labels)
colors = [plt.cm.Spectral(each)
          for each in np.linspace(0, 1, len(unique_labels))]
for k, col in zip(unique_labels, colors):
    if k == -1:
        # Black used for noise.
        col = [0, 0, 0, 1]

    class_member_mask = (labels == k)

    xy = X[class_member_mask & core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=14)

    xy = X[class_member_mask & ~core_samples_mask]
    plt.plot(xy[:, 0], xy[:, 1], 'o', markerfacecolor=tuple(col),
             markeredgecolor='k', markersize=6)

plt.title('Estimated number of clusters: %d' % n_clusters_)
plt.show()

最後輸出結果

Estimated number of clusters: 3
Estimated number of noise points: 18
Homogeneity: 0.953
Completeness: 0.883
V-measure: 0.917
Adjusted Rand Index: 0.952
Adjusted Mutual Information: 0.916
Silhouette Coefficient: 0.626

參考文獻
1.https://zh.wikipedia.org/wiki/DBSCAN
2.http://aandds.com/blog/dbscan.html

小白都能瞭解的聚類算法之二(DBSCAN)

1.簡介

2.DBSCAN的一些基本概念

3.算法的過程

3.算法的優缺點

4.例子

小白都能理解的FTRL

樹算法系列之四:XGBoost

Redis常用數據結構

樹算法系列之一:CART迴歸樹

HashMap簡單小結

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結