Machine Learning In Action - Chapter 10 k-means clustering

Chapter10 - k-means clustering

k-均值算法流程:

創建k個點作爲起始質心(經常是隨機選擇)
當任意一個點的簇分配結果發生改變時
    對數據集中的每個數據點
        對每個質心
            計算質心與數據點之間的距離
        將數據點分配到距其最近的簇
    對每一個簇,計算簇中所有點的均值並將均值作爲質心

python實現

def randCent(dataSet, k):
    n = shape(dataSet)[1] # n features
    centroids = mat(zeros((k,n)))
    for j in range(n):
        minJ = min(dataSet[:,j])
        rangeJ = float(max(dataSet[:,j]) - minJ)
        centroids[:,j] = minJ + rangeJ * random.rand(k,1)
    return centroids

def kMeans(dataSet, k, distMeas=distEclud, createCent=randCent):
    m = shape(dataSet)[0]
    clusterAssment = mat(zeros((m,2))) # 屬於哪個族,距離的平方
    centroids = createCent(dataSet, k) # k個族,k行n列
    clusterChanged = True
    while clusterChanged:
        clusterChanged = False
        # 對每個實例找到該屬於哪個族
        for i in range(m):
            minDist = inf; minIndex = -1
            for j in range(k):
                # 兩個行向量的距離
                distJI = distMeas(centroids[j,:],dataSet[i,:])
                if distJI < minDist:
                    minDist = distJI; minIndex = j
            if clusterAssment[i,0] != minIndex: 
                clusterChanged = True
            clusterAssment[i,:] = minIndex,minDist**2
        print centroids
        for cent in range(k):
            ptsInClust = dataSet[nonzero(clusterAssment[:,0].A==cent)[0]]
            centroids[cent,:] = mean(ptsInClust, axis=0) # 對每個特徵列計算均值
    return centroids, clusterAssment

以上的算法可能會收斂於局部最小值,有一種改進的算法是二分K-均值算法

將所有點看成一個襄
當簇數目小於k時
    對於每一個簇
        計算總誤差
        在給定的簇上面進行K-均值聚類(k=2)
        計算將該簇一分爲二之後的總誤差
    選擇使得誤差最小的那個族進行劃分操作

python實現

def biKmeans(dataSet, k, distMeas=distEclud):
    m = shape(dataSet)[0]
    clusterAssment = mat(zeros((m,2)))
    centroid0 = mean(dataSet, axis=0).tolist()[0]
    centList =[centroid0] # 質心列表
    for j in range(m):
        clusterAssment[j,1] = distMeas(mat(centroid0), dataSet[j,:])**2
    while (len(centList) < k):
        lowestSSE = inf
        for i in range(len(centList)):
            ptsInCurrCluster = dataSet[nonzero(clusterAssment[:,0].A==i)[0],:]
            centroidMat, splitClustAss = kMeans(ptsInCurrCluster, 2 , distMeas)
            # 當前族劃分爲兩個族後,當前族中數據的誤差
            sseSplit = sum(splitClustAss[:,1])
            # 其他族的誤差
            sseNotSplit = sum(clusterAssment[nonzero(clusterAssment[:,0].A!=i)[0],1])
            print "sseSplit, and notSplit: ",sseSplit,sseNotSplit
            if (sseSplit + sseNotSplit) < lowestSSE:
                bestCentToSplit = i
                bestNewCents = centroidMat # 兩個質心
                bestClustAss = splitClustAss.copy() 
                lowestSSE = sseSplit + sseNotSplit
        # 兩個質心中的後一個的index是list的長度
        bestClustAss[nonzero(bestClustAss[:,0].A == 1)[0],0] = len(centList)
        # 前一個是被分裂的那個index
        bestClustAss[nonzero(bestClustAss[:,0].A == 0)[0],0] = bestCentToSplit
        print 'the bestCentToSplit is: ',bestCentToSplit
        print 'the len of bestClustAss is: ', len(bestClustAss)
        # 原來的質心改成2箇中的第一個
        centList[bestCentToSplit] = bestNewCents[0,:]
        # 在列表最後加上2箇中的第二個
        centList.append(bestNewCents[1,:])
        # 將被分裂的那個族(屬於哪個族,距離的平方)重新賦值,一一對應,沒有問題
        clusterAssment[nonzero(clusterAssment[:,0].A == bestCentToSplit)[0],:]= bestClustAss
    for i in range(k):
        centList[i] = centList[i].tolist()[0]
    return mat(centList), clusterAssment
發佈了35 篇原創文章 · 獲贊 64 · 訪問量 7萬+
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章