Chapter10 - k-means clustering
k-均值算法流程:
創建k個點作爲起始質心(經常是隨機選擇)
當任意一個點的簇分配結果發生改變時
對數據集中的每個數據點
對每個質心
計算質心與數據點之間的距離
將數據點分配到距其最近的簇
對每一個簇,計算簇中所有點的均值並將均值作爲質心
python實現
def randCent(dataSet, k):
n = shape(dataSet)[1] # n features
centroids = mat(zeros((k,n)))
for j in range(n):
minJ = min(dataSet[:,j])
rangeJ = float(max(dataSet[:,j]) - minJ)
centroids[:,j] = minJ + rangeJ * random.rand(k,1)
return centroids
def kMeans(dataSet, k, distMeas=distEclud, createCent=randCent):
m = shape(dataSet)[0]
clusterAssment = mat(zeros((m,2))) # 屬於哪個族,距離的平方
centroids = createCent(dataSet, k) # k個族,k行n列
clusterChanged = True
while clusterChanged:
clusterChanged = False
# 對每個實例找到該屬於哪個族
for i in range(m):
minDist = inf; minIndex = -1
for j in range(k):
# 兩個行向量的距離
distJI = distMeas(centroids[j,:],dataSet[i,:])
if distJI < minDist:
minDist = distJI; minIndex = j
if clusterAssment[i,0] != minIndex:
clusterChanged = True
clusterAssment[i,:] = minIndex,minDist**2
print centroids
for cent in range(k):
ptsInClust = dataSet[nonzero(clusterAssment[:,0].A==cent)[0]]
centroids[cent,:] = mean(ptsInClust, axis=0) # 對每個特徵列計算均值
return centroids, clusterAssment
以上的算法可能會收斂於局部最小值,有一種改進的算法是二分K-均值算法
將所有點看成一個襄
當簇數目小於k時
對於每一個簇
計算總誤差
在給定的簇上面進行K-均值聚類(k=2)
計算將該簇一分爲二之後的總誤差
選擇使得誤差最小的那個族進行劃分操作
python實現
def biKmeans(dataSet, k, distMeas=distEclud):
m = shape(dataSet)[0]
clusterAssment = mat(zeros((m,2)))
centroid0 = mean(dataSet, axis=0).tolist()[0]
centList =[centroid0] # 質心列表
for j in range(m):
clusterAssment[j,1] = distMeas(mat(centroid0), dataSet[j,:])**2
while (len(centList) < k):
lowestSSE = inf
for i in range(len(centList)):
ptsInCurrCluster = dataSet[nonzero(clusterAssment[:,0].A==i)[0],:]
centroidMat, splitClustAss = kMeans(ptsInCurrCluster, 2 , distMeas)
# 當前族劃分爲兩個族後,當前族中數據的誤差
sseSplit = sum(splitClustAss[:,1])
# 其他族的誤差
sseNotSplit = sum(clusterAssment[nonzero(clusterAssment[:,0].A!=i)[0],1])
print "sseSplit, and notSplit: ",sseSplit,sseNotSplit
if (sseSplit + sseNotSplit) < lowestSSE:
bestCentToSplit = i
bestNewCents = centroidMat # 兩個質心
bestClustAss = splitClustAss.copy()
lowestSSE = sseSplit + sseNotSplit
# 兩個質心中的後一個的index是list的長度
bestClustAss[nonzero(bestClustAss[:,0].A == 1)[0],0] = len(centList)
# 前一個是被分裂的那個index
bestClustAss[nonzero(bestClustAss[:,0].A == 0)[0],0] = bestCentToSplit
print 'the bestCentToSplit is: ',bestCentToSplit
print 'the len of bestClustAss is: ', len(bestClustAss)
# 原來的質心改成2箇中的第一個
centList[bestCentToSplit] = bestNewCents[0,:]
# 在列表最後加上2箇中的第二個
centList.append(bestNewCents[1,:])
# 將被分裂的那個族(屬於哪個族,距離的平方)重新賦值,一一對應,沒有問題
clusterAssment[nonzero(clusterAssment[:,0].A == bestCentToSplit)[0],:]= bestClustAss
for i in range(k):
centList[i] = centList[i].tolist()[0]
return mat(centList), clusterAssment