數據挖掘day20、21-《數據挖掘導論》-第三章，探索數據

文章目錄

3.3.3-1、少量屬性的可視化

2、可視化空間數據

主要是使用鳶尾花數據，使用python對書中的各種可視化手段進行實現。

3.3.3-1、少量屬性的可視化

1.1 莖葉圖

莖葉圖，在《商務經濟統計》實現過，商務與經濟統計（13版，Python）筆記 01-02章
改動了一下

import numpy as np
import seaborn as sns
iris = sns.load_dataset("iris")
_stem=[]
data=iris['sepal_length']*10
for x in data:
    _stem.append(int(x//10))
    stem=list(set(_stem))
for m in stem:
    print(m,'|',end=' ')
    leaf=[]
    for n in data:
        if n//10==m:
            leaf.append(int(n%10))
    leaf.sort()   
    for i in range(1,len(leaf)):
        print(leaf[i],end='')
    print('\n')

4 | 444566667788888999999

5 | 000000000111111111222234444445555555666666777777778888888999

6 | 00000111111222233333333344444445555566777777778889999

7 | 122234677779

1.2 直方圖（histogram）

import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
iris = sns.load_dataset("iris")
cols=['sepal_length','sepal_width','petal_length','petal_width']
bins=10
plt.figure(figsize=(20,4))
for i in range(len(cols)):  
    plt.subplot(1,4,i+1)
    plt.hist(iris[cols[i]],10,histtype='bar',facecolor='yellowgreen',alpha=0.75,rwidth=0.95)
    plt.title(cols[i])

1.3 二維直方圖（two-dimensional histogram）

數據還是之前的數據，增加使用工具Axes3D參考官方例子

from mpl_toolkits.mplot3d import Axes3D
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
bins=3
hist, xedges, yedges = np.histogram2d(iris['petal_width'],iris['petal_length'] , bins=bins)
#獲取座標點，去掉最後一個
xpos, ypos = np.meshgrid(xedges[:-1], yedges[:-1] )
#由於x軸的方向由左向右，需要倒序
xpos = sorted(xpos.flatten('F'),reverse=True)
ypos = ypos.flatten('F')
zpos = np.zeros_like(xpos)
#每個圖像寬度，使用 最大值/bins
dx =(iris['petal_width'].max()/bins)*np.ones_like(zpos)
dy = iris['petal_length'].max()/bins*np.ones_like(zpos)
dz = hist.flatten()
ax.bar3d(xpos, ypos, zpos, dx, dy, dz, color='yellowgreen', zsort='average')
#因爲前面的倒序，需要人爲調整x軸刻度（不知道有其他方法沒有）
xticks=[2.5,2,1.5,1,0.5,0]
plt.xticks(xticks,('0','0.5','1.0','1.5','2.0','2.5'))
plt.xlabel('花瓣寬度',rotation=-15)
plt.ylabel('花瓣長度',rotation=45)
plt.show()

1.4 盒狀圖（box plot）

盒狀圖較爲簡單，順便弄點顏色

plt.boxplot(iris.iloc[:,0:4].T,vert=True,patch_artist=True)
plt.xticks([1,2,3,4],('sepal_length','sepal_width','petal_length','petal_width'))
for patch, color in zip(ax['boxes'], colors):
        patch.set_facecolor(color)

1.5 餅圖（pie plot）

之前已經把好看的餅圖都摘出來了，商務與經濟統計（13版，Python）筆記 01-02章
使用value_count（）函數彙總數據，順便加一個圖例

plt.pie(iris.species.value_counts(),labels=iris.species.value_counts().index)
plt.legend(loc="center left",bbox_to_anchor=(1, 0, 0.5, 1))

1.6 經驗累積分佈函數（ECDF）

需要手動構造數據，循環內使用reduce會增加計算了，但是數據少無所謂，然後用plt.step

from functools import reduce
cols=['sepal_length','sepal_width','petal_length','petal_width']
plt.figure(figsize=(10,6))
for n in range(len(cols)): 
#   構造數據  
    data=iris[cols[n]].value_counts().sort_index()
    len_data=len(data)
    y_max=reduce(lambda a,b:a+b,data)
    y=[data.iloc[0]/y_max]
    for i in range(1,len_data):
        y.append(reduce(lambda a,b:a+b,data.iloc[:i+1])/y_max)
    plt.subplot(2,2,n+1)
    plt.step(data.index,y,where='mid', label='mid')
    plt.grid(axis='both',linestyle='-')
#   plt.plot(data.index,y, 'C1o', alpha=0.5)
    plt.title(cols[n])

1.6 百分位數圖（percentile plot）

cols=['sepal_length','sepal_width','petal_length','petal_width']
marker=['o','v','s','D']
x=list(range(0,101,10))
for n in range(len(cols)):
    data_per=[]
    for i in x:
        data_per.append(np.percentile(iris[cols[n]],i))
    plt.plot(x,data_per,marker=marker[n])
plt.legend(cols)

1.7 散佈圖矩陣（scatter plot matrix）

seaborn.PairGrid的例子就是鳶尾花數據做的，但是圖例不知道怎麼放好

g = sns.PairGrid(iris, hue="species", palette="Set2",hue_kws={"marker": ["o", "s", "D"]})
g = g.map_offdiag(plt.scatter, linewidths=1, edgecolor="w", s=40)
g.add_legend()

1.8 散佈圖

cols=['sepal_length','sepal_width','petal_length','petal_width']
species=['versicolor', 'virginica', 'setosa']
fig = plt.figure()
for c,m,i in [('r', 'o',0), ('b', '^',1),('y','*',2)]:
    iris_1=iris[iris.species==species[i]]
    plt.scatter(iris_1[cols[2]],iris_1[cols[3]],c=c,marker=m)
plt.legend(['versicolor', 'virginica', 'setosa'],loc='upper left')
ax.set_xlabel('petal_length')
ax.set_ylabel('petal_width')

1.9 三維散佈圖

感覺做的有點笨，但是米辦法。for循環用列表的方式，只是記憶一下有這種方式。

cols=['sepal_length','sepal_width','petal_length','petal_width']
species=['versicolor', 'virginica', 'setosa']
fig = plt.figure()
ax = fig.add_subplot(111, projection='3d')
for c,m,i in [('r', 'o',0), ('b', '^',1),('y','*',2)]:
    iris_1=iris[iris.species==species[i]]
    ax.scatter(iris_1[cols[0]],iris_1[cols[1]],iris_1[cols[2]],c=c,marker=m)
plt.legend(['sepal_length','sepal_width','petal_length'],loc='upper left')
ax.set_xlabel('sepal_length')
ax.set_ylabel('sepal_width')
ax.set_zlabel('petal_length')

2、可視化空間數據

2.1 等高線圖（contour plot）

抄一個例子Contour plot of irregularly spaced data

origin = 'lower'
delta = 0.025
x = y = np.arange(-3.0, 3.01, delta)
X, Y = np.meshgrid(x, y)
Z1 = np.exp(-X**2 - Y**2)
Z2 = np.exp(-(X - 1)**2 - (Y - 1)**2)
Z = (Z1 - Z2) * 2
fig1, ax2 = plt.subplots(constrained_layout=True)
CS = ax2.contourf(X, Y, Z, 10, cmap=plt.cm.bone, origin=origin)
CS2 = ax2.contour(CS, levels=CS.levels[::2], colors='r', origin=origin)
ax2.set_title('Nonsense (3 masked regions)')
ax2.set_xlabel('word length anomaly')
ax2.set_ylabel('sentence length anomaly')
cbar = fig1.colorbar(CS)
cbar.ax.set_ylabel('verbosity coefficient')
cbar.add_lines(CS2)

2.2 曲面圖（surface plot）

第九章再說吧，先放個核密度圖

x=[4,6,1,2,4,6,7,1,2,4,6,7]
y=[1,1,4,4,4,4,4,5,5,5,5,5]
plt.scatter(x,y)
sns.kdeplot(x,y)

2.2 平行座標圖（parallel coordinates）

使用pandas.parallel_coordinates

from pandas.plotting import parallel_coordinates
fig,axes = plt.subplots()
parallel_coordinates(iris,'species',ax=axes)

2.3 星形座標（star coordinates）

沒有找到庫，做chernoff臉，只能自己動手搞一個星形座標圖，沒有隨機抽取樣本，只是每種花選前5朵。

cols=['sepal_length','sepal_width','petal_length','petal_width']
species=['versicolor', 'virginica', 'setosa']
# plt.figure(figsize=(15,15))
for i in range(3):
    numbers=list(iris[iris.species==species[i]].index)[:5]
    plt.figure(figsize=(10,3))
    for n in range(len(numbers)):
        ir=iris.iloc[numbers[n]]
        #點畫線，12341324
        x=[ir[0],0,-ir[2],0,ir[0],-ir[2],0,0]
        y=[0,ir[1],0,-ir[3],0,0,ir[1],-ir[3]]
        plt.subplot(1,5,n+1)
        plt.scatter(x,y)
        plt.plot(x,y,c='r')
        #統一大小
        plt.xlim(-7,8)
        plt.ylim(-3,5)
        #去掉刻度線
        plt.xticks([0],'')
        plt.yticks([0],'')
        plt.title('%s %i' % (species[i],numbers[n]))

數據挖掘day20、21-《數據挖掘導論》-第三章，探索數據

文章目錄

3.3.3-1、少量屬性的可視化

1.1 莖葉圖

1.2 直方圖（histogram）

1.3 二維直方圖（two-dimensional histogram）

1.4 盒狀圖（box plot）

1.5 餅圖（pie plot）

1.6 經驗累積分佈函數（ECDF）

1.6 百分位數圖（percentile plot）

1.7 散佈圖矩陣（scatter plot matrix）

1.8 散佈圖

1.9 三維散佈圖

2、可視化空間數據

2.1 等高線圖（contour plot）

2.2 曲面圖（surface plot）

2.2 平行座標圖（parallel coordinates）

2.3 星形座標（star coordinates）

工作中用到的腳本合集

微服務實踐Aspire項目發佈到遠程k8s集羣

通過f-string編寫簡潔高效的Python格式化輸出代碼

[轉帖]20個常用的Linux工具命令

[轉帖]PostgreSQL從小白到高手教程 - 第46講：poc-tpch測試

24-5-18 X

1082. Sales Analysis I 難度：簡單

數據挖掘day22、23-《數據挖掘導論》-第四章，4.1-4.3.7 決策樹

01、（golang）FIFO循環隊列

百家號爬蟲（獲取各領域創作者appid）

數據分析工具彙總

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結