【數據分析】Matplotlib可視化最有價值的圖表之——1、關聯(Correlation)

本文總結了 Matplotlib 以及 Seaborn 用的最多的 50 個圖形，掌握這些圖形的繪製，對於數據分析的可視化有莫大的作用，運行本文代碼，除了安裝 matplotlib 和 seaborn 可視化庫外，還需要安裝其他的一些輔助可視化庫，已在代碼部分作標註，具體內容請查看下面內容。

在數據分析和可視化中最有用的 50 個 Matplotlib 圖表。這些圖表列表允許您使用 python 的 matplotlib 和 seaborn 庫選擇要顯示的可視化對象。

這裏開始第一部分內容：關聯（Correlation）

介紹

這些圖表根據可視化目標的7個不同情景進行分組。例如，如果要想象兩個變量之間的關係，請查看“關聯”部分下的圖表。或者，如果您想要顯示值如何隨時間變化，請查看“變化”部分，依此類推。

有效圖表的重要特徵：

在不歪曲事實的情況下傳達正確和必要的信息。
設計簡單，您不必太費力就能理解它。
從審美角度支持信息而不是掩蓋信息。
信息沒有超負荷。

準備工作

在代碼運行前先引入下面的設置內容。當然，單獨的圖表，可以重新設置顯示要素。

# !pip install brewer2mpl
import numpy as np
import pandas as pd
import matplotlib as mpl
import matplotlib.pyplot as plt
import seaborn as sns
import warnings; warnings.filterwarnings(action='once')

large = 22; med = 16; small = 12
params = {'axes.titlesize': large,
          'legend.fontsize': med,
          'figure.figsize': (16, 10),
          'axes.labelsize': med,
          'axes.titlesize': med,
          'xtick.labelsize': med,
          'ytick.labelsize': med,
          'figure.titlesize': large}
plt.rcParams.update(params)
plt.style.use('seaborn-whitegrid')
sns.set_style("white")
# %matplotlib inline

# Version
print(mpl.__version__)  # >> 3.0.2
print(sns.__version__)  # >> 0.9.0

關聯（Correlation）
關聯圖表用於可視化2個或更多變量之間的關係。也就是說，一個變量如何相對於另一個變化。

關聯 - 1 散點圖（Scatter plot）

散點圖是用於研究兩個變量之間關係的經典的和基本的圖表。如果數據中有多個組，則可能需要以不同顏色可視化每個組。在 matplotlib 中，您可以使用 plt.scatterplot() 方便地執行此操作。

# Import dataset
midwest = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/midwest_filter.csv")

# Prepare Data
# Create as many colors as there are unique midwest['category']
categories = np.unique(midwest['category'])
colors = [plt.cm.tab10(i/float(len(categories)-1)) for i in range(len(categories))]

# Draw Plot for Each Category
plt.figure(figsize=(16, 10), dpi= 80, facecolor='w', edgecolor='k')

for i, category in enumerate(categories):
    plt.scatter('area', 'poptotal',
                data=midwest.loc[midwest.category==category, :],
                s=20, cmap=colors[i], label=str(category))
    # "c=" 修改爲 "cmap="

# Decorations
plt.gca().set(xlim=(0.0, 0.1), ylim=(0, 90000),
              xlabel='Area', ylabel='Population')

plt.xticks(fontsize=12); plt.yticks(fontsize=12)
plt.title("Scatterplot of Midwest Area vs Population", fontsize=22)
plt.legend(fontsize=12)    
plt.show()

2 帶邊界的氣泡圖（Bubble plot with Encircling）

有時，您希望在邊界內顯示一組點以強調其重要性。在這個例子中，你從數據框中獲取記錄，並用下面代碼中描述的 encircle() 來使邊界顯示出來。

from matplotlib import patches
from scipy.spatial import ConvexHull
import warnings; warnings.simplefilter('ignore')
sns.set_style("white")

# Step 1: Prepare Data
midwest = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/midwest_filter.csv")

# As many colors as there are unique midwest['category']
categories = np.unique(midwest['category'])
colors = [plt.cm.tab10(i/float(len(categories)-1)) for i in range(len(categories))]

# Step 2: Draw Scatterplot with unique color for each category
fig = plt.figure(figsize=(16, 10), dpi= 80, facecolor='w', edgecolor='k')    

for i, category in enumerate(categories):
    plt.scatter('area', 'poptotal', data=midwest.loc[midwest.category==category, :],
                s='dot_size', cmap=colors[i], label=str(category), edgecolors='black', linewidths=.5)
    # "c=" 修改爲 "cmap="

# Step 3: Encircling
# https://stackoverflow.com/questions/44575681/how-do-i-encircle-different-data-sets-in-scatter-plot
def encircle(x,y, ax=None, **kw):
    if not ax: ax=plt.gca()
    p = np.c_[x,y]
    hull = ConvexHull(p)
    poly = plt.Polygon(p[hull.vertices,:], **kw)
    ax.add_patch(poly)

# Select data to be encircled
midwest_encircle_data = midwest.loc[midwest.state=='IN', :]                         

# Draw polygon surrounding vertices    
encircle(midwest_encircle_data.area, midwest_encircle_data.poptotal, ec="k", fc="gold", alpha=0.1)
encircle(midwest_encircle_data.area, midwest_encircle_data.poptotal, ec="firebrick", fc="none", linewidth=1.5)

# Step 4: Decorations
plt.gca().set(xlim=(0.0, 0.1), ylim=(0, 90000),
              xlabel='Area', ylabel='Population')

plt.xticks(fontsize=12); plt.yticks(fontsize=12)
plt.title("Bubble Plot with Encircling", fontsize=22)
plt.legend(fontsize=12)    
plt.show()

3 帶線性迴歸最佳擬合線的散點圖

帶線性迴歸最佳擬合線的散點圖（Scatter plot with linear regression line of best fit），如果你想了解兩個變量如何相互改變，那麼最佳擬合線就是常用的方法。下圖顯示了數據中各組之間最佳擬合線的差異。要禁用分組並僅爲整個數據集繪製一條最佳擬合線，請從下面的sns.lmplot() 調用中刪除 hue =‘cyl’ 參數。

# Import Data
df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/mpg_ggplot2.csv")
df_select = df.loc[df.cyl.isin([4,8]), :]

# Plot
sns.set_style("white")
gridobj = sns.lmplot(x="displ", y="hwy", hue="cyl", data=df_select,
                     height=7, aspect=1.6, robust=True, palette='tab10',
                     scatter_kws=dict(s=60, linewidths=.7, edgecolors='black'))

# Decorations
gridobj.set(xlim=(0.5, 7.5), ylim=(0, 50))
plt.title("Scatterplot with line of best fit grouped by number of cylinders", fontsize=20)
plt.show()

針對每列繪製線性迴歸線

或者，可以在其每列中顯示每個組的最佳擬合線。可以通過在 sns.lmplot() 中設置 col=groupingcolumn 參數來實現，如下：

# Import Data
df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/mpg_ggplot2.csv")
df_select = df.loc[df.cyl.isin([4,8]), :]

# Each line in its own column
sns.set_style("white")
gridobj = sns.lmplot(x="displ", y="hwy",
                     data=df_select,
                     height=7,
                     robust=True,
                     palette='Set1',
                     col="cyl",
                     scatter_kws=dict(s=60, linewidths=.7, edgecolors='black'))

# Decorations
gridobj.set(xlim=(0.5, 7.5), ylim=(0, 50))
plt.show()

4 抖動圖（Jittering with stripplot）

通常，多個數據點具有完全相同的 X 和 Y 值。結果，多個點繪製會重疊並隱藏。爲避免這種情況，請將數據點稍微抖動，以便您可以直觀地看到它們。使用 seaborn 的 stripplot() 很方便實現這個功能。

# Import Data
df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/mpg_ggplot2.csv")

# Draw Stripplot
fig, ax = plt.subplots(figsize=(16,10), dpi= 80)    
sns.stripplot(df.cty, df.hwy, jitter=0.25, size=8, ax=ax, linewidth=.5)

# Decorations
plt.title('Use jittered plots to avoid overlapping of points', fontsize=22)
plt.show()

5 計數圖（Counts Plot）

避免點重疊問題的另一個選擇是增加點的大小，這取決於該點中有多少點。因此，點的大小越大，其周圍的點的集中度越高。

# Import Data
df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/mpg_ggplot2.csv")
df_counts = df.groupby(['hwy', 'cty']).size().reset_index(name='counts')

# Draw Stripplot
fig, ax = plt.subplots(figsize=(16,10), dpi= 80)    
sns.stripplot(df_counts.cty, df_counts.hwy, size=df_counts.counts*2, ax=ax)

# Decorations
plt.title('Counts Plot - Size of circle is bigger as more points overlap', fontsize=22)
plt.show()

6 邊緣直方圖（Marginal Histogram）

邊緣直方圖具有沿 X 和 Y 軸變量的直方圖。這用於可視化 X 和 Y 之間的關係以及單獨的 X 和 Y 的單變量分佈。這種圖經常用於探索性數據分析（EDA）。

# Import Data
df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/mpg_ggplot2.csv")

# Create Fig and gridspec
fig = plt.figure(figsize=(16, 10), dpi= 80)
grid = plt.GridSpec(4, 4, hspace=0.5, wspace=0.2)

# Define the axes
ax_main = fig.add_subplot(grid[:-1, :-1])
ax_right = fig.add_subplot(grid[:-1, -1], xticklabels=[], yticklabels=[])
ax_bottom = fig.add_subplot(grid[-1, 0:-1], xticklabels=[], yticklabels=[])

# Scatterplot on main ax
ax_main.scatter('displ', 'hwy', s=df.cty*4, c=df.manufacturer.astype('category').cat.codes, alpha=.9, data=df, cmap="tab10", edgecolors='gray', linewidths=.5)

# histogram on the right
ax_bottom.hist(df.displ, 40, histtype='stepfilled', orientation='vertical', color='deeppink')
ax_bottom.invert_yaxis()

# histogram in the bottom
ax_right.hist(df.hwy, 40, histtype='stepfilled', orientation='horizontal', color='deeppink')

# Decorations
ax_main.set(title='Scatterplot with Histograms \n displ vs hwy', xlabel='displ', ylabel='hwy')
ax_main.title.set_fontsize(20)
for item in ([ax_main.xaxis.label, ax_main.yaxis.label] + ax_main.get_xticklabels() + ax_main.get_yticklabels()):
    item.set_fontsize(14)

xlabels = ax_main.get_xticks().tolist()
ax_main.set_xticklabels(xlabels)
plt.show()

7 邊緣箱形圖（Marginal Boxplot）

邊緣箱圖與邊緣直方圖具有相似的用途。然而，箱線圖有助於精確定位 X 和 Y 的中位數、第25和第75百分位數。

# Import Data
df = pd.read_csv("https://raw.githubusercontent.com/selva86/datasets/master/mpg_ggplot2.csv")

# Create Fig and gridspec
fig = plt.figure(figsize=(16, 10), dpi= 80)
grid = plt.GridSpec(4, 4, hspace=0.5, wspace=0.2)

# Define the axes
ax_main = fig.add_subplot(grid[:-1, :-1])
ax_right = fig.add_subplot(grid[:-1, -1], xticklabels=[], yticklabels=[])
ax_bottom = fig.add_subplot(grid[-1, 0:-1], xticklabels=[], yticklabels=[])

# Scatterplot on main ax
ax_main.scatter('displ', 'hwy', s=df.cty*5, c=df.manufacturer.astype('category').cat.codes, alpha=.9, data=df, cmap="Set1", edgecolors='black', linewidths=.5)

# Add a graph in each part
sns.boxplot(df.hwy, ax=ax_right, orient="v")
sns.boxplot(df.displ, ax=ax_bottom, orient="h")

# Decorations ------------------
# Remove x axis name for the boxplot
ax_bottom.set(xlabel='')
ax_right.set(ylabel='')

# Main Title, Xlabel and YLabel
ax_main.set(title='Scatterplot with Histograms \n displ vs hwy', xlabel='displ', ylabel='hwy')

# Set font size of different components
ax_main.title.set_fontsize(20)
for item in ([ax_main.xaxis.label, ax_main.yaxis.label] + ax_main.get_xticklabels() + ax_main.get_yticklabels()):
    item.set_fontsize(14)

plt.show()

8 相關圖（Correllogram）

相關圖用於直觀地查看給定數據框（或二維數組）中所有可能的數值變量對之間的相關度量。

# Import Dataset
df = pd.read_csv("https://github.com/selva86/datasets/raw/master/mtcars.csv")

# Plot
plt.figure(figsize=(12,10), dpi= 80)
sns.heatmap(df.corr(), xticklabels=df.corr().columns, yticklabels=df.corr().columns, cmap='RdYlGn', center=0, annot=True)

# Decorations
plt.title('Correlogram of mtcars', fontsize=22)
plt.xticks(fontsize=12)
plt.yticks(fontsize=12)
plt.show()

9 矩陣圖（Pairwise Plot）

矩陣圖是探索性分析中的最愛，用於理解所有可能的數值變量對之間的關係。它是雙變量分析的必備工具。

# Load Dataset
df = sns.load_dataset('iris')

# Plot
plt.figure(figsize=(10,8), dpi= 80)
sns.pairplot(df, kind="scatter", hue="species", plot_kws=dict(s=80, edgecolor="white", linewidth=2.5))
plt.show()

# Load Dataset
df = sns.load_dataset('iris')

# Plot
plt.figure(figsize=(10,8), dpi= 80)
sns.pairplot(df, kind="reg", hue="species")
plt.show()

總結

第一部分【關聯】(Correlation) 就到這裏結束啦~

傳送門

Matplotlib可視化圖表——第一部分【關聯】(Correlation)
Matplotlib可視化圖表——第二部分【偏差】(Deviation)
Matplotlib可視化圖表——第三部分【排序】(Ranking)
Matplotlib可視化圖表——第四部分【分佈】(Distribution)
Matplotlib可視化圖表——第五部分【組成】(Composition)
Matplotlib可視化圖表——第六部分【變化】(Change)
Matplotlib可視化圖表——第七部分【分組】(Groups)

完整版參考

原文地址： Top 50 matplotlib Visualizations – The Master Plots (with full python code)
中文轉載：深度好文 | Matplotlib可視化最有價值的 50 個圖表（附完整 Python 源代碼）

【數據分析】Matplotlib可視化最有價值的圖表之——1、關聯(Correlation)

有效圖表的重要特徵：

針對每列繪製線性迴歸線

傳送門

完整版參考

【數據分析】Matplotlib可視化最有價值的圖表之——7、分組(Groups)

MySQL連接異常：2003 - Can't connect to MySQL server on 'localhost'

【數據分析】Matplotlib可視化最有價值的圖表之——2、偏差(Deviation)

【讀書筆記】《爲什麼精英都是時間控》總結整理 - 乾貨滿滿的時間管理書籍

【數據分析】Matplotlib可視化最有價值的圖表之——3、排序(Ranking)

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結