Python數據分析十三課：數據分析實戰

一個專業的數據分析，他的定位應該是一個“謀士”，所謂謀士，應該運籌帷幄，決勝千里，不出五尺書堂，便知天下大勢。

我們現在已經從IT（Information Technology）時代進入了DT（Data Technology）時代。我們有能力低成本的收集和存儲大量的數據，從而衍生出數據分析這個行業。

數據分析最重要的作用是從數據裏面尋求真正有價值的信息，並幫助我們作出合理的決策。

爲了更好的瞭解數據分析師這個崗位，本節課我們將以某招聘網站的2017年數據分析師職位數據爲基礎，進行數據分析。

一、數據基本情況

我們先了解一下數據的基本信息：

因爲csv文件中帶有中文字符而產生字符編碼錯誤，造成讀取文件錯誤，在這個時候，我們可以嘗試將pd.read_csv()函數的encoding參數設置爲"gbk"。

%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

# 導入數據
path = r'C:\Users\lin-a\Desktop\data\analyse_spider.csv'
data = pd.read_csv(path,encoding='GBK')
# 查看數據基本情況
print(data.shape)
data.head()

(6876, 14)

	city	companyId	companySize	businessZones	firstType	secondType	education	industryField	positionId	positionAdvantage	positionName	positionLables	salary	workYear
0	上海	8581	2000人以上	['張江']	技術	數據開發	碩士	移動互聯網	2537336	知名平臺	數據分析師	['分析師', '數據分析', '數據挖掘', '數據']	7k-9k	應屆畢業生
1	上海	23177	500-2000人	['五里橋', '打浦橋', '製造局路']	技術	數據開發	本科	金融	2427485	挑戰機會,團隊好,與大牛合作,工作環境好	數據分析師-CR2017-SH2909	['分析師', '數據分析', '數據挖掘', '數據']	10k-15k	應屆畢業生
2	上海	57561	50-150人	['打浦橋']	設計	數據分析	本科	移動互聯網	2511252	時間自由,領導nic	數據分析師	['分析師', '數據分析', '數據']	4k-6k	應屆畢業生
3	上海	7502	150-500人	['龍華', '上海體育場', '萬體館']	市場與銷售	數據分析	本科	企業服務,數據服務	2427530	五險一金績效獎金帶薪年假節日福利	大數據業務分析師【數雲校招】	['商業', '分析師', '大數據', '數據']	6k-8k	應屆畢業生
4	上海	130876	15-50人	['上海影城', '新華路', '虹橋']	技術	軟件開發	本科	其他	2245819	在大牛下指導	BI開發/數據分析師	['分析師', '數據分析', '數據', 'BI']	2k-3k	應屆畢業生

數據共包含14列，先看一下每一列的含義：

city：城市
companyId：公司ID
companySize：公司規模
CbusinessZones：公司所在商圈
firstType：職位所屬一級類目
secondType：職業所屬二級類目
education：學歷要求
industryField：公司所屬領域
positionId：職位ID
positionAdvantage：職位福利
positionName：職位名稱
positionLables：職位標籤
salary：薪水
workYear：工作年限要求

查看數據的具體情況：

# 查看數據具體情況
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6876 entries, 0 to 6875
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   city               6876 non-null   object
 1   companyId          6876 non-null   int64 
 2   companySize        6876 non-null   object
 3   businessZones      4853 non-null   object
 4   firstType          6869 non-null   object
 5   secondType         6870 non-null   object
 6   education          6876 non-null   object
 7   industryField      6876 non-null   object
 8   positionId         6876 non-null   int64 
 9   positionAdvantage  6876 non-null   object
 10  positionName       6876 non-null   object
 11  positionLables     6844 non-null   object
 12  salary             6876 non-null   object
 13  workYear           6876 non-null   object
dtypes: int64(2), object(12)
memory usage: 752.2+ KB

二、數據分析目標

數據分析的大忌是不知道分析方向和目的，拿着一堆數據不知所措。一切數據分析都是以業務爲核心目的，而不是以數據爲目的。

所以，我們應該先定分析的目標，然後再處理數據。

我們本案例的目標很簡單，就是根據該數據，分析影響薪資的因素：

地區對數據分析師的薪酬的影響；
學歷對數據分析師的薪酬的影響；
工作年限對數據分析師的薪酬的影響。

三、數據清洗

缺失值

數據的缺失值在很大程度上會影響數據的分析結果，如果某一個字段缺失值超過一半的時候，我們就可以將這個字段刪除了，因爲缺失過多就沒有業務意義了。

注意：並不是，只要含有有缺失值的時候，我們就要將數據刪除，如果數據量比較少、缺失值不多，並且對我們的分析指標沒有實際影響時，我們就可以將其保留。

# 查看數據具體情況
data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 6876 entries, 0 to 6875
Data columns (total 14 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   city               6876 non-null   object
 1   companyId          6876 non-null   int64 
 2   companySize        6876 non-null   object
 3   businessZones      4853 non-null   object
 4   firstType          6869 non-null   object
 5   secondType         6870 non-null   object
 6   education          6876 non-null   object
 7   industryField      6876 non-null   object
 8   positionId         6876 non-null   int64 
 9   positionAdvantage  6876 non-null   object
 10  positionName       6876 non-null   object
 11  positionLables     6844 non-null   object
 12  salary             6876 non-null   object
 13  workYear           6876 non-null   object
dtypes: int64(2), object(12)
memory usage: 752.2+ KB

通過結果我們可以看出：一共有6876個數據，其中businessZones、firstType、secondType，positionLables都存在爲空的情況。companyId和positionId爲數字，其他都是字符串。

從數量上可以看出，businessZones列的數據缺失量比較大，需要將該列數據刪除。

其他三列的缺失值的總數量爲45，並不會影響整體分析效果，我們可以刪除這45條數據。

%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

# 導入數據
path = r'C:\Users\lin-a\Desktop\data\analyse_spider.csv'
data = pd.read_csv(path,encoding='GBK')

# 刪除缺失值的數據列businessZones
data.drop(columns='businessZones',axis=1,inplace=True)
# 同時刪除缺失值的行
data.dropna(inplace=True)
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 6837 entries, 0 to 6875
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   city               6837 non-null   object
 1   companyId          6837 non-null   int64 
 2   companySize        6837 non-null   object
 3   firstType          6837 non-null   object
 4   secondType         6837 non-null   object
 5   education          6837 non-null   object
 6   industryField      6837 non-null   object
 7   positionId         6837 non-null   int64 
 8   positionAdvantage  6837 non-null   object
 9   positionName       6837 non-null   object
 10  positionLables     6837 non-null   object
 11  salary             6837 non-null   object
 12  workYear           6837 non-null   object
dtypes: int64(2), object(11)
memory usage: 747.8+ KB

處理完空值之後，數據還剩6837條，13列。

重複值

處理完空值以後，我們還需要注意另外一個會影響我們分析結果的因素，就是重複值。

我們來看一下計算一下重複的數據，並將其刪除。

使用data.duplicated()方法判斷每一行是否重複，然後使用data.duplicated()[data.duplicated()==True]取出重複行，最後使用len()計算重複的數據。

%matplotlib inline
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

# 導入數據
path = r'C:\Users\lin-a\Desktop\data\analyse_spider.csv'
data = pd.read_csv(path,encoding='GBK')

# 刪除缺失值的數據列businessZones
data.drop(columns='businessZones',axis=1,inplace=True)
# 同時刪除缺失值的行
data.dropna(inplace=True)

# 計算重複的數據數量
print(len(data.duplicated()[data.duplicated()==True]))
# 刪除重複數據
data.drop_duplicates(inplace=True)
data.info()

1830
<class 'pandas.core.frame.DataFrame'>
Int64Index: 5007 entries, 0 to 6766
Data columns (total 13 columns):
 #   Column             Non-Null Count  Dtype 
---  ------             --------------  ----- 
 0   city               5007 non-null   object
 1   companyId          5007 non-null   int64 
 2   companySize        5007 non-null   object
 3   firstType          5007 non-null   object
 4   secondType         5007 non-null   object
 5   education          5007 non-null   object
 6   industryField      5007 non-null   object
 7   positionId         5007 non-null   int64 
 8   positionAdvantage  5007 non-null   object
 9   positionName       5007 non-null   object
 10  positionLables     5007 non-null   object
 11  salary             5007 non-null   object
 12  workYear           5007 non-null   object
dtypes: int64(2), object(11)
memory usage: 547.6+ KB

共有1830條重複的數據，使用data.drop_duplicates()刪除後，還剩5007條數據。

數據的缺失值和重複值，是我們在分析之前必須要做，因爲，他們的存在會很大程度上影響我們的分析結果。

四、整理和數據分析

我們的第一個任務是薪酬分佈情況，所以我們先來整理這個數據。

薪資字段格式基本分爲15k-25k和15k以上這兩種，並且都是字符串。

如果我們只想要薪資下限數據或者薪資上限數據時，怎麼辦呢？

最好的方式就將salary薪資字段按照最高薪水和最低薪水拆成兩列，並且薪水的話如果用幾K表示，直接用於計算，所以將k去掉。

這裏我們可以使用pandas中的apply方法，針對薪酬這一列數據進行操作，結果如下圖。

# 定義拆分的行數
def split_salary(salary,method):
    # 獲取“-”的索引值
    position = salary.upper().find('-')
    if position != -1:# salary值是15k-25k的樣式
        low_salary = salary[:position-1]
        high_salary = salary[position+1:len(salary)-1]
        
    else:# salary值是15k以上的樣式
        low_salary = salary[:salary.upper().find('K')]
        high_salary = low_salary
    
    # 根據參數用以判斷返回的值
    if method == 'low':
        return low_salary
    if method == 'high':
        return high_salary
    if method == 'avg':
        return (int(low_salary)+int(high_salary))/2

# 賦值
data['low_salary'] = data.salary.apply(split_salary,method='low')
data['high_salary'] = data.salary.apply(split_salary,method='high')
data['avg_salary'] = data.salary.apply(split_salary,method='avg')
data.head()

	city	companyId	companySize	businessZones	firstType	secondType	education	industryField	positionId	positionAdvantage	positionName	positionLables	salary	workYear	low_salary	high_salary	avg_salary
0	上海	8581	2000人以上	['張江']	技術	數據開發	碩士	移動互聯網	2537336	知名平臺	數據分析師	['分析師', '數據分析', '數據挖掘', '數據']	7k-9k	應屆畢業生	7	9	8.0
1	上海	23177	500-2000人	['五里橋', '打浦橋', '製造局路']	技術	數據開發	本科	金融	2427485	挑戰機會,團隊好,與大牛合作,工作環境好	數據分析師-CR2017-SH2909	['分析師', '數據分析', '數據挖掘', '數據']	10k-15k	應屆畢業生	10	15	12.5
2	上海	57561	50-150人	['打浦橋']	設計	數據分析	本科	移動互聯網	2511252	時間自由,領導nic	數據分析師	['分析師', '數據分析', '數據']	4k-6k	應屆畢業生	4	6	5.0
3	上海	7502	150-500人	['龍華', '上海體育場', '萬體館']	市場與銷售	數據分析	本科	企業服務,數據服務	2427530	五險一金績效獎金帶薪年假節日福利	大數據業務分析師【數雲校招】	['商業', '分析師', '大數據', '數據']	6k-8k	應屆畢業生	6	8	7.0
4	上海	130876	15-50人	['上海影城', '新華路', '虹橋']	技術	軟件開發	本科	其他	2245819	在大牛下指導	BI開發/數據分析師	['分析師', '數據分析', '數據', 'BI']	2k-3k	應屆畢業生	2	3	2.5

第一步，我們自定義了一個函數split_salary()函數，salary參數是使用apply函數必須要傳的參數，其實就是data.salary的值。
第二步，使用salary.upper().find(’-’)判斷salary值是15k-25k的形式還是15k以上形式，如果結果是-1，表示是15k以上形式，反之是15k-25k形式。爲了避免k的大小寫，我們用upper函數將k都轉換爲K，然後以K作爲截取。
第三步，在split_salary函數增加了新的參數用以判斷返回low_salary還是high_salary或者是avg_salary。

分列除了採取以上的find()函數外，還可以使用split()函數，如下所示：

import pandas as pd

df = pd.DataFrame(data={'序號':[1,2,3,4],
                       '待遇':['12K','12K-15K','20-22k','20k以上']})
df2 = df['待遇'].str.split('-',expand=True)
df2

	0	1
0	12K	None
1	12K	15K
2	20	22k
3	20k以上	None

df

	序號	待遇
0	1	12K
1	2	12K-15K
2	3	20-22k
3	4	20k以上

到此，我們完成了數據整理部分，接着我們看看數據分析師的薪酬情況。

五、繪製圖表

%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib import font_manager
import seaborn as sns

# 定義中文字體
my_font = font_manager.FontProperties(fname=r'C:\Users\lin-a\Desktop\Python數據分析\經典黑體簡.TTF')
sns.set(style='darkgrid')
plt.figure(figsize=(10,8),dpi=80)
plt.hist(data['avg_salary'],width=5)
plt.xlabel('薪酬區間',fontproperties=my_font)
plt.ylabel('數量',fontproperties=my_font)
plt.title('數據分析師薪酬分佈圖',fontproperties=my_font)
plt.show()

接着我們按照城市來觀察不同城市對薪酬的影響：

圖表中繪製了數據分析師薪資的分佈，可以看出薪資的大部分在10k-30k之間，10k-20k這個範圍最多。

%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib import font_manager
import seaborn as sns

my_font = font_manager.FontProperties(fname=r'C:\Users\lin-a\Desktop\Python數據分析\經典黑體簡.TTF')
plt.figure(figsize=(20,8),dpi=80)
sns.set(style='darkgrid')

groups = data.groupby(by='city')
xticks = []
for group_name,group_df in groups:
    xticks.append(group_name)
    plt.bar(group_name,group_df.avg_salary.mean())
plt.xticks(xticks,fontproperties=my_font)
plt.title('不同城市薪酬分佈',fontproperties=my_font)
plt.ylabel('薪酬水平',fontproperties=my_font)
plt.show()

代碼中我們用city進行分組，然後分別繪製了每個城市的平均薪資。
從圖表中我們看出，北京的數據分析師薪資高於其他城市，上海和深圳稍次，廣州甚至不如杭州和蘇州。

接下來，我們再看看不同學歷對薪資的影響。
我們同樣按學歷進行分組，然後對比不同學歷的平均薪資。

%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib import font_manager

my_font = font_manager.FontProperties(fname=r'C:\Users\lin-a\Desktop\Python數據分析\經典黑體簡.TTF')
plt.figure(figsize=(6,4),dpi=80)
sns.set(style='darkgrid')

groups = data.groupby(by='education')
xticks = []
for group_name,group_df in groups:
    xticks.append(group_name)
    plt.bar(group_name,group_df.avg_salary.mean())
plt.xticks(xticks,fontproperties=my_font)
plt.title('不同學歷水平薪酬分佈',fontproperties=my_font)
plt.ylabel('薪酬水平',fontproperties=my_font)
plt.show()

代碼中我們用city進行分組，然後分別繪製了不同學歷的平均薪資。

從圖表中我們看出,博士薪資最高，碩士和本科基本持平，大專學歷稍有弱勢。

最後，我們再看看不同工作年限對薪資的影響。

我們同樣按工作年年限進行分組，然後對比不同年限的平均薪資。

%matplotlib inline
import matplotlib.pyplot as plt
from matplotlib import font_manager

my_font = font_manager.FontProperties(fname=r'C:\Users\lin-a\Desktop\Python數據分析\經典黑體簡.TTF')
plt.figure(figsize=(6,4),dpi=80)
sns.set(style='darkgrid')

groups = data.groupby(by='workYear')
xticks = []
for group_name,group_df in groups:
    xticks.append(group_name)
    plt.bar(group_name,group_df.avg_salary.mean())
plt.xticks(xticks,fontproperties=my_font)
plt.title('不同工作年限薪酬分佈',fontproperties=my_font)
plt.ylabel('薪酬水平',fontproperties=my_font)
plt.show()

薪資我們就簡單的分析到這裏，我們簡單的歸納一下我們數據展現的結果：

數據分析師的薪資的平均數是17k，最大薪資在75k，大部分分析師薪資在10k-20k之間。
北京的數據分析師薪資高於其他城市，上海和深圳稍次，杭州和蘇州已經超過廣州。
薪資最高的是博士，碩士和本科的薪資基本持平，大專學歷稍有弱勢。
工作年限越長，薪資就越高。

根據上面數據展現的結果可以得到這樣的結論：北上廣深依然是我們高薪就業地，同時工作年限和學歷都是都與薪資成線性增長趨勢。

2019年全年的天氣數據,完成如下需求： 1. 2019年北京哪個月的氣溫波動最大？ 2. 2019年各種空氣質量的佔比是多少？

%matplotlib inline
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

# 導入數據
data = pd.read_

Python數據分析十三課：數據分析實戰

一、數據基本情況

二、數據分析目標

三、數據清洗

缺失值

重複值

四、整理和數據分析

五、繪製圖表

解決sns加載數據load_dataset()報錯問題

Seaborn常見圖形繪製（kdeplot、distplot）

理解Pandas的Transform

Pandas數據處理——map、apply、applymap的異同

Excel操作：製作to do list

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結