Pandas 11-綜合練習

import pandas as pd
import numpy as np
np.seterr(all = 'ignore')

{'divide': 'ignore', 'over': 'ignore', 'under': 'ignore', 'invalid': 'ignore'}

【任務一】企業收入的多樣性

【題目描述】一個企業的產業收入多樣性可以仿照信息熵的概念來定義收入熵指標 :
I = − ∑ i p ( x i ) l o g ( p ( x i ) ) I=-\sum_i{p(x_i)log(p(x_i))} I=−i∑p(xi)log(p(xi))
其中 p ( x i ) p(x_i) p(xi)是企業該年某產業收入額佔該年所有產業總收入的比重。
在company.csv中存有需要計算的企業和年份 , 在company_data.csv中存有企業、各類收入額和收入年份的信息。現請利用後一張表中的數據 , 在前一張表中增加一列表示該公司該年份的收入熵指標I。
【數據下載】鏈接：https://pan.baidu.com/s/1leZZctxMUSW55kZY5WwgIw 53 密碼：u6fd

My solution :

讀取兩表數據

df1 = pd.read_csv('company.csv')
df2 = pd.read_csv('company_data.csv')

df1.head()

	證券代碼	日期
0	#000007	2014
1	#000403	2015
2	#000408	2016
3	#000408	2017
4	#000426	2015

df2.head()

	證券代碼	日期	收入類型	收入額
0	1	2008/12/31	1	1.084218e+10
1	1	2008/12/31	2	1.259789e+10
2	1	2008/12/31	3	1.451312e+10
3	1	2008/12/31	4	1.063843e+09
4	1	2008/12/31	5	8.513880e+08

經觀察兩表的證券代碼列和日期格式都不一致 , 因當首先變一致
將df1表中證券代碼列裏的#去掉轉爲int
將df2表日期列取前四位year轉爲int

df1_ = df1.copy()
df1_['證券代碼'] = df1_['證券代碼'].str[1:].astype('int64')

df2['日期'] = df2['日期'].str[:4].astype('int64')

定義entropy函數計算信息熵 , 並跳過NaN值
用df1表左連接df2表 , 連接列爲證券代碼和日期 , 再繼續對這兩列分組 , 取出收入額列用apply調用信息熵函數 , 重置索引

def entropy(x):
    if x.any():
        p = x/x.sum()
        return -(p*np.log2(p)).sum()
    return np.nan
res = df1_.merge(df2, on=['證券代碼','日期'], how='left').groupby(['證券代碼','日期'])['收入額'].apply(entropy).reset_index()
res.head()

	證券代碼	日期	收入額
0	7	2014	4.429740
1	403	2015	4.025963
2	408	2016	4.066295
3	408	2017	NaN
4	426	2015	4.449655

將df1表新增一列收入熵指標 , 值爲結果表中的收入額

df1['收入熵指標'] = res['收入額']
df1

	證券代碼	日期	收入熵指標
0	#000007	2014	4.429740
1	#000403	2015	4.025963
2	#000408	2016	4.066295
3	#000408	2017	NaN
4	#000426	2015	4.449655
...	...	...	...
1043	#600978	2011	4.788391
1044	#600978	2014	4.022378
1045	#600978	2015	4.346303
1046	#600978	2016	4.358608
1047	#600978	2017	NaN

1048 rows × 3 columns

將上述所有過程封裝爲函數 , 並測試性能

def information_entropy():
    df1 = pd.read_csv('company.csv')
    df2 = pd.read_csv('company_data.csv')
    df1_ = df1.copy()
    df1_['證券代碼'] = df1_['證券代碼'].str[1:].astype('int64')
    df2['日期'] = df2['日期'].str[:4].astype('int64')
    res = df1_.merge(df2, on=['證券代碼','日期'], how='left').groupby(['證券代碼','日期'])['收入額'].apply(entropy).reset_index()
    df1['收入熵指標'] = res['收入額']
    return df1

%timeit -n 5 information_entropy()

1.62 s ± 44.5 ms per loop (mean ± std. dev. of 7 runs, 5 loops each)

【任務二】組隊學習信息表的變換

【題目描述】請把組隊學習的隊伍信息表變換爲如下形態，其中'是否隊長'一列取1表示隊長，否則爲0

【數據下載】鏈接：https://pan.baidu.com/s/1ses24cTwUCbMx3rvYXaz-Q 34 密碼：iz57

My solution :

讀取數據

df = pd.read_excel('組隊信息彙總表_Pandas.xlsx')

所在羣列沒有用到 , drop掉

df.drop(columns='所在羣', inplace=True)
df.head(2)

	隊伍名稱	隊長編號	隊長_羣暱稱	隊員1 編號	隊員_羣暱稱	隊員2 編號	隊員_羣暱稱.1	隊員3 編號	隊員_羣暱稱.2	隊員4 編號	...	隊員6 編號	隊員_羣暱稱.5	隊員7 編號	隊員_羣暱稱.6	隊員8 編號	隊員_羣暱稱.7	隊員9 編號	隊員_羣暱稱.8	隊員10編號	隊員_羣暱稱.9
0	你說的都對隊	5	山楓葉紛飛	6	蔡	7.0	安慕希	8.0	信仰	20.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	熊貓人	175	魚呲呲	44	Heaven	37.0	呂青	50.0	餘柳成蔭	82.0	...	25.0	Never say never	55.0	K	120.0	Y.	28.0	X.Y.Q	151.0	swrong

2 rows × 23 columns

爲了使用wide_to_long將寬錶轉長表 ,需要先對錶columns進行重命名
對照結果表中的名字 , 分別將隊長和隊員用leader和member區分 , 結果表中隊長和隊員分別用1和0分類 , 不妨在重命名時就先分好類 , 在重命名的末尾追加1和0,最後直接取出字符串最後一位即可

col_1 = np.array(['隊伍名稱','編號_leader01','暱稱_leader01'])
col_2 = np.array([[f'編號_member{i}0', f'暱稱_member{i}0']for i in range(1,11)]).flatten()
df.columns = np.r_[col_1,col_2]
df.head(2)

	隊伍名稱	編號_leader01	暱稱_leader01	編號_member10	暱稱_member10	編號_member20	暱稱_member20	編號_member30	暱稱_member30	編號_member40	...	編號_member60	暱稱_member60	編號_member70	暱稱_member70	編號_member80	暱稱_member80	編號_member90	暱稱_member90	編號_member100	暱稱_member100
0	你說的都對隊	5	山楓葉紛飛	6	蔡	7.0	安慕希	8.0	信仰	20.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
1	熊貓人	175	魚呲呲	44	Heaven	37.0	呂青	50.0	餘柳成蔭	82.0	...	25.0	Never say never	55.0	K	120.0	Y.	28.0	X.Y.Q	151.0	swrong

2 rows × 23 columns

將重命名好的表用wide_to_long轉換爲長表 , 命名對照結果表 , 省的還要再重命名
轉換後dropna刪除NaN值 , 恢復索引

res = pd.wide_to_long(  df.reset_index(),
                        stubnames = ['暱稱','編號'],
                        i = ['index','隊伍名稱'],
                        j = '是否隊長',
                        sep = '_',
                        suffix = '.+').dropna().reset_index().drop(columns='index')
res

	隊伍名稱	是否隊長	暱稱	編號
0	你說的都對隊	leader01	山楓葉紛飛	5.0
1	你說的都對隊	member10	蔡	6.0
2	你說的都對隊	member20	安慕希	7.0
3	你說的都對隊	member30	信仰	8.0
4	你說的都對隊	member40	biubiu🙈🙈	20.0
...	...	...	...	...
141	七星聯盟	member40	Daisy	63.0
142	七星聯盟	member50	One Better	131.0
143	七星聯盟	member60	rain	112.0
144	應如是	leader01	思無邪	54.0
145	應如是	member10	Justzer0	58.0

146 rows × 4 columns

到這裏已經接近結果了 , 把是否隊長一列的值最後一個取出最爲該列的分類
編號列的類型爲float轉爲int
是否隊長和隊伍名稱兩列順序倒了 , 恢復一下即可

res['是否隊長'],res['編號'] = res['是否隊長'].str[-1],res['編號'].astype('int64')

res.reindex(columns=['是否隊長','隊伍名稱','暱稱','編號'])

	是否隊長	隊伍名稱	暱稱	編號
0	1	你說的都對隊	山楓葉紛飛	5
1	0	你說的都對隊	蔡	6
2	0	你說的都對隊	安慕希	7
3	0	你說的都對隊	信仰	8
4	0	你說的都對隊	biubiu🙈🙈	20
...	...	...	...	...
141	0	七星聯盟	Daisy	63
142	0	七星聯盟	One Better	131
143	0	七星聯盟	rain	112
144	1	應如是	思無邪	54
145	0	應如是	Justzer0	58

146 rows × 4 columns

將上述所有過程封裝爲函數 , 並測試性能

def transform_table():
    df = pd.read_excel('組隊信息彙總表_Pandas.xlsx')
    df.drop(columns='所在羣', inplace=True)
    col_1 = np.array(['隊伍名稱','編號_leader01','暱稱_leader01'])
    col_2 = np.array([[f'編號_member{i}0', f'暱稱_member{i}0']for i in range(1,11)]).flatten()
    df.columns = np.r_[col_1,col_2]
    res = pd.wide_to_long(  df.reset_index(),
                            stubnames = ['暱稱','編號'],
                            i = ['index','隊伍名稱'],
                            j = '是否隊長',
                            sep = '_',
                            suffix = '.+').dropna().reset_index().drop(columns='index')
    res['是否隊長'], res['編號'] = res['是否隊長'].str[-1], res['編號'].astype('int64')
    res.reindex(columns=['是否隊長','隊伍名稱','暱稱','編號'])

%timeit -n 50 transform_table()

45.7 ms ± 3 ms per loop (mean ± std. dev. of 7 runs, 50 loops each)

【任務三】美國大選投票情況

【題目描述】兩張數據表中分別給出了美國各縣（county）的人口數以及大選的投票情況 , 請解決以下問題：

有多少縣滿足總投票數超過縣人口數的一半
把州（state）作爲行索引 , 把投票候選人作爲列名 , 列名的順序按照候選人在全美的總票數由高到低排序 , 行列對應的元素爲該候選人在該州獲得的總票數

此處是一個樣例，實際的州或人名用原表的英語代替
		                拜登   川普
			  威斯康星州   2      1
			  德克薩斯州   3      4

每一個州下設若干縣 , 定義拜登在該縣的得票率減去川普在該縣的得票率爲該縣的BT指標 , 若某個州所有縣BT指標的中位數大於0 , 則稱該州爲Biden State , 請找出所有的Biden State

【數據下載】鏈接：https://pan.baidu.com/s/182rr3CpstVux2CFdFd_Pcg 32 提取碼：q674

My solution :

讀取兩表數據

df1 = pd.read_csv('president_county_candidate.csv')
df2 = pd.read_csv('county_population.csv')

df1.head()

	state	county	candidate	party	total_votes	won
0	Delaware	Kent County	Joe Biden	DEM	44552	True
1	Delaware	Kent County	Donald Trump	REP	41009	False
2	Delaware	Kent County	Jo Jorgensen	LIB	1044	False
3	Delaware	Kent County	Howie Hawkins	GRN	420	False
4	Delaware	New Castle County	Joe Biden	DEM	195034	True

df2.head()

	US County	Population
0	.Autauga County, Alabama	55869
1	.Baldwin County, Alabama	223234
2	.Barbour County, Alabama	24686
3	.Bibb County, Alabama	22394
4	.Blount County, Alabama	57826

爲了後續分組或合併操作 , 先統一state和county列名和值
將df2中US County按,拆分 , 注意逗號後還有個空格 , 否則拆分後值並不相同

df2[['county','state']] = pd.DataFrame([*df2['US County'].str.split(', ')])
df2.county = df2.county.str[1:]
df2.drop(columns='US County', inplace=True)
df2.head()

	Population	county	state
0	55869	Autauga County	Alabama
1	223234	Baldwin County	Alabama
2	24686	Barbour County	Alabama
3	22394	Bibb County	Alabama
4	57826	Blount County	Alabama

1. 有多少縣滿足總投票數超過縣人口數的一半 ?

對df1按state和county分組 , 求和計算每個county總票數
再與df2按state和county兩列merge , 將Population轉移過來

df_merge = df1.groupby(['state','county'])['total_votes'].sum().reset_index().merge(df2, on=['state','county'], how='left')
df_merge.head()

	state	county	total_votes	Population
0	Alabama	Autauga County	27770	55869.0
1	Alabama	Baldwin County	109679	223234.0
2	Alabama	Barbour County	10518	24686.0
3	Alabama	Bibb County	9595	22394.0
4	Alabama	Blount County	27588	57826.0

對上述結果取出total_votes與Population作比較篩選出即可

df_merge[df_merge['total_votes'] > 0.5*df_merge['Population']]

	state	county	total_votes	Population
11	Alabama	Choctaw County	7464	12589.0
12	Alabama	Clarke County	13135	23622.0
13	Alabama	Clay County	6930	13235.0
16	Alabama	Colbert County	27886	55241.0
17	Alabama	Conecuh County	6441	12067.0
...	...	...	...	...
4626	Wyoming	Sheridan County	16428	30485.0
4627	Wyoming	Sublette County	4970	9831.0
4629	Wyoming	Teton County	14677	23464.0
4631	Wyoming	Washakie County	4012	7805.0
4632	Wyoming	Weston County	3542	6927.0

1434 rows × 4 columns

2. 把州（state）作爲行索引 , 把投票候選人作爲列名 , 列名的順序按照候選人在全美的總票數由高到低排序 , 行列對應的元素爲該候選人在該州獲得的總票數

依題意可以用pivot_table透視 , 填入行和列 , 對同一位置用sum聚合 , 打開margins彙總 , 對最後一行All降序排列
可以看到第一列是每行的彙總 , 也就是每個state的彙總 , 第二列是Biden最高票 , Trump緊隨其後

df1.pivot_table(values = ['total_votes'],
                index = ['state'],
                columns = 'candidate',
                aggfunc = 'sum',
                margins = True).sort_values('All', 1, ascending=False).head()

	total_votes
candidate	All	Joe Biden	Donald Trump	Jo Jorgensen	Howie Hawkins	Write-ins	Rocky De La Fuente	Gloria La Riva	Kanye West	Don Blankenship	...	Tom Hoefling	Ricki Sue King	Princess Jacob-Fambro	Blake Huber	Richard Duncan	Joseph Kishore	Jordan Scott	Gary Swing	Keith McCormic	Zachary Scalf
state
Alabama	2323304	849648.0	1441168.0	25176.0	NaN	7312.0	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Alaska	391346	153405.0	189892.0	8896.0	NaN	34210.0	318.0	NaN	NaN	1127.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Arizona	3387326	1672143.0	1661686.0	51465.0	NaN	2032.0	NaN	NaN	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
Arkansas	1219069	423932.0	760647.0	13133.0	2980.0	NaN	1321.0	1336.0	4099.0	2108.0	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN
California	17495906	11109764.0	6005961.0	187885.0	81025.0	80.0	60155.0	51036.0	NaN	NaN	...	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN	NaN

5 rows × 39 columns

3. 每一個州下設若干縣 , 定義拜登在該縣的得票率減去川普在該縣的得票率爲該縣的BT指標 , 若某個州所有縣BT指標的中位數大於0 , 則稱該州爲Biden State , 請找出所有的Biden State
方法一 :

定義一個計算BT指標的函數 , 分別取出Biden的票數 , Trump的票數 , 計算每個county總票數 , 做差相除得到BT
對state和county分組 , 取出candidate和total_votes兩列調用apply計算BT

def BT(x):
    biden = x[x['candidate']=='Joe Biden']['total_votes'].values
    trump = x[x['candidate']=='Donald Trump']['total_votes'].values
    return pd.Series((biden-trump)/x['total_votes'].sum(), index=['BT'])   
bt = df1.groupby(['state','county'])[['candidate','total_votes']].apply(BT)
bt.head()

		BT
state	county
Alabama	Autauga County	-0.444184
Baldwin County	-0.537623
Barbour County	-0.076631
Bibb County	-0.577280
Blount County	-0.800022

將bt結果恢復索引重新對state分組 , 用filter過濾每個state下county的BT指標中位數是否大於0
對state去重後即滿足條件的所有state , 只有9個

bt.reset_index().groupby('state').filter(lambda x:x.BT.median()>0)[['state']].drop_duplicates()

	state
197	California
319	Connecticut
488	Delaware
491	District of Columbia
725	Hawaii
1878	Massachusetts
2999	New Jersey
3536	Rhode Island
4065	Vermont

方法二 :

分別用bool條件取出biden和trump的所有行 , 再對state和county分組求出每個county的總票數
這三個df巧了都是一樣的大小 , 說明每個county都有biden和trump的票

biden_df = df1[df1['candidate']=='Joe Biden'][['state','county','total_votes']]
trump_df = df1[df1['candidate']=='Donald Trump'][['state','county','total_votes']]
sum_df = df1.groupby(['state','county'])[['total_votes']].sum().reset_index()

將上述三個一樣大的df合併

res = biden_df.merge(trump_df, on=['state','county'] ,suffixes=('_biden','_trump')).merge(sum_df, on=['state','county'])
res.head()

	state	county	total_votes_biden	total_votes_trump	total_votes
0	Delaware	Kent County	44552	41009	87025
1	Delaware	New Castle County	195034	88364	287633
2	Delaware	Sussex County	56682	71230	129352
3	District of Columbia	District of Columbia	39041	1725	41681
4	District of Columbia	Ward 2	29078	2918	32881

分別取出biden列和trump列做差後除以sum列得出BT指標

res['BT'] = (res['total_votes_biden']-res['total_votes_trump'])/res['total_votes']
res.head()

	state	county	total_votes_biden	total_votes_trump	total_votes	BT
0	Delaware	Kent County	44552	41009	87025	0.040712
1	Delaware	New Castle County	195034	88364	287633	0.370855
2	Delaware	Sussex County	56682	71230	129352	-0.112468
3	District of Columbia	District of Columbia	39041	1725	41681	0.895276
4	District of Columbia	Ward 2	29078	2918	32881	0.795596

同樣的 , 按要求過濾後取出所有滿足條件的state , 也是9個

res[['state','BT']].groupby('state').filter(lambda x:x.median()>0)[['state']].drop_duplicates()

	state
0	Delaware
3	District of Columbia
237	Hawaii
1390	Massachusetts
2511	New Jersey
3048	Rhode Island
3577	Vermont
4327	California
4449	Connecticut

下面將上述兩種方法分別封裝成方法 , 並測試性能

def method1():
    bt = df1.groupby(['state','county'])[['candidate','total_votes']].apply(BT)
    bt.reset_index().groupby('state').filter(lambda x:x.BT.median()>0)[['state']].drop_duplicates()

def method2():
    biden_df = df1[df1['candidate']=='Joe Biden'][['state','county','total_votes']]
    trump_df = df1[df1['candidate']=='Donald Trump'][['state','county','total_votes']]
    sum_df = df1.groupby(['state','county'])[['total_votes']].sum().reset_index()
    res = biden_df.merge(trump_df, on=['state','county'] ,suffixes=('_biden','_trump')).merge(sum_df, on=['state','county'])
    res['BT'] = (res['total_votes_biden']-res['total_votes_trump'])/res['total_votes']
    res[['state','BT']].groupby('state').filter(lambda x:x.median()>0)[['state']].drop_duplicates()

%timeit method1()

6.56 s ± 210 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)

%timeit method2()

90.9 ms ± 4.83 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

可以看到方法二雖然拆分好多步驟 , 但是沒有用apply調用自定義函數 , 性能強到飛起

【任務四】計算城市間的距離矩陣

【題目描述】數據中給出了若干城市的經緯度，請構造關於城市的距離DataFrame ，其橫縱索引爲城市名稱，值爲矩陣 M M M , M i j M_ij Mij表示城市i與城市j間的球面距離（可以利用geopy包中distance模塊的geodesic函數），並規定與自身的距離爲0。

My solution :

讀取表數據 , 以,分割 , 並重命名列名

df = pd.read_table('map.txt', sep=',', names=['city1','longitude','latitude'], skiprows=1)
df.head()

	city1	longitude	latitude
0	瀋陽市	123.429092	41.796768
1	長春市	125.324501	43.886841
2	哈爾濱市	126.642464	45.756966
3	北京市	116.405289	39.904987
4	天津市	117.190186	39.125595

將緯度和經度打包成元組供後面計算距離用
刪除原來的兩列經緯度

df['coord1'] = pd.Series([*zip(df.latitude, df.longitude)])

df.drop(columns=['longitude','latitude'], inplace=True)
df.head(3)

	city1	coord1
0	瀋陽市	(41.796768, 123.429092)
1	長春市	(43.886841, 125.324501)
2	哈爾濱市	(45.756966, 126.64246399999999)

複製一份df表 , 並重命名列名加以區分 , 並設置city2爲index , 爲後續透視表做準備

df2 = df.rename(columns={
   
   'city1':'city2','coord1':'coord2'}).set_index('city2')
df2.head(3)

	coord2
city2
瀋陽市	(41.796768, 123.429092)
長春市	(43.886841, 125.324501)
哈爾濱市	(45.756966, 126.64246399999999)

將df和df2擴展 , 先借用groupby對df兩列分組 , 看似分了個寂寞 , 實則用apply將df2一個一個拼上去了 , 將原表在索引裏的座標coord1恢復到數據列 , 用stack把列移下來做一個reshape , 再重置索引 , 將空列名起個名字coords , 這一列都是座標了 , 爲後續透視表做完了準備

df_expand = df.groupby(['city1','coord1']).apply(lambda x:df2).reset_index(1).stack().reset_index().rename(columns={
   
   0:'coords'})
df_expand.head(3)

	city1	city2	level_2	coords
0	上海市	瀋陽市	coord1	(31.231707, 121.472641)
1	上海市	瀋陽市	coord2	(41.796768, 123.429092)
2	上海市	長春市	coord1	(31.231707, 121.472641)

導入計算距離的函數geodesic
將上述準備好的表進行透視 , 並對透視結果座標列用geodesic計算距離 , 用km做單位 , 再保留兩位小數

from geopy.distance import geodesic
df_expand.pivot_table(values = 'coords',
                      index = 'city1',
                      columns = 'city2',
                      aggfunc = lambda x : geodesic(*x).km
                     ).round(2).head(3)

city2	上海市	烏魯木齊市	蘭州市	北京市	南京市	南寧市	南昌市	臺北市	合肥市	呼和浩特市	...	福州市	西寧市	西安市	貴陽市	鄭州市	重慶市	銀川市	長春市	長沙市	香港
city1
上海市	0.00	3272.69	1718.73	1065.83	271.87	1601.34	608.49	687.22	403.87	1378.10	...	609.42	1912.40	1220.03	1527.44	827.47	1449.73	1606.18	1444.77	887.53	1229.26
烏魯木齊市	3272.69	0.00	1627.85	2416.78	3010.73	3005.60	3023.06	3708.72	2907.56	2009.15	...	3466.84	1443.79	2120.86	2572.82	2448.76	2305.96	1666.74	3004.86	2850.22	3415.02
蘭州市	1718.73	1627.85	0.00	1182.61	1447.27	1529.94	1397.55	2085.95	1326.02	870.72	...	1841.48	194.64	506.74	1086.41	904.20	765.89	343.02	2023.00	1226.07	1826.19

3 rows × 34 columns

將上述所有過程封裝爲函數 , 並測試性能

def calculate_M():
    df = pd.read_table('map.txt', sep=',', names=['city1','longitude','latitude'], skiprows=1)
    df['coord1'] = pd.Series([*zip(df.latitude, df.longitude)])
    df.drop(columns=['longitude','latitude'], inplace=True)
    df2 = df.rename(columns={
   
   'city1':'city2', 'coord1':'coord2'}).set_index('city2')
    df_expand = df.groupby(['city1','coord1']).apply(lambda x:df2).reset_index(1).stack().reset_index().rename(columns={
   
   0:'coords'})
    return df_expand.pivot_table(values = 'coords',
                                  index = 'city1',
                                  columns = 'city2',
                                  aggfunc = lambda x : geodesic(*x).km
                                 ).round(2)

%timeit -n 10 calculate_M()

395 ms ± 23.5 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

附錄 :

`geopy`包

根據城市名查城市位置

創建定位器 :

from geopy.geocoders import Nominatim
geolocator = Nominatim(user_agent="Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.121 Safari/537.36")

根據城市名稱查詢位置 :

location = geolocator.geocode("南京市雨花臺區")
location.address

'雨花臺區, 南京市, 江蘇省, 中國'

經度 :

location.longitude

118.7724224

緯度 :

location.latitude

31.9932018

根據經緯度查詢位置 :

location = geolocator.reverse("31.997858805465647, 118.78544536405718")

location.address

'雨花東路, 雨花臺區, 建鄴區, 南京市, 江蘇省, 21006, 中國'

location.raw

{'place_id': 134810031,
 'licence': 'Data © OpenStreetMap contributors, ODbL 1.0. https://osm.org/copyright',
 'osm_type': 'way',
 'osm_id': 189414212,
 'lat': '31.99705152324867',
 'lon': '118.78513775762214',
 'display_name': '雨花東路, 雨花臺區, 建鄴區, 南京市, 江蘇省, 21006, 中國',
 'address': {'road': '雨花東路',
  'suburb': '雨花臺區',
  'city': '建鄴區',
  'state': '江蘇省',
  'postcode': '21006',
  'country': '中國',
  'country_code': 'cn'},
 'boundingbox': ['31.9964788', '31.9989487', '118.7819222', '118.7866616']}

計算距離 :

from geopy.distance import distance, geodesic

wellington, salamanca = (-41.32, 174.81), (40.96, -5.50)
distance(wellington, salamanca, ellipsoid='GRS-80').miles

12402.369702934551

shanghai, beijing = (31.235929042252014,121.48053886017651), (39.910924547299565,116.4133836971231)
distance(shanghai, beijing).km

1065.985103985533

geodesic(shanghai, beijing).km

1065.985103985533

因此 , 任務四的數據集就可以自己造了 :

cities = df['city1']
cities.head()

0     瀋陽市
1     長春市
2    哈爾濱市
3     北京市
4     天津市
Name: city1, dtype: object

def get_lon_lat(city):
    location = geolocator.geocode(city)
    return location.longitude,location.latitude
longitude, latitude = [*zip(*[get_lon_lat(city) for city in cities])]
data = pd.DataFrame({
   
   'city':cities, 'longitude':longitude, 'latitude':latitude})
data.head()

	city	longitude	latitude
0	瀋陽市	123.458674	41.674989
1	長春市	125.317122	43.813074
2	哈爾濱市	126.530400	45.798827
3	北京市	116.718521	39.902080
4	天津市	117.195107	39.085673

Pandas 11-綜合練習

Pandas 11-綜合練習

【任務一】企業收入的多樣性

My solution :

【任務二】組隊學習信息表的變換

My solution :

【任務三】美國大選投票情況

My solution :

【任務四】計算城市間的距離矩陣

My solution :

附錄 :

`geopy`包

【安裝部署】Apache SeaTunnel 和 Web快速安裝詳解

一個.NET開源的功能豐富、靈活易用的 Windows 窗口增強神器

快速上手微軟 “羣策 MARO” 平臺，打造簡易的共享單車場景

Pandas 11-綜合練習

VMware虛擬機怎麼設置使主機和虛擬機不同IP

推薦一款我自己寫的全網無水印下載工具

全局變量局部變量 static autoregister extern

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

Pandas 11-綜合練習

Pandas 11-綜合練習

【任務一】企業收入的多樣性

My solution :

【任務二】組隊學習信息表的變換

My solution :

【任務三】美國大選投票情況

My solution :

【任務四】計算城市間的距離矩陣

My solution :

附錄 :

geopy包

`geopy`包