筆記來自b站:Numpy & Pandas (莫煩 Python 數據處理教程),總結如下:
pandas簡介
pandas 是基於NumPy 的一種工具,該工具是爲了解決數據分析任務而創建的。Pandas 納入了大量庫和一些標準的數據模型,提供了高效地操作大型數據集所需的工具。pandas提供了大量能使我們快速便捷地處理數據的函數和方法。你很快就會發現,它是使Python成爲強大而高效的數據分析環境的重要因素之一。
pandas數據結構
Series:一維數組,與Numpy中的一維array類似。二者與Python基本的數據結構List也很相近。Series如今能保存不同種數據類型,字符串、boolean值、數字等都能保存在Series中。
Time- Series:以時間爲索引的Series。
DataFrame:二維的表格型數據結構。很多功能與R中的data.frame類似。可以將DataFrame理解爲Series的容器。
Panel :三維的數組,可以理解爲DataFrame的容器。
Panel4D:是像Panel一樣的4維數據容器。
PanelND:擁有factory集合,可以創建像Panel4D一樣N維命名容器的模塊。
pandas常用操作
常用的操作有以下幾類:1、初始化, 2、索引,3、設置值,4、導入導出數據,5、合併列表(連接),6、繪製數據(畫圖)
接下來我打算一一介紹上述操作,最後介紹一個數據預處理的實例。
1、初始化
以Series和DataFrame爲例,DataFrame的初始化有三種方式,一種是用字典初始化,一種使用numpy數組初始化,首先看一下pandas API提供的構造函數:
pandas.DataFrame(data=None, index=None, columns=None, dtype=None, copy=False)
data : numpy ndarray (structured or homogeneous), dict, or DataFrame,Dict can contain Series, arrays, constants, or list-like objects
data爲要裝載的數組,可以是numpy數組、字典、或DataFrame
import pandas as pd
import numpy as np
#創建一個一維的Series
s = pd.Series([1,3,6,np.nan,4,1]) # similar with 1D numpy
print(s)
#創建一個日期序列
dates = pd.date_range('20160101', periods=6)
print(dates)
#創建一個DataFrame,data:6x4的numpy矩陣,取值爲-1到1,行標籤爲datas,列標籤爲A、B、C、D
df = pd.DataFrame(data=np.random.randn(6,4),index=dates,columns=['A','B','C','D'])
#打印列標籤爲B的一列
print(df['B'])
用字典初始化DataFrame
#未指定行標籤
df2 = pd.DataFrame(data={'A':['A1','A2','A3','A4'],'B':['B1','B2','B3','B4'],'C':['C1','C2','C3','C4']})
#指定行標籤
df3 = pd.DataFrame(data={'A':['A1','A2','A3','A4'],'B':['B1','B2','B3','B4'],'C':['C1','C2','C3','C4']},index=['a','b','c','d'])
用DataFrame初始化
df4 = pd.DataFrame(df3)
查看屬性
df4 = pd.DataFrame({'A' : 1.,
'B' : pd.Timestamp('20130102'),
'C' : pd.Series(1,index=list(range(4)),dtype='float32'),
'D' : np.array([3] * 4,dtype='int32'),
'E' : pd.Categorical(["test","train","test","train"]),
'F' : 'foo'})
# 行標籤
print(df4.index)
#列標籤
print(df4.columns)
#取值
print(df4.values)
#轉置
print(df4.T)
#形狀
print(df4.shape)
#排序
df4 = pd.DataFrame(np.random.randn(4,4),index=['A','B','C','D'],columns=['k1','k2','k3','k4'])
print(df4.sort_index(axis=0, ascending=True)) #對索引按行降序排列,axis=1 按列
print(df4.sort_values(by='k2',axis=0)) #按第2列指標排序
print(df4.sort_values(by='B',axis=1)) #按第2行指標排序
2、索引
索引主要是用到兩個函數iloc和loc,iloc使用索引取值,loc使用標籤取值
dates = pd.date_range('20160101', periods=6)
df6 = pd.DataFrame(np.arange(24).reshape(6,4),index=dates,columns=['A','B','C','D'])
print(df6)
print(df6['A'], df6.A)
print(df6[0:3], df6['20160102':'20160104'])
# select by label: loc
print(df6.loc['20160102'])
print(df6.loc[:,['A','B']])
print(df6.loc['20160102', ['A','B']])
# select by position: iloc
print(df6.iloc[3])
print(df6.iloc[3, 1])
print(df6.iloc[3:5,0:2])
print(df6.iloc[[1,2,4],[0,2]])
3、設置值
修改值
dates = pd.date_range('20160101', periods=6)
df6 = pd.DataFrame(np.random.randn(6,4), index=dates, columns=['A', 'B', 'C', 'D'])
df6.iloc[2,2] = 1111
df6.loc['2013-01-03', 'D'] = 2222
df6.A[df6.A>0] = 0
#加上一個空行
df6['F'] = np.nan
df6['G'] = pd.Series([1,2,3,4,5,6], index=pd.date_range('20160101', periods=6))
print(df6)
缺失值處理
dates = pd.date_range('20130101', periods=6)
df = pd.DataFrame(np.arange(24).reshape((6,4)), index=dates, columns=['A', 'B', 'C', 'D'])
df.iloc[0,1] = np.nan
df.iloc[1,2] = np.nan
print(df.dropna(axis=0, how='any')) # how={'any', 'all'}
print(df.fillna(value=0))
print(pd.isnull(df))
4、導入導出數據
read_table (filepath_or_buffer[, sep, ...]) |
Read general delimited file into DataFrame |
read_csv (filepath_or_buffer[, sep, ...]) |
Read CSV (comma-separated) file into DataFrame |
read_fwf (filepath_or_buffer[, colspecs, widths]) |
Read a table of fixed-width formatted lines into DataFrame |
read_msgpack (path_or_buf[, encoding, iterator]) |
Load msgpack pandas object from the specified |
# read from
data = pd.read_csv('student.csv')
data = pd.read_csv('student.csv',sep=',’) #sep分隔符可以是,\t,空格等
print(data)
# save to
data.to_pickle('student.pickle')
5、合併列表
df1 = pd.DataFrame(np.ones((3,4))*0,columns=['a','b','c','d'])
df2 = pd.DataFrame(np.ones((3,4))*0,columns=['a','b','c','d'])
df3 = pd.DataFrame(np.ones((3,4))*0,columns=['a','b','c','d'])
#使用concat合併
print(pd.concat([df1,df2,df3],axis=0,ignore_index=True)) #按行合併
print(pd.concat([df1,df2,df3],axis=1,ignore_index=True)) #按列合併
#使用append合併
print(df1.append(df2))
#加上一行
s = pd.Series([1,2,3,4],index=['a','b','c','d'])
print(df1.append(s,ignore_index=True))
# join, ('inner', 'outer')
df1 = pd.DataFrame(np.ones((3,4))*0, columns=['a','b','c','d'], index=[1,2,3])
df2 = pd.DataFrame(np.ones((3,4))*1, columns=['b','c','d', 'e'], index=[2,3,4])
res = pd.concat([df1, df2], axis=0, join='outer') #外連接,不去除重複的列
print(res)
res = pd.concat([df1, df2], axis=0, join='inner') #內連接,去除重複的列
print(res)
# join_axes
res = pd.concat([df1, df2], axis=1, join_axes=[df1.index])
#merge
left = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'A': ['A0', 'A1', 'A2', 'A3'],
'B': ['B0', 'B1', 'B2', 'B3']})
right = pd.DataFrame({'key': ['K0', 'K1', 'K2', 'K3'],
'C': ['C0', 'C1', 'C2', 'C3'],
'D': ['D0', 'D1', 'D2', 'D3']})
print(left)
print(right)
res = pd.merge(left, right, on='key')
print(res)
6、繪製數據
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
# plot data
# Series
data = pd.Series(np.random.randn(1000), index=np.arange(1000))
data = data.cumsum() #將所有數進行累加
##data.plot()
# DataFrame
data = pd.DataFrame(np.random.randn(1000, 4), index=np.arange(1000), columns=list("ABCD"))
data = data.cumsum()
# plot methods:
# 'bar', 'hist', 'box', 'kde', 'area', scatter', hexbin', 'pie'
ax = data.plot.scatter(x='A', y='B', color='DarkBlue', label="Class 1")
#在一個圖裏繪製另一個圖
data.plot.scatter(x='A', y='C', color='LightGreen', label='Class 2', ax=ax)
plt.show()
泰坦尼克號數據預處理
數據預處理一般過程
# 導入相關數據包 import numpy as np import pandas as pd import seaborn as sns import matplotlib.pyplot as plt %matplotlib inline
root_path = '/opt/data/datasets/getting-started/titanic/input' train = pd.read_csv('%s/%s' % (root_path, 'train.csv')) test = pd.read_csv('%s/%s' % (root_path, 'test.csv')) train.head(5) train.info()
<class 'pandas.core.frame.DataFrame'> RangeIndex: 891 entries, 0 to 890 Data columns (total 12 columns): PassengerId 891 non-null int64 Survived 891 non-null int64 Pclass 891 non-null int64 Name 891 non-null object Sex 891 non-null object Age 714 non-null float64 SibSp 891 non-null int64 Parch 891 non-null int64 Ticket 891 non-null object Fare 891 non-null float64 Cabin 204 non-null object Embarked 889 non-null object dtypes: float64(2), int64(5), object(5) memory usage: 83.6+ KB
# 存活人數 train['Survived'].value_counts() 0 549 1 342 Name: Survived, dtype: int64 # 對缺失值處理(Age 中位數不錯) titanic["Age"] = titanic["Age"].fillna(titanic["Age"].median()) titanic["Fare"] = titanic["Fare"].fillna(titanic["Fare"].median()) # 對文本特徵進行處理(性別, 登船港口) print(titanic["Sex"].unique()) titanic.loc[titanic["Sex"]=="male", "Sex"] = 0 titanic.loc[titanic["Sex"]=="female", "Sex"] = 1 # 組合特徵(特徵組合相關性變差了) # titanic["FamilySize"] = titanic["SibSp"] + titanic["Parch"] # S的概率最大,當然我們也可以按照概率隨機算,都可以 print(titanic["Embarked"].unique()) """ titanic[["Embarked"]].groupby("Embarked").agg({"Embarked": "count"}) Embarked Embarked C 168 Q 77 S 644 """ titanic["Embarked"] = titanic["Embarked"].fillna('S') titanic.loc[titanic["Embarked"] == "S", "Embarked"] = 0 titanic.loc[titanic["Embarked"] == "C", "Embarked"] = 1 titanic.loc[titanic["Embarked"] == "Q", "Embarked"] = 2
def get_title(name): # 名字的尊稱 title_search = re.search(' ([A-Za-z]+)\.', name) if title_search: return title_search.group(1) return "" titles = titanic["Name"].apply(get_title) # print(pandas.value_counts(titles)) # 對尊稱建立mapping字典 # 在數據的Name項中包含了對該乘客的稱呼,如Mr、Miss等,這些信息包含了乘客的年齡、性別、也有可能包含社會地位,如Dr、Lady、Major、Master等稱呼。這一項不方便用圖表展示,但是在特徵工程中,我們會將其提取出來,然後放到模型中。 # 剩餘因素還有船票價格、船艙號和船票號,這三個因素都可能會影響乘客在船中的位置從而影響逃生順序,但是因爲這三個因素與生存之間看不出明顯規律,所以在後期模型融合時,將這些因素交給模型來決定其重要性。 title_mapping = {"Mr": 1, "Miss": 2, "Mrs": 3, "Master": 4, "Dr": 5, "Rev": 6, "Major": 7, "Col": 7, "Mlle": 8, "Mme": 8, "Don": 9, "Dona": 9, "Lady": 10, "Countess": 10, "Jonkheer": 10, "Sir": 9, "Capt": 7, "Ms": 2} for k, v in title_mapping.items(): titles[titles == k] = v # print(pd.value_counts(titles)) # 添加一個新特徵表示擁護尊稱 titanic["Title"] = [int(i) for i in titles.values.tolist()] # 添加一個新特徵表示名字長度 titanic["NameLength"] = titanic["Name"].apply(lambda x: len(x)) # 相關性太差,刪除 # titanic.drop(['PassengerId'], axis=1,inplace=True) titanic.drop(['Cabin'], axis=1,inplace=True) titanic.drop(['SibSp'], axis=1,inplace=True) # titanic.drop(['Parch'],axis=1,inplace=True) titanic.drop(['Ticket'], axis=1,inplace=True) titanic.drop(['Name'], axis=1,inplace=True)