numpy中文文檔(updating…)
numpy,scipy,matplotlib,pandas,keras,scikit-learn簡明實例教程
- 基礎部分
numpy
的主要對象是一個同類元素的多維數組. 這是一個所有元素均爲同種類型,並通過正整數元組來進行索引的元素(一般爲數字)表. 在numpy
中維度(dimensions)稱之爲軸(axes). 數目稱之爲秩(rank).
就比如,在3D空間中一個點的座標[1, 2, 1]就是一個秩爲1的數組,因爲它僅有一個軸, 並且其長度爲3. 又比如在下面的例子中,數組的秩爲2(有兩個維度),第一個維度(軸)的長度爲2,第二個維度(軸)的長度爲3.
[[1., 0., 0.],
[0., 1., 2.]]
numpy
的數組類被稱之爲ndarray
, 我們也將它叫做array
. 需要注意的是,numpy.array與python標準庫中的array.array是有區別的,後者僅處理一維數組並且只提供了少量的功能. 對於ndarray對象而言,比較重要的屬性有:
ndarray.ndim
數組中軸(維度)的個數,在python的世界裏,維度的個數是指秩
ndarray.shape
數組的維度. 這是一個表示數組在每一個維度上的大小的一個整數元組. 對於一個n行m列的矩陣而言,它的shape屬性就爲(n, m). 那麼,這個元組的長度就必然爲秩,或者爲維度的個數,或爲ndim屬性
ndarray.size
數組中元素的總個數. 也就等於shape屬性元組中各個元素的乘積.
ndarray.dtype
一個用來描述數組中元素類型的對象. 你能通過標準的python類型來創建或者直接指定dtype屬性. 另外numpy也提供了它自己的數據類型. 例如,numpy.int32, numpy.int16, 以及numpy.float64, 等等.
ndarray.itemsize
數組中元素的字節大小(bytes). 例如,一個類型爲float64的數組元素的itemsize爲8(=64/8), 而一個類型爲complex32的數組元素的itemsize爲4(=32/8). 這個屬性等價於ndarray.dtype.itemsize
ndarray.data
包含了實際數組元素的緩衝區. 通常我們不需要用這個屬性,原因是,我們會用索引(功能)訪問數組中元素.
……
Pandas
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import warnings
warnings.filterwarnings('ignore')
plt.rcParams['font.sans-serif'] = ['SimHei'] # 用來正常顯示中文標籤
plt.rcParams['axes.unicode_minus'] = False
def test1():
# for each Series, it includes `index`, so merging them into `DataFrame`, as corresponding index-value into a row
# print(pd.Index([3]*4))
# print(pd.Index(range(4)))
# print(pd.date_range('20180201', periods=4)) # DatetimeIndex, default 'D' (calendar daily), `stride` as daily
# print(pd.period_range('20180101', '2018-01-04')) # PeriodIndex
# print(pd.Index(data=[i for i in 'ABCDEF']))
# print(list(pd.RangeIndex(10)))
# s = pd.Series(10) # scalar
# s = pd.Series(data=[1, 2, 3], index=[10, 20, 20]) # array-like, and non-unique index values are allowed
s = pd.Series({'a': 10, 10: 'AA'}, index=['aa', 10]) # dict
print(s) # print(s[:])
# df = pd.DataFrame(data=np.random.randn(4, 3), index=pd.RangeIndex(1, 5), columns=['A', 'B', 'C']) # ndarray
# df = pd.DataFrame(data={'A': np.array(range(1, 4))**2, 'B': pd.Timestamp('20180206'),
# 'C': pd.Series(data=['MLee', 'python', 'Pearson']), 'D': 126,
# 'E': pd.Categorical(values=['Upper', 'Middle', 'Lower'], categories=['Middle', 'Lower']),
# 'F': 'Laplace'}, index=pd.RangeIndex(3), columns=['A', 'B']) # dict
df = pd.DataFrame(data={'A': np.array(range(1, 4))**2, 'B': pd.Timestamp('20180206'),
'C': pd.Series(data=['MLee', 'python', 'Pearson']), 'D': [126, 10, 66],
'E': pd.Categorical(values=['Upper', 'Middle', 'Lower'], categories=['Middle', 'Lower']),
'F': 'Laplace'}, index=pd.RangeIndex(3), columns=pd.Index([i for i in 'FEDCBA'])) # dict
print(df)
# print(df.dtypes)
# print(df.index)
# print(df.columns)
# print(df.values) # numpy.ndarray
# print('*'*126)
# print(df.info())
# print('*'*126)
# print(df.describe())
# print(df.transpose())
# print(df.sort_index(axis=0, ascending=False))
# print(df.sort_index(axis=1))
# print(df.sort_values(by='D'))
print('*'*126)
df = pd.DataFrame(data=np.arange(24).reshape(6, 4), index=pd.date_range('20180201', periods=6),
columns=pd.Index([i for i in 'ABCD']))
# df = pd.DataFrame(data=np.arange(24).reshape(6, 4))
# print(df[0])
# print(df[0:1]) # select rows require to use the slice
print(df[:])
print(df['A']) # print(df.A)
print(df[0:2][['A', 'B']])
print(df['20180201':'20180202'][['A', 'B']])
# select rows require to use the `slice`, while select columns require to use the `list`
# print(df[['A', 'B']])
# print(df[0:2]) # exclude 3th row
# print(df['20180201':'20180203']) # include index `20180203` row
# print(df[0:1])
# print(df.loc['20180201']) # enable to get only one
# For row & column, df.loc requires to use `index` and `column`, while df.iloc requires to use `slice` and `slice`,
# in particular, df.ix supports mixed-selection
print(df.loc['20180201':'20180202'][['A', 'B']])
print(df.loc['20180201':'20180202', ['A', 'B']]) # print(df['20180201':'20180202', ['A', 'B']]) # error
# equivalent to df.iloc[0, 0]
print(df.iloc[0:1, 0:1]) # use index(both row and column, necessarily all like 0, 1, 2, ...)
print(df.iloc[[0, 2, 4], 0:2])
print(df.ix[0:2, 0:2])
# print(df.ix['20180201':'20180202', 0:2])
# print(df.ix[0:2, ['A', 'B']])
# print(df.ix['20180201':'20180202', ['A', 'B']])
print('**')
print(df['B'][df.A>4]) # print(df.B[df.A>4])
df.B[df.A > 4] = np.nan
print(df)
df['E'] = 0
print(df)
df['F'] = pd.Series(data=range(6), index=pd.date_range('20180201', periods=6))
print(df)
print('*'*12)
df = pd.DataFrame(data=np.arange(24).reshape(6, 4), index=pd.date_range('20180201', periods=6), columns=pd.Index([i for i in 'ABCD']))
# print(df)
# df.dropna()
# df.fillna()
def test2():
dataset_training = pd.read_csv('C:/users/myPC/Desktop/ml/Titanic/train.csv')
# print(dataset_training)
print(dataset_training.Survived.value_counts())
deceased = dataset_training.Pclass[dataset_training.Survived == 0].value_counts(sort=True)
survived = dataset_training.Pclass[dataset_training.Survived == 1].value_counts(sort=True)
# print(deceased, survived, sep='\n')
df = pd.DataFrame({'Survived': survived, 'Deceased': deceased})
print(df)
df.plot(kind='bar', stacked=True)
plt.title('Distribution of SES')
plt.xlabel('Class')
plt.ylabel('Numbers')
plt.show()
def test3():
id = ['1001', '1008', '1102', '1001', '1003', '1101', '1126', '1007']
name = ['Shannon', 'Gauss', 'Newton', 'Leibniz', 'Taylor', 'Lagrange', 'Laplace', 'Fourier']
country = ['America', 'Germany', 'Britain', 'Germany', 'Britain', 'France', 'France', 'France']
iq = [168, 180, 172, 228, 182, 172, 160, 186]
sq = [180, 194, 160, 274, 150, 200, 158, 180]
eq = [144, 152, 134, 166, 118, 144, 156, 128]
dataset = list(zip(id, name, country, iq, sq, eq))
df = pd.DataFrame(data=dataset, columns=['Id', 'Name', 'Country', 'IQ', 'SQ', 'EQ'])
df.to_csv('persons.csv', index=True, header=True)
df = pd.read_csv('persons.csv', usecols=range(1, 7))
print(df)
# print(df.info())
# print(df[df.IQ == df.IQ.max()])
print(df.sort_values(by='IQ', axis=0, ascending=False)) # df.head(1)
plt.subplot2grid((1, 3), (0, 0))
df.IQ.plot()
df.SQ.plot()
df.EQ.plot()
for i in range(df.shape[0]):
plt.annotate(s=df.ix[i, 'Name'], xy=(i, df.ix[i, 'IQ']), xytext=(1, 1), xycoords='data', textcoords='offset points')
plt.subplot2grid((1, 3), (0, 1), colspan=2)
df[['IQ', 'SQ', 'EQ']].plot(kind='bar')
# df['IQ'].plot(kind='bar')
# df['SQ'].plot(kind='bar')
# df['EQ'].plot(kind='bar')
for i in range(df.shape[0]):
plt.annotate(s=df.ix[i, 'Name'], xy=(i, df.ix[i, 'IQ']), xytext=(1, 1), xycoords='data', textcoords='offset points')
plt.show()
def test4():
data_train = pd.read_csv(r"C:\Users\myPC\Desktop\ml\Titanic\train.csv")
# plt.subplot2grid((2, 3), (0, 0)) # 在一張大圖裏分列幾個小圖
survived = data_train.Pclass[data_train.Survived == 1].value_counts()
deceased = data_train.Pclass[data_train.Survived == 0].value_counts()
pd.DataFrame({'Survived': survived, 'deceased': deceased}).plot(kind='bar', stacked=True)
# print(data_train.Sex[data_train.Survived == 1].value_counts())
print(data_train.groupby(by='Survived').count())
# data_train.Survived.value_counts().plot(kind='bar') # 柱狀圖
# plt.title("獲救情況 (1爲獲救)")
# plt.ylabel("人數")
# plt.subplot2grid((2, 3), (0, 1))
# data_train.Pclass.value_counts().plot(kind="bar")
# plt.ylabel("人數")
# plt.title("乘客等級分佈")
#
# plt.subplot2grid((2, 3), (0, 2))
# plt.scatter(data_train.Survived, data_train.Age)
# plt.ylabel("年齡") # 設定縱座標名稱
# plt.grid(b=True, which='major', axis='y')
# plt.title("按年齡看獲救分佈 (1爲獲救)")x
#
# plt.subplot2grid((2, 3), (1, 0), colspan=2)
# data_train.Age[data_train.Pclass == 1].plot(kind='kde')
# data_train.Age[data_train.Pclass == 2].plot(kind='kde')
# data_train.Age[data_train.Pclass == 3].plot(kind='kde')
# plt.xlabel("年齡") # plots an axis lable
# plt.ylabel("密度")
# plt.title("各等級的乘客年齡分佈")
# plt.legend(('頭等艙', '2等艙', '3等艙'), loc='best') # sets our legend for our graph.
#
# plt.subplot2grid((2, 3), (1, 2))
# data_train.Embarked.value_counts().plot(kind='bar')
# plt.title("各登船口岸上船人數")
# plt.ylabel("人數")
plt.show()
def test5():
url = r'http://s3.amazonaws.com/assets.datacamp.com/course/dasi/present.txt'
present = pd.read_table(url, sep=' ')
# print(present)
# present.set_index(keys=['year'], inplace=True)
# print(present)
print(present.columns)
print(present.index)
print(present.dtypes)
# present.boys.plot(kind='kde')
# present.girls.plot(kind='kde')
present.set_index(keys=['year'], inplace=True)
kinds = ['line', 'bar', 'barh', 'hist', 'box', 'kde', 'density', 'area', 'pie', 'scatter', 'hexbin']
# plt.figure()
# for i in range(len(kinds)):
# plt.subplot2grid(shape=(2, 3), loc=(i//3, i % 3))
# present[:10].plot(kind=kinds[i], subplots=True)
present[:].plot(x='boys', y='girls', kind=kinds[-1])
plt.legend(loc='upper right')
plt.show()
def test6():
s = pd.Series(data=np.random.randn(1000), index=pd.date_range('20180101', periods=1000))
print(s)
s = np.exp(s.cumsum())
s.plot(style='m*', logy=True)
plt.show()
def test7():
df = pd.read_csv('http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data',
names=['sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'name'])
# print(df)
# df.boxplot(by='name')
# df.plot(kind='kde')
# df.ix[:, :-1].plot(kind='hist')
setosa = df[df.name == 'Iris-setosa']
versicolor = df[df.name == 'Iris-versicolor']
virginica = df[df.name == 'Iris-virginica']
# plt.subplot2grid(shape=(1, 3), loc=(0, 0))
plt.subplot(131)
pd.DataFrame.plot(setosa)
# setosa.plot(title='setosa', subplots=True)
# plt.subplot2grid(shape=(1, 3), loc=(0, 1))
# versicolor.plot(title='versicolor', subplots=True)
# pd.DataFrame.plot(versicolor)
# plt.subplot2grid(shape=(1, 3), loc=(0, 2))
# virginica.plot(title='virginica', subplots=True)
# pd.DataFrame.plot(data=virginica)
plt.show()
# df.sepal_length.plot(kind='hist')
# plt.show()
def test8():
import numpy as np
import pandas as pd
from sklearn.ensemble import RandomForestRegressor
from sklearn import preprocessing
from sklearn import linear_model
dataset_training = pd.read_csv('C:/users/myPC/Desktop/ml/Titanic/train.csv')
dataset_test = pd.read_csv('C:/users/myPC/Desktop/ml/Titanic/test.csv')
passenger_id = dataset_test['PassengerId']
# for the feature `Fare` in the `test_data`, only one missed
dataset_test.loc[dataset_test.Fare.isnull(), 'Fare'] = 0.0
# drop the irrelevant features
dataset_training.drop(labels=['PassengerId', 'Name', 'Ticket'], axis=1, inplace=True)
dataset_test.drop(columns=['PassengerId', 'Name', 'Ticket'], inplace=True)
# predict `age` which is missed by others' features
dataset_training_age = dataset_training[['Pclass', 'SibSp', 'Parch', 'Fare', 'Age']]
dataset_test_age = dataset_test[['Pclass', 'SibSp', 'Parch', 'Fare', 'Age']]
age_known0 = dataset_training_age[dataset_training_age.Age.notnull()].as_matrix() # get the `ndarray`
age_unknown0 = np.array(dataset_training_age[dataset_training_age.Age.isnull()])
age_unknown1 = dataset_test_age[dataset_test_age.Age.isnull()].as_matrix()
training_data_age = age_known0[:, :-1]
training_target_age = age_known0[:, -1]
rfr = RandomForestRegressor(n_estimators=1000, n_jobs=-1, random_state=0) # enable to fit them by the 1000 trees
rfr.fit(training_data_age, training_target_age)
predicts = rfr.predict(age_unknown0[:, :-1])
dataset_training.ix[dataset_training.Age.isnull(), 'Age'] = predicts # fill the `age` which is missed
# fit model(RandomForestRegressor) by the `training data`
dataset_test.loc[dataset_test.Age.isnull(), 'Age'] = rfr.predict(age_unknown1[:, :-1])
dataset_training.ix[dataset_training.Cabin.notnull(), 'Cabin'] = 'Yes' # fill the `Cabin` as `Yes` which `notnull`
dataset_training.ix[dataset_training.Cabin.isnull(), 'Cabin'] = 'No' # else, `No`
dataset_test.ix[dataset_test.Cabin.notnull(), 'Cabin'] = 'Yes'
dataset_test.ix[dataset_test.Cabin.isnull(), 'Cabin'] = 'No'
# dummy some fields whose types of [`object`, `category`] to eliminate relation between categories
dataset_training_dummies = pd.get_dummies(dataset_training, columns=['Pclass', 'Sex', 'Cabin', 'Embarked'])
dataset_test_dummies = pd.get_dummies(dataset_test, columns=['Pclass', 'Sex', 'Cabin', 'Embarked'])
ss = preprocessing.StandardScaler() # standardize some features which have some differences
dataset_training_dummies['Age'] = ss.fit_transform(dataset_training_dummies.Age.reshape(-1, 1))
dataset_training_dummies['Fare'] = ss.fit_transform(dataset_training_dummies.Fare.reshape(-1, 1))
dataset_test_dummies['Age'] = ss.fit_transform(dataset_test_dummies.Age.reshape(-1, 1))
dataset_test_dummies['Fare'] = ss.fit_transform(dataset_test_dummies.Fare.reshape(-1, 1))
# get all processed samples
print(dataset_training_dummies)
dataset_training_dummies = dataset_training_dummies.filter(regex='Age|SibSp|Parch|Fare|Pclass_*|Sex_*|Cabin_*|Embarked_*|Survived').as_matrix()
# print(data_training_dummies.info())
training_data = dataset_training_dummies[:, 1:]
training_target = dataset_training_dummies[:, 0:1]
lr = linear_model.LogisticRegression(C=1.0, penalty='l1', tol=1e-5)
from sklearn import model_selection
print(model_selection.cross_val_score(lr, training_data, training_target, cv=4))
lr.fit(training_data, training_target)
predicts = lr.predict(dataset_test_dummies)
ans = pd.DataFrame({'PassengerId': passenger_id, 'Survived': predicts.astype(np.int32)})
# print(ans)
# ans.to_csv('C:/users/myPC/Desktop/ml/Titanic/submission.csv', index=False) # ignore label-index
# print(pd.DataFrame({'features': list(dataset_test_dummies[1:]), 'coef': list(lr.coef_.T)}))
def test9():
import numpy as np
import numpy.linalg as nla
import scipy.linalg as sla
a = np.random.randint(20, size=(3, 4))
print(a)
print(np.diag(a))
U, Sigma, V_H = nla.svd(a) # 其中U, V爲酉矩陣,Sigma爲一個由奇異值組成的對角矩陣(但返回的是一個由奇異值組成的向量形式)
Sigma = np.concatenate((np.diag(Sigma), np.zeros((U.shape[0], V_H.shape[1]-U.shape[1]))), axis=1)
print(U)
print(Sigma)
print(V_H)
print(U.dot(Sigma.dot(V_H)))
def google():
import tensorflow as tf
a = tf.constant((1, 1))
b = tf.constant((2, 2))
ans = a + b
sess = tf.Session()
# print(type(sess.run(ans)))
print(sess.run(ans))
if __name__ == '__main__':
# test9()
# google()
test8()
# pd.concat()
# df.drop()
# df = pd.DataFrame({'A': [1, 2, 3], 'B': [np.nan, 1, np.nan], 'C': [10, 111, 1111], 'D': ['good', 'common', 'bad']})
# df.loc[df.B.notnull(), 'B'] = 'Yes' # prior to judge `isnull()`, or leading to all values as the `Yes`
# df.loc[df.B.isnull(), 'B'] = 'No'
# print(df.ix[:, 'B'])
# df.ix[df.B.isnull(), 'B'] = [0, 0]
# print(pd.get_dummies(df, columns=['D', 'B']))
# print(df)
# print(df.filter(regex='A|D|B'))
# df = pd.get_dummies(df, prefix=['M', 'L']) # loss original data(variable)
# print(df)