pandas學習

pandas入門

pandas庫理解：

兩個數據類型：series（一維數據）、 dataframe（二維或者多維數據類型）

numpy(基礎數據類型) 注重數據結構

pandas（擴展數據類型）注重應用（索引）

篩選：

.loc 標籤索引 (‘ ’)

.iloc 位置索引（1,2，3...）

.ix 標籤與位置混合索引（先按標籤進行索引，然後再按位置索引）

pandas庫的Series類型

Series類型由一組數據以及與之相關的數據索引組成

import pandas as pd
a = pd.Series([9, 8, 7, 6])
print(a)
print(a.dtype)

自定義索引：

# 自定義索引
import pandas as pd
a = pd.Series([9, 8, 7, 6], index=["a", 'b', 'c', 'd']) # index= 可以省略
print(a)
print(a.dtype)

series數據類型的創建

《1》由字典創建

# # <1> 由字典創建
# d = pd.Series({'a': 8, 'b': 6, 'c': 5}) # 字典與series本身結構就相似
# print(d)
# # 改變series的索引 TODO 有問題
# d = pd.Series({'a': 8, 'b': 6, 'c': 5}, index=['c', 'd', 'f', 'p'])
# print(d)

《2》由ndarray創建

# <2> 從ndarray類型創建
import numpy as np
n = pd.Series(np.arange(5))
m = pd.Series(np.arange(5), index=np.arange(9, 3, -1))
print(n)
print(m)

《3》由列表創建

# a = pd.Series([9, 8, 7, 6], index=["a", 'b', 'c', 'd']) # index= 可以省略
# print(a)
# print(a.dtype)

《4》由函數創建，range()

series 類型基本操作

# series 類型基本操作
import numpy as np
b = pd.Series([9, 8, 7, 6], index=['a', 'b', 'c', 'd'])
print(b.index) # 獲得索引
print(b.values) # 獲得數據
print(b["b"]) # 由索引獲取數據
print(b[['a', 'b', 'c']]) # 由索引獲取數據
# 切片
print("*************************")
print(b[:3])
print(b[1:4])
# 運算
print("+++++++++++++++++++++++++")
print(np.sin(b)) # 運算後仍是series類型
# 判斷某個索引值是否在series中
print('c' in b) # 確實存在，返回True
print(3 in b) # 不存在，返回False
print(b.get('f', 100)) # 如果series中有標籤 f 則返回標籤值，否者放回100
# series 修改name（屬性）
print("################################")
b.name = "Series"
b.index.name = "索引列"
print(b)
# 修改series數據值（賦值），以及name
b.name = "new_name"
b["c", 'd'] = [777, 666]
print(b)

Pandas庫的DataFrame類型

DataFrame類型由相同索引的一組列組成（索引 + 多列數據）

每列值的數據類型可以不同，

DataFrame既有行索引（index）也有列索引（column）

DataFrame可以由二維ndarray創建、一維(ndarray、列表、字典、元組、或者series構成的字典）創建、其他series、其他DataFrame類型創建。

由二維ndarray創建DataFrame

由一維ndarray對象字典創建

由列表類型的字典創建DataFrame

# 由二維ndarray創建DataFrame
import numpy as np
d = pd.DataFrame(np.arange(1, 21, 1).reshape(4, 5))
print(d)
# 由一維ndarray對象字典創建
dt = {'one': pd.Series([1, 2, 3], index=['a', 'b', 'c']),
'two': pd.Series([9, 8, 7, 6], index=['a', 'b', 'c', 'd'])}
d = pd.DataFrame(dt) # 創建DataFrame時，數據不完整時會使用NaN補全
print(d)
# 由列表類型的字典創建DataFrame
dl = {'one': [1, 2, 3, 4], 'two': [9, 8, 7, 6]}
d = pd.DataFrame(dl, index=['a', 'b', 'c', 'd'])
print(d)
print("^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^")
print(d['one']) # 獲取DataFrame中的某一列數據（column）並構成series
print(d.ix['d']) # 獲取DataFrame中的某一行數據（index）並構成series
print(d['one']['d']) # 獲取交匯的數據

pandas庫的數據類型操作

改變series和DataFrame對象：

重新索引：.reindex() 能夠改變或重排series和DataFrame索引

# 重新索引 .reindex()
dl = {'城市': ['北京', '上海', '廣州', '深圳', '瀋陽'],
      '環比': [102.5, 101.2, 334.6, 90.2, 100.1],
      '同比': [120.3, 89.3, 132.4, 110.7, 100.1],
      '定基': [121.7, 127.8, 120.0, 145.5, 102.6]}
d = pd.DataFrame(dl, index=['c1', 'c2', 'c3', 'c4', 'c5'])
print(d)
# 重新排列列的順序（把城市排在最前面）
d = d.reindex(columns=['城市', '同比', '環比', '定基'])
print(d)

.reindex() 的參數（屬性）index columns fill_value method limit copy

新增：

dl = {'城市': ['北京', '上海', '廣州', '深圳', '瀋陽'],
      '環比': [102.5, 101.2, 334.6, 90.2, 100.1],
      '同比': [120.3, 89.3, 132.4, 110.7, 100.1],
      '定基': [121.7, 127.8, 120.0, 145.5, 102.6]}
d = pd.DataFrame(dl, index=['c1', 'c2', 'c3', 'c4', 'c5'])
print(d)
# 重新排列列的順序（把城市排在最前面）
d = d.reindex(columns=['城市', '同比', '環比', '定基'])
print(d)
# 新增
newc = d.columns.insert(4, "新增")
newd = d.reindex(columns=newc, fill_value=200)
print(newd)

DataFrame索引類型：

# 索引類型
dl = {'城市': ['北京', '上海', '廣州', '深圳', '瀋陽'],
      '環比': [102.5, 101.2, 334.6, 90.2, 100.1],
      '同比': [120.3, 89.3, 132.4, 110.7, 100.1],
      '定基': [121.7, 127.8, 120.0, 145.5, 102.6]}
d = pd.DataFrame(dl, index=['c1', 'c2', 'c3', 'c4', 'c5'])
print(d.index) # 行索引
print(d.columns) # 列索引

DataFrame索引類型使用：（增加刪除）

# DataFrame索引類型使用
dl = {'城市': ['北京', '上海', '廣州', '深圳', '瀋陽'],
      '環比': [102.5, 101.2, 334.6, 90.2, 100.1],
      '同比': [120.3, 89.3, 132.4, 110.7, 100.1],
      '定基': [121.7, 127.8, 120.0, 145.5, 102.6]}
d = pd.DataFrame(dl, index=['c1', 'c2', 'c3', 'c4', 'c5'])
d = d.reindex(columns=['城市', '同比', '環比', '定基'])
print(d)
nc = d.columns.delete(2) # 刪除第三列
ni = d.index.insert(5, "c0")
nd = d.reindex(index=ni, columns=nc)
print(nd)

刪除指定索引對象：

# 刪除指定索引對象
# <1> series
a = pd.Series([9, 8, 7, 6], index=['a', 'b', 'c', 'd'])
print(a)
print("111111111111111111")
print(a.drop(['b', 'c']))
# <2> dataframe
dl = {'城市': ['北京', '上海', '廣州', '深圳', '瀋陽'],
      '環比': [102.5, 101.2, 334.6, 90.2, 100.1],
      '同比': [120.3, 89.3, 132.4, 110.7, 100.1],
      '定基': [121.7, 127.8, 120.0, 145.5, 102.6]}
d = pd.DataFrame(dl, index=['c1', 'c2', 'c3', 'c4', 'c5'])
d = d.reindex(columns=['城市', '同比', '環比', '定基'])
print(d)
print("22222222222222222")
print(d.drop('c5')) # 根據行索引（index）進行刪除
print("333333333333333333")
print(d.drop("同比", axis=1)) # 根據列索引（axis=1,表示列（columns））進行刪除

數據類型的算術運算

算術運算根據行列索引，補齊後運算，（結果爲浮點數）

補齊時缺項填充NaN（空值）

二維和一維、一維和單個數字間爲廣播運算（即同MATLAB的運算法則）

# 加法
a = pd.DataFrame(np.arange(12).reshape(3, 4))
b = pd.DataFrame(np.arange(20).reshape(4, 5))
print(a)
print(b)
print('喵喵喵喵喵喵喵喵喵喵喵喵喵喵喵喵喵喵喵喵喵喵喵喵喵喵喵')
print(a + b) # 對應位置進行相關運算,缺項補NaN

數據類型的方法形式運算（區別於符號運算+-*/）

說明：比符號運算增加了更多參數的運算

.add(d, **argws) 類型間加法運算，可選參數

.sub(d, **argws) 類型間減法運算，可選參數

.mul(d, **argws) 類型間乘法運算，可選參數

.div(d, **argws) 類型間除法運算，可選參數

# 數據類型的方法形式運算（區別於符號運算+-*/）
# 缺項的填充方式（fill_value）
a = pd.DataFrame(np.arange(12).reshape(3, 4))
b = pd.DataFrame(np.arange(20).reshape(4, 5))
print(a)
print(b)
print(a.add(b, fill_value=100)) # 缺項補齊爲100，在進行運算
print('_______________________')
print(b.mul(a, fill_value=0)) # 缺項補齊爲0，在進行運算

不同維度間進行運算（廣播運算）

# 不同維度間進行運算（廣播運算）
a = pd.DataFrame(np.arange(12).reshape(3, 4))
b = pd.DataFrame(np.arange(20).reshape(4, 5))
print(a)
print(b)
c = pd.Series(np.arange(4))
print(c)
print(c + 10) # 每個元素依次加10
print(b.sub(c, axis=0)) # 二維的b的每列與series對應元素進行運算
print(b.sub(a, fill_value=100)) # 默認axis爲1，二維的b的每行與series對應元素進行運
# 算，缺省時補全爲100

比較運算法則

說明：比較運算只能比較相同索引的元素，不進行補齊

二維與一維、一維與零維間爲廣播運算

採用< > <= >= == != 等符號進行的二元運算產生布爾對象

# 比較運算法則
a = pd.DataFrame(np.arange(12).reshape(3, 4))
d = pd.DataFrame(np.arange(12, 0, -1).reshape(3, 4))
print(d)
print(a > d)

數據特徵分析

數據排序

.sort_index()方法在指定軸上根據索引進行排序，默認升序

具體：.sort_index(axis=0, ascending=True)

# 數據排序（對索引值）
import pandas as pd
import numpy as np
b = pd.DataFrame(np.arange(20).reshape(4, 5),
index=['c', 'a', 'b', 'd'])
print(b)
print(b.sort_index()) # 默認在axis=0上進行操作,按索引值順序進行排序，可以添加axis=1
print(b.sort_index(ascending=False)) # 修改ascending（上升）按照索引進行降序排序
c = b.sort_index(axis=1, ascending=False) # 對列進行降序排列
print('^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^')
print(c)
print(c.sort_index()) # 對行進行升序排列

數據排序

.sort_values()方法在指定軸上根據數值進行排序，默認升序

Series.sort_values(axis=0, ascending=True)

DataFrame.sort_values(by, axis=0, ascending=True) # 對某個軸上的某一個索引對應的一串數據進行排序（升序或者降序）

by: axis軸上的某一個索引或索引列表

注意：axis與行列的關係

# 數據排序（對數值大小）
import pandas as pd
import numpy as np
b = pd.DataFrame(np.random.rand(20).reshape(4, 5), index=['c', 'a', 'b', 'd'])
print(b)
c = b.sort_values(2, ascending=False) # 默認爲axis=0,對豎直方向（列數據）進行排序
print(c)
print('xxxxxxxxxxxxxxxxxxxxxxxxxxxxx')
print(c.sort_values('a', axis=1, ascending=False)) # 對水平方向（行數據）進行排序

NaN的排序處理（統一放到末尾，不管升降序，一律放在末尾）

數據的基本統計分析

《1》說明（Series類型和DataFrame類型均適用）

.sum() 計算數據總和，按0軸計算，下同

.count() 非NaN值的數量

.mean() .median() 數據算術平均值、中位數

.var() .std() 數據的方差、標準差

.min() .max() 數據最值

《2》說明（只適用series）

.argmin() .argmax() 計算數據最值所在位置的索引位置（自動索引）

.idmin() .idmax() 計算最值所在位置的索引（自定義索引）

《3》Series類型和DataFrame類型均適用

.describe() 針對0軸（各列）的統計彙總

# 數據的基本統計分析
# # <1> 一維series
# import pandas as pd
# a = pd.Series([9, 8, 7, 6], index=['a', 'b', 'c', 'd'])
# print(a)
# print(a.describe()) # describe輸出結果是series對象，所以可以用索引獲取數據
# print(a.describe()['count'])
# print(a.describe()['25%'])
# 二維DataFrame
import pandas as pd
import numpy as np
b = pd.DataFrame(np.arange(20).reshape(4, 5), index=['a', 'b', 'c', 'd'])
print(b)
print(b.describe())
print("@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@@")
# 獲取某一行的信息,返回結果爲series
print(b.describe().ix['max'])
# 獲取某一列的信息，返回結果爲series
print(b.describe()[2])

數據的累計統計分析

.cumsum() 依次給出前1,、2、3...n個數的和

.cumprod() 給出前n個數的積

.cummax() 給出前n個數的最大值

.cummin() 。。。給出。。最小值

數據相關分析

正相關、負相關、不相關

判斷標準：

（1）、協方差（協方差 > 0 ,兩個變量正相關、協方差 < 0, 兩個變量負相關、協方差 = 0 ，兩個變量獨立無關） ----------------------- > .cov() 計算協方差矩陣

（2）、Pearson相關係數（r範圍【-1， 1】）

當0.8 -- 1 極強相關

0.6 --0.8 強相關

0.4 --0.6 中等程度相關

0.2 --0.4 弱相關

0.0 - 0.2 極弱相關或無相關

-----------------------> .corr() 計算相關係數矩陣，Pearson Spearman Kendall 等係數

# 數據相關分析
# 分析房價增幅與人民幣發行增幅間關係
import pandas as pd
import matplotlib.pyplot as plt
import numpy as np
hprice = pd.Series([3.04, 22.93, 12.75, 22.6, 12.33], index=['2008', '2009', '2010', '2011', '2012'])
m2 = pd.Series([8.18, 18.38, 9.13, 7.82, 6.69], index=['2008', '2009', '2010', '2011', '2012'])
print(hprice.corr(m2)) # r = 0.5239即中等程度相關
# 繪製散點圖，觀察相關性
plt.scatter(hprice, m2)
plt.show()

pandas處理丟失數據

# 處理NaN 數據
# <1> 刪除
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(4, 5) + 10,
index=['a', 'b', 'c', 'd'], columns=[np.arange(5)])

# .loc 標籤索引 (‘ ’)
# .iloc 位置索引（1,2，3...）
# .ix 標籤與位置混合索引

df.ix['c', 3] = np.nan
df.loc['c', 2] = np.nan
df.iloc[3, 4] = np.nan
print(df)
print('_____________________________________')
# how={'any', 'all'} any表示只要有nan存在該行（列）數據全部刪除
# all表示該行（列）數據全是nan 時才刪除整行（列）
print(df.dropna(axis=0, how='any'))
print(df)

# <2> nan填入某個數據
import pandas as pd
import numpy as np
df = pd.DataFrame(np.random.randn(4, 5) + 10,
index=['a', 'b', 'c', 'd'], columns=[np.arange(5)])

# .loc 標籤索引 (‘ ’)
# .iloc 位置索引（1,2，3...）
# .ix 標籤與位置混合索引

df.ix['c', 3] = np.nan
df.loc['c', 2] = np.nan
df.iloc[3, 4] = np.nan
print(df)
print('_____________________________________')
print(df.fillna(value=0)) # 將nan數據替換爲0
print(df)

# 檢驗數據表格中是否有缺失值（表格過大不便於查看）
print(np.any(df.isnull()) == True) # 如果存在丟失數據，則返回True,也可以省略== True

數據的合併

"""
concatenating合併
"""
# df1 = pd.DataFrame(np.ones((3, 4)) * 0, columns=['a', 'b', 'c', 'd'])
# df2 = pd.DataFrame(np.ones((3, 4)) * 1, columns=['a', 'b', 'c', 'd'])
# df3 = pd.DataFrame(np.ones((3, 4)) * 2, columns=['a', 'b', 'c', 'd'])
# print(df1)
# print(df2)
# print(df3)
# # axis=0表示豎向合併,axis=1表示橫向合併
# # ignore_index=True表示忽略原先的index，重新默認排序index
# res = pd.concat([df1, df2, df3], axis=0, ignore_index=True)
# print(res)

# # join,['inner', 'outer']
# df1 = pd.DataFrame(np.ones((3, 4)) * 0, columns=['a', 'b', 'c', 'd'], index=[1, 2, 3])
# df2 = pd.DataFrame(np.ones((3, 4)) * 1, columns=['b', 'c', 'd', 'e'], index=[2, 3, 4])
# print(df1)
# print(df2)
# # 默認爲join=outer（缺失部分使用nan補全），join=inner表示保留共同部分
# res = pd.concat([df1, df2], join='inner') # 也可以ignore_index=True表示忽略原先的index，重新默認排序index
# print(res)

# join_axes參數

"""
append參數，與concat相似
"""
# 待更...

"""
merge合併
"""
# 待更...

python_pandas學習

pandas學習

pandas入門

pandas庫理解：

pandas庫的Series類型

series 類型基本操作

Pandas庫的DataFrame類型

pandas庫的數據類型操作

數據類型的算術運算

數據特徵分析

數據排序

數據的基本統計分析

數據的累計統計分析

數據相關分析

pandas處理丟失數據

數據的合併

關於遊戲付費的一點想法

我通過CKA和CKS啦！

生活問題

py入門 _ 01

python 小程序--圖片轉字符畫

python_matplotlib繪圖

python_pandas學習

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結