数据科学包——pandas基础(整形、透视、时间序列、类别)

一、数据整形(stack函数)

在用pandas进行数据重排时,经常用到stackunstack两个函数。stack的意思是堆叠,堆积,unstack即“不要堆叠”
 表格在行列方向上均有索引(类似于DataFrame),花括号结构只有“列方向”上的索引(类似于层次化的Series),结构更加偏向于堆叠(Series-stack,方便记忆)。stack函数会将数据从"表格结构"变成"花括号结构",即将其行索引变成列索引,反之,unstack函数将数据从”花括号结构“变成”表格结构“,即要将其中一层的列索引变成行索引

tuples = list(zip(*[['bar', 'bar', 'baz', 'baz',
                     'foo', 'foo', 'qux', 'qux'],
                    ['one', 'two', 'one', 'two',
                     'one', 'two', 'one', 'two']]))
index = pd.MultiIndex.from_tuples(tuples, names=['first', 'second'])
df = pd.DataFrame(np.random.randn(8, 2), index=index, columns=['A', 'B'])
df
>>>
                     A         B
first second
bar   one     0.040744 -0.114534
      two     1.172007  0.314802
baz   one     2.609065  1.809331
      two     0.998182  0.630713
foo   one    -1.592057 -0.383721
      two     0.547086  0.964051
qux   one    -1.547477 -0.014280
      two     0.193672  1.180379
  • stack行索引变成列索引
stacked = df.stack()
stacked
>>>
first  second
bar    one     A    0.040744
               B   -0.114534
       two     A    1.172007
               B    0.314802
baz    one     A    2.609065
               B    1.809331
       two     A    0.998182
               B    0.630713
foo    one     A   -1.592057
               B   -0.383721
       two     A    0.547086
               B    0.964051
qux    one     A   -1.547477
               B   -0.014280
       two     A    0.193672
               B    1.180379
dtype: float64
  • unstack列索引变成行索引
stacked.unstack()
>>>
                     A         B
first second
bar   one     0.040744 -0.114534
      two     1.172007  0.314802
baz   one     2.609065  1.809331
      two     0.998182  0.630713
foo   one    -1.592057 -0.383721
      two     0.547086  0.964051
qux   one    -1.547477 -0.014280
      two     0.193672  1.180379

stacked.unstack().unstack()
>>>
               A                   B
second       one       two       one       two
first
bar     0.040744  1.172007 -0.114534  0.314802
baz     2.609065  0.998182  1.809331  0.630713
foo    -1.592057  0.547086 -0.383721  0.964051
qux    -1.547477  0.193672 -0.014280  1.180379

二、数据透视表(pivot table)

pivot table/轴向旋转表

>>> df = pd.DataFrame({'A' : ['one', 'one', 'two', 'three'] * 3,
...                     'B' : ['A', 'B', 'C'] * 4,
...                     'C' : ['foo', 'foo', 'foo', 'bar', 'bar', 'bar'] * 2,
...                     'D' : np.random.randn(12),
...                     'E' : np.random.randn(12)})
>>> df
        A  B    C         D         E
0     one  A  foo  0.133675  0.008375
1     one  B  foo -0.184734 -0.175907
2     two  C  foo -0.380400  0.020585
3   three  A  bar  0.447330  0.030309
4     one  B  bar -0.242254  0.194685
5     one  C  bar  0.388497 -0.846569
6     two  A  foo  0.739428  0.737537
7   three  B  foo  0.171285 -1.558148
8     one  C  foo -0.458316 -1.048907
9     one  A  bar -1.319356  1.664459
10    two  B  bar  0.797775 -0.163211
11  three  C  bar  2.528107  1.471624
  • 选择D列数据,以A、B列为行标签,以C列为列标签
>>> pd.pivot_table(df, values='D', index=['A', 'B'], columns=['C'])
C             bar       foo
A     B
one   A -1.319356  0.133675
      B -0.242254 -0.184734
      C  0.388497 -0.458316
three A  0.447330       NaN
      B       NaN  0.171285
      C  2.528107       NaN
two   A       NaN  0.739428
      B  0.797775       NaN
      C       NaN -0.380400
  • 选择'one'类数据,按照C列分组并且求各行数据的均值
>>> df[df.A=='one'].groupby('C').mean()
            D         E
C
bar -0.391038  0.337525
foo -0.169792 -0.405480

三、时间序列

pandas 提供了强大的时间序列功能,比如把秒级的股票数据转换为5分钟周期数据等。

1.时间序列

  • 从20200504开始,以秒为单位记录600个时间节点
>>> rng=pd.date_range('20200504',periods=600,freq='s')
>>> rng
DatetimeIndex(['2020-05-04 00:00:00', '2020-05-04 00:00:01',
               '2020-05-04 00:00:02', '2020-05-04 00:00:03',
               '2020-05-04 00:00:04', '2020-05-04 00:00:05',
               '2020-05-04 00:00:06', '2020-05-04 00:00:07',
               '2020-05-04 00:00:08', '2020-05-04 00:00:09',
               ...
               '2020-05-04 00:09:50', '2020-05-04 00:09:51',
               '2020-05-04 00:09:52', '2020-05-04 00:09:53',
               '2020-05-04 00:09:54', '2020-05-04 00:09:55',
               '2020-05-04 00:09:56', '2020-05-04 00:09:57',
               '2020-05-04 00:09:58', '2020-05-04 00:09:59'],
              dtype='datetime64[ns]', length=600, freq='S')
  • 为每一个时间节点匹配一个0-500的数字
>>> ts = pd.Series(np.random.randint(0, 500, len(rng)), index=rng)
>>> ts
2020-05-04 00:00:00    163
2020-05-04 00:00:01    380
2020-05-04 00:00:02    459
2020-05-04 00:00:03     73
                      ...
2020-05-04 00:09:56    189
2020-05-04 00:09:57    242
2020-05-04 00:09:58    398
2020-05-04 00:09:59    281
Freq: S, Length: 600, dtype: int32
  • 以两分钟为一个时间周期,计算每两分钟内所有节点对应的数字的和
>>> ts.resample('2Min', how='sum')
__main__:1: FutureWarning: how in .resample() is deprecated
the new syntax is .resample(...).sum()
2020-05-04 00:00:00    31323
2020-05-04 00:02:00    30679
2020-05-04 00:04:00    29742
2020-05-04 00:06:00    32315
2020-05-04 00:08:00    27771
Freq: 2T, dtype: int32

2.在不同的时间表达方式间转换

>>> rng = pd.date_range('20160301', periods=5, freq='M')
>>> ts = pd.Series(np.random.randn(len(rng)), index=rng)
>>> ts
2016-03-31    0.608165
2016-04-30    0.371630
2016-05-31   -0.169328
2016-06-30    0.403367
2016-07-31    0.242917
Freq: M, dtype: float64

2.1 时间戳转为时期 to_period()函数

>>> ps = ts.to_period() #默认按月('M')显示
>>> ps
2016-03    0.608165
2016-04    0.371630
2016-05   -0.169328
2016-06    0.403367
2016-07    0.242917
Freq: M, dtype: float64

2.2 时期转为时间戳 to_timestamp()函数

>>> ps.to_timestamp() #默认how='start',即:ps.to_timestamp(how='start')
2016-03-01    0.608165
2016-04-01    0.371630
2016-05-01   -0.169328
2016-06-01    0.403367
2016-07-01    0.242917
Freq: MS, dtype: float64

2.3 返回固定频率的周期索引 period_range()函数

  • 格式
pandas.period_range(start=None, end=None, periods=None, freq=None, name=None) 
  • 实例
>>> prng = pd.period_range('1990Q1', '2000Q4', freq='Q-NOV')
>>> prng
PeriodIndex(['1990Q1', '1990Q2', '1990Q3', '1990Q4', '1991Q1', '1991Q2',
             '1991Q3', '1991Q4', '1992Q1', '1992Q2', '1992Q3', '1992Q4',
             '1993Q1', '1993Q2', '1993Q3', '1993Q4', '1994Q1', '1994Q2',
             '1994Q3', '1994Q4', '1995Q1', '1995Q2', '1995Q3', '1995Q4',
             '1996Q1', '1996Q2', '1996Q3', '1996Q4', '1997Q1', '1997Q2',
             '1997Q3', '1997Q4', '1998Q1', '1998Q2', '1998Q3', '1998Q4',
             '1999Q1', '1999Q2', '1999Q3', '1999Q4', '2000Q1', '2000Q2',
             '2000Q3', '2000Q4'],
            dtype='period[Q-NOV]', freq='Q-NOV')

一年有四个季度,每年都以11月份为结束,即是9,10,11三个月为一个季度,以此类推
Q:quarter end frequency #季度结束频率
(B)Q(S)-NOV :quarterly frequency, year ends in November #季度频率,年度结束于11月份

2.4 series索引

>>> ts = pd.Series(np.random.randn(len(prng)), prng)
>>> ts
1990Q1   -0.504704
1990Q2   -0.343573
1990Q3   -1.511160
1990Q4   -0.268746
1991Q1   -0.013206
1991Q2   -1.730078
1991Q3    0.193844
1991Q4    0.732594
1992Q1   -0.656539
1992Q2   -0.577248
1992Q3    1.942460
1992Q4   -1.112426
1993Q1    0.040957
1993Q2   -1.572008
1993Q3   -0.130710
1993Q4   -1.019854
1994Q1    0.817004
1994Q2    2.099237
1994Q3    0.574557
1994Q4   -0.775947
1995Q1   -1.034555
1995Q2   -1.449465
1995Q3   -0.472482
1995Q4    0.750370
1996Q1   -0.081299
1996Q2   -0.214759
1996Q3   -0.141590
1996Q4   -0.397613
1997Q1   -0.771929
1997Q2    0.823948
1997Q3   -0.479001
1997Q4    0.624560
1998Q1    0.956335
1998Q2    0.531748
1998Q3   -0.048368
1998Q4    0.183149
1999Q1    1.060150
1999Q2    1.207727
1999Q3    1.309845
1999Q4    0.817419
2000Q1    0.992689
2000Q2    0.187355
2000Q3    1.014541
2000Q4    0.431878
Freq: Q-NOV, dtype: float64
  • series.index
>>> ts.index
PeriodIndex(['1990-03-01 09:00', '1990-06-01 09:00', '1990-09-01 09:00',
             '1990-12-01 09:00', '1991-03-01 09:00', '1991-06-01 09:00',
             '1991-09-01 09:00', '1991-12-01 09:00', '1992-03-01 09:00',
             '1992-06-01 09:00', '1992-09-01 09:00', '1992-12-01 09:00',
             '1993-03-01 09:00', '1993-06-01 09:00', '1993-09-01 09:00',
             '1993-12-01 09:00', '1994-03-01 09:00', '1994-06-01 09:00',
             '1994-09-01 09:00', '1994-12-01 09:00', '1995-03-01 09:00',
             '1995-06-01 09:00', '1995-09-01 09:00', '1995-12-01 09:00',
             '1996-03-01 09:00', '1996-06-01 09:00', '1996-09-01 09:00',
             '1996-12-01 09:00', '1997-03-01 09:00', '1997-06-01 09:00',
             '1997-09-01 09:00', '1997-12-01 09:00', '1998-03-01 09:00',
             '1998-06-01 09:00', '1998-09-01 09:00', '1998-12-01 09:00',
             '1999-03-01 09:00', '1999-06-01 09:00', '1999-09-01 09:00',
             '1999-12-01 09:00', '2000-03-01 09:00', '2000-06-01 09:00',
             '2000-09-01 09:00', '2000-12-01 09:00'],
            dtype='period[H]', freq='H')

在周期和时间戳之间进行转换,可以使用一些方便的算术函数。在下面的例子中,我们将截至11月的季度频率转换为季度结束后的月初的上午9时:

>>> ts.index = (prng.asfreq('M', 'e') + 1).asfreq('H', 's') + 9
>>> ts
1990-03-01 09:00   -0.504704
1990-06-01 09:00   -0.343573
1990-09-01 09:00   -1.511160
1990-12-01 09:00   -0.268746
1991-03-01 09:00   -0.013206
1991-06-01 09:00   -1.730078
1991-09-01 09:00    0.193844
1991-12-01 09:00    0.732594
1992-03-01 09:00   -0.656539
1992-06-01 09:00   -0.577248
1992-09-01 09:00    1.942460
1992-12-01 09:00   -1.112426
1993-03-01 09:00    0.040957
1993-06-01 09:00   -1.572008
1993-09-01 09:00   -0.130710
1993-12-01 09:00   -1.019854
1994-03-01 09:00    0.817004
1994-06-01 09:00    2.099237
1994-09-01 09:00    0.574557
1994-12-01 09:00   -0.775947
1995-03-01 09:00   -1.034555
1995-06-01 09:00   -1.449465
1995-09-01 09:00   -0.472482
1995-12-01 09:00    0.750370
1996-03-01 09:00   -0.081299
1996-06-01 09:00   -0.214759
1996-09-01 09:00   -0.141590
1996-12-01 09:00   -0.397613
1997-03-01 09:00   -0.771929
1997-06-01 09:00    0.823948
1997-09-01 09:00   -0.479001
1997-12-01 09:00    0.624560
1998-03-01 09:00    0.956335
1998-06-01 09:00    0.531748
1998-09-01 09:00   -0.048368
1998-12-01 09:00    0.183149
1999-03-01 09:00    1.060150
1999-06-01 09:00    1.207727
1999-09-01 09:00    1.309845
1999-12-01 09:00    0.817419
2000-03-01 09:00    0.992689
2000-06-01 09:00    0.187355
2000-09-01 09:00    1.014541
2000-12-01 09:00    0.431878
Freq: H, dtype: float64

四、类别数据

>>> df = pd.DataFrame({"id":[1,2,3,4,5,6], "raw_grade":['a', 'b', 'b', 'a', 'a', 'e']} 
>>> df
   id raw_grade
0   1         a
1   2         b
2   3         b
3   4         a
4   5         a
5   6         e

1.astype函数

  • 格式
DataFrame.astype(self: ~FrameOrSeries, dtype, copy: bool = True, errors: str = 'raise') 
  • 功能
    将pandas对象转换为指定的dtype类型。
  • 实例
>>> df["grade"] = df["raw_grade"].astype("category")
>>> df
   id raw_grade grade
0   1         a     a
1   2         b     b
2   3         b     b
3   4         a     a
4   5         a     a
5   6         e     e
  • 查看类型
>>> df["grade"].cat.categories
Index(['a', 'b', 'e'], dtype='object')
  • 替换数据
    ['a', 'b', 'e']变成了对应的["very good", "good", "very bad"]
>>> df["grade"].cat.categories = ["very good", "good", "very bad"]
>>> df
   id raw_grade      grade
0   1         a  very good
1   2         b       good
2   3         b       good
3   4         a  very good
4   5         a  very good
5   6         e   very bad

2.排序(sort_values())

根据grade的值排序

>>> df.sort_values(by='grade', ascending=True)
   id raw_grade      grade
0   1         a  very good
3   4         a  very good
4   5         a  very good
1   2         b       good
2   3         b       good
5   6         e   very bad

3.统计个数(groupby().size())

>>> df.groupby("grade").size()
grade
very good    3
good         2
very bad     1
dtype: int64
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章