第五章:pandas入門Day8-11

說明:本文章爲Python數據處理學習日誌,記錄內容爲實現書本內容時遇到的錯誤以及一些與書本不一致的地方,一些簡單操作則不再贅述。日誌主要內容來自書本《利用Python進行數據分析》,Wes McKinney著,機械工業出版社。

1、pandas的數據結構

Series

Init signature:
Series(self, data=None, index=None, dtype=None,name=None, copy=False, fastpath=False)

Docstring:
One-dimensional ndarray with axis labels (including time series).

Labels need not be unique but must be any hashable type. The object supports both integer- and label-based indexing and provides a host of methods for performing operations involving the index. Statistical methods from ndarray have been overridden to automatically exclude missing data (currently represented as NaN)

Operations between Series (+, -, /, , *) align values based on their associated index values– they need not be the same length. The result index will be the sorted union of the two indexes.

Parameters:
data : array-like, dict, or scalar value
Contains data stored in Series
index : array-like or Index (1d)
Values must be unique and hashable, same length as data. Index object (or other iterable of same length as data) Will default to RangeIndex(len(data)) if not provided. If both a dict and index sequence are used, the index will override the keys found in the dict.
dtype : numpy.dtype or None
If None, dtype will be inferred
copy : boolean, default False
Copy input data

What is hashable type?

DataFrame

Init signature:
DataFrame(self, data=None, index=None, columns=None, dtype=None, copy=False)

Docstring:
Two-dimensional size-mutable, potentially heterogeneous tabular data structure with labeled axes (rows and columns).
Arithmetic operations align on both row and column labels. Can be thought of as a dict-like container for Series objects. The primary
pandas data structure

Parameters:
data : numpy ndarray (structured or homogeneous), dict, or DataFrame
Dict can contain Series, arrays, constants, or list-like objects
index : Index or array-like
Index to use for resulting frame. Will default to np.arange(n) if
no indexing information part of input data and no index provided
columns : Index or array-like
Column labels to use for resulting frame. Will default to
np.arange(n) if no column labels are provided
dtype : dtype, default None
Data type to force, otherwise infer
copy : boolean, default False
Copy data from inputs. Only affects DataFrame / 2d ndarray input

書本注

P116 Series
Series.index顯示模式有所不同:

obj = Series([4,7,-5,3])

obj
Out[12]: 
0    4
1    7
2   -5
3    3
dtype: int64

obj.values
Out[15]: array([ 4,  7, -5,  3], dtype=int64)

obj.index
Out[16]: RangeIndex(start=0, stop=4, step=1)

obj2 = Series([4,7,-5,3],index=['d','b','a','c'])

obj2
Out[19]: 
d    4
b    7
a   -5
c    3
dtype: int64

obj2.index
Out[20]: Index([u'd', u'b', u'a', u'c'], dtype='object')

P120 顯示DataFrame的列

"""
frame2['state']結果與frame2.state一樣
frame2['year']結果與frame2.year一樣
frame2['debt']結果與frame2.debt一樣
frame2['pop']結果與frame2.pop卻不一樣
"""
frame2['state'] 
Out[39]: 
one        ohio
two        ohio
three      ohio
four     Nevada
five     Nevada
Name: state, dtype: object

frame2.state
Out[40]: 
one        ohio
two        ohio
three      ohio
four     Nevada
five     Nevada
Name: state, dtype: object

frame2['year']
Out[41]: 
one      2000
two      2001
three    2002
four     2000
five     2001
Name: year, dtype: int64

frame2.year
Out[42]: 
one      2000
two      2001
three    2002
four     2000
five     2001
Name: year, dtype: int64

frame2['debt']
Out[43]: 
one      NaN
two     -1.2
three    NaN
four    -1.5
five    -1.7
Name: debt, dtype: float64

frame2.debt
Out[44]: 
one      NaN
two     -1.2
three    NaN
four    -1.5
five    -1.7
Name: debt, dtype: float64

frame2['pop']
Out[45]: 
one      1.5
two      1.7
three    3.6
four     2.4
five     2.9
Name: pop, dtype: float64

frame2.pop
Out[46]: 
<bound method DataFrame.pop of        year   state  pop  debt
one    2000    ohio  1.5   NaN
two    2001    ohio  1.7  -1.2
three  2002    ohio  3.6   NaN
four   2000  Nevada  2.4  -1.5
five   2001  Nevada  2.9  -1.7>

P122 del方法

"""
不能用del frame2.column_name
而要用del frmae2['column_name']
"""
frame2
Out[48]: 
       year   state  pop  debt eastern
one    2000    ohio  1.5   NaN   False
two    2001    ohio  1.7  -1.2   False
three  2002    ohio  3.6   NaN   False
four   2000  Nevada  2.4  -1.5   False
five   2001  Nevada  2.9  -1.7   False

del frame2.eastern
---------------------------------------------------------------------------
AttributeError                            Traceback (most recent call last)
<ipython-input-49-1f2f896bbb30> in <module>()
----> 1 del frame2.eastern

AttributeError: eastern 

del frame2['eastern']

frame2
Out[51]: 
       year   state  pop  debt
one    2000    ohio  1.5   NaN
two    2001    ohio  1.7  -1.2
three  2002    ohio  3.6   NaN
four   2000  Nevada  2.4  -1.5
five   2001  Nevada  2.9  -1.7

2、基本功能

reindex

Signature:
obj3.reindex(index=None, **kwargs)

Docstring: Conform Series to new index with optional filling logic, placing NA/NaN in locations having no value in the previous
index. A new object is produced unless the new index is equivalent to
the current one and copy=False

Parameters:
index :array-like, optional (can be specified in order, or as keywords)
New labels / index to conform to. Preferably an Index object to
avoid duplicating data
method : {None, ‘backfill’/’bfill’, ‘pad’/’ffill’, ‘nearest’}, optional
method to use for filling holes in reindexed DataFrame.
Please note: this is only applicable to DataFrames/Series with a
monotonically increasing/decreasing index.

* default: don't fill gaps
* pad / ffill: propagate last valid observation forward to next
  valid
* backfill / bfill: use next valid observation to fill gap
* nearest: use nearest valid observations to fill gap

copy : boolean, default True
Return a new object, even if the passed indexes are the same
level : int or name
Broadcast across a level, matching Index values on the
passed MultiIndex level
fill_value : scalar, default np.NaN
Value to use for missing values. Defaults to NaN, but can be any
“compatible” value
limit : int, default None
Maximum number of consecutive elements to forward or backward fill
tolerance : optional
Maximum distance between original and new labels for inexact
matches. The values of the index at the matching locations most
satisfy the equation abs(index[indexer] - target) <= tolerance.

frame.ix

Type: property
Docstring:
A primarily label-location based indexer, with integer position fallback.

.ix[] supports mixed integer and label based access. It is
primarily label based, but will fall back to integer positional access
unless the corresponding axis is of integer type.

.ix is the most general indexer and will support any of the inputs
in .loc and .iloc. .ix also supports floating point label
schemes. .ix is exceptionally useful when dealing with mixed
positional and label based hierachical indexes.

However, when an axis is integer based, ONLY label based access and
not positional access is supported. Thus, in such cases, it’s usually
better to be explicit and use .iloc or .loc.

DataFrame.drop

Signature: data.drop(labels, axis=0, level=None, inplace=False, errors=’raise’)

Docstring: Return new object with labels in requested axis removed.

Parameters:
labels : single label or list-like
axis : int or axis name
level : int or level name, default None
For MultiIndex
inplace : bool, default False
If True, do operation inplace and return None.
errors : {‘ignore’, ‘raise’}, default ‘raise’
If ‘ignore’, suppress error and existing labels are dropped.

Returns:
dropped : type of caller

Series.rank()

Signature: obj.rank(axis=0, method=’average’, numeric_only=None, na_option=’keep’, ascending=True, pct=False)

Docstring: Compute numerical data ranks (1 through n) along axis. Equal values are assigned a rank that is the average of the ranks of
those values

Parameters:
axis: {0 or ‘index’, 1 or ‘columns’}, default 0
index to direct ranking
method : {‘average’, ‘min’, ‘max’, ‘first’, ‘dense’}
* average: average rank of group
* min: lowest rank in group
* max: highest rank in group
* first: ranks assigned in order they appear in the array
* dense: like ‘min’, but rank always increases by 1 between groups
numeric_only : boolean, default None
Include only float, int, boolean data. Valid only for DataFrame or
Panel objects
na_option : {‘keep’, ‘top’, ‘bottom’}
* keep: leave NA values where they are
* top: smallest rank if ascending
* bottom: smallest rank if descending
ascending : boolean, default True
False for ranks by high (1) to low (N)
pct : boolean, default False
Computes percentage rank of data

Returns:
ranks : same type as caller

主要說明一下rank的method參數。rank函數是用來給元素排序的:

obj
Out[206]: 
0    7
1   -5
2    7
3    4
4    2
5    0
6    4
dtype: int64

"""
沒有則默認爲average。obj中有兩個4,兩個7,按排名算則一次佔據4、5、6、7四個名次:
1)當爲average時,兩個4的名次爲(4+5)/2=4.5,連個7的名次爲(6+7)/2=6.5(若有3個4,排名分別爲4、5、6,則三個4的名次爲(4+5+6)/3=5)。
2)當爲max時,兩個4的名次爲兩者中較大的名次,即爲5;同理兩個7的名次爲7。
3)當爲min時,兩個4的名次爲兩者中較小的名次,即爲4;同理兩個7的名次爲6。
4)當爲first時,在原Series中排名靠前的佔據靠前的名次,排名靠後的佔據靠後的名次。
"""

obj.rank() 
Out[207]: 
0    6.5
1    1.0
2    6.5
3    4.5
4    3.0
5    2.0
6    4.5
dtype: float64

obj.rank(method='max')
Out[208]: 
0    7.0
1    1.0
2    7.0
3    5.0
4    3.0
5    2.0
6    5.0
dtype: float64

obj.rank(method='min')
Out[209]: 
0    6.0
1    1.0
2    6.0
3    4.0
4    3.0
5    2.0
6    4.0
dtype: float64

obj.rank(method='first')
Out[210]: 
0    6.0
1    1.0
2    7.0
3    4.0
4    3.0
5    2.0
6    5.0
dtype: float64

書本注

P132 DataFrame選取行列
書上說 obj[val] 用來選取DataFrame的單個列或一組列,其方法是通過具體的columns名查詢,而並不能用單純的數字來索引:

data['one'] #用columns名查詢
Out[60]: 
Ohio         0
Colorado     0
Utah         8
New York    12
Name: one, dtype: int32

data[0] #用數字代替索引
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-61-c0c8b06be82d> in <module>()
----> 1 data[0] #用數字代替索引

E:\Enthought\hzk\User\lib\site-packages\pandas\core\frame.pyc in __getitem__(self, key)
   1990             return self._getitem_multilevel(key)
   1991         else:
-> 1992             return self._getitem_column(key)
   1993 
   1994     def _getitem_column(self, key):

E:\Enthought\hzk\User\lib\site-packages\pandas\core\frame.pyc in _getitem_column(self, key)
   1997         # get column
   1998         if self.columns.is_unique:
-> 1999             return self._get_item_cache(key)
   2000 
   2001         # duplicate columns & possible reduce dimensionality

E:\Enthought\hzk\User\lib\site-packages\pandas\core\generic.pyc in _get_item_cache(self, item)
   1343         res = cache.get(item)
   1344         if res is None:
-> 1345             values = self._data.get(item)
   1346             res = self._box_item_values(item, values)
   1347             cache[item] = res

E:\Enthought\hzk\User\lib\site-packages\pandas\core\internals.pyc in get(self, item, fastpath)
   3223 
   3224             if not isnull(item):
-> 3225                 loc = self.items.get_loc(item)
   3226             else:
   3227                 indexer = np.arange(len(self.items))[isnull(self.items)]

E:\Enthought\hzk\User\lib\site-packages\pandas\indexes\base.pyc in get_loc(self, key, method, tolerance)
   1876                 return self._engine.get_loc(key)
   1877             except KeyError:
-> 1878                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   1879 
   1880         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:4027)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:3891)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12408)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12359)()

KeyError: 0 

data['Ohio'] #用行index索引(行)
---------------------------------------------------------------------------
KeyError                                  Traceback (most recent call last)
<ipython-input-62-5dd4df56835a> in <module>()
----> 1 data['Ohio'] #用行index索引(行)

E:\Enthought\hzk\User\lib\site-packages\pandas\core\frame.pyc in __getitem__(self, key)
   1990             return self._getitem_multilevel(key)
   1991         else:
-> 1992             return self._getitem_column(key)
   1993 
   1994     def _getitem_column(self, key):

E:\Enthought\hzk\User\lib\site-packages\pandas\core\frame.pyc in _getitem_column(self, key)
   1997         # get column
   1998         if self.columns.is_unique:
-> 1999             return self._get_item_cache(key)
   2000 
   2001         # duplicate columns & possible reduce dimensionality

E:\Enthought\hzk\User\lib\site-packages\pandas\core\generic.pyc in _get_item_cache(self, item)
   1343         res = cache.get(item)
   1344         if res is None:
-> 1345             values = self._data.get(item)
   1346             res = self._box_item_values(item, values)
   1347             cache[item] = res

E:\Enthought\hzk\User\lib\site-packages\pandas\core\internals.pyc in get(self, item, fastpath)
   3223 
   3224             if not isnull(item):
-> 3225                 loc = self.items.get_loc(item)
   3226             else:
   3227                 indexer = np.arange(len(self.items))[isnull(self.items)]

E:\Enthought\hzk\User\lib\site-packages\pandas\indexes\base.pyc in get_loc(self, key, method, tolerance)
   1876                 return self._engine.get_loc(key)
   1877             except KeyError:
-> 1878                 return self._engine.get_loc(self._maybe_cast_indexer(key))
   1879 
   1880         indexer = self.get_indexer([key], method=method, tolerance=tolerance)

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:4027)()

pandas/index.pyx in pandas.index.IndexEngine.get_loc (pandas\index.c:3891)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12408)()

pandas/hashtable.pyx in pandas.hashtable.PyObjectHashTable.get_item (pandas\hashtable.c:12359)()

KeyError: 'Ohio' 

data[:3] #這樣顯示的是篩選後的行信息
Out[63]: 
          one  two  three  four
Ohio        0    0      0     0
Colorado    0    5      6     7
Utah        8    9     10    11

data[2:3] #同上
Out[64]: 
      one  two  three  four
Utah    8    9     10    11

P133 有點有趣的現象
與計算機存儲數據有關:

s1
Out[82]: 
a    7.3
c   -2.5
d    3.4
e    1.5
dtype: float64

s1.a
Out[83]: 7.2999999999999998

s1['a']
Out[84]: 7.2999999999999998

s1.c
Out[85]: -2.5

s1.d
Out[86]: 3.3999999999999999

s1.e
Out[87]: 1.5

P134 add函數
書上的例子並不好,並不能顯示add函數的全貌:

df1 = DataFrame(arange(20).reshape((5,4)),columns=list('abcd'))

df2 = DataFrame(arange(24).reshape((4,6)),columns=list('abcdef'))

df1
Out[121]: 
    a   b   c   d
0   0   1   2   3
1   4   5   6   7
2   8   9  10  11
3  12  13  14  15
4  16  17  18  19

df2
Out[122]: 
    a   b   c   d   e   f
0   0   1   2   3   4   5
1   6   7   8   9  10  11
2  12  13  14  15  16  17
3  18  19  20  21  22  23

df1+df2
Out[123]: 
      a     b     c     d   e   f
0   0.0   2.0   4.0   6.0 NaN NaN
1  10.0  12.0  14.0  16.0 NaN NaN
2  20.0  22.0  24.0  26.0 NaN NaN
3  30.0  32.0  34.0  36.0 NaN NaN
4   NaN   NaN   NaN   NaN NaN NaN

df1.add(df2,fill_value=0)
Out[124]: 
      a     b     c     d     e     f
0   0.0   2.0   4.0   6.0   4.0   5.0
1  10.0  12.0  14.0  16.0  10.0  11.0
2  20.0  22.0  24.0  26.0  16.0  17.0
3  30.0  32.0  34.0  36.0  22.0  23.0
4  16.0  17.0  18.0  19.0   NaN   NaN

df2.add(df1,fill_value=0)
Out[125]: 
      a     b     c     d     e     f
0   0.0   2.0   4.0   6.0   4.0   5.0
1  10.0  12.0  14.0  16.0  10.0  11.0
2  20.0  22.0  24.0  26.0  16.0  17.0
3  30.0  32.0  34.0  36.0  22.0  23.0
4  16.0  17.0  18.0  19.0   NaN   NaN

用內省方法查看add函數的參數:

Signature: df1.add(other, axis=’columns’, level=None, fill_value=None)

Docstring: Addition of dataframe and other, element-wise (binary operator add). Equivalent to dataframe + other, but with support
to substitute a fill_value for missing data in one of the inputs.

Parameters:
other : Series, DataFrame, or constant
axis : {0, 1, ‘index’, ‘columns’}
For Series input, axis to match Series index on
fill_value : None or float value, default None
Fill missing (NaN) values with this value. If both DataFrame
locations are missing, the result will be missing
level : int or name
Broadcast across a level, matching Index values on the
passed MultiIndex level

可以看到“If both DataFrame locations are missing, the result will be missing”,shape爲(5,4)和(4,6)兩個DataFrame相加時,在(5,5)和(5,6)位置上的兩個元素均沒有值,故相加後依然爲NaN。
P139 order函數和sort_函數
警告:不贊成使用order函數。

obj.order()
-c:1: FutureWarning: order is deprecated, use sort_values(...)
Out[186]: 
2   -3
3    2
0    4
1    7
dtype: int64

obj.sort_values()
Out[187]: 
2   -3
3    2
0    4
1    7
dtype: int64

警告:不贊成在sort_index函數中使用參數by。

frame.sort_index(by='b')
-c:1: FutureWarning: by argument to sort_index is deprecated, pls use .sort_values(by=...)
Out[194]: 
   a  b  c
2  0 -3  2
3  1  2  1
0  0  4  4
1  1  7  3

frame.sort_values(by='b')
Out[195]: 
   a  b  c
2  0 -3  2
3  1  2  1
0  0  4  4
1  1  7  3

sort_index函數中已經沒有by參數了:

Signature: rame.sort_index(axis=0, level=None, ascending=True, inplace=False, kind=’quicksort’, na_position=’last’,
sort_remaining=True, by=None)

Docstring: Sort object by labels (along an axis)

Parameters:
axis : index, columns to direct sorting
level : int or level name or list of ints or list of level names
if not None, sort on values in specified index level(s)
ascending : boolean, default True
Sort ascending vs. descending
inplace : bool
if True, perform operation in-place
kind : {quicksort, mergesort, heapsort}
Choice of sorting algorithm. See also ndarray.np.sort for more
information. mergesort is the only stable algorithm. For
DataFrames, this option is only applied when sorting on a single
column or label.
na_position : {‘first’, ‘last’}
first puts NaNs at the beginning, last puts NaNs at the end
sort_remaining : bool
if true and sorting by level and index is multilevel, sort by other
levels too (in order) after sorting by specified level

Returns:
sorted_obj : DataFrame

sort_values函數:

Signature: frame.sort_values(by, axis=0, ascending=True, inplace=False, kind=’quicksort’, na_position=’last’)

Docstring: Sort by the values along either axis

Parameters:
by : string name or list of names which refer to the axis items
axis : index, columns to direct sorting
ascending : bool or list of bool
Sort ascending vs. descending. Specify list for multiple sort
orders. If this is a list of bools, must match the length of
the by.
inplace : bool
if True, perform operation in-place
kind : {quicksort, mergesort, heapsort}
Choice of sorting algorithm. See also ndarray.np.sort for more
information. mergesort is the only stable algorithm. For
DataFrames, this option is only applied when sorting on a single
column or label.
na_position : {‘first’, ‘last’}
first puts NaNs at the beginning, last puts NaNs at the end

Returns:
sorted_obj : DataFrame

3、彙總和計算描述統計

書本注

P145 import web
不過仍可用。

import pandas.io.data as web
E:\Enthought\hzk\User\lib\site-packages\pandas\io\data.py:35: FutureWarning: 
The pandas.io.data module is moved to a separate package (pandas-datareader) and will be removed from pandas in a future version.
After installing the pandas-datareader package (https://github.com/pydata/pandas-datareader), you can change the import ``from pandas.io import data, wb`` to ``from pandas_datareader import data, wb``.
  FutureWarning)

書本上的函數讀不出數據:

for ticker in ['APPL','IBM','MSFT','GOOG']:
        all_data[ticker]=web.get_data_yahoo(ticker,'1/1/2010','1/1/2011')

---------------------------------------------------------------------------
IOError                                   Traceback (most recent call last)
<ipython-input-254-4cf257dad11f> in <module>()
      1 for ticker in ['APPL','IBM','MSFT','GOOG']:
----> 2     all_data[ticker]=web.get_data_yahoo(ticker,'1/1/2010','1/1/2011')
      3 

E:\Enthought\hzk\User\lib\site-packages\pandas\io\data.pyc in get_data_yahoo(symbols, start, end, retry_count, pause, adjust_price, ret_index, chunksize, interval)
    438         raise ValueError("Invalid interval: valid values are 'd', 'w', 'm' and 'v'")
    439     return _get_data_from(symbols, start, end, interval, retry_count, pause,
--> 440                           adjust_price, ret_index, chunksize, 'yahoo')
    441 
    442 

E:\Enthought\hzk\User\lib\site-packages\pandas\io\data.pyc in _get_data_from(symbols, start, end, interval, retry_count, pause, adjust_price, ret_index, chunksize, source)
    379     # If a single symbol, (e.g., 'GOOG')
    380     if isinstance(symbols, (compat.string_types, int)):
--> 381         hist_data = src_fn(symbols, start, end, interval, retry_count, pause)
    382     # Or multiple symbols, (e.g., ['GOOG', 'AAPL', 'MSFT'])
    383     elif isinstance(symbols, DataFrame):

E:\Enthought\hzk\User\lib\site-packages\pandas\io\data.pyc in _get_hist_yahoo(sym, start, end, interval, retry_count, pause)
    222            '&g=%s' % interval +
    223            '&ignore=.csv')
--> 224     return _retry_read_url(url, retry_count, pause, 'Yahoo!')
    225 
    226 

E:\Enthought\hzk\User\lib\site-packages\pandas\io\data.pyc in _retry_read_url(url, retry_count, pause, name)
    199 
    200     raise IOError("after %d tries, %s did not "
--> 201                   "return a 200 for url %r" % (retry_count, name, url))
    202 
    203 

IOError: after 3 tries, Yahoo! did not return a 200 for url 'http://ichart.finance.yahoo.com/table.csv?s=APPL&a=0&b=1&c=2010&d=0&e=1&f=2011&g=d&ignore=.csv' 

可更改如下:

for ticker in ['AAPL','IBM','MSFT','GOOG']:
    all_data[ticker]=web.DataReader(ticker,'yahoo','1/1/2000','1/1/2010')

P147 計數
結果與書上略不一樣:

pd.value_counts(obj.values,sort=False)
Out[292]: 
a    3    """這裏順序不一樣"""
c    3
b    2
d    1
dtype: int64

pd.value_counts(obj.values)
Out[293]: 
c    3
a    3
b    2
d    1
dtype: int64

4、處理缺失數據

DataFrame.dropna()

Signature: data.dropna(axis=0, how=’any’, thresh=None, subset=None, inplace=False)

Docstring: Return object with labels on given axis omitted where alternately any or all of the data are missing

Parameters:
axis : {0 or ‘index’, 1 or ‘columns’}, or tuple/list thereof
Pass tuple or list to drop on multiple axes
how : {‘any’, ‘all’}
* any : if any NA values are present, drop that label
* all : if all values are NA, drop that label
thresh : int, default None
int value : require that many non-NA values
subset : array-like
Labels along other axis to consider, e.g. if you are dropping rows
these would be a list of columns to include
inplace : boolean, default False
If True, do operation inplace and return None.

Returns:
dropped : DataFrame

fillna()

Signature: df.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)

Docstring: Fill NA/NaN values using the specified method

Parameters:
value : scalar, dict, Series, or DataFrame
Value to use to fill holes (e.g. 0), alternately a
dict/Series/DataFrame of values specifying which value to use for
each index (for a Series) or column (for a DataFrame). (values not
in the dict/Series/DataFrame will not be filled). This value cannot
be a list.
method : {‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None
Method to use for filling holes in reindexed Series
pad / ffill: propagate last valid observation forward to next valid
backfill / bfill: use NEXT valid observation to fill gap
axis : {0, 1, ‘index’, ‘columns’}
inplace : boolean, default False
If True, fill in place. Note: this will modify any
other views on this object, (e.g. a no-copy slice for a column in a
DataFrame).
limit : int, default None
If method is specified, this is the maximum number of consecutive NaN values to forward/backward fill. In other words, if there is a gap
with more than this number of consecutive NaNs, it will only be
partially filled. If method is not specified, this is the maximum
number of entries along the entire axis where NaNs will be filled.
downcast : dict, default is None
a dict of item->dtype of what to downcast if possible,
or the string ‘infer’ which will try to downcast to an appropriate
equal type (e.g. float64 to int64 if possible)

Returns:
filled : DataFrame

5、層次化索引

書本注

P150 df.ix
注意:

#這裏是5
df[:5] 
Out[40]: 
          0   1         2
0 -0.085390 NaN       NaN
1  0.502172 NaN       NaN
2 -1.382911 NaN       NaN
3  0.037798 NaN  0.535017
4  0.358564 NaN  0.036123

#這裏是4
df.ix[:4]
Out[41]: 
          0   1         2
0 -0.085390 NaN       NaN
1  0.502172 NaN       NaN
2 -1.382911 NaN       NaN
3  0.037798 NaN  0.535017
4  0.358564 NaN  0.036123

P155 分級排序
內層爲level1,外層爲level0:

frame
Out[96]: 
state      Ohio     Colorado
color     Green Red    Green
key1 key2                   
a    1        0   1        2
     2        3   4        5
b    1        6   7        8
     2        9  10       11

frame.sortlevel(1)
Out[97]: 
state      Ohio     Colorado
color     Green Red    Green
key1 key2                   
a    1        0   1        2
b    1        6   7        8
a    2        3   4        5
b    2        9  10       11

frame.sortlevel(0)
Out[98]: 
state      Ohio     Colorado
color     Green Red    Green
key1 key2                   
a    1        0   1        2
     2        3   4        5
b    1        6   7        8
     2        9  10       11

frame.sortlevel(1,axis=1)
Out[99]: 
state     Colorado  Ohio    
color        Green Green Red
key1 key2                   
a    1           2     0   1
     2           5     3   4
b    1           8     6   7
     2          11     9  10

frame.sortlevel(0,axis=1)
Out[100]: 
state     Colorado  Ohio    
color        Green Green Red
key1 key2                   
a    1           2     0   1
     2           5     3   4
b    1           8     6   7
     2          11     9  10
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章