【pandas 小记】Categoricals数据类型

原創

2020-06-15 03:19

一，分类变量

在做数据分析统计时，常遇到这样的类型，比如：性别、社会阶层、血型、国籍、观察时段、赞美程度等等。这类数据都是固定的可能值，取值重复并且多为字符串。如性别中男和女，血型中A、B、O和AB。pandas中有可以存储和处理这类数据的数据类型——categorical，categorical是pandas中对应分类变量的一种数据类型。

二，创建方式

1，astype进行类型转换

import pandas as pd
import numpy as np
path = '../data/sz.xlsx'
sz_frame = pd.read_excel(path)
sz_frame['floor'].astype('category')

0        低楼层
1        低楼层
2        低楼层
3        低楼层
4        低楼层
        ... 
45263    中楼层
45264    高楼层
45265    低楼层
45266    低楼层
45267    中楼层
Name: floor, Length: 45268, dtype: category
Categories (3, object): [中楼层, 低楼层, 高楼层]

2，通过 dtype="category "显式创建

# Series
floor = pd.Series(sz_frame['floor'],dtype='category')
floor

0        低楼层
1        低楼层
2        低楼层
3        低楼层
4        低楼层
        ... 
45263    中楼层
45264    高楼层
45265    低楼层
45266    低楼层
45267    中楼层
Name: floor, Length: 45268, dtype: category
Categories (3, object): [中楼层, 低楼层, 高楼层]

#DataFrame
floor = pd.DataFrame(sz_frame['floor'],dtype='category')
floor.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 45268 entries, 0 to 45267
Data columns (total 1 columns):
floor    45268 non-null category
dtypes: category(1)
memory usage: 44.4 KB

3，cut/qcut 隐式创建

unit_price_level = pd.cut(sz_frame['unit_price'],3,precision=2)

0          (1.18, 14.8]
1          (1.18, 14.8]
2          (1.18, 14.8]
3          (1.18, 14.8]
4          (1.18, 14.8]
              ...      
45263    (28.38, 41.95]
45264    (28.38, 41.95]
45265    (28.38, 41.95]
45266    (28.38, 41.95]
45267    (28.38, 41.95]
Name: unit_price, Length: 45268, dtype: category
Categories (3, interval[float64]): [(1.18, 14.8] < (14.8, 28.38] < (28.38, 41.95]]

unit_price_level = pd.qcut(sz_frame['unit_price'],3,precision=2)
unit_price_level.value_counts()

(1.21, 4.56]    15090
(6.4, 41.95]    15089
(4.56, 6.4]     15089
Name: unit_price, dtype: int64
# 向cut/qcut 传入整数个箱数，cut 通常不会使每个箱子具有相同数据量，而qcuts使用样本的分位数，可以通过qcut获得等数据量的箱子。

4，Categorical显式创建

index = pd.Index(data=["Tom", "Bob", "Mary", "James", "Andy", "Alice"], name="name")
user_info = pd.Series(data=["A", "AB", np.nan, "AB", "O", "B"], index=index, name="blood_type") 
# categories：自定义类别数据
pd.Categorical(user_info, categories=["A", "B", "AB"])

[A, AB, NaN, AB, NaN, B]    # 对于不存在的类型，则为NaN
Categories (3, object): [A, B, AB]

三，应用

1，内存使用与效率
Categorical类型使得DataFrame数据占用更少的内存。

n = 10000000
labels = pd.Series(['1E76B5DCA3A19D03B0FB39BCF2A2F534',
                    '6945300E90C69061B463CCDA370DE5D6',
                    '4F4BEA1914E323156BE0B24EF8205B73',
                    '191115180C29B1E2AF8BE0FD0ABD138F']*(n //4))
draws  = pd.DataFrame({'labels':labels,'data':np.random.randn(n)})
draws.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000000 entries, 0 to 9999999
Data columns (total 2 columns):
labels    object
data      float64
dtypes: float64(1), object(1)
memory usage: 152.6+ MB

draws['labels'] = draws['labels'].astype('category')
draws.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 10000000 entries, 0 to 9999999
Data columns (total 2 columns):
labels    category
data      float64
dtypes: category(1), float64(1)
memory usage: 85.8 MB

memory usage 上减少了。至于对groupby操作性能的提升，也做了测试，感觉提升也不是很多，反而转换成category时消耗的部分性能，也可能是测试的数据量，或者分类类型不是很多，所以效果不明显。

%timeit draws.groupby('labels').sum()
# 159 ms ± 5.6 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)
draws['labels'] = draws['labels'].astype('category')
%timeit draws.groupby('labels').sum()
# 157 ms ± 4.44 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)

2，属性与方法
Categorical对象有两个常用到的属性categories与codes

floor = pd.Series(sz_frame['floor'],dtype='category')
floor.cat.categories #类别数组，Index(['中楼层', '低楼层', '高楼层'], dtype='object')
floor.cat.codes   # 返回一个数据，每个数据对应的类别数据的下标

0        0
1        1
2        1
3        2
4        0
        ..
45263    1
45264    1
45265    1
45266    1
45267    1
Length: 45268, dtype: int8

其他方法，参考官方文档，pandas中用挺大的篇幅介绍的，应该也是蛮实用的。
,
3，one-hot
Categorical类型数据除了在groupby中使用，还有可以用于机器学习的one-hot编码，通常会将分类数据转换成虚拟变量，也成one-hot编码，这将会产生一个datafrme，每个类型对应一列，如为该类型，则数值为1，否则为0。

floor = pd.Series(sz_frame['floor'],dtype='category')
pd.get_dummies(floor)

floor	中楼层	低楼层	高楼层
0	1	0	0
1	0	1	0
2	0	1	0
3	0	0	1
4	1	0	0
...	...	...	...
45263	0	1	0
45264	0	1	0
45265	0	1	0
45266	0	1	0
45267	0	1	0
45268 rows × 3 columns

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

【pandas 小记】Categoricals数据类型

一，分类变量

二，创建方式

三，应用

Android启动过程-万字长文(Android14)

这种嵌套字典类型的数据，我想把它读取到df里，如何操作？

【SQL进阶】CASE语句的使用

微调真的能让LLM学到新东西吗:引入新知识可能让模型产生更多的幻觉

iNeuOS工业互联网操作系统，增加电力IEC104协议

微服务实践k8s&dapr开发部署实验（3）订阅发布

kbgressdb之数据结构V0.2

【Oracle】淺析遊標使用

【Oracle】深入多表連接

【Python】NumPy 中 ravel() 正確打開方式

【pandas小記】pandas日期類型數據處理

【pandas小記】pandas中易混淆的描述性統計

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結