Categoricals 是 pandas 的一種數據類型，對應着被統計的變量。Categoricals 是由固定的且有限數量的變量組成的。比如：性別、社會階層、血型、國籍、觀察時段、讚美程度等等。

與其它被統計的變量相比，categorical 類型的數據可以具有特定的順序——比如：按程度來設定，“強烈同意”與“同意”，“首次觀察”與“二次觀察”，但是不能做按數值來進行排序操作（比如：sort_by 之類的，換句話說，categorical 的順序是創建時手工設定的，是靜態的）

類型數據的每一個元素的值要麼是預設好的類型中的某一個，要麼是空值（np.nan）。順序是由預設好的類型集合來決定的，而不是按照類型集合中各個元素的字母順序排序的。categorical 實例的內部是由類型名字集合和一個整數組成的數組構成的，後者標明瞭類型集合真正的值

摘錄於 https://blog.csdn.net/mengenqing/article/details/80616094

以下學習資源來自學習資源

本節主要內容：

category的創建及其性質

分類變量的排序

分類變量的比較操作

問題與練習

category的創建及其性質

分類變量的創建

#1、對於Series數據結構，傳入參數dtype='category'即可
    series_cat=pd.Series(["a", "b", "c", "a"], dtype="category")
    series_cat
    
    結果
    0    a
    1    b
    2    c
    3    a
    dtype: category
    Categories (3, object): [a, b, c]

    ## series_cat的類型爲category，但是沒有聲明順序，這時若對Series排序，實際上還是按照詞法的順序：

    series_cat.sort_values()

    0    a
    3    a
    1    b
    2    c
    dtype: category
    Categories (3, object): [a, b, c]

#2、對DataFrame指定類型創建
    #2、1直接指定類型

    temp_df = pd.DataFrame({'A':pd.Series(["a", "b", "c", "a"],          dtype="category"),'B':list('abcd')})
    print(temp_df)
    temp_df.dtypes

   結果： 
      A  B
   0  a  a
   1  b  b
   2  c  c
   3  a  d
   A    category
   B      object
   dtype: object

   #2、2 也可以在定義數據之後轉換類型
    #創建數據框
        df_cat = pd.DataFrame({ 'V1':['A','C','B','D']})
    #轉換指定列的數據類型爲category
        df_cat['V1'] = df_cat['V1'].astype('category')
        df_cat['V1']


# 3、利用內置Categorical類型創建

   # pd.Categorical(values, categories=None, ordered=None, dtype=None, fastpath=False)

   #3、1 利用pd.Categorical()生成類別型數據後轉換爲Series，或替換DataFrame中的內容

   cat = pd.Categorical(["a", "b", "c", "a"], categories=['a','b','c'])
   pd.Series(cat)

   結果：

   0    a
   1    b
   2    c
   3    a
   dtype: category
   Categories (3, object): [a, b, c]

    #3、2 替換DataFrame中的內容
    categorical_ = pd.Categorical(['A','B','D','C'],
                              categories=['B','C','D'])
    df_cat = pd.DataFrame({'V1':categorical_})
    df_cat['V1']
       
    結果:
    0    NaN
    1      B
    2      D
    3      C
    Name: V1, dtype: category
    Categories (3, object): [B, C, D]

    總結:而pd.Categorical()獨立創建categorical數據時有兩個新的特性，一是其通過參數    categories定義類別時，若原數據中出現了categories參數中沒有的數據，則會自動轉換爲pd.nan：

#3、3 另外pd.Categorical()還有一個bool型參數ordered，設置爲True時則會按照categories中的順序定義從小到大的範圍：

     categorical_ = pd.Categorical(['A','B','D','C'],
                              categories=['A','B','C','D'],
                             ordered=True)
     df_cat = pd.DataFrame({'V1':categorical_})
     df_cat['V1']

#4、利用pandas.api.types中的CategoricalDtype()對已有數據進行轉換

    from pandas.api.types import CategoricalDtype
    #創建數據框
    df_cat = pd.DataFrame({'V1':['A','C','B','D']})
    cat = CategoricalDtype(categories=['A','C','B'],ordered=True)
    df_cat['V1'] = df_cat['V1'].astype(cat)
    df_cat['V1']

   結果

    0      A
    1      C
    2      B
    3    NaN
    Name: V1, dtype: category
    Categories (3, object): [A < C < B]

    通過CategoricalDtype()，我們可以結合astype()完成從其他類型數據向categorical數據的轉換    過程，利用CategoricalDtype()的參數categories、ordered，彌補.astype('category')的短板（實際    上.astype('category')等價於.astype(CategoricalDtype(categories=None, ordered=False))）:

# 5、利用cut函數創建
    pd.cut(x, bins, right:bool=True, labels=None, retbins:bool=False, precision:int=3, 
    include_lowest:bool=False, duplicates:str='raise')

#5、1 默認使用區間類型爲標籤
      pd.cut(np.random.randint(0,60,5), [0,10,30,60])

      out:
        [(30, 60], (0, 10], (30, 60], (30, 60], (30, 60]]
        Categories (3, interval[int64]): [(0, 10] < (10, 30] < (30, 60]]

       #可指定字符爲標籤
        pd.cut(np.random.randint(0,60,5), [0,10,30,60], right=False, labels=['0-10','10-    30','30-60'])

 結果:
            [10-30, 30-60, 30-60, 10-30, 30-60]
            Categories (3, object): [0-10 < 10-30 < 30-60]

分類變量的結構

一個分類變量包括三個部分，元素值（values）、分類類別（categories）、是否有序（order）
從上面可以看出，使用cut函數創建的分類變量默認爲有序分類變量

獲取屬性

#1、describe方法
s = pd.Series(pd.Categorical(["a", "b", "c", "a",np.nan], categories=['a','b','c','d']))
s.describe()

count     4
unique    3
top       a
freq      2
dtype: object

 #2、使用 .cat.categories查看類別名稱列表

   s= pd.Series([‘a’, ‘b’, ‘c’, ‘c’], dtype=’category’)

   s.cat.categories

   結果: Index([‘a’, ‘b’, ‘c’], dtype=’object’)

  #3、 使用 .cat.ordered 查看類別是否有序

     s = pd.Series( pd.Categorical([‘差’, ‘中’, ‘良’, ‘優’], categories=[‘差’, ‘中’, 
    ‘良’, ‘優’], ordered=True))

    s.cat.ordered

                outcome: True

     a2 = pd.Series([‘a’, ‘b’, ‘c’, ‘c’], dtype=’category’)

     a2.cat.ordered

                 outcome:False

# 4、讀取：cat.categories

    a1 = pd.Series([‘a’, ‘b’, ‘c’, ‘c’], dtype=’category’)

    a1.cat.categories

        outcome:  Index([‘a’, ‘b’, ‘c’], dtype=’object’)

類別的修改

#1、修改屬性 
a1 = pd.Series([‘a’, ‘b’, ‘c’, ‘c’], dtype=’category’)

    a1.cat.categories

        outcome:  Index([‘a’, ‘b’, ‘c’], dtype=’object’)

 a1.cat.categories=[‘類別a’, ‘類別b’, ‘c’]

    a1

outcome:    0 類別a

                    1    類別b

                    2      c

                    3      c

                    dtype: category

                    Categories (3, object): [類別a, 類別b, c]

#2、利用set_categories修改
    s = pd.Series(pd.Categorical(["a", "b", "c", "a",np.nan], categories=   ['a','b','c','d']))
    print(s)
    s.cat.set_categories(['new_a','c'])

outcome：
0      a
1      b
2      c
3      a
4    NaN
dtype: category
Categories (4, object): [a, b, c, d]
0    NaN
1    NaN
2      c
3    NaN
4    NaN
dtype: category
Categories (2, object): [new_a, c]

#3、利用rename_categories修改， 需要注意的是該方法會把值和分類同時修改

    s = pd.Series(pd.Categorical(["a", "b", "c", "a",np.nan], categories=    ['a','b','c','d']))
print(s)
s.cat.rename_categories(['new_%s'%i for i in s.cat.categories])

outcome；
0      a
1      b
2      c
3      a
4    NaN
dtype: category
Categories (4, object): [a, b, c, d]
0    new_a
1    new_b
2    new_c
3    new_a
4      NaN
dtype: category
Categories (4, object): [new_a, new_b, new_c, new_d]

5、利用字典修改,也會把值和類型都改變
s.cat.rename_categories({'a':'new_a','b':'new_b'})

outcome
0    new_a
1    new_b
2        c
3    new_a
4      NaN
dtype: category
Categories (4, object): [new_a, new_b, c, d]


# 6、利用add_categories添加
    s = pd.Series(pd.Categorical(["a", "b", "c", "a",np.nan], categories=['a','b','c','d']))
s.cat.add_categories(['e'])

outcome：
0      a
1      b
2      c
3      a
4    NaN
dtype: category
Categories (5, object): [a, b, c, d, e]

# 7、利用remove_categories移除
s = pd.Series(pd.Categorical(["a", "b", "c", "a",np.nan], categories=['a','b','c','d']))
s.cat.remove_categories(['d'])

# 8、 刪除元素值未出現的分類類型
    s = pd.Series(pd.Categorical(["a", "b", "c", "a",np.nan], categories=['a','b','c','d']))
s.cat.remove_unused_categories()

outcome;
0      a
1      b
2      c
3      a
4    NaN
dtype: category
Categories (3, object): [a, b, c]

分類變量的排序

# 無序變爲有序
    1、.cat.as_ordered方法
        s = pd.Series(["a", "d", "c", "a"]).astype('category').cat.as_ordered()
        s

   OUTCOME 
    0    a
    1    d
    2    c
    3    a
    dtype: category
    Categories (3, object): [a < c < d]
2、有序變爲無序.cat.as_unordered()

    s.cat.as_unordered()
3、利用.cat.set_categories方法中的order參數
    pd.Series(["a", "d", "c",     "a"]).astype('category')
    .cat.set_categories(['a','c','d'],ordered=True)
4。利用.cat.reorder_categories方法,這個方法的特點在於，新設置的分類必須與原分類爲同一集合
    s = pd.Series(["a", "d", "c", "a"]).astype('category')
    s.cat.reorder_categories(['a','c','d'],ordered=True)

5、值排序和索引排序都是適用的

分類變量的比較操作
1. 與標量數據或等長序列的比較：
  1. 兩個分類變量的等式判別需要滿足分類完全相同
2. 與另外一個分類變量比較
問題與練習

1.4.0.1 【問題一】如何使用union_categoricals方法？它的作用是什麼？

1、
from pandas.api.types import union_categoricals
a = pd.Categorical(["b", "c"])
b = pd.Categorical(["a", "b"])
union_categoricals([a, b])
#對分類數據求集合,下面會報錯

a = pd.Categorical(["a", "b"], ordered=True)
b = pd.Categorical(["a", "b", "c"], ordered=True)
union_categoricals([a, b])

TypeError ：to union ordered Categoricals, all categories must be the same

1.4.0.2 【問題二】利用concat方法將兩個序列縱向拼接，它的結果一定是分類變量嗎？什麼情況下不是？

1.4.0.3 【問題三】當使用groupby方法或者value_counts方法時，分類變量的統計結果和普通變量有什麼區別？

1.4.0.4 【問題四】下面的代碼說明了Series創建分類變量的什麼“缺陷”？如何避免？（提示：使用Series中的copy參數）¶

【練習一】現繼續使用第四章中的地震數據集，請解決以下問題：

1.4.0.6 （a）現在將深度分爲七個等級：[0,5,10,15,20,30,50,np.inf]，請以深度等級Ⅰ,Ⅱ,Ⅲ,Ⅳ,Ⅴ,Ⅵ,Ⅶ爲索引並按照由淺到深的順序進行排序。

d=pd.read_csv('data/Earthquake.csv').head()
pd.cut(d['深度'], [0,5,10,15,20,30,50,np.inf],labels=['Ⅰ','Ⅱ','Ⅲ','Ⅳ','Ⅴ','Ⅵ','Ⅶ'])
d.set_index('深度').sort_index().head()

1.4.0.7 （b）在（a）的基礎上，將烈度分爲4個等級：[0,3,4,5,np.inf]，依次對南部地區的深度和烈度等級建立多級索引排序。

方向裏邊沒有南部啊？？？？？？

d['烈度'] = pd.cut(d['烈度'], [0,3,4,5,np.inf],labels=['Ⅰ','Ⅱ','Ⅲ','Ⅳ'])
d.set_index(['深度','烈度']).sort_index().head()

【練習二】對於分類變量而言，調用第4章中的變形函數會出現一個BUG（目前的版本下還未修復）：例如對於crosstab函數，按照官方文檔的說法，即使沒有出現的變量也會在變形後的彙總結果中出現，但事實上並不是這樣，比如下面的例子就缺少了原本應該出現的行'c'和列'f'。基於這一問題，請嘗試設計my_crosstab函數，在功能上能夠返回正確的結果。

參考答案：

def my_crosstab(foo,bar):
    num = len(foo)
    s1 = pd.Series([i for i in list(foo.categories.union(set(foo)))],name='1nd var')
    s2 = [i for i in list(bar.categories.union(set(bar)))]
    df = pd.DataFrame({i:[0]*len(s1) for i in s2},index=s1)
    for i in range(num):
        df.at[foo[i],bar[i]] += 1
    return df.rename_axis('2st var',axis=1)
my_crosstab(foo,bar)

Pandas 第8章分類數據

目錄

category的創建及其性質

分類變量的排序

分類變量的比較操作

問題與練習

category的創建及其性質

分類變量的創建

分類變量的結構

linux安裝cuda和cudnn

模擬手機設備：使用 Playwright 實現移動端自動化測試

Mellanox網卡開啓SR-IOV

全面系統的AI學習路徑，幫助普通人也能玩轉AI

HTML 00 Tutorial

uni-app實現上拉加載

vue3編譯優化之“靜態提升”

又是一個月-20240513

flask 如何保證返回json有序

linux服務器設置ssh免密

Mysql 第 n 高的薪水相關知識整理

Pandas 數據結構-Series

sklearn 數據預處理

Pandas-第六章缺失數據處理

Pandas 第8章分類數據

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

Pandas 第8章 分類數據

目錄

分類變量的創建

分類變量的結構

Pandas 第8章分類數據