One-Hot 編碼

原創

2020-06-09 01:26

獨熱編碼即 One-Hot 編碼，又稱一位有效編碼，其方法是使用N位狀態寄存器來對N個狀態進行編碼，每個狀態都由他獨立的寄存器位，並且在任意時候，其中只有一位有效。獨熱編碼恰好是一種解決上述問題的好辦法。不過數據也因此變得稀疏。
[{‘city’: ‘北京’,‘location’:‘北方’,‘temperature’:100},
{‘city’: ‘上海’,‘location’:‘南方’,‘temperature’:60},
{‘city’: ‘深圳’,‘location’:‘南方’,‘temperature’:30}，
{‘city’: ‘深圳’,‘location’:‘南方’,‘temperature’:20}]
上述中對於city類別進行數字化，我們可以使用123分表來表示北京、上海、深圳；在特徵量非常少的情況下，使用這種簡單數字化的方式表示一個類別是可以，但是在面對大量特徵的時候，我們就會發現無法繼續使用這個方式來唯一的表示一個類別。而one-hot編碼剛好契合的這個問題，對於city特徵，在這裏我們發現它一共有3個類別，那麼我們是否可以使用3個狀態位來分別表示呢，對於location使用2個狀態位來表示呢？
對於特徵city：
北京：100
上海：010
深圳：001

對於特徵location：
北方01
南方10

對於特徵temperature：
本身是數字，不使用one-hot編碼

對於樣本1編碼爲：100 01 100

from sklearn.feature_extraction import DictVectorizer
import numpy as np

def dictvec():
    '''
    字典數據抽取
    return None
    '''
    # 實例化
    # dict = DictVectorizer() # sparse默認爲True
    dict = DictVectorizer(sparse=False)  # data輸出爲ndarray數組

    # 調用fit_transform
    data = dict.fit_transform([{'city': '北京','location':'北方','temperature':100},
{'city': '上海','location':'南方','temperature':60},
{'city': '深圳','location':'南方','temperature':30},
{'city': '深圳','location':'南方','temperature':20}])
    print(dict.get_feature_names())
    # 字典數據抽取：把字典中一些類別的數據，分別進行轉換成特徵
    print(data)

    return None

if __name__ == '__main__':
    dictvec()

得到的輸出結果如下，爲ndarray數組。
[‘city=上海’, ‘city=北京’, ‘city=深圳’, ‘location=北方’, ‘location=南方’, ‘temperature’]
[[ 0. 1. 0. 1. 0. 100.]
[ 1. 0. 0. 0. 1. 60.]
[ 0. 0. 1. 0. 1. 30.]
[ 0. 0. 1. 0. 1. 20.]]

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

One-Hot 編碼

MySQL 核心模塊揭祕 | 18 期 | 鎖在內存里長什麼樣*

使用perf工具生成火焰圖

HttpSecurity 是如何組裝過濾器鏈的

數說海南——近6年海南各市縣人口簡單看

長序列中Transformers的高級注意力機制總結

WebStorm 創建 Vue 項目

大齡程序員思考

響應式界面控件DevExtreme * 更強的數據分析和可視化功能

One-Hot 編碼

Spark常用算子概述

AttributeError: 'DataFrame' object has no attribute 'map'

spark-env.sh配置參數詳解

spark重分區算子repartition和coalesce解析

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結