特徵工程——分類變量的處理

分類變量(categorical variables)是機器學習中一類很重要的特徵。所謂分類變量，是指包含固定數量的可能性取值的變量。分類變量的每一個取值代表一個組，或一個類別。他們和順序變量的區別在於，分類變量不同的類別之間的距離是相等的(或者說沒有一個真正的距離的定義)。舉個例子，
Ordinal: low, medium, high
Categorical: Georgia, Alabama, South Carolina, … , New York

機器學習模型接受的輸入都是數字，所以我們需要把字符轉化爲數字。
但是這個轉化的過程中可能會引入維度問題，如果對特徵進行轉化完之後導致維度很大，有可能帶來維度災難(curse of dimensionality)問題，所以要注意避免維度過大。

作者基於多個數據集，一共嘗試了7種encoding的方法

Ordinal
One-Hot
Binary
首先將這麼類別進行有序編碼，然後把這些編碼（數字）轉化爲二進制編碼，最後把二進制編碼分裂成多列。譬如年齡組有4個取值，分別爲“少年”、“青年”、“中年”、“老年”，則可以先進行有序編碼，則“少年” -> 0，“青年” -> 1，“中年” -> 2，“老年” ->3，然後進行二進制編碼，則“少年” -> 0 -> 00，“青年” -> 1 ->01，“中年” -> 2 -> 10，“老年” ->3 -> 11，最後對二進制編碼進行分裂，則“少年” -> 0 -> 00 -> 0|0，“青年” -> 1 ->01 -> 0|1，“中年” -> 2 -> 10 -> 1|0，“老年” ->3 -> 11 -> 1|1。
這種編碼形式的好處在於，比one-hot編碼所增加的列數要少一些。以上這個例子，如果採用one-hot編碼，會生成4列新的特徵，而採用binary編碼，則只增加了2列特徵。
Sum
Polynomial
Backward Difference
Helmert

作者在幾個不同的數據集上比較了幾種encoding方法的效果

綜合來看，binary 編碼的效果是最優的。

下面介紹另外一種常用的編碼，即特徵的hash 編碼。
wikipedia對 feature hashing的介紹如下

In machine learning, feature hashing, also known as the hashing trick (by analogy to the kernel trick), is a fast and space-efficient way of vectorizing features, i.e. turning arbitrary features into indices in a vector or matrix. It works by applying a hash function to the features and using their hash values as indices directly, rather than looking the indices up in an associative array. This trick is often attributed to Weinberger et al., but there exists a much earlier description of this method published by John Moody in 1989.

簡而言之，feature hashing的基本做法爲，使用一個hash 函數，把一個樣本的多列特徵映射成一個向量。具體來說，對於每一列特徵的每一個取值，hash函數會把該值映射到一個固定長度的某個位置上。不同的特徵類型映射方法也不同
spark ml包裏的介紹如下

The FeatureHasher transformer operates on multiple columns. Each column may contain either numeric or categorical features. Behavior and handling of column data types is as follows: -Numeric columns: For numeric features, the hash value of the column name is used to map the feature value to its index in the feature vector. By default, numeric features are not treated as categorical (even when they are integers). To treat them as categorical, specify the relevant columns in categoricalCols. -String columns: For categorical features, the hash value of the string “column_name=value” is used to map to the vector index, with an indicator value of 1.0. Thus, categorical features are “one-hot” encoded (similarly to using OneHotEncoder with dropLast=false). -Boolean columns: Boolean values are treated in the same way as string columns. That is, boolean features are represented as “column_name=true” or “column_name=false”, with an indicator value of 1.0.

舉例如下

原始特徵集中的第一列"real"列爲numberic feature，所以映射的方法爲：把列名"real"的hash值作爲映射目標向量的index（每個樣本的index都一樣，本例中是174475），向量的第174475這個位置上的取值爲原始“real”的值；"stringNum"和"string"兩列都是categorical feature，對於不同的樣本，映射目標向量的index是不一樣的，比如本例子中"stringNum"列，第一個樣本的原始取值爲1，所以第一個樣本該特徵的映射結果爲字符串"stringNum=1"的hash值，同理，第二個樣本的原始取值爲2，則第二個樣本該特徵的映射結果爲字符串"stringNum=2"的hash值。"bool"列的處理方法與"stringNum"和"string"相同。

feature hashing較之於one-hot encoding的優勢在於編碼之後的向量維度更低，佔用內存小，模型訓練效率更高。

那麼上例子中的那些index以及每個index上的value是怎麼確定的呢？

wikipedia上的說明如下

Instead of maintaining a dictionary, a feature vectorizer that uses the hashing trick can build a vector of a pre-defined length by applying a hash function h to the features (e.g., words), then using the hash values directly as feature indices and updating the resulting vector at those indices. Here, we assume that feature actually means feature vector.

利用上面的函數，給定一個樣本（一組string），就可以把它轉化爲一個長度爲 N 的向量。
對於每一個特徵 f，利用hash函數，得到值 h，然後h對N取模（目的是保證index介於0～N-1之間）作爲index，向量在該index處的值累加1。
舉個例子，如果原始特徵向量爲 [“cat”,“dog”,“cat”]
hash函數爲
$hash(x_{f})=1$ if $x_{f}$ is “cat” and 2 if $x_{f}$ is “dog”
給定輸出向量的長度爲4，則生成的向量爲[0,2,1,0]。
不同的特徵hash函數計算得到的h可能是一樣的，這就造成hash collisions問題，爲了緩解該問題，增加一個hash函數（一個sign function）

$ξ$ 函數的作用就是減少hash collisions問題發生的概率。由於 $ξ$ 函數的加入，會使得新生成的特徵的每一列的期望值爲0。

參考文獻

特徵工程——分類變量的處理

釘釘打卡速度慢

Nginx R31 doc 官方文檔-01-nginx 如何安裝

Python 潮流週刊#51：用 Python 繪製美觀的圖表

Qt/C++音視頻開發74-合併標籤圖形/生成yolo運算結果圖形/文字和圖形合併成一個/水印濾鏡

挑戰程序設計競賽 2.2章習題 POJ - 3617 Best Cow Line 貪心

字節面試：MySQL什麼時候鎖表？如何防止鎖表？

.NET8連接SQL SERVER 2008 R2 報：證書鏈是由不受信任的頒發機構頒發的

golang開發環境搭建(win10)

python計算機視覺學習筆記——PIL庫的用法

Golang初學：獲取程序內存使用情況，std runtime

從decision tree到bagging、boosting

序列標註任務中的CRFs和LSTMs

特徵工程——分類變量的處理

貝葉斯統計學相關

推薦系統優秀論文、博文彙總

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結