特征工程——分类变量的处理

分类变量(categorical variables)是机器学习中一类很重要的特征。所谓分类变量，是指包含固定数量的可能性取值的变量。分类变量的每一个取值代表一个组，或一个类别。他们和顺序变量的区别在于，分类变量不同的类别之间的距离是相等的(或者说没有一个真正的距离的定义)。举个例子，
Ordinal: low, medium, high
Categorical: Georgia, Alabama, South Carolina, … , New York

机器学习模型接受的输入都是数字，所以我们需要把字符转化为数字。
但是这个转化的过程中可能会引入维度问题，如果对特征进行转化完之后导致维度很大，有可能带来维度灾难(curse of dimensionality)问题，所以要注意避免维度过大。

作者基于多个数据集，一共尝试了7种encoding的方法

Ordinal
One-Hot
Binary
首先将这么类别进行有序编码，然后把这些编码（数字）转化为二进制编码，最后把二进制编码分裂成多列。譬如年龄组有4个取值，分别为“少年”、“青年”、“中年”、“老年”，则可以先进行有序编码，则“少年” -> 0，“青年” -> 1，“中年” -> 2，“老年” ->3，然后进行二进制编码，则“少年” -> 0 -> 00，“青年” -> 1 ->01，“中年” -> 2 -> 10，“老年” ->3 -> 11，最后对二进制编码进行分裂，则“少年” -> 0 -> 00 -> 0|0，“青年” -> 1 ->01 -> 0|1，“中年” -> 2 -> 10 -> 1|0，“老年” ->3 -> 11 -> 1|1。
这种编码形式的好处在于，比one-hot编码所增加的列数要少一些。以上这个例子，如果采用one-hot编码，会生成4列新的特征，而采用binary编码，则只增加了2列特征。
Sum
Polynomial
Backward Difference
Helmert

作者在几个不同的数据集上比较了几种encoding方法的效果

综合来看，binary 编码的效果是最优的。

下面介绍另外一种常用的编码，即特征的hash 编码。
wikipedia对 feature hashing的介绍如下

In machine learning, feature hashing, also known as the hashing trick (by analogy to the kernel trick), is a fast and space-efficient way of vectorizing features, i.e. turning arbitrary features into indices in a vector or matrix. It works by applying a hash function to the features and using their hash values as indices directly, rather than looking the indices up in an associative array. This trick is often attributed to Weinberger et al., but there exists a much earlier description of this method published by John Moody in 1989.

简而言之，feature hashing的基本做法为，使用一个hash 函数，把一个样本的多列特征映射成一个向量。具体来说，对于每一列特征的每一个取值，hash函数会把该值映射到一个固定长度的某个位置上。不同的特征类型映射方法也不同
spark ml包里的介绍如下

The FeatureHasher transformer operates on multiple columns. Each column may contain either numeric or categorical features. Behavior and handling of column data types is as follows: -Numeric columns: For numeric features, the hash value of the column name is used to map the feature value to its index in the feature vector. By default, numeric features are not treated as categorical (even when they are integers). To treat them as categorical, specify the relevant columns in categoricalCols. -String columns: For categorical features, the hash value of the string “column_name=value” is used to map to the vector index, with an indicator value of 1.0. Thus, categorical features are “one-hot” encoded (similarly to using OneHotEncoder with dropLast=false). -Boolean columns: Boolean values are treated in the same way as string columns. That is, boolean features are represented as “column_name=true” or “column_name=false”, with an indicator value of 1.0.

举例如下

原始特征集中的第一列"real"列为numberic feature，所以映射的方法为：把列名"real"的hash值作为映射目标向量的index（每个样本的index都一样，本例中是174475），向量的第174475这个位置上的取值为原始“real”的值；"stringNum"和"string"两列都是categorical feature，对于不同的样本，映射目标向量的index是不一样的，比如本例子中"stringNum"列，第一个样本的原始取值为1，所以第一个样本该特征的映射结果为字符串"stringNum=1"的hash值，同理，第二个样本的原始取值为2，则第二个样本该特征的映射结果为字符串"stringNum=2"的hash值。"bool"列的处理方法与"stringNum"和"string"相同。

feature hashing较之于one-hot encoding的优势在于编码之后的向量维度更低，占用内存小，模型训练效率更高。

那么上例子中的那些index以及每个index上的value是怎么确定的呢？

wikipedia上的说明如下

Instead of maintaining a dictionary, a feature vectorizer that uses the hashing trick can build a vector of a pre-defined length by applying a hash function h to the features (e.g., words), then using the hash values directly as feature indices and updating the resulting vector at those indices. Here, we assume that feature actually means feature vector.

利用上面的函数，给定一个样本（一组string），就可以把它转化为一个长度为 N 的向量。
对于每一个特征 f，利用hash函数，得到值 h，然后h对N取模（目的是保证index介于0～N-1之间）作为index，向量在该index处的值累加1。
举个例子，如果原始特征向量为 [“cat”,“dog”,“cat”]
hash函数为
$hash(x_{f})=1$ if $x_{f}$ is “cat” and 2 if $x_{f}$ is “dog”
给定输出向量的长度为4，则生成的向量为[0,2,1,0]。
不同的特征hash函数计算得到的h可能是一样的，这就造成hash collisions问题，为了缓解该问题，增加一个hash函数（一个sign function）

$ξ$ 函数的作用就是减少hash collisions问题发生的概率。由于 $ξ$ 函数的加入，会使得新生成的特征的每一列的期望值为0。

参考文献

特征工程——分类变量的处理

python gdal 安装使用（Windows， python 3.6.8）

從decision tree到bagging、boosting

序列標註任務中的CRFs和LSTMs

特徵工程——分類變量的處理

貝葉斯統計學相關

推薦系統優秀論文、博文彙總

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結