LGBM为什么可以直接输入类别特征，而不需要one-hot

LGBM官方文档对如何处理类别特征的解释

Optimal Split for Categorical Features

It is common to represent categorical features with one-hot encoding, but this approach is suboptimal for tree learners. Particularly for high-cardinality categorical features, a tree built on one-hot features tends to be unbalanced and needs to grow very deep to achieve good accuracy.

Instead of one-hot encoding, the optimal solution is to split on a categorical feature by partitioning its categories into 2 subsets. If the feature has k categories, there are 2^(k-1) - 1 possible partitions. But there is an efficient solution for regression trees[8]. It needs about O(k * log(k)) to find the optimal partition.

The basic idea is to sort the categories according to the training objective at each split. More specifically, LightGBM sorts the histogram (for a categorical feature) according to its accumulated values (sum_gradient / sum_hessian) and then finds the best split on the sorted histogram.

综合外网各路的解释

打个比方，我现在有一个特征是颜色，每个样本的颜色特征是{红、黄、蓝、绿}四种类别中的一种，那么我们来对比一下one-hot和LGBM的处理方式，到底有什么不同：

One-hot encoding

这种编码方式很常用，直接将颜色这一维特征变成四维特征，分别表示红、黄、蓝、绿四维特征（每个样本只有其中一维取值为一，其他为0）。
那么我们决策树分裂的时候，只会选择其中一维进行节点分裂，比如选黄色进行分裂，那么意思就是所有样本是否是黄色作为节点分裂条件。
那么我们反观原始的颜色特征，会发现，其实就是一个1对其他颜色的分裂策略，对于颜色这维原始特征来说其实我们只有四种分裂策略可选( $k$ 个类别就有 $k$ 中分裂策略)。

LGBM节点分裂

其实LGBM节点分裂策略很简单，就是将红、黄、蓝、绿对应的四类样本分为两类的所有可能策略，比如：红黄一类，蓝绿一类。那么就会有 $2^{(k - 1)} - 1$ 种策略，这样才能充分的挖掘该维特征所包含的信息，找到最优的分割策略。
但是这样子寻找最优分割策略的时间复杂度就会很大，而该篇论文“On Grouping for Maximum Homogeneity.”介绍了一种针对回归树的高效方法，只需要大概 $O (k * l o g (k))$ 的时间复杂度就可以找到最优分割策略。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

LGBM是如何处理类别特征，相比onehot编码的优势在哪

LGBM为什么可以直接输入类别特征，而不需要one-hot

LGBM官方文档对如何处理类别特征的解释

Optimal Split for Categorical Features

综合外网各路的解释

One-hot encoding

LGBM节点分裂

「Pygors跨平台GUI」1：Pygors跨平台GUI应用研究

[转帖]

python列出centos7内存使用前50的进程信息

「Pygors跨平台GUI」2：安装MinGW-w64、MSYS2还是WSL2

Garnet：微软官方基于.NET开源的高性能分布式缓存存储数据库

Flink执行图

Java响应式编程

评估统计算法在银行伪造钞票检测中的价值

nodejs学习06——小案例

機器學習各優化算法的簡單總結

mac本機pySpark配置並且能在本地遠程調用服務器Spark以及文件

線性迴歸和邏輯迴歸損失函數推導

csdn如何快速完美的轉載別人的文章

LGBM是如何處理類別特徵，相比onehot編碼的優勢在哪

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結