LGBM爲什麼可以直接輸入類別特徵，而不需要one-hot

LGBM官方文檔對如何處理類別特徵的解釋

Optimal Split for Categorical Features

It is common to represent categorical features with one-hot encoding, but this approach is suboptimal for tree learners. Particularly for high-cardinality categorical features, a tree built on one-hot features tends to be unbalanced and needs to grow very deep to achieve good accuracy.

Instead of one-hot encoding, the optimal solution is to split on a categorical feature by partitioning its categories into 2 subsets. If the feature has k categories, there are 2^(k-1) - 1 possible partitions. But there is an efficient solution for regression trees[8]. It needs about O(k * log(k)) to find the optimal partition.

The basic idea is to sort the categories according to the training objective at each split. More specifically, LightGBM sorts the histogram (for a categorical feature) according to its accumulated values (sum_gradient / sum_hessian) and then finds the best split on the sorted histogram.

綜合外網各路的解釋

打個比方，我現在有一個特徵是顏色，每個樣本的顏色特徵是{紅、黃、藍、綠}四種類別中的一種，那麼我們來對比一下one-hot和LGBM的處理方式，到底有什麼不同：

One-hot encoding

這種編碼方式很常用，直接將顏色這一維特徵變成四維特徵，分別表示紅、黃、藍、綠四維特徵（每個樣本只有其中一維取值爲一，其他爲0）。
那麼我們決策樹分裂的時候，只會選擇其中一維進行節點分裂，比如選黃色進行分裂，那麼意思就是所有樣本是否是黃色作爲節點分裂條件。
那麼我們反觀原始的顏色特徵，會發現，其實就是一個1對其他顏色的分裂策略，對於顏色這維原始特徵來說其實我們只有四種分裂策略可選( $k$ 個類別就有 $k$ 中分裂策略)。

LGBM節點分裂

其實LGBM節點分裂策略很簡單，就是將紅、黃、藍、綠對應的四類樣本分爲兩類的所有可能策略，比如：紅黃一類，藍綠一類。那麼就會有 $2^{(k - 1)} - 1$ 種策略，這樣才能充分的挖掘該維特徵所包含的信息，找到最優的分割策略。
但是這樣子尋找最優分割策略的時間複雜度就會很大，而該篇論文“On Grouping for Maximum Homogeneity.”介紹了一種針對迴歸樹的高效方法，只需要大概 $O (k * l o g (k))$ 的時間複雜度就可以找到最優分割策略。

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

LGBM是如何處理類別特徵，相比onehot編碼的優勢在哪

LGBM爲什麼可以直接輸入類別特徵，而不需要one-hot

LGBM官方文檔對如何處理類別特徵的解釋

Optimal Split for Categorical Features

綜合外網各路的解釋

One-hot encoding

LGBM節點分裂

《日本蠟燭圖》讀書筆記 & 技術分析回測

一分鐘部署 Llama3 中文大模型，沒別的，就是快

Python多線程編程深度探索：從入門到實戰

《期貨-市場技術分析》讀書筆記

mongodb處理json數據很好

頂級 Javaer 都在用的 20 個類庫，真香！

[轉帖]cpupower

google瀏覽器插件開發

35K*14 薪，入職了！這公司只要不裁員，我能一直呆下去！

ffmpeg 百度雲盤

機器學習各優化算法的簡單總結

mac本機pySpark配置並且能在本地遠程調用服務器Spark以及文件

線性迴歸和邏輯迴歸損失函數推導

csdn如何快速完美的轉載別人的文章

LGBM是如何處理類別特徵，相比onehot編碼的優勢在哪

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結