如何選擇機器學習算法

原文鏈接:http://www.52ml.net/15063.html

How do you know what machine learning algorithm to choose for your classification problem? Of course, if you really care about accuracy, your best bet is to test out a couple different ones (making sure to try different parameters within each algorithm as well), and select the best one by cross-validation. But if you’re simply looking for a “good enough” algorithm for your problem, or a place to start, here are some general guidelines I’ve found to work well over the years.

如何針對某個分類問題決定使用何種機器學習算法? 當然,如果你真心在乎準確率,最好的途徑就是測試一大堆各式各樣的算法(同時確保在每個算法上也測試不同的參數),最後選擇在交叉驗證中表現最好的。倘若你只是想針對你的問題尋找一個“足夠好”的算法,或者一個起步點,這裏給出了一些我覺得這些年用着還不錯的常規指南。

How large is your training set?

訓練集有多大?

If your training set is small, high bias/low variance classifiers (e.g., Naive Bayes) have an advantage over low bias/high variance classifiers (e.g., kNN), since the latter will overfit. But low bias/high variance classifiers start to win out as your training set grows (they have lower asymptotic error), since high bias classifiers aren’t powerful enough to provide accurate models.

如果是小訓練集,高偏差/低方差的分類器(比如樸素貝葉斯)要比低偏差/高方差的分類器(比如k最近鄰)具有優勢,因爲後者容易過擬合。然而隨着訓練集的增大,低偏差/高方差的分類器將開始具有優勢(它們擁有更低的漸近誤差),因爲高偏差分類器對於提供準確模型不那麼給力。

You can also think of this as a generative model vs. discriminative model distinction.

你也可以把這一點看作生成模型和判別模型的差別。

Advantages of some particular algorithms

一些常用算法的優缺點

Advantages of Naive Bayes:   Super simple, you’re just doing a bunch of counts. If the NB conditional independence assumption actually holds, a Naive Bayes classifier will converge quicker than discriminative models like logistic regression, so you need less training data. And even if the NB assumption doesn’t hold, a NB classifier still often does a great job in practice. A good bet if want something fast and easy that performs pretty well. Its main disadvantage is that it can’t learn interactions between features (e.g., it can’t learn that although you love movies with Brad Pitt and Tom Cruise, you hate movies where they’re together).

樸素貝葉斯:  巨尼瑪簡單,你只要做些算術就好了。倘若條件獨立性假設確實滿足,樸素貝葉斯分類器將會比判別模型,譬如邏輯迴歸收斂得更快,因此你只需要更少的訓練數據。就算該假設不成立,樸素貝葉斯分類器在實踐中仍然有着不俗的表現。如果你需要的是快速簡單並且表現出色,這將是個不錯的選擇。其主要缺點是它學習不了特徵間的交互關係(比方說,它學習不了你雖然喜歡甄子丹和姜文的電影,卻討厭他們共同出演的電影《關雲長》的情況)。

Advantages of Logistic Regression:   Lots of ways to regularize your model, and you don’t have to worry as much about your features being correlated, like you do in Naive Bayes. You also have a nice probabilistic interpretation, unlike decision trees or SVMs, and you can easily update your model to take in new data (using an online gradient descent method), again unlike decision trees or SVMs. Use it if you want a probabilistic framework (e.g., to easily adjust classification thresholds, to say when you’re unsure, or to get confidence intervals) or if you expect to receive more training data in the future that you want to be able to quickly incorporate into your model.

邏輯迴歸:  有很多正則化模型的方法,而且你不必像在用樸素貝葉斯那樣擔心你的特徵是否相關。與決策樹與支持向量機相比,你還會得到一個不錯的概率解釋,你甚至可以輕鬆地利用新數據來更新模型(使用在線梯度下降算法)。如果你需要一個概率架構(比如簡單地調節分類閾值,指明不確定性,或者是要得得置信區間),或者你 以後 想將更多的訓練數據 快速 整合到模型中去,使用它吧。

Advantages of Decision Trees:   Easy to interpret and explain (for some people – I’m not sure I fall into this camp). They easily handle feature interactions and they’re non-parametric, so you don’t have to worry about outliers or whether the data is linearly separable (e.g., decision trees easily take care of cases where you have class A at the low end of some feature x, class B in the mid-range of feature x, and A again at the high end). One disadvantage is that they don’t support online learning, so you have to rebuild your tree when new examples come on. Another disadvantage is that they easily overfit, but that’s where ensemble methods like random forests (or boosted trees) come in. Plus, random forests are often the winner for lots of problems in classification (usually slightly ahead of SVMs, I believe), they’re fast and scalable, and you don’t have to worry about tuning a bunch of parameters like you do with SVMs, so they seem to be quite popular these days.

決策樹:  易於解釋說明(對於某些人來說 —— 我不確定我是否在這其中)。它可以毫無壓力地處理特徵間的交互關係並且是非參數化的,因此你不必擔心異常值或者數據是否線性可分(舉個例子,決策樹能輕鬆處理好類別A在某個 特徵維度x的末端 ,類別B在中間,然後類別A又出現在特徵維度x前端的情況 )。它的一個缺點就是不支持在線學習,於是在新樣本到來後,決策樹需要全部重建。另一個缺點是容易過擬合,但這也就是諸如隨機森林(或提升樹)之類的集成方法的切入點。另外,隨機森林經常是很多分類問題的贏家(通常比支持向量機好上那麼一點,我認爲),它快速並且可調,同時你無須擔心要像支持向量機那樣調一大堆參數,所以最近它貌似相當受歡迎。

Advantages of SVMs:   High accuracy, nice theoretical guarantees regarding overfitting, and with an appropriate kernel they can work well even if you’re data isn’t linearly separable in the base feature space. Especially popular in text classification problems where very high-dimensional spaces are the norm. Memory-intensive, hard to interpret, and kind of annoying to run and tune, though, so I think random forests are starting to steal the crown.

支持向量機:  高準確率,爲避免過擬合提供了很好的理論保證,而且就算數據在原特徵空間線性不可分,只要給個合適的核函數,它就能運行得很好。在動輒超高維的文本分類問題中特別受歡迎。可惜內存消耗大,難以解釋,運行和調參也有些煩人,所以我認爲隨機森林要開始取而代之了。

But…

然而。。。

Recall, though, that better data often beats better algorithms, and designing good features goes a long way. And if you have a huge dataset, then whichever classification algorithm you use might not matter so much in terms of classification performance (so choose your algorithm based on speed or ease of use instead).

儘管如此,回想一下,好的數據卻要優於好的算法,設計優良特徵是大有裨益的。假如你有一個超大數據集,那麼無論你使用哪種算法可能對分類性能都沒太大影響(此時就根據速度和易用性來進行抉擇)。

And to reiterate what I said above, if you really care about accuracy, you should definitely try a bunch of different classifiers and select the best one by cross-validation. Or, to take a lesson from the Netflix Prize (and Middle Earth), just use an ensemble method to choose them all.

再重申一次我上面說過的話,倘若你真心在乎準確率,你一定得嘗試多種多樣的分類器,並且通過交叉驗證選擇最優。要麼就從Netflix Prize(和Middle Earth)取點經,用集成方法把它們合而用之,妥妥的。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章