最近在學習數據挖掘中的分類算法,順便整理了各種分類算法的優缺點。
決策樹
一種啓發式算法,核心是在決策樹各個節點上應用信息增益等準則來選取特徵,進而遞歸地構造決策樹。
優點:
1. 計算複雜度不高,易於理解和解釋,可以理解決策樹所表達的意義;
2. 數據預處理階段比較簡單,且可以處理缺失數據;
3. 能夠同時處理數據型和分類型屬性,且可對有許多屬性的數據集構造決策樹,其他技術往往需要數據屬性的單一;
4. 是一個白盒模型,若給定一個觀察模型,則根據所產生的決策樹很容易推斷出相應的邏輯表達式;
5. 在相對短的時間內能夠對大數據集合做出可行且效果良好的分類結果。
缺點:
1. 對於那些各類別樣本數目不一致的數據,信息增益的結果偏向於那些具有更多數值的屬性;
2. 對噪聲數據較爲敏感;
3. 容易出現過擬合問題;
4. 忽略了數據集中屬性之間的相關性。
可以處理的樣例數據集:Soybean數據集
diaporthe-stem-canker,6,0,2,1,0,1,1,1,0,0,1,1,0,2,2,0,0,0,1,1,3,1,1,1,0,0,0,0,4,0,0,0,0,0,0
diaporthe-stem-canker,4,0,2,1,0,2,0,2,1,1,1,1,0,2,2,0,0,0,1,0,3,1,1,1,0,0,0,0,4,0,0,0,0,0,0
diaporthe-stem-canker,3,0,2,1,0,1,0,2,1,2,1,1,0,2,2,0,0,0,1,0,3,0,1,1,0,0,0,0,4,0,0,0,0,0,0
diaporthe-stem-canker,4,0,2,1,0,2,0,2,0,2,1,1,0,2,2,0,0,0,1,0,3,1,1,1,0,0,0,0,4,0,0,0,0,0,0
charcoal-rot,6,0,0,2,0,1,3,1,1,0,1,1,0,2,2,0,0,0,1,0,0,3,0,0,0,2,1,0,4,0,0,0,0,0,0
charcoal-rot,4,0,0,1,1,1,3,1,1,1,1,1,0,2,2,0,0,0,1,1,0,3,0,0,0,2,1,0,4,0,0,0,0,0,0
charcoal-rot,3,0,0,1,0,1,2,1,0,0,1,1,0,2,2,0,0,0,1,0,0,3,0,0,0,2,1,0,4,0,0,0,0,0,0
charcoal-rot,5,0,0,2,1,2,2,1,0,2,1,1,0,2,2,0,0,0,1,0,0,3,0,0,0,2,1,0,4,0,0,0,0,0,0
rhizoctonia-root-rot,1,1,2,0,0,2,1,2,0,2,1,0,0,2,2,0,0,0,1,0,1,1,0,1,1,0,0,3,4,0,0,0,0,0,0
rhizoctonia-root-rot,1,1,2,0,0,1,1,2,0,1,1,0,0,2,2,0,0,0,1,0,1,1,0,1,0,0,0,3,4,0,0,0,0,0,0
rhizoctonia-root-rot,2,1,2,0,0,2,1,1,0,1,1,0,0,2,2,0,0,0,1,0,1,1,0,1,0,0,0,3,4,0,0,0,0,0,0
rhizoctonia-root-rot,2,1,2,0,0,1,1,2,0,2,1,0,0,2,2,0,0,0,1,0,1,1,0,1,0,0,0,3,4,0,0,0,0,0,0
KNN算法
一種惰性分類方法,從訓練集中找出k個最接近測試對象的訓練對象,再從這k個訓練對象中找出居於主導的類別,將其賦給測試對象。
優點:
1. 簡單有效,容易理解和實現;
2. 重新訓練的代價較低(類別體系的變化和訓練集的變化);
3. 計算時間和空間線性於訓練集的規模;
4. 錯誤率漸進收斂於貝葉斯錯誤率,可作爲貝葉斯的近似;
5. 適合處理多模分類和多標籤分類問題;
6. 對於類域的交叉或重疊較多的待分類樣本集較爲適合;
缺點:
1. 是lazy learning方法,比一些積極學習的算法要慢;
2. 計算量比較大,需對樣本點進行剪輯;
3. 對於樣本不平衡的數據集效果不佳,可採用加權投票法改進;
4. k值的選擇對分類效果有很大影響,較小的話對噪聲敏感,需估計最佳k值。
可以處理的樣例數據集:Iris數據集
4.6,3.2,1.4,0.2,Iris-setosa 5.3,3.7,1.5,0.2,Iris-setosa 5.0,3.3,1.4,0.2,Iris-setosa 7.0,3.2,4.7,1.4,Iris-versicolor 6.4,3.2,4.5,1.5,Iris-versicolor 6.9,3.1,4.9,1.5,Iris-versicolor
樸素貝葉斯算法
優點:
1. 數學基礎堅實,分類效率穩定,容易解釋;2. 所需估計的參數很少,對缺失數據不太敏感;
3. 無需複雜的迭代求解框架,適用於規模巨大的數據集。
(原因:通常數據集會先執行屬性選擇過程,提高了屬性之間的獨立性,且樸素貝葉斯可以產生較爲複雜的非線性決策面,可以擬合出相當複雜的曲面)
缺點:
1. 屬性之間的獨立性假設往往不成立(可考慮用聚類算法先將相關性較大的屬性進行聚類);
2. 需要知道先驗概率,分類決策存在錯誤率。
可以處理的樣例數據集:Breast Cancer數據集
858477,B,8.618,11.79,54.34,224.5,0.09752,0.05272,0.02061,0.007799,0.1683,0.07187,0.1559,0.5796,1.046,8.322,0.01011,0.01055,0.01981,0.005742,0.0209,0.002788,9.507,15.4,59.9,274.9,0.1733,0.1239,0.1168,0.04419,0.322,0.09026
858970,B,10.17,14.88,64.55,311.9,0.1134,0.08061,0.01084,0.0129,0.2743,0.0696,0.5158,1.441,3.312,34.62,0.007514,0.01099,0.007665,0.008193,0.04183,0.005953,11.02,17.45,69.86,368.6,0.1275,0.09866,0.02168,0.02579,0.3557,0.0802
858981,B,8.598,20.98,54.66,221.8,0.1243,0.08963,0.03,0.009259,0.1828,0.06757,0.3582,2.067,2.493,18.39,0.01193,0.03162,0.03,0.009259,0.03357,0.003048,9.565,27.04,62.06,273.9,0.1639,0.1698,0.09001,0.02778,0.2972,0.07712
858986,M,14.25,22.15,96.42,645.7,0.1049,0.2008,0.2135,0.08653,0.1949,0.07292,0.7036,1.268,5.373,60.78,0.009407,0.07056,0.06899,0.01848,0.017,0.006113,17.67,29.51,119.1,959.5,0.164,0.6247,0.6922,0.1785,0.2844,0.1132
859196,B,9.173,13.86,59.2,260.9,0.07721,0.08751,0.05988,0.0218,0.2341,0.06963,0.4098,2.265,2.608,23.52,0.008738,0.03938,0.04312,0.0156,0.04192,0.005822,10.01,19.23,65.59,310.1,0.09836,0.1678,0.1397,0.05087,0.3282,0.0849
85922302,M,12.68,23.84,82.69,499,0.1122,0.1262,0.1128,0.06873,0.1905,0.0659,0.4255,1.178,2.927,36.46,0.007781,0.02648,0.02973,0.0129,0.01635,0.003601,17.09,33.47,111.8,888.3,0.1851,0.4061,0.4024,0.1716,0.3383,0.1031
SVM算法
對於兩類線性可分學習任務,SVM找到一個間隔最大的超平面將兩類樣本分開,最大間隔能夠保證該超平面具有最好的泛化能力。
優點:
1. 可以解決小樣本情況下的ML問題;
2. 可以提高泛化性能;
3. 可以解決高維問題,避免維數災難;
4. 可以解決非線性問題;
5. 可以避免神經網絡結構選擇和局部極小點問題。
參數C和g的選擇對分類性能的影響:
C是懲罰係數,C越大,交叉validation高,容易過學習;g是核函數的到達0的速率,g越小,K函數的係數越小,函數下降快,交叉validation高,也容易造成過學習。
缺點:
1. 對缺失數據敏感;
2. 對非線性問題沒有通用解決方案,必須謹慎選擇kernel function來處理。
可以處理的樣例數據集:SPECTF Heart數據集
1,70,66,66,68,71,69,64,61,68,67,50,53,73,71,73,63,71,73,80,81,82,82,67,71,52,47,67,64,66,67,66,75,58,62,65,65,71,67,70,71,67,64,52
1,73,76,68,74,56,59,73,76,54,48,75,78,47,53,25,19,60,56,56,54,80,79,47,53,19,14,58,50,67,71,63,54,49,48,66,65,62,58,57,72,31,30,15
1,68,76,79,78,63,73,68,78,64,71,73,77,67,71,58,57,61,63,52,64,64,74,53,72,36,44,52,54,49,56,73,81,65,80,53,60,63,70,58,64,52,57,49
1,68,64,65,68,63,64,77,73,75,72,80,77,70,71,61,61,73,68,63,62,76,73,69,69,48,59,62,44,66,59,75,74,64,64,63,61,70,69,74,67,51,48,45
0,62,67,64,70,59,58,67,74,60,66,68,68,73,71,60,63,64,74,64,65,74,77,69,73,59,58,58,67,65,69,78,76,61,62,64,67,72,74,71,71,71,69,66
0,62,67,68,70,65,70,73,77,69,70,69,73,71,74,71,71,76,75,66,67,73,73,70,74,63,67,58,68,66,69,78,79,69,70,71,73,72,71,73,77,72,76,64
0,59,68,69,67,69,59,78,73,66,65,77,73,74,66,66,55,71,66,69,68,75,73,80,79,69,65,69,66,68,65,75,71,59,61,65,64,73,71,81,75,74,65,69
0,75,75,70,77,67,75,75,75,67,66,74,73,68,72,64,70,76,70,67,63,74,75,72,68,69,68,75,69,71,74,75,76,63,70,71,69,66,63,70,73,66,68,58
AdaBoost算法
優點:
1. 分類精度高;
2. 可以使用各種方法構建子分類器,Adaboost算法提供的是框架;
3. 簡單,且不用做特徵篩選;
4. 不會造成overfitting。
缺點:
1. 對分類錯誤的樣本多次被分錯而多次加權後,導致權重過大,影響分類器的選擇,造成退化問題;(需改進權值更新方式)
2. 數據不平衡問題導致分類精度的急劇下降;
3. 算法訓練耗時,拓展困難;
4. 存在過擬合,魯棒性不強等問題。
Logistic迴歸算法
二項logistic迴歸模型是一種分類模型,由條件概率分佈P(Y|X)表示,形式爲參數化的logistic分佈。這裏隨機變量X取值爲實數,隨機變量Y取值爲1或0。可以通過有監督的方法來估計模型參數。
優點:
1. 計算代價不高,易於理解和實現;
2. 適用於數值型和分類型數據。
缺點:
1. 容易欠擬合;
2. 分類精度可能不高。
人工神經網絡
優點:
1. 分類的準確度高,並行分佈處理能力強,分佈存儲及學習能力強;
2. 對噪聲神經有較強的魯棒性和容錯能力,能充分逼近複雜的非線性關係,具備聯想記憶的功能等。
缺點:
1. 神經網絡需要大量的參數,如網絡拓撲結構、權值和閾值的初始值;
2. 不能觀察之間的學習過程,輸出結果難以解釋,會影響到結果的可信度和可接受程度;
3. 學習時間過長,甚至可能達不到學習的目的。
遺傳算法
優點:
1. 與問題領域無關切快速隨機的搜索能力;
2. 搜索從羣體出發,具有潛在的並行性,可以進行多個個體的同時比較,魯棒性好;
3. 搜索使用評價函數啓發,過程簡單;
4. 使用概率機制進行迭代,具有隨機性;
5. 具有可擴展性,容易與其他算法結合。
缺點:
1. 遺傳算法的編程實現比較複雜,首先需要對問題進行編碼,找到最優解之後還需要對問題進行解碼;
2. 三個算子的實現也有許多參數,如交叉率和變異率,並且這些參數的選擇嚴重影響解的品質,而目前這些參數的選擇大部分是依靠經驗。沒有能夠及時利用網絡的反饋信息,故算法的搜索速度比較慢,要得要較精確的解需要較多的訓練時間;
3. 算法對初始種羣的選擇有一定的依賴性,能夠結合一些啓發算法進行改進。
如有錯誤請不吝指出!