原理:
決策樹生成算法: 是遞歸地生成決策樹,它往往分類精細,對訓練數據集分類準確,但是對未知數據集卻沒有那麼準確,有比較嚴重的過擬合問題。因此,爲了簡化模型的複雜度,使模型的泛化能力更強,需要對已生成的決策樹進行剪枝。
集成分類算法: 集成(Ensemble)分類模型綜合考量多個分類器的預測結果,從而做出決策。
隨機森林分類器用相同的訓練數據同時搭建多個獨立的分裂模型,然後通過投票的方式,以少數服從多數的原則作出最終分類的決策。在相同的訓練數據上同時搭建多棵決策樹,每棵決策樹會放棄固定的排序算法,隨機選取特徵。
梯度提升決策樹按照一定的次序搭建多個分類模型。模型之間彼此存在依賴關係。後續加入的模型對集成模型的綜合性能有所貢獻。每棵決策樹在生成的過程中都會盡可能降低整體集成模型在訓練集上的擬合誤差。
代碼:
import pandas as pd
df = pd.read_excel('E:\Desktop\learning\python3\cell_test.xlsx')
X = df[['ABCC4', 'ABL1', 'ADAM19','AFG3L2','ANXA2','ANXA2P2','APOC1','APOL4','AREG','ARRDC3','ASF1A','ATF3','ATG4B','ATHL1','ATP6V0D1','ATP6V0E2','BCL11A','BHLHE40','BIN1','BST2','BTG3','C10orf128','C16orf54','C1orf186','C1orf228','C1orf54','C2orf27A','C2orf68','C6orf25','CAD','CCDC152','CCDC42','CCNL1','CD164','CD33','CD69','CD9','CDC42','CDCA7','CDH2','CFHR1','CKS2','CMTM3','COQ10B','CPA3','CPSF6','CPVL','CRHBP','CRYGD','CTNNB1','CXCR4','CXXC1','CYCS','CYP51A1','DLC1','DLK1','DNAJC12','DNTT','DUSP1','DUSP10','DUXAP10','DUXAP8','ELK3','ELOVL6','ENGASE','ERAP2','EZH2','FAIM3','FAM120A','FAM133A','FAM188A','FAM19A2','FAM47E','FHL1','FLT3','FNBP4','FOS','FREM1','FRY','FTH1','GALNT1','GAS2','GIMAP7','GLRX5','GMDS','GNAS','GOLGA8A','GOLGA8B','GOLGA8S','GPKOW','HCST','HDLBP','HERC2P2','HERC2P7','HERC2P9','HES1']]
# ABCC4 ABL1 ADAM19 AFG3L2 ANXA2 ANXA2P2 APOC1 APOL4 AREG ARRDC3 ASF1A ATF3 ATG4B ATHL1 ATP6V0D1 ATP6V0E2 BCL11A BHLHE40 BIN1 BST2 BTG3 C10orf128 C16orf54 C1orf186 C1orf228 C1orf54 C2orf27A C2orf68 C6orf25 CAD CCDC152 CCDC42 CCNL1 CD164 CD33 CD69 CD9 CDC42 CDCA7 CDH2 CFHR1 CKS2 CMTM3 COQ10B CPA3 CPSF6 CPVL CRHBP CRYGD CTNNB1 CXCR4 CXXC1 CYCS CYP51A1 DLC1 DLK1 DNAJC12 DNTT DUSP1 DUSP10 DUXAP10 DUXAP8 ELK3 ELOVL6 ENGASE ERAP2 EZH2 FAIM3 FAM120A FAM133A FAM188A FAM19A2 FAM47E FHL1 FLT3 FNBP4 FOS FREM1 FRY FTH1 GALNT1 GAS2 GIMAP7 GLRX5 GMDS GNAS GOLGA8A GOLGA8B GOLGA8S GPKOW HCST HDLBP HERC2P2 HERC2P7 HERC2P9 HES1 HIST1H2AK HIST1H2BG HIST2H2AA3+HIST2H2AA4 HIST3H2BB HLA-E HNRNPAB HNRNPD HNRNPL HNRNPUL2 HOPX HTATSF1 HYOU1 ID1 ID2 ID3 IER5 IFITM1 IL1RAP IRF2BP2 ITGA2B KCNA3 KIAA0125 KIT KYNU LAPTM4B LDB1 LGALS1 LIMCH1 LINC-PINT LINC01296 LOC101928834 LOC154761 LRRC70 MAF MAFIP MAGED2 MAP4K4 MARCKS MED4-AS1 MIR221 MMRN1 MPDZ MPL MRPL16 MYCN MYCT1 NCF4 NET1 NFKBIA NFKBIE NOG NUBP1 OGFRL1 PCDH9 PDE4B PDLIM1 PIEZO2 PLAG1 PLAU PPIF PPM1A PPP1CC PPP1R14B PRKAR2B PRKD2 PRKG1 PROM1 PROSER1 PRSS1 PRSS21 PRSS3 PRSS3P2 PTBP3 PTPRD PTPRS QSOX1 RAB31 RANBP9 RGS1 RHOB RIPK2 RNA5-8S5 ROGDI RTN4 RXFP1 S100A10 SAT1 SCN2A SCN9A SDK2 SDPR SELL SELM SEPP1 11.Sep SIK1 SLA SMARCC1 SNHG15 SNHG3 SNHG9 SOCS2 SORT1 SPG20 SPNS1 SQLE SRM STAT4 STK17B STON2 STXBP5 TAGLN2 TCTEX1D1 TEKT4P2 TFRC TLR3 TOB1 TOMM40 TPSAB1 TPSB2 TPSD1 TRA2B TSC22D1 TSC22D3 TTC34 TUBB4B TUBB6 VPREB1 VWF YBX1 YDJC YIF1B YKT6 ZC3H12C ZFP36L2 ZKSCAN1 ZNF200 ZNF317 Labels
y = df['Labels']
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.25, random_state = 33)
#原始數分割25%爲“測試集”和75%“訓練集”
from sklearn.feature_extraction import DictVectorizer
vec = DictVectorizer(sparse=False) #不產生稀疏矩陣
X_train = vec.fit_transform(X_train.to_dict(orient='record'))
X_test = vec.transform(X_test.to_dict(orient='record'))
# 單一決策樹
from sklearn.tree import DecisionTreeClassifier
dtc = DecisionTreeClassifier()
dtc.fit(X_train, y_train)
dtc_y_pred = dtc.predict(X_test)
#隨機森林分類器
from sklearn.ensemble import RandomForestClassifier
rfc = RandomForestClassifier()
rfc.fit(X_train, y_train)
rfc_y_pred = rfc.predict(X_test)
#梯度提升決策樹
from sklearn.ensemble import GradientBoostingClassifier
gbc = GradientBoostingClassifier()
gbc.fit(X_train, y_train)
gbc_y_pred = gbc.predict(X_test)
from sklearn.metrics import classification_report
# 輸出單一決策樹
print ("單一決策樹進行模型訓練以及預測準確率:",dtc.score(X_test, y_test))
print (classification_report(dtc_y_pred, y_test))
# 輸出隨機森林分類器
print('------------------------------------------------------------')
print ('隨機森林分類器進行模型訓練以及預測準確率:', rfc.score(X_test, y_test))
print (classification_report(rfc_y_pred, y_test))
# 梯度提升決策樹
print('------------------------------------------------------------')
print ('梯度提升決策樹進行模型訓練以及預測準確率:', gbc.score(X_test, y_test))
print (classification_report(gbc_y_pred, y_test))
**結果分析:**梯度提升決策樹和隨機森林分類器的準確率遠高於單一決策樹。集成分類器綜合考量多個分類器的預測結果,降低擬合誤差,其準確率更高。
運行結果: