文章目錄
嘮兩句
這篇文章是自己寫實驗報告的時候突發奇想寫的,把這學期的計算時能答辯的課題改編成自己的博客,嗯總算實現老師說的寫博客的意義了。源代碼是借鑑自https://blog.csdn.net/cyberliferk800/article/details/90549795
但是很可惜啊這哥們兒(姐們兒)的代碼根本跑不出來(不是diss您可能是matlab版本問題也好像是您確實sample錯了如果有幸看到勿噴咱們理智辯解),我理解完了debug然後改寫了整個邏輯,最後正確率還蠻喜人的嘿嘿嘿。
本文裏葡萄酒種類預測只是爲了滿足了老師所吩咐的“現實意義”這一要求,並沒有太大的研究意義,我就重點介紹隨機森林算法了,接下來進入正題。
1. 隨機森林算法原理
1.1 決策樹的構建(CART算法)
CART算法由以下兩步組成:
決策樹生成:基於訓練數據集生成決策樹,生成的決策樹要儘量大;
決策樹剪枝:用驗證數據集對已生成的樹進行剪枝並選擇最優子樹,這時損失函數最小作爲剪枝的標準。
CART決策樹的生成就是遞歸地構建二叉決策樹的過程。CART決策樹既可以用於分類也可以用於迴歸。本文我們僅討論用於分類的CART。對分類樹而言,CART用Gini係數最小化準則來進行特徵選擇,生成二叉樹。
1.2 Gini係數
決策樹建立後使用Gini係數判斷其是否爲一顆好樹
Gini係數代表了模型的不純度,基尼係數越小,不純度越低,特徵越好。
假設K個類別,第k個類別的概率爲pk,概率分佈的基尼係數表達式:
由於本文葡萄酒種類只存在兩個類別,所以基尼係數表達式是:
又由需求最佳劃分點,劃分點左右兩側都有樣本存在,左邊樣本點爲n個,右邊樣本點爲個,所以基尼係數時表達式應爲:
1.3 隨機森林的構建
決策樹相當於一個大師,通過自己在數據集中學到的知識對於新的數據進行分類。那麼隨機森林的具體構建有兩個方面:數據的隨機性選取,以及待選特徵的隨機選取。
1.數據的隨機選取:
首先,從原始的數據集中採取有放回的抽樣,構造子數據集,子數據集的數據量是和原始數據集相同的。不同子數據集的元素可以重複,同一個子數據集中的元素也可以重複。第二,利用子數據集來構建子決策樹,將這個數據放到每個子決策樹中,每個子決策樹輸出一個結果。最後,如果有了新的數據需要通過隨機森林得到分類結果,就可以通過對子決策樹的判斷結果的投票,得到隨機森林的輸出結果了。如下圖,假設隨機森林中有3棵子決策樹,2棵子樹的分類結果是A類,1棵子樹的分類結果是B類,那麼隨機森林的分類結果就是A類。
圖2-3 一個具有3個數據樣本的數據集中的數據的隨機選取
2.待選特徵的隨機選取
與數據集的隨機選取類似,隨機森林中的子樹的每一個分裂過程並未用到所有的待選特徵,而是從所有的待選特徵中隨機選取一定的特徵,之後再在隨機選取的特徵中選取最優的特徵進行劃分。這樣能夠使得隨機森林中的決策樹都能夠彼此不同,提升系統的多樣性,從而提升分類性能。
下圖中,藍色的方塊代表所有可以被選擇的特徵,也就是目前的待選特徵。橙色的方塊是分裂特徵。左邊是一棵決策樹的特徵選取過程,通過在待選特徵中選取最優的分裂特徵(本文采用CART算法),完成分裂。右邊是一個隨機森林中的子樹的特徵選取過程。
2. 數據集來源
數據集來源於UCI數據庫
3. 代碼實現(核心代碼)
3.1 隨機森林函數
%隨機森林,共有trees_num棵樹
function result=random_forest(sample,trees_num,data,sample_select,decision_select,sample_limit)
type1=0;
type0=0;
conclusion=zeros(1,trees_num);
%data的最後一個改爲自定義值,待會兒改成GUI傳進來的值
data(size(data,1),:) = sample;
for i=1:trees_num
[path,boundary,~,result]=decision_tree(data,sample_select,decision_select,sample_limit);
conclusion(i)=decide(path,boundary,result);
if conclusion(i)==1
type1=type1+1;
else
type0=type0+1;
end
end
if type1>type0
result=1;
else
result=0;
end
3.2 決策樹生成函數
%生成決策樹,輸入原始數據,採樣樣本數,採樣決策屬性數,預剪枝樣本限制
function [path,boundary,gini,result]=decision_tree(data,sample_select,decision_select,sample_limit)
score=100;
flag=0;
temp=inf;
%data(size(data,1),:)=sample;
%評價函數得分
while(score>(sample_select*0.3)) %直到找到好樹才停止
%%設置兩個變量conclusion4_0和conclusion4_1,如果分類在第三層停下確保0和1的數量不一樣
conclusion3_0=0;
conclusion3_1=0;
%設置兩個變量conclusion4_0和conclusion4_1,如果葉子結點數量多於一個要判斷conclusion4_0和conclusion4_1的數量誰更多
conclusion4_0=0;
conclusion4_1=0;
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%分界%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
data_new=select_sample_decision(data,sample_select,decision_select);
%計算初始gini係數
gini_now=gini_self(data_new);
%主程序
layer=1; %記錄決策樹當前層數
leaf_sample=zeros(1,sample_select); %記錄子結點樣本個數
leaf_gini=zeros(1,sample_select); %葉子節點gini係數
leaf_num=0; %記錄葉子數
path=zeros(decision_select,2^(decision_select-1)); %初始化路徑
gini=ones(decision_select,2^(decision_select-1)); %初始化gini
boundary=zeros(decision_select,2^(decision_select-1)); %初始化劃分邊界
result=ones(decision_select,2^(decision_select-1)); %初始化結果
path(:)=inf;
gini(:)=inf;
boundary(:)=inf;
result(1:4,1:8)=inf;
%第一層
[decision_global_best,boundary_global_best,data_new1,gini_now1,data_new2,gini_now2,~]=generate_node(data_new);
path(layer,1)=data_new(size(data_new,1),decision_global_best);
boundary(layer,1)=boundary_global_best;
gini(layer,1)=gini_now;
layer=layer+1;
gini(layer,1)=gini_now1;
gini(layer,2)=gini_now2;
%第二層
%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%二層1%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%%
if ((size(data_new1,1)-1)>=sample_limit)&&(gini(layer,1)>0)
[decision_global_best,boundary_global_best,data_new1_1,gini_now1_1,data_new1_2,gini_now1_2,~]=generate_node(data_new1);
path(layer,1)=data_new1(size(data_new1,1),decision_global_best);
boundary(layer,1)=boundary_global_best;
layer=layer+1;
gini(layer,1)=gini_now1_1;
gini(layer,2)=gini_now1_2;
%%%%%%%%%%%%%%%%%%%%%%%%%三層1%%%%%%%%%%%%%%%%%%%%%%%%%%%%
if (size(data_new1_1,1)-1)>=sample_limit&&(gini(layer,1)>0)
for i=1:size(data_new1_1,1)
if(data_new1_1(i,end)==1)
conclusion3_1=conclusion3_1+1;
else
conclusion3_0=conclusion3_0+1;
end
end
[decision_global_best,boundary_global_best,data_new1_1_1,gini_now1_1_1,data_new1_1_2,gini_now1_1_2,~]=generate_node(data_new1_1);
path(layer,1)=data_new1_1(size(data_new1_1,1),decision_global_best);
boundary(layer,1)=boundary_global_best;
layer=layer+1;
gini(layer,1)=gini_now1_1_1;
%test
temp1=0;
temp2=0;
for i=1:size(data_new1_1_1,1)
if(data_new1_1_1(i,end)==1)
temp1=temp1+1;
else
temp2=temp2+1;
end
end
if(temp1>temp2)
temp=1;
conclusion4_1=conclusion4_1+1;
elseif temp1<temp2
temp=0;
conclusion4_0=conclusion4_0+1;
else
flag=1;
end
result(layer,1)=temp;
leaf_num=leaf_num+1;
leaf_gini(leaf_num)=gini_now1_1_1;
leaf_sample(leaf_num)=size(data_new1_1_1,1)-1;
gini(layer,2)=gini_now1_1_2;
%%%%%%%%%%%%%%%%%%%%%%%%%四層2%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%test
temp1=0;
temp2=0;
for i=1:size(data_new1_1_2,1)
if(data_new1_1_2(i,end)==1)
temp1=temp1+1;
else
temp2=temp2+1;
end
end
if(temp1>temp2)
temp=1;
conclusion4_1=conclusion4_1+1;
elseif temp1<temp2
temp=0;
conclusion4_0=conclusion4_1+0;
else
flag=1;
end
result(layer,2)=temp;
leaf_num=leaf_num+1;
leaf_gini(leaf_num)=gini_now1_1_2;
leaf_sample(leaf_num)=size(data_new1_1_2,1)-1;
else
%%%%%%%%%%%%%%%%%%%%%%%%%三層1else%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%test
temp1=0;
temp2=0;
for i=1:size(data_new1_1,1)
if(data_new1_1(i,end)==1)
temp1=temp1+1;
else
temp2=temp2+1;
end
end
if(temp1>temp2)
temp=1;
conclusion4_1=conclusion4_1+1;
elseif temp1<temp2
temp=0;
conclusion4_0=conclusion4_0+1;
else
flag=1;
end
result(layer,1)=temp;
leaf_num=leaf_num+1;
leaf_gini(leaf_num)=gini_now1_1;
leaf_sample(leaf_num)=size(data_new1_1,1)-1;
path(layer,1)=nan;
boundary(layer,1)=nan;
gini(layer+1,1:2)=nan;
end
layer=3;
%%%%%%%%%%%%%%%%%%%%%%%%%三層2%%%%%%%%%%%%%%%%%%%%%%%%%%%%
if (size(data_new1_2,1)-1)>=sample_limit&&(gini(layer,2)>0)
for i=1:size(data_new1_2,1)
if(data_new1_2(i,end)==1)
conclusion3_1=conclusion3_1+1;
else
conclusion3_0=conclusion3_0+1;
end
end
[decision_global_best,boundary_global_best,data_new1_2_1,gini_now1_2_1,data_new1_2_2,gini_now1_2_2,~]=generate_node(data_new1_2);
path(layer,2)=data_new1_2(size(data_new1_2,1),decision_global_best);
boundary(layer,2)=boundary_global_best;
layer=layer+1;
gini(layer,3)=gini_now1_2_1;
%test
temp1=0;
temp2=0;
for i=1:size(data_new1_2_1,1)
if(data_new1_2_1(i,end)==1)
temp1=temp1+1;
else
temp2=temp2+1;
end
end
if(temp1>temp2)
temp=1;
conclusion4_1=conclusion4_1+1;
elseif temp1<temp2
temp=0;
conclusion4_0=conclusion4_0+1;
else
flag=1;
end
result(layer,3)=temp;
leaf_num=leaf_num+1;
leaf_gini(leaf_num)=gini_now1_2_1;
leaf_sample(leaf_num)=size(data_new1_2_1,1)-1;
gini(layer,4)=gini_now1_2_2;
%%%%%%%%%%%%%%%%%%%%%%%%%四層4%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%test
temp1=0;
temp2=0;
for i=1:size(data_new1_2_2,1)
if(data_new1_2_2(i,end)==1)
temp1=temp1+1;
else
temp2=temp2+1;
end
end
if(temp1>temp2)
temp=1;
conclusion4_1=conclusion4_1+1;
elseif temp1<temp2
temp=0;
conclusion4_0=conclusion4_0+1;
else
flag=1;
end
result(layer,4)=temp;
leaf_num=leaf_num+1;
leaf_gini(leaf_num)=gini_now1_2_2;
leaf_sample(leaf_num)=size(data_new1_2_2,1)-1;
else
%%%%%%%%%%%%%%%%%%%%%%%%%三層2else%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%test
temp1=0;
temp2=0;
for i=1:size(data_new1_2,1)
if(data_new1_2(i,end)==1)
temp1=temp1+1;
else
temp2=temp2+1;
end
end
if(temp1>temp2)
temp=1;
conclusion4_1=conclusion4_1+1;
elseif temp1<temp2
temp=0;
conclusion4_0=conclusion4_0+1;
else
flag=1;
end
result(layer,2)=temp;
leaf_num=leaf_num+1;
leaf_gini(leaf_num)=gini_now1_2;
leaf_sample(leaf_num)=size(data_new1_2,1)-1;
path(layer,2)=nan;
boundary(layer,2)=nan;
gini(layer+1,3:4)=nan;
end
else
%%%%%%%%%%%%%%%%%%%%%%%%%二層1else%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%test
temp1=0;
temp2=0;
for i=1:size(data_new1,1)
if(data_new1(i,end)==1)
temp1=temp1+1;
else
temp2=temp2+1;
end
end
if(temp1>temp2)
temp=1;
conclusion4_1=conclusion4_1+1;
elseif temp1<temp2
temp=0;
conclusion4_0=conclusion4_0+1;
else
flag=1;
end
result(layer,1)=temp;
leaf_num=leaf_num+1;
leaf_gini(leaf_num)=gini_now1;
leaf_sample(leaf_num)=size(data_new1,1)-1;
path(layer,1)=nan;
boundary(layer,1)=nan;
layer=layer+1;
gini(layer,1:2)=nan;
%第三層
path(layer,1:2)=nan;
boundary(layer,1:2)=nan;
%gini第四層葉子
layer=layer+1;
gini(layer,1:4)=nan;
end
layer=2;
%%%%%%%%%%%%%%%%%%%%%%%%%二層2%%%%%%%%%%%%%%%%%%%%%%%%%%%%
if (size(data_new2,1)-1)>=sample_limit&&(gini(layer,2)>0)
[decision_global_best,boundary_global_best,data_new2_1,gini_now2_1,data_new2_2,gini_now2_2,~]=generate_node(data_new2);
path(layer,2)=data_new2(size(data_new2,1),decision_global_best);
boundary(layer,2)=boundary_global_best;
layer=layer+1;
gini(layer,3)=gini_now2_1;
gini(layer,4)=gini_now2_2;
%第三層
%%%%%%%%%%%%%%%%%%%%%%%%%三層3%%%%%%%%%%%%%%%%%%%%%%%%%%%%
if (size(data_new2_1,1)-1)>=sample_limit&&(gini(layer,3)>0)
for i=1:size(data_new2_1,1)
if(data_new2_1(i,end)==1)
conclusion3_1=conclusion3_1+1;
else
conclusion3_0=conclusion3_0+1;
end
end
[decision_global_best,boundary_global_best,data_new2_1_1,gini_now2_1_1,data_new2_1_2,gini_now2_1_2,~]=generate_node(data_new2_1);
path(layer,3)=data_new2_1(size(data_new2_1,1),decision_global_best);
boundary(layer,3)=boundary_global_best;
layer=layer+1;
gini(layer,5)=gini_now2_1_1;
%test
temp1=0;
temp2=0;
for i=1:size(data_new2_1_1,1)
if(data_new2_1_1(i,end)==1)
temp1=temp1+1;
else
temp2=temp2+1;
end
end
if(temp1>temp2)
temp=1;
conclusion4_1=conclusion4_1+1;
elseif temp1<temp2
temp=0;
conclusion4_0=conclusion4_0+1;
else
flag=1;
end
result(layer,5)=temp;
leaf_num=leaf_num+1;
leaf_gini(leaf_num)=gini_now2_1_1;
leaf_sample(leaf_num)=size(data_new2_1_1,1)-1;
gini(layer,6)=gini_now2_1_2;
%test
temp1=0;
temp2=0;
for i=1:size(data_new2_1_2,1)
if(data_new2_1_2(i,end)==1)
temp1=temp1+1;
else
temp2=temp2+1;
end
end
if(temp1>temp2)
temp=1;
conclusion4_1=conclusion4_1+1;
elseif temp1<temp2
temp=0;
conclusion4_0=conclusion4_0+1;
else
flag=1;
end
result(layer,6)=temp;
leaf_num=leaf_num+1;
leaf_gini(leaf_num)=gini_now2_1_2;
leaf_sample(leaf_num)=size(data_new2_1_2,1)-1;
else
%%%%%%%%%%%%%%%%%%%%%%%%%三層3else%%%%%%%%%%%%%%%%%%%%%%%%%%%%
temp1=0;
temp2=0;
for i=1:size(data_new2_1,1)
if(data_new2_1(i,end)==1)
temp1=temp1+1;
else
temp2=temp2+1;
end
end
if(temp1>temp2)
temp=1;
conclusion4_1=conclusion4_1+1;
elseif temp1<temp2
temp=0;
conclusion4_0=conclusion4_0+1;
else
flag=1;
end
result(layer,3)=temp;
leaf_num=leaf_num+1;
leaf_gini(leaf_num)=gini_now2_1;
leaf_sample(leaf_num)=size(data_new2_1,1)-1;
path(layer,3)=nan;
boundary(layer,3)=nan;
gini(layer+1,5:6)=nan;
end
layer=3;
%%%%%%%%%%%%%%%%%%%%%%%%%三層4%%%%%%%%%%%%%%%%%%%%%%%%%%%%
if (size(data_new2_2,1)-1)>=sample_limit&&(gini(layer,4)>0)
for i=1:size(data_new2_2,1)
if(data_new2_2(i,end)==1)
conclusion3_1=conclusion3_1+1;
else
conclusion3_0=conclusion3_0+1;
end
end
[decision_global_best,boundary_global_best,data_new2_2_1,gini_now2_2_1,data_new2_2_2,gini_now2_2_2,~]=generate_node(data_new2_2);
path(layer,4)=data_new2_2(size(data_new2_2,1),decision_global_best);
boundary(layer,4)=boundary_global_best;
layer=layer+1;
gini(layer,7)=gini_now2_2_1;
% %test
temp1=0;
temp2=0;
for i=1:size(data_new2_2_1,1)
if(data_new2_2_1(i,end)==1)
temp1=temp1+1;
else
temp2=temp2+1;
end
end
if(temp1>temp2)
temp=1;
conclusion4_1=conclusion4_1+1;
elseif temp1<temp2
temp=0;
conclusion4_0=conclusion4_0+1;
else
flag=1;
end
result(layer,7)=temp;
leaf_num=leaf_num+1;
leaf_gini(leaf_num)=gini_now2_2_1;
leaf_sample(leaf_num)=size(data_new2_2_1,1)-1;
gini(layer,8)=gini_now2_2_2;
%test
temp1=0;
temp2=0;
for i=1:size(data_new2_2_2,1)
if(data_new2_2_2(i,end)==1)
temp1=temp1+1;
else
temp2=temp2+1;
end
end
if(temp1>temp2)
temp=1;
conclusion4_1=conclusion4_1+1;
elseif temp1<temp2
temp=0;
conclusion4_0=conclusion4_0+1;
else
flag=1;
end
result(layer,8)=temp;
leaf_num=leaf_num+1;
leaf_gini(leaf_num)=gini_now2_2_2;
leaf_sample(leaf_num)=size(data_new2_2_2,1)-1;
else
%%%%%%%%%%%%%%%%%%%%%%%%%三層4else%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%test
temp1=0;
temp2=0;
for i=1:size(data_new2_2,1)
if(data_new2_2(i,end)==1)
temp1=temp1+1;
else
temp2=temp2+1;
end
end
if(temp1>temp2)
temp=1;
conclusion4_1=conclusion4_1+1;
elseif temp1<temp2
temp=0;
conclusion4_0=conclusion4_0+1;
else
flag=1;
end
result(layer,4)=temp;
leaf_num=leaf_num+1;
leaf_gini(leaf_num)=gini_now2_2;
leaf_sample(leaf_num)=size(data_new2_2,1)-1;
path(layer,4)=nan;
boundary(layer,4)=nan;
gini(layer+1,7:8)=nan;
end
else
%%%%%%%%%%%%%%%%%%%%%%%%%二層2else%%%%%%%%%%%%%%%%%%%%%%%%%%%%
%test
temp1=0;
temp2=0;
for i=1:size(data_new2,1)
if(data_new2(i,end)==1)
temp1=temp1+1;
else
temp2=temp2+1;
end
end
if(temp1>temp2)
temp=1;
conclusion4_1=conclusion4_1+1;
elseif temp1<temp2
temp=0;
conclusion4_0=conclusion4_0+1;
else
flag=1;
end
result(layer,1)=temp;
leaf_num=leaf_num+1;
leaf_gini(leaf_num)=gini_now2;
leaf_sample(leaf_num)=size(data_new2,1)-1;
path(layer,2)=nan;
boundary(layer,2)=nan;
layer=layer+1;
gini(layer,3:4)=nan;
%第三層
path(layer,3:4)=nan;
boundary(layer,3:4)=nan;
%gini第四層葉子
layer=layer+1;
gini(layer,5:8)=nan;
end
if flag==1||conclusion4_1==conclusion4_0||(conclusion3_0==conclusion3_1&&conclusion4_1==0&&conclusion4_0==0)
score=100;
else
score=evaluation(leaf_num,leaf_sample,leaf_gini);
end
flag=0;
result(2,:)=nan;
end
3.3 決策樹決策函數
%樣本決策函數,輸入樣本與決策樹,輸出判斷結果
function conclusion=decide(path,boundary,result)
%
%disp(sample(path(1,1)));
%disp(boundary(1,1));
%sample
conclusion0=0;
conclusion1=0;
%是否有到達第四層
flag=0;
if path(1,1)<boundary(1,1)
if result(2,1)==0||result(2,1)==1
conclusion=result(2,1);
else
%sample
if path(2,1)<boundary(2,1)
if result(2,1)==0||result(2,1)==1
conclusion=result(3,1);
else
%sample
if path(3,1)<boundary(3,1)
if result(4,1)==1
conclusion1=conclusion1+1;
else
conclusion0=conclusion0+1;
end
flag=1;
else
if result(4,2)==1
conclusion1=conclusion1+1;
else
conclusion0=conclusion0+1;
end
flag=1;
end
end
else
if result(3,2)==0||result(3,2)==1
conclusion=result(3,2);
else
%sample
if path(3,2)<boundary(3,2)
if result(4,3)==1
conclusion1=conclusion1+1;
else
conclusion0=conclusion0+1;
end
flag=1;
else
if result(4,4)==1
conclusion1=conclusion1+1;
else
conclusion0=conclusion0+1;
end
flag=1;
end
end
end
end
else
if result(2,2)==0||result(2,2)==1
conclusion=result(2,2);
else
%sample
if path(2,2)<boundary(2,2)
if result(3,3)==0||result(3,3)==1
conclusion=result(3,3);
else
%sample
if path(3,3)<boundary(3,3)
if result(4,5)==1
conclusion1=conclusion1+1;
else
conclusion0=conclusion0+1;
end
flag=1;
else
if result(4,1)==1
conclusion1=conclusion1+1;
else
conclusion0=conclusion0+1;
end
flag=1;
end
end
else
if result(3,4)==0||result(3,4)==1
conclusion=result(3,4);
else
%sample
if path(3,4)<boundary(3,4)
if result(4,7)==1
conclusion1=conclusion1+1;
else
conclusion0=conclusion0+1;
end
flag=1;
else
if result(4,8)==1
conclusion1=conclusion1+1;
else
conclusion0=conclusion0+1;
end
flag=1;
end
end
end
end
end
if flag==1
if conclusion1>conclusion0
conclusion=1;
else
conclusion=0;
end
end
整個系統的代碼我會放在另外一篇博客裏大家emm有機自取吧,不喜勿噴。
4. 系統分析
因爲做自身對比分析的時候還存在一些小bug,所以準確率普遍偏低了3%-4%,也就是說在變量定義的不夠好的時候準確率也是可以比較高的,也證明了隨機森林的優勢——“性能優化過程剛好又提高了模型的準確性,這種精彩表現並不常有”
今天太累了要休息了,有空的時候我重新做一個對比然後給大家品一品,老的對比就暫時放一張大家蠻看(從自己的課程論文裏截圖出來的,醜了點,隨意觀賞)。