文章目錄
【機器學習】決策樹算法(ID3算法及C4.5算法)的理解和應用
1.決策樹的介紹
概念: 決策樹是機器學習中一種特別常見的算法。決策樹是一個預測模型,它在已知各種情況發生的概率下,訓練而成的一個學習器,代表着對象屬性和對象值的一種映射關係。
決策樹結構:
根結點: 代表想要預測的目標的屬性
分支: 代表想要預測的目標的屬性類別
葉節點: 代表預測結果
1.1 ID3算法
1.1.1 算法核心
在信息論中有一個理論,信息期望越小,信息增益越大。在ID3算法中我們的主要目的就是要找到信息增益最大的屬性做爲決策樹的根結點來進行決策樹的建立。
1.1.2 基本概念
這裏我們拿一個例子(相親滿意程度預測)來解釋:
身高 | 年齡 | 是否有房 | 工作單位 | 滿意程度 |
---|---|---|---|---|
175 | 偏大 | 有 | 企業 | 滿意 |
178 | 偏小 | 無 | 政府 | 滿意 |
180 | 偏大 | 無 | 企業 | 不滿意 |
178 | 偏大 | 無 | 政府 | 不滿意 |
180 | 偏小 | 有 | 企業 | 滿意 |
這裏滿意程度爲預測結果,身高,年齡,是否有房,工作單位爲構建決策樹的屬性。
1.信息熵: 度量一個屬性的信息量
2.條件熵: 表示在已知屬性X的情況下它的屬性類別Y的不確定性(需要求所有屬性的條件熵)
-
求身高的條件熵
-
求年齡的條件熵
-
求是否有房的條件熵
-
求工作單位的條件熵
**3.信息增益:**信息熵與條件熵的差
1.1.3 算法過程
在求出各屬性的信息增益後,將屬性增益最大的屬性做爲決策樹的根結點。然後至上向下遞歸建樹。這裏簡單的距離介紹確定第一個根結點後的步驟。
比如信息增益最大的屬性爲年齡,它有兩個屬性類別(即決策樹的兩個分支)
屬性類別爲“偏大”時(即分支爲“偏大”)
身高 | 年齡 | 是否有房 | 工作單位 | 滿意程度 |
---|---|---|---|---|
175 | 偏大 | 有 | 企業 | 滿意 |
178 | 偏大 | 無 | 企業 | 不滿意 |
180 | 偏大 | 無 | 政府 | 不滿意 |
屬性類別爲“偏小”時(即分支爲“偏小”)
身高 | 年齡 | 是否有房 | 工作單位 | 滿意程度 |
---|---|---|---|---|
178 | 偏小 | 無 | 政府 | 滿意 |
180 | 偏小 | 有 | 企業 | 滿意 |
在此基礎上在通過1.1.2中的基本概念求出各屬性的信息增益,求出最大的信息增益在確定爲根結點。
這裏假設偏大分支下信息增益最大的屬性是身高,偏小分支下信息增益最大的屬性是工作單位。(如圖)
1.2 C4.5算法
1.2.1 算法核心
在信息增益的基礎上引入分裂率的概念,從而決定決策樹根結點的因素變成了信息增益率。
1.2.2 基礎概念
分裂率:
信息增益率:
其餘概念與ID3算法相同。
1.2.3 算法過程
除了判斷根結點屬性的依據是信息增益率之外,其餘過程與ID3算法相同。
2.決策樹分類實戰
2.1 C++實現ID3算法和C4.5算法
功能菜單:
數據集:
數據集可以自己製作,這裏我是用網上常用的數據(偷個懶),但製作時注意進行預處理操作。
附上代碼(在VS下操作):
除了功能6需要配置環境,其餘均可直接複製使用。
頭文件(pch.h):
//#define PCH_H
#include<stdio.h>
#include<stdlib.h>
#include <iostream>
#include <fstream>
#include <math.h>
#include <string>
#include <string.h>
#include <queue>
#include <stack>
using namespace std;
#define ROW 14
#define COL 5
#define log2 0.69314718055//這裏指log以e爲底的loge(2)的定義,方便後面使用換底公式
//定義決策樹結點,這裏運用的是兄弟-孩子雙叉鏈表
typedef struct TNode
{
char data[20];
char weight[20];
TNode* firstchild, * nextsibling;
}*tree;
//定義訓練樣本鏈表的結點
typedef struct LNode
{
char OutLook[20];
char Temperature[20];
char Humidity[20];
char Wind[20];
char PlayTennis[20];
LNode* next;
}*link;
//定義屬性鏈表的結點
typedef struct AttrNode
{
char attributes[15];//屬性
int attr_Num;//屬性的個數
AttrNode* next;
}*Attributes;
//定義屬性類別
const char* OutLook_kind[3] = { "Sunny","OverCast","Rain" };
const char* Temperature_kind[3] = { "Hot","Mild","Cool" };
const char* Humidity_kind[2] = { "High","Normal" };
const char* Wind_kind[2] = { "Weak","Strong" };
//廣義表表示決策樹
void treelists(tree T);//不帶分支
void treelists1(tree T, int& i);//帶分支
//構建決策樹的基本函數
void InitAttr(Attributes& attr_link, const char* Attributes_kind[], int Attr_kind[]);//構造屬性鏈表
void InitLink(link& L, const char* Examples[][COL]);//構造訓練樣本鏈表
void PN_Num(link L, int& positve, int& negative);//計算正負樣本
//ID3算法構建決策樹
double Gain(int positive, int negative, const char* atrribute, link L, Attributes attr_L);//計算信息增益
void ID3(tree& T, link L, link Target_Attr, Attributes attr);//ID3算法構建決策樹
//C4.5算法構建決策樹
double Gain_Ratio(int positive, int negative, const char* atrribute, link L, Attributes attr_L);//計算信息增益率
void C4_5(tree& T, link L, link Target_Attr, Attributes attr);//C4.5算法構建決策樹
//展示訓練樣本
void show(const char* Examples[][COL]);//終端展示
void show_txt(link LL, const char* Attributes_kind[], const char* Examples[][COL]);//打印有關樣本數據
//測試數據
void Test(tree T, char train[4][20], char max[20], stack <char*>& a);//輸出分類類別
void route(tree T, char train[4][20]);//輸出測試數據在決策樹中的遍歷路徑
//可視化決策樹
void graphic1(tree T);//可視化ID3算法構建的決策樹
void graphic2(tree T);//可視化C4.5算法構建的決策樹
//基本函數(備用)
int TreeHeight(tree T);//求樹的高度
void InOrderTraverse1(tree T);//先序遍歷
主控文件(decision_tree.cpp):
#include "pch.h"
#include <iostream>
const char* Examples[ROW][COL] = {
//"OverCast","Cool","High","Strong","No",
//"Rain","Hot","Normal","Strong","Yes",
"Sunny","Hot","High","Weak","No",
"Sunny","Hot","High","Strong","No",
"OverCast","Hot","High","Weak","Yes",
"Rain","Mild","High","Weak","Yes",
"Rain","Cool","Normal","Weak","Yes",
"Rain","Cool","Normal","Strong","No",
"OverCast","Cool","Normal","Strong","Yes",
"Sunny","Mild","High","Weak","No",
"Sunny","Cool","Normal","Weak","Yes",
"Rain","Mild","Normal","Weak","Yes",
"Sunny","Mild","Normal","Strong","Yes",
"OverCast","Mild","Normal","Strong","Yes",
"OverCast","Hot","Normal","Weak","Yes",
"Rain","Mild","High","Strong","No"
};
const char* Attributes_kind[4] = { "OutLook","Temperature","Humidity","Wind" };
int Attr_kind[4] = { 3,3,2,2 };
int main()
{
//char* kind[COL - 1];
link LL, p;
Attributes attr_L, q;
tree T, T_C;
T = new TNode;
T->firstchild = T->nextsibling = NULL;
strcpy_s(T->weight, "");
strcpy_s(T->data, "");
T_C = new TNode;
T_C->firstchild = T_C->nextsibling = NULL;
strcpy_s(T_C->weight, "");
strcpy_s(T_C->data, "");
attr_L = new AttrNode;
attr_L->next = NULL;
LL = new LNode;
LL->next = NULL;
int choice;
printf(" 該系統通過決策樹預測在何種天氣下適合出門打球\n");
printf(" $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$\n");
printf(" 按鍵1:導入訓練樣本,顯示訓練數據\n");
printf(" 按鍵2:構建決策樹,並展示信息增益信息\n");
printf(" 按鍵3:通過廣義表來展示訓練出來的決策樹\n");
printf(" 按鍵4:輸入測試數據展示該數據在決策樹中的遍歷路徑,並輸出分類結果\n");
printf(" 按鍵5:改進算法,優越性展示\n");
printf(" 按鍵6:可視化決策樹(該程序使用會直接釋放內存),建議關閉程序時使用\n");
printf(" 按鍵0: 退出系統\n");
printf(" $$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$$\n");
cout << "選擇操作:";
cin >> choice;
while (choice != 0)
{
switch (choice)
{
case 1:
{
InitLink(LL, Examples);
printf("成功導入樣本數據\n\n");
show(Examples);
int choice1;
printf("是否將訓練樣本信息進行打印(按鍵1:是;其他按鍵:否):");
cin >> choice1;
if (choice1 == 1)
{
if (LL->next != NULL)
{
show_txt(LL, Attributes_kind, Examples);
printf("操作成功,請到相應目錄找尋文件!\n");
}
else
printf("訓練樣本鏈表未建立,建立後可呈現正負樣本數!\n");
}
break;
}
case 2:
if (LL->next != NULL)
{
InitAttr(attr_L, Attributes_kind, Attr_kind);
ID3(T, LL, NULL, attr_L);
if (T)
printf("成功構建決策樹!\n\n");
else
printf("決策樹爲空\n\n");
}
else
printf("未導入訓練樣本,無法構建決策樹!!!\n");
break;
case 3:
if (T->firstchild != NULL)
{
int n;
printf("是否展現樹的分支權值(屬性類別)(按鍵1.是;按鍵2.否):");
cin >> n;
if (n == 1)
{
printf("說明:{}內爲分支權值(屬性類別)\n");
int i = 0;
treelists1(T,i);
}
else if (n == 2)
treelists(T);
else
printf("操作錯誤!!!\n");
cout << endl << endl;
}
if (T->firstchild == NULL)
printf("決策樹未構建!!!\n");
break;
case 4:
{
if (T->firstchild != NULL)
{
char train[COL - 1][20];
cout << endl;
int g = 0;
printf("預測在某天氣下是否出門打球\n");
printf("仔細閱讀輸入數據規範要求\n");
printf("----------------------------------------\n");
cout << "四種屬性:" << "OutLook" << ',' << "Temperature" << ',' << "Humidity" << ',' << "Wind" << endl;
cout << "OutLook屬性類別爲:" << "Sunny" << ',' << "OverCast" << ',' << "Rain" << endl;
cout << "Temperature屬性類別爲:" << "Hot" << ',' << "Mild" << ',' << "Cool" << endl;
cout << "Humidity屬性類別爲:" << "High" << ',' << "Normal" << endl;
cout << "Wind屬性類別爲:" << "Weak" << ',' << "Strong" << endl;
printf("----------------------------------------\n");
printf("請輸入OutLook的屬性類別:");
cin >> train[0];
if (!(strcmp(train[0], "Sunny") == 0 || strcmp(train[0], "OverCast") == 0 || strcmp(train[0], "Rain") == 0))
{
printf("操作錯誤,待輸入完全後,請重新操作!!!\n");
g++;
}
printf("請輸入Temprature的屬性類別:");
cin >> train[1];
if (!(strcmp(train[1], "Hot") == 0 || strcmp(train[1], "Mild") == 0 || strcmp(train[1], "Cool") == 0))
{
printf("操作錯誤,待輸入完全後,請重新操作!!!\n");
g++;
}
printf("請輸入Humidity的屬性類別:");
cin >> train[2];
if (!(strcmp(train[2], "High") == 0 || strcmp(train[2], "Normal") == 0))
{
printf("操作錯誤,待輸入完全後,請重新操作!!!\n");
g++;
}
printf("請輸入Wind的屬性類別:");
cin >> train[3];
if (!(strcmp(train[3], "Weak") == 0 || strcmp(train[3], "Strong") == 0))
{
printf("操作錯誤,待輸入完全後,請重新操作!!!\n");
g++;
}
if (g == 0)
{
stack<char*> a;
//int g = 0;
Test(T, train, T->data, a);
//cout<<g<<' '<<a.size()<<endl;
//cout << a.top() << endl;
if (strcmp(a.top(), "Yes") == 0 || strcmp(a.top(), "No") == 0)
{
printf("預測結果(PlayTennis):");
if (strcmp(a.top(), "Yes") == 0)
printf("可以出門打球!\n");
else
if (strcmp(a.top(), "No") == 0)
printf("唉,不能出門打球了。。。\n");
printf("輸出測試數據在ID3算法構建出的決策樹下經過的路徑:");
printf("%s", T->data);
route(T->firstchild, train);
cout << endl;
printf("是否比較在C4.5算法下測試數據經過在決策樹中經過的路徑(按鍵1:是;其他按鍵:否):");
int choice1;
cin >> choice1;
if (choice1 == 1)
{
if (T_C->firstchild != NULL)
{
printf("輸出測試數據在C4.5算法構建出的決策樹下經過的路徑:");
printf("%s", T_C->data);
route(T_C->firstchild, train);
}
else
printf("還未通過C4.5算法建立決策樹,請選擇按鍵5優化算法構建決策樹!!!\n");
}
cout << endl;
}
}
else
{
printf("重要的事說三遍:操作失誤,請重新操作!!!\n");
}
}
else
printf("決策樹未構建!\n");
break;
}
case 5:
{
if (LL->next == NULL)
printf("訓練樣本未導入!\n");
else
{
printf("通過C4.5算法構建決策樹,優化於ID3算法\n");
C4_5(T_C, LL, NULL, attr_L);
if (T_C)
printf("成功構建決策樹!\n\n");
else
printf("決策樹爲空\n\n");
printf("以廣義表形式展現決策樹:\n");
if (T_C != NULL)
{
printf("不帶分支:");
treelists(T_C);
cout << endl;
printf("帶分支:");
int i = 0;
treelists1(T_C,i);
cout << endl << endl;
}
if (T_C == NULL)
printf("決策樹未構建!!!\n");
}
break;
}
case 6:
{
int choice1;
printf("決策樹可視化(按鍵1:ID3算法構建的決策樹;按鍵2:C4.5算法構建的決策樹):");
cin >> choice1;
if (choice1 == 1)
{
if (T->firstchild != NULL)
graphic1(T);
else
printf("樹未構建!\n");
}
else if (choice1 == 2)
{
if (T_C->firstchild != NULL)
graphic2(T_C);
else
printf("樹未構建!\n");
}
else
printf("操作錯誤!\n");
break;
}
default:printf("操作失誤!\n"); break;
}
cout << "選擇操作:";
cin >> choice;
}
printf("謝謝使用!!!\n");
cout << endl;
return 0;
}
功能模塊文件(pch.cpp):
#include "pch.h"
//以廣義表的形式輸出樹
void treelists(tree T)
{
tree p;
if (T == NULL)
return;
cout << T->data;//輸出根結點
p = T->firstchild;
if (p)
{
cout << "(";//層次加'('
treelists(p);//層次遞歸
p = p->nextsibling;
while (p)
{
cout << ',';//層內加','
treelists(p);//層內遞歸
p = p->nextsibling;
}
cout << ")";
}
}
void treelists1(tree T,int &i)
{
tree p;
if (!T)
return;
if(i!=0)
cout << "{" << T->weight << "}";
i++;
cout << T->data;
p = T->firstchild;
if (p)
{
cout << "(";
while (p)
{
treelists1(p,i);
p = p->nextsibling;
if(p)
cout << ',';
}
cout << ")";
}
}
//建立屬性鏈表(參數:待構建鏈表,屬性種類,屬性種類數)
void InitAttr(Attributes& attr_link, const char* Attributes_kind[], int Attr_kind[])
{
Attributes p;
for (int i = 0; i < COL - 1; i++)
{
p = new AttrNode;//結點初始化
p->next = NULL;
strcpy_s(p->attributes, Attributes_kind[i]);
p->attr_Num = Attr_kind[i];
//頭插法(後面部分驗證代碼需要注意)
p->next = attr_link->next;
attr_link->next = p;
}
}
//建立訓練樣本鏈表(參數待訓練的樣本,給定的樣本數據)
void InitLink(link& LL, const char* Examples[][COL])
{
link p;
for (int i = 0; i < ROW; i++)
{
//結點初始化
p = new LNode;
p->next = NULL;
strcpy_s(p->OutLook, Examples[i][COL - 5]);
strcpy_s(p->Temperature, Examples[i][COL - 4]);
strcpy_s(p->Humidity, Examples[i][COL - 3]);
strcpy_s(p->Wind, Examples[i][COL - 2]);
strcpy_s(p->PlayTennis, Examples[i][COL - 1]);
//頭插法
p->next = LL->next;
LL->next = p;
}
}
//尋找正負樣本
void PN_Num(link L, int& positve, int& negative)
{
positve = 0;
negative = 0;
link p;
//找最終分類結果
p = L->next;
while (p)
{
if (strcmp(p->PlayTennis, "No") == 0)
negative++;
else if (strcmp(p->PlayTennis, "Yes") == 0)
positve++;
p = p->next;
}
}
//計算信息增益(重點)
//參數:正樣本數,負樣本數,待計算屬性,訓練樣本鏈表,屬性鏈表
double Gain(int positive, int negative, const char* atrribute, link L, Attributes attr_L)
{
int atrr_kinds;//每個屬性的類別數
int attr_th = 0;//第幾個屬性(這裏是根據屬性鏈表的順序,由於是頭插法構建,故順序與插入順序顛倒),僅用以驗證
Attributes p = attr_L->next;
link q = L->next;
//確定該屬性的類別數
while (p)
{
if (strcmp(p->attributes, atrribute) == 0)
{
atrr_kinds = p->attr_Num;//獲得其屬性類別數
break;
}
p = p->next;
attr_th++;
}
//printf("attr_th:%d,atrr_kinds:%d\n", attr_th, atrr_kinds);
double entropy, gain = 0;
//信息熵的計算(公式:entropy = -p1*log2(p1) - p2*log2(p2))
double p1 = 1.0 * positive / (positive + negative);//正樣本佔總樣本的比例
double p2 = 1.0 * negative / (positive + negative);//負樣本佔總樣本的比例
entropy = -p1 * log(p1) / log2 - p2 * log(p2) / log2;//公式計算,這裏用到數學公式中換底公式的小技巧
gain = entropy;
//定義一個3*atrr_kinds的數組,目的爲了存儲該屬性類別下的一系列信息,來計算條件熵
int** kinds = new int* [3];//c++中定義二維數組的方法
for (int j = 0; j < 3; j++)
{
kinds[j] = new int[atrr_kinds];
//printf("%d\n",kinds[j]);
//printf("kinds[%d]=%d\n",j,kinds[j]);
}
//初始化
for (int j = 0; j < 3; j++)
{
for (int i = 0; i < atrr_kinds; i++)
{
kinds[j][i] = 0;
}
}
/*初始化效果(以OutLook爲例):
Sunny OverCast Rain
總: 0 0 0
正: 0 0 0
負: 0 0 0
進行統計後效果(以OutLook爲例):
Sunny OverCast Rain
總: 5 4 5
正: 2 4 3
負: 3 0 2
*/
//Gain函數的目的是爲了求某一個屬性的信息增益(故需要在訓練樣本中找出該屬性的條件熵)
//將定義的二維數組填滿信息
while (q)
{
if (strcmp("OutLook", atrribute) == 0)
{
for (int i = 0; i < atrr_kinds; i++)//計算屬性類別的信息(樣本佔比數,正樣本數,負樣本數)
{
if (strcmp(q->OutLook, OutLook_kind[i]) == 0)
{
kinds[0][i]++;//計算樣本佔比數
if (strcmp(q->PlayTennis, "Yes") == 0)
kinds[1][i]++;//計算正樣本數
else
kinds[2][i]++;//計算負樣本數
}
}
}
else if (strcmp("Temperature", atrribute) == 0)
{
for (int i = 0; i < atrr_kinds; i++)
{
if (strcmp(q->Temperature, Temperature_kind[i]) == 0)
{
kinds[0][i]++;
if (strcmp(q->PlayTennis, "Yes") == 0)
kinds[1][i]++;
else
kinds[2][i]++;
}
}
}
else if (strcmp("Humidity", atrribute) == 0)
{
for (int i = 0; i < atrr_kinds; i++)
{
if (strcmp(q->Humidity, Humidity_kind[i]) == 0)
{
kinds[0][i]++;
if (strcmp(q->PlayTennis, "Yes") == 0)
kinds[1][i]++;//
else
kinds[2][i]++;
}
}
}
else if (strcmp("Wind", atrribute) == 0)
{
for (int i = 0; i < atrr_kinds; i++)
{
if (strcmp(q->Wind, Wind_kind[i]) == 0)
{
kinds[0][i]++;
if (strcmp(q->PlayTennis, "Yes") == 0)
kinds[1][i]++;
else
kinds[2][i]++;
}
}
}
q = q->next;
}
//計算信息增益,定義一個atrr_kinds的數組(目的存儲entropy()如上註釋)
double* gain_kind = new double[atrr_kinds];
/*
條件熵公式的計算(需要):每個屬性類別的正負樣本及其佔比,類別之間的佔比
以OotLook爲例:
entropy(S)= -2/5 * log2(2/5) - 3/5 * log2(3/5)
entropy(O)= -4/4 * log2(4/4) - 0/4 * log2(0/4)=0
entropy(R)= -3/5 * log2(3/5) - 2/5 * log2(2/5)
entropy(SOR)= 5/14entropy(S) + 4/14entropy(O) + 5/14entropy(R)
gain = entropy(信息熵) - entropy(SOR)
*/
//上方公式計算
for (int j = 0; j < atrr_kinds; j++)
{
if (kinds[0][j] != 0 && kinds[1][j] != 0 && kinds[2][j] != 0)
{
p1 = 1.0 * kinds[1][j] / kinds[0][j];
p2 = 1.0 * kinds[2][j] / kinds[0][j];
gain_kind[j] = -p1 * log(p1) / log2 - p2 * log(p2) / log2;//換底公式
gain = gain - (1.0 * kinds[0][j] / (positive + negative)) * gain_kind[j];
}
else
gain_kind[j] = 0;//通過上方註釋可得出該結論
}
return gain;
}
//ID3算法中需要及時清空子屬性鏈和子訓練樣本鏈
void FreeLink(link& Link)
{
link p, q;
p = Link->next;
while (p)
{
q = p;
p = p->next;
delete(q);
}
Link->next = NULL;
}
//通過ID3算法構建決策樹
//參數:樹(待構建),訓練樣本鏈表,目標鏈(爲了表示當屬性鏈爲空時的普遍情況,以防萬一),屬性鏈
void ID3(tree& T, link L, link Target_Attr, Attributes attr)
{
//定義p,p1是爲了構建attr_chilid的輔助,max是爲了找到最大增益的屬性
Attributes p, max, attr_child, p1;
//定義q和q1是爲了構建link_child的輔助
link q, link_child, q1;
//定義r是爲了構建新的結點以構樹
//定義tree_p的目的是爲了當我們構建完每一層的第一個結點後需要改變操作對象(由T到T->firstchild)
tree r, tree_p;
//計算總訓練樣本中的正負樣本
int positive = 0, negative = 0;
PN_Num(L, positive, negative);
//初始化兩個子集合(兩個子集合是構建樹的關鍵)
attr_child = new AttrNode;
attr_child->next = NULL;
link_child = new LNode;
link_child->next = NULL;
if (positive == 0)//全是反例
{
strcpy_s(T->data, "No");
}
else if (negative == 0)//全是正例
{
strcpy_s(T->data, "Yes");
}
p = attr->next; //屬性鏈表
double gain, g = 0;
if (p)
{
//建立屬性子鏈表
while (p)
{
//計算所有屬性中哪個屬性的信息增益最大,做爲決策樹的根結點
gain = Gain(positive, negative, p->attributes, L, attr);
cout << p->attributes << "的信息增益爲:" << gain << endl;
if (gain > g)
{
g = gain;
max = p;
}
p = p->next;
}
strcpy_s(T->data, max->attributes);//增加決策樹的節點
cout << "信息增益最大的屬性:max->attributes = " << max->attributes << endl << endl;
//創建屬性子鏈表(樹的每一層只需要創建一次屬性子鏈表)
p = attr->next;
while (p)
{
if (strcmp(p->attributes, max->attributes) != 0)//屬性鏈中不爲信息最大增益的屬性進行鏈表構建
{
//初始化
p1 = new AttrNode;
p1->next = NULL;
strcpy_s(p1->attributes, p->attributes);
p1->attr_Num = p->attr_Num;
//頭插法
p1->next = attr_child->next;
attr_child->next = p1;
}
p = p->next;
}
//由於我們已經得到信息增益最大的點,我們就需要構建其分支(即屬性類別,也可理解爲權重)
//而我們構建決策樹的方式是利用兄弟孩子鏈表進行構建,因此構建第一步需要找到所有層的第一個結點(孩子結點)
if (strcmp("OutLook", max->attributes) == 0)
{
//結點初始化
r = new TNode;
r->firstchild = r->nextsibling = NULL;
//爲結點的分支賦值(屬性類別)
strcpy_s(r->weight, OutLook_kind[0]);
T->firstchild = r;
//建立樣本子鏈表(目的找第一個孩子結點)
q = L->next;
while (q)
{
//將q->OutLook爲“Sunny”的數據進行鏈表建立
if (strcmp(q->OutLook, OutLook_kind[0]) == 0)
{
q1 = new LNode;
strcpy_s(q1->OutLook, q->OutLook);
strcpy_s(q1->Humidity, q->Humidity);
strcpy_s(q1->Temperature, q->Temperature);
strcpy_s(q1->Wind, q->Wind);
strcpy_s(q1->PlayTennis, q->PlayTennis);
q1->next = NULL;
q1->next = link_child->next;
link_child->next = q1;
}
q = q->next;
}
}
else if (strcmp("Temperature", max->attributes) == 0)
{
r = new TNode;
r->firstchild = r->nextsibling = NULL;
strcpy_s(r->weight, Temperature_kind[0]);
T->firstchild = r;
q = L->next;
while (q)
{
if (strcmp(q->Temperature, Temperature_kind[0]) == 0)
{
q1 = new LNode;
strcpy_s(q1->OutLook, q->OutLook);
strcpy_s(q1->Humidity, q->Humidity);
strcpy_s(q1->Temperature, q->Temperature);
strcpy_s(q1->Wind, q->Wind);
strcpy_s(q1->PlayTennis, q->PlayTennis);
q1->next = NULL;
q1->next = link_child->next;
link_child->next = q1;
}
q = q->next;
}
}
else if (strcmp("Humidity", max->attributes) == 0)
{
r = new TNode;
r->firstchild = r->nextsibling = NULL;
strcpy_s(r->weight, Humidity_kind[0]);
T->firstchild = r;
q = L->next;
while (q)
{
if (strcmp(q->Humidity, Humidity_kind[0]) == 0)
{
q1 = new LNode;
strcpy_s(q1->OutLook, q->OutLook);
strcpy_s(q1->Humidity, q->Humidity);
strcpy_s(q1->Temperature, q->Temperature);
strcpy_s(q1->Wind, q->Wind);
strcpy_s(q1->PlayTennis, q->PlayTennis);
q1->next = NULL;
q1->next = link_child->next;
link_child->next = q1;
}
q = q->next;
}
}
else if (strcmp("Wind", max->attributes) == 0)
{
r = new TNode;
r->firstchild = r->nextsibling = NULL;
strcpy_s(r->weight, Wind_kind[0]);
T->firstchild = r;
q = L->next;
while (q)
{
if (strcmp(q->Wind, Wind_kind[0]) == 0)
{
q1 = new LNode;
strcpy_s(q1->OutLook, q->OutLook);
strcpy_s(q1->Humidity, q->Humidity);
strcpy_s(q1->Temperature, q->Temperature);
strcpy_s(q1->Wind, q->Wind);
strcpy_s(q1->PlayTennis, q->PlayTennis);
q1->next = NULL;
q1->next = link_child->next;
link_child->next = q1;
}
q = q->next;
}
}
//上述過程,分別建立了屬性子鏈表和訓練樣本子鏈表
int p = 0, n = 0;
PN_Num(link_child, p, n);//在子訓練樣本中找正負樣本
if (p != 0 && n != 0)
{
//遞歸(建立每一層的第一個節點,T->firstchild是重點)
ID3(T->firstchild, link_child, Target_Attr, attr_child);
FreeLink(link_child);//因爲link_child會一直不一樣,所以佔用的空間要及時清空
}
else if (p == 0)//當樣本都爲負樣本時
{
strcpy_s(T->firstchild->data, "No");
FreeLink(link_child);
}
else if (n == 0)//當樣本都爲正樣本時
{
strcpy_s(T->firstchild->data, "Yes");
FreeLink(link_child);
}
/*
(假設)樣本例子(按建立第一個結點後的效果):
LookOut
|
Humidity — Temperature — Wind
*/
//建立每一層上的其他節點
//因爲我們構建決策樹是利用
tree_p = T->firstchild;//由於我們在上面的操作中已經將每層的第一個結點構建完成,所以現在的操作目標應該從每層的第一個子節點的兄弟結點來操作
for (int i = 1; i < max->attr_Num; i++)//注意這裏的下標從1開始,因爲我們已經建立了每層的第一個結點
{
//每層的max是固定的,但是分支(weight)是不一樣的,因此對應的link_child也不一樣
//需要區分出是哪一種屬性
if (strcmp("OutLook", max->attributes) == 0)
{
r = new TNode;
r->firstchild = r->nextsibling = NULL;
strcpy_s(r->weight, OutLook_kind[i]);//這裏是決策樹分支的賦值
tree_p->nextsibling = r;//對每層的兄弟結點進行操作
q = L->next;
while (q)
{
if (strcmp(q->OutLook, OutLook_kind[i]) == 0)
{
q1 = new LNode;
strcpy_s(q1->OutLook, q->OutLook);
strcpy_s(q1->Humidity, q->Humidity);
strcpy_s(q1->Temperature, q->Temperature);
strcpy_s(q1->Wind, q->Wind);
strcpy_s(q1->PlayTennis, q->PlayTennis);
q1->next = NULL;
q1->next = link_child->next;
link_child->next = q1;
}
q = q->next;
}
}
else if (strcmp("Temperature", max->attributes) == 0)
{
r = new TNode;
r->firstchild = r->nextsibling = NULL;
strcpy_s(r->weight, Temperature_kind[i]);
tree_p->nextsibling = r;
q = L->next;
while (q)
{
if (strcmp(q->Temperature, Temperature_kind[i]) == 0)
{
q1 = new LNode;
strcpy_s(q1->OutLook, q->OutLook);
strcpy_s(q1->Humidity, q->Humidity);
strcpy_s(q1->Temperature, q->Temperature);
strcpy_s(q1->Wind, q->Wind);
strcpy_s(q1->PlayTennis, q->PlayTennis);
q1->next = NULL;
q1->next = link_child->next;
link_child->next = q1;
}
q = q->next;
}
}
else if (strcmp("Humidity", max->attributes) == 0)
{
r = new TNode;
r->firstchild = r->nextsibling = NULL;
strcpy_s(r->weight, Humidity_kind[i]);
tree_p->nextsibling = r;
q = L->next;
while (q)
{
if (strcmp(q->Humidity, Humidity_kind[i]) == 0)
{
q1 = new LNode;
strcpy_s(q1->OutLook, q->OutLook);
strcpy_s(q1->Humidity, q->Humidity);
strcpy_s(q1->Temperature, q->Temperature);
strcpy_s(q1->Wind, q->Wind);
strcpy_s(q1->PlayTennis, q->PlayTennis);
q1->next = NULL;
q1->next = link_child->next;
link_child->next = q1;
}
q = q->next;
}
}
else if (strcmp("Wind", max->attributes) == 0)
{
r = new TNode;
r->firstchild = r->nextsibling = NULL;
strcpy_s(r->weight, Wind_kind[i]);
tree_p->nextsibling = r;
q = L->next;
while (q)
{
if (strcmp(q->Wind, Wind_kind[i]) == 0)
{
q1 = new LNode;
strcpy_s(q1->OutLook, q->OutLook);
strcpy_s(q1->Humidity, q->Humidity);
strcpy_s(q1->Temperature, q->Temperature);
strcpy_s(q1->Wind, q->Wind);
strcpy_s(q1->PlayTennis, q->PlayTennis);
q1->next = NULL;
q1->next = link_child->next;
link_child->next = q1;
}
q = q->next;
}
}
//通過正負樣本和訓練樣本子鏈表,屬性子鏈表的關係,遞歸建樹
int p = 0, n = 0;
PN_Num(link_child, p, n);
if (p != 0 && n != 0)
{
//這裏操作對象是兄弟結點
ID3(tree_p->nextsibling, link_child, Target_Attr, attr_child);
FreeLink(link_child);
}
else if (p == 0)
{
strcpy_s(tree_p->nextsibling->data, "No");
FreeLink(link_child);
}
else if (n == 0)
{
strcpy_s(tree_p->nextsibling->data, "Yes");
FreeLink(link_child);
}
tree_p = tree_p->nextsibling;//建立所有的孩子結點
}//建立決策樹結束
}
else
{
q = L->next;
strcpy_s(T->data, q->PlayTennis);
//這個地方要賦以訓練樣本Example中最普遍的Target_attributes的值
}
}
void show(const char* Examples[][COL])
{
int i, j;
printf("OutLook Temperature Humidity Wind PlayTennis\n");
printf("--------------------------------------------------------------------------\n");
for (i = 0; i < ROW; i++)
{
if (strcmp(Examples[i][0], "OverCast") == 0)
printf("%s ", Examples[i][0]);
else
printf("%s\t\t", Examples[i][0]);
printf("%s\t\t%s\t\t%s\t\t%s\n", Examples[i][1], Examples[i][2], Examples[i][3], Examples[i][4]);
}
}
//測試,輸出分類類別
void Test(tree T, char train[4][20], char max[20], stack <char*>& a)
{
tree p;
//g++;
//queue<char*> a;
if (T == NULL)
return;
if (strcmp(T->data, max) == 0)//根結點沒有分支
p = T->firstchild;
else
p = T;
//cout << T->data << endl;
a.push(T->data);
int j, fig = 0;
int i = 0;
if (p != NULL)
{
for (j = 0; j < 4; j++)
{
if (strcmp(train[j], p->weight) == 0)
fig++;
}
if (fig != 0)
{
p = p->firstchild;
//cout << T->data << "----->";
Test(p, train, max, a);
}
else
{
p = p->nextsibling;
//cout << T->data << "----->";
Test(p, train, max, a);
}
//cout << T->data << endl;
}
}
//輸出決策樹經過路徑
void route(tree T, char train[4][20])
{
tree p;
if (T == NULL)
return;
p = T;
int j, fig = 0;
int i = 0;
if (p != NULL)
{
for (j = 0; j < 4; j++)
{
if (strcmp(train[j], p->weight) == 0)
fig++;
}
if (fig != 0)
{
if (strcmp(T->data, "Yes") == 0 || strcmp(T->data, "No") == 0)
{
cout << "---" << '(' << T->weight << ')' << "--->" << T->data;
p = p->firstchild;
}
if (strcmp(T->data, "Yes") != 0 && strcmp(T->data, "No") != 0)
{
p = p->firstchild;
cout << "---" << '(' << T->weight << ')' << "--->" << T->data;
}
route(p, train);
}
else
{
p = p->nextsibling;
//cout << "---" << '(' << T->weight << ')' << "--->" << T->data;
route(p, train);
}
}
}
void InOrderTraverse1(tree T)
{
//先序遍歷
if (T)
{
InOrderTraverse1(T->firstchild);
InOrderTraverse1(T->nextsibling);
cout << T->data << ' ';
}
}
int TreeHeight(tree T)
{
tree p;
int h, maxh = 0;
if (!T)
return 0;
else
{
p = T->firstchild;
while (p)
{
h = TreeHeight(p);
if (maxh < h)
maxh = h;
p = p->nextsibling;
}
return (maxh + 1);
}
}
double Gain_Ratio(int positive, int negative, const char* atrribute, link L, Attributes attr_L)
{
int atrr_kinds;//每個屬性的類別數
int attr_th = 0;//第幾個屬性(這裏是根據屬性鏈表的順序,由於是頭插法構建,故順序與插入順序顛倒),僅用以驗證
Attributes p = attr_L->next;
link q = L->next;
//確定該屬性的類別數
while (p)
{
if (strcmp(p->attributes, atrribute) == 0)
{
atrr_kinds = p->attr_Num;
break;
}
p = p->next;
attr_th++;
}
//printf("attr_th:%d,atrr_kinds:%d\n", attr_th, atrr_kinds);
double entropy, gain = 0;
//信息熵的計算(公式:entropy = -p1*log2(p1) - p2*log2(p2))
double p1 = 1.0 * positive / (positive + negative);//正樣本佔總樣本的比例
double p2 = 1.0 * negative / (positive + negative);//負樣本佔總樣本的比例
entropy = -p1 * log(p1) / log2 - p2 * log(p2) / log2;//公式計算,這裏用到數學公式中換底公式的小技巧
gain = entropy;
//定義一個3*atrr_kinds的數組,目的爲了存儲該屬性類別下的一系列信息,來計算條件熵
int** kinds = new int* [3];//c++中定義二維數組的方法
for (int j = 0; j < 3; j++)
{
kinds[j] = new int[atrr_kinds];
//printf("%d\n",kinds[j]);
//printf("kinds[%d]=%d\n",j,kinds[j]);
}
//初始化
for (int j = 0; j < 3; j++)
{
for (int i = 0; i < atrr_kinds; i++)
{
kinds[j][i] = 0;
}
}
/*初始化效果(以OutLook爲例):
Sunny OverCast Rain
總: 0 0 0
正: 0 0 0
負: 0 0 0
進行統計後效果(以OutLook爲例):
Sunny OverCast Rain
總: 5 4 5
正: 2 4 3
負: 3 0 2
*/
//Gain函數的目的是爲了求某一個屬性的信息增益(故需要在訓練樣本中找出該屬性的條件熵)
//將定義的二維數組填滿信息
while (q)
{
if (strcmp("OutLook", atrribute) == 0)
{
for (int i = 0; i < atrr_kinds; i++)//計算屬性類別的信息(樣本佔比數,正樣本數,負樣本數)
{
if (strcmp(q->OutLook, OutLook_kind[i]) == 0)
{
kinds[0][i]++;//計算樣本佔比數
if (strcmp(q->PlayTennis, "Yes") == 0)
kinds[1][i]++;//計算正樣本數
else
kinds[2][i]++;//計算負樣本數
}
}
}
else if (strcmp("Temperature", atrribute) == 0)
{
for (int i = 0; i < atrr_kinds; i++)
{
if (strcmp(q->Temperature, Temperature_kind[i]) == 0)
{
kinds[0][i]++;
if (strcmp(q->PlayTennis, "Yes") == 0)
kinds[1][i]++;
else
kinds[2][i]++;
}
}
}
else if (strcmp("Humidity", atrribute) == 0)
{
for (int i = 0; i < atrr_kinds; i++)
{
if (strcmp(q->Humidity, Humidity_kind[i]) == 0)
{
kinds[0][i]++;
if (strcmp(q->PlayTennis, "Yes") == 0)
kinds[1][i]++;//
else
kinds[2][i]++;
}
}
}
else if (strcmp("Wind", atrribute) == 0)
{
for (int i = 0; i < atrr_kinds; i++)
{
if (strcmp(q->Wind, Wind_kind[i]) == 0)
{
kinds[0][i]++;
if (strcmp(q->PlayTennis, "Yes") == 0)
kinds[1][i]++;
else
kinds[2][i]++;
}
}
}
q = q->next;
}
//計算信息增益,定義一個atrr_kinds的數組(目的存儲entropy()如上註釋)
double* gain_kind = new double[atrr_kinds];
/*
條件熵公式的計算(需要):每個屬性類別的正負樣本及其佔比,類別之間的佔比
以OotLook爲例:
entropy(S)= -2/5 * log2(2/5) - 3/5 * log2(3/5)
entropy(O)= -4/4 * log2(4/4) - 0/4 * log2(0/4)=0
entropy(R)= -3/5 * log2(3/5) - 2/5 * log2(2/5)
entropy(SOR)= 5/14entropy(S) + 4/14entropy(O) + 5/14entropy(R)
gain = entropy(信息熵) - entropy(SOR)
*/
//上方公式計算
for (int j = 0; j < atrr_kinds; j++)
{
if (kinds[0][j] != 0 && kinds[1][j] != 0 && kinds[2][j] != 0)
{
p1 = 1.0 * kinds[1][j] / kinds[0][j];
p2 = 1.0 * kinds[2][j] / kinds[0][j];
gain_kind[j] = -p1 * log(p1) / log2 - p2 * log(p2) / log2;//換底公式
gain = gain - (1.0 * kinds[0][j] / (positive + negative)) * gain_kind[j];
}
else
gain_kind[j] = 0;//通過上方註釋可得出該結論
}
//計算該屬性的不同類別在總樣本中的佔比,計算分裂率
double* split_kind = new double[atrr_kinds];//計算分裂率
int* split_th = new int[atrr_kinds];//存儲該屬性類別在樣本中的數量
//初始化
for (int i = 0; i < atrr_kinds; i++)
split_th[i] = 0;
q = L->next;
while (q)
{
if (strcmp("OutLook", atrribute) == 0)
{
for (int i = 0; i < atrr_kinds; i++)
{
if (strcmp(q->OutLook, OutLook_kind[i]) == 0)
split_th[i]++;
}
}
else if (strcmp("Temperature", atrribute) == 0)
{
for (int i = 0; i < atrr_kinds; i++)
{
if (strcmp(q->Temperature, Temperature_kind[i]) == 0)
split_th[i]++;
}
}
else if (strcmp("Humidity", atrribute) == 0)
{
for (int i = 0; i < atrr_kinds; i++)
{
if (strcmp(q->Humidity, Humidity_kind[i]) == 0)
split_th[i]++;
}
}
else if (strcmp("Wind", atrribute) == 0)
{
for (int i = 0; i < atrr_kinds; i++)
{
if (strcmp(q->Wind, Wind_kind[i]) == 0)
split_th[i]++;
}
}
q = q->next;
}
/*
以OutLook爲例:
Sunny OverCast Rain
5 4 5
split_kind[0] = -5/14 * log2(5/14)
split_kind[1] = -4/14 * log2(4/14)
split_kind[2] = -5/14 * log2(5/14)
split = split_kind[0] + split_kind[1] +split_kind[2]
gain_ratio = gain/spilit
*/
//cout << "該屬性的信息增益爲" << gain << endl;
double num;
double split = 0;
/*
for (int i = 0; i < atrr_kinds; i++)
{
cout << split_th[i] << endl;
}*/
for (int i = 0; i < atrr_kinds; i++)
{
if (split_th[i] != 0 )
{
num = 1.0 * split_th[i] / (positive + negative);
if (num != 2)
split_kind[i] = -1 * num * log(num) / log2;
if (num == 2)
split_kind[i] = 0;
}
else
split_kind[i] = 0;
split += split_kind[i];
}
//cout << split <<','<<split_kind[0]<<','<<split_kind[1]<<','<<split_kind[2]<< endl;
double gain_ratio;
if (gain != 0)
gain_ratio = gain / split;
else
gain_ratio = 0;
return gain_ratio;
}
void C4_5(tree& T, link L, link Target_Attr, Attributes attr)
{
//定義p,p1是爲了構建attr_chilid的輔助,max是爲了找到最大增益的屬性
Attributes p, max, attr_child, p1;
//定義q和q1是爲了構建link_child的輔助
link q, link_child, q1;
//定義r是爲了構建新的結點以構樹
//定義tree_p的目的是爲了當我們構建完每一層的第一個結點後需要改變操作對象(由T到T->firstchild)
tree r, tree_p;
//計算總訓練樣本中的正負樣本
int positive = 0, negative = 0;
PN_Num(L, positive, negative);
//初始化兩個子集合(兩個子集合是構建樹的關鍵)
attr_child = new AttrNode;
attr_child->next = NULL;
link_child = new LNode;
link_child->next = NULL;
if (positive == 0)//全是反例
{
strcpy_s(T->data, "No");
}
else if (negative == 0)//全是正例
{
strcpy_s(T->data, "Yes");
}
p = attr->next; //屬性鏈表
double gain, g = 0;
if (p)
{
//建立屬性子鏈表
while (p)
{
//計算所有屬性中哪個屬性的信息增益最大,做爲決策樹的根結點
gain = Gain_Ratio(positive, negative, p->attributes, L, attr);
cout << p->attributes << "的信息增益率爲:" << gain << endl;
if (gain > g)
{
g = gain;
max = p;
}
p = p->next;
}
strcpy_s(T->data, max->attributes);//增加決策樹的節點
cout << "信息增益率最大的屬性:max->attributes = " << max->attributes << endl << endl;
//創建屬性子鏈表(樹的每一層只需要創建一次屬性子鏈表)
p = attr->next;
while (p)
{
if (strcmp(p->attributes, max->attributes) != 0)//屬性鏈中不爲信息最大增益的屬性進行鏈表構建
{
//初始化
p1 = new AttrNode;
p1->next = NULL;
strcpy_s(p1->attributes, p->attributes);
p1->attr_Num = p->attr_Num;
//頭插法
p1->next = attr_child->next;
attr_child->next = p1;
}
p = p->next;
}
//由於我們已經得到信息增益最大的點,我們就需要構建其分支(即屬性類別,也可理解爲權重)
//而我們構建決策樹的方式是利用兄弟孩子鏈表進行構建,因此構建第一步需要找到所有層的第一個結點(孩子結點)
if (strcmp("OutLook", max->attributes) == 0)
{
//結點初始化
r = new TNode;
r->firstchild = r->nextsibling = NULL;
//爲結點的分支賦值(屬性類別)
strcpy_s(r->weight, OutLook_kind[0]);
T->firstchild = r;
//建立樣本子鏈表(目的找第一個孩子結點)
q = L->next;
while (q)
{
//將q->OutLook爲“Sunny”的數據進行鏈表建立
if (strcmp(q->OutLook, OutLook_kind[0]) == 0)
{
q1 = new LNode;
strcpy_s(q1->OutLook, q->OutLook);
strcpy_s(q1->Humidity, q->Humidity);
strcpy_s(q1->Temperature, q->Temperature);
strcpy_s(q1->Wind, q->Wind);
strcpy_s(q1->PlayTennis, q->PlayTennis);
q1->next = NULL;
q1->next = link_child->next;
link_child->next = q1;
}
q = q->next;
}
}
else if (strcmp("Temperature", max->attributes) == 0)
{
r = new TNode;
r->firstchild = r->nextsibling = NULL;
strcpy_s(r->weight, Temperature_kind[0]);
T->firstchild = r;
q = L->next;
while (q)
{
if (strcmp(q->Temperature, Temperature_kind[0]) == 0)
{
q1 = new LNode;
strcpy_s(q1->OutLook, q->OutLook);
strcpy_s(q1->Humidity, q->Humidity);
strcpy_s(q1->Temperature, q->Temperature);
strcpy_s(q1->Wind, q->Wind);
strcpy_s(q1->PlayTennis, q->PlayTennis);
q1->next = NULL;
q1->next = link_child->next;
link_child->next = q1;
}
q = q->next;
}
}
else if (strcmp("Humidity", max->attributes) == 0)
{
r = new TNode;
r->firstchild = r->nextsibling = NULL;
strcpy_s(r->weight, Humidity_kind[0]);
T->firstchild = r;
q = L->next;
while (q)
{
if (strcmp(q->Humidity, Humidity_kind[0]) == 0)
{
q1 = new LNode;
strcpy_s(q1->OutLook, q->OutLook);
strcpy_s(q1->Humidity, q->Humidity);
strcpy_s(q1->Temperature, q->Temperature);
strcpy_s(q1->Wind, q->Wind);
strcpy_s(q1->PlayTennis, q->PlayTennis);
q1->next = NULL;
q1->next = link_child->next;
link_child->next = q1;
}
q = q->next;
}
}
else if (strcmp("Wind", max->attributes) == 0)
{
r = new TNode;
r->firstchild = r->nextsibling = NULL;
strcpy_s(r->weight, Wind_kind[0]);
T->firstchild = r;
q = L->next;
while (q)
{
if (strcmp(q->Wind, Wind_kind[0]) == 0)
{
q1 = new LNode;
strcpy_s(q1->OutLook, q->OutLook);
strcpy_s(q1->Humidity, q->Humidity);
strcpy_s(q1->Temperature, q->Temperature);
strcpy_s(q1->Wind, q->Wind);
strcpy_s(q1->PlayTennis, q->PlayTennis);
q1->next = NULL;
q1->next = link_child->next;
link_child->next = q1;
}
q = q->next;
}
}
//上述過程,分別建立了屬性子鏈表和訓練樣本子鏈表
int p = 0, n = 0;
PN_Num(link_child, p, n);//在子訓練樣本中找正負樣本
if (p != 0 && n != 0)
{
//遞歸(建立每一層的第一個節點,T->firstchild是重點)
C4_5(T->firstchild, link_child, Target_Attr, attr_child);
FreeLink(link_child);//因爲link_child會一直不一樣,所以佔用的空間要及時清空
}
else if (p == 0)//當樣本都爲負樣本時
{
strcpy_s(T->firstchild->data, "No");
FreeLink(link_child);
}
else if (n == 0)//當樣本都爲正樣本時
{
strcpy_s(T->firstchild->data, "Yes");
FreeLink(link_child);
}
/*
(假設)樣本例子(按建立第一個結點後的效果):
LookOut
|
Humidity — Temperature — Wind
*/
//建立每一層上的其他節點
//因爲我們構建決策樹是利用
tree_p = T->firstchild;//由於我們在上面的操作中已經將每層的第一個結點構建完成,所以現在的操作目標應該從每層的第一個子節點的兄弟結點來操作
for (int i = 1; i < max->attr_Num; i++)//注意這裏的下標從1開始,因爲我們已經建立了每層的第一個結點
{
//每層的max是固定的,但是分支(weight)是不一樣的,因此對應的link_child也不一樣
//需要區分出是哪一種屬性
if (strcmp("OutLook", max->attributes) == 0)
{
r = new TNode;
r->firstchild = r->nextsibling = NULL;
strcpy_s(r->weight, OutLook_kind[i]);//這裏是決策樹分支的賦值
tree_p->nextsibling = r;//對每層的兄弟結點進行操作
q = L->next;
while (q)
{
if (strcmp(q->OutLook, OutLook_kind[i]) == 0)
{
q1 = new LNode;
strcpy_s(q1->OutLook, q->OutLook);
strcpy_s(q1->Humidity, q->Humidity);
strcpy_s(q1->Temperature, q->Temperature);
strcpy_s(q1->Wind, q->Wind);
strcpy_s(q1->PlayTennis, q->PlayTennis);
q1->next = NULL;
q1->next = link_child->next;
link_child->next = q1;
}
q = q->next;
}
}
else if (strcmp("Temperature", max->attributes) == 0)
{
r = new TNode;
r->firstchild = r->nextsibling = NULL;
strcpy_s(r->weight, Temperature_kind[i]);
tree_p->nextsibling = r;
q = L->next;
while (q)
{
if (strcmp(q->Temperature, Temperature_kind[i]) == 0)
{
q1 = new LNode;
strcpy_s(q1->OutLook, q->OutLook);
strcpy_s(q1->Humidity, q->Humidity);
strcpy_s(q1->Temperature, q->Temperature);
strcpy_s(q1->Wind, q->Wind);
strcpy_s(q1->PlayTennis, q->PlayTennis);
q1->next = NULL;
q1->next = link_child->next;
link_child->next = q1;
}
q = q->next;
}
}
else if (strcmp("Humidity", max->attributes) == 0)
{
r = new TNode;
r->firstchild = r->nextsibling = NULL;
strcpy_s(r->weight, Humidity_kind[i]);
tree_p->nextsibling = r;
q = L->next;
while (q)
{
if (strcmp(q->Humidity, Humidity_kind[i]) == 0)
{
q1 = new LNode;
strcpy_s(q1->OutLook, q->OutLook);
strcpy_s(q1->Humidity, q->Humidity);
strcpy_s(q1->Temperature, q->Temperature);
strcpy_s(q1->Wind, q->Wind);
strcpy_s(q1->PlayTennis, q->PlayTennis);
q1->next = NULL;
q1->next = link_child->next;
link_child->next = q1;
}
q = q->next;
}
}
else if (strcmp("Wind", max->attributes) == 0)
{
r = new TNode;
r->firstchild = r->nextsibling = NULL;
strcpy_s(r->weight, Wind_kind[i]);
tree_p->nextsibling = r;
q = L->next;
while (q)
{
if (strcmp(q->Wind, Wind_kind[i]) == 0)
{
q1 = new LNode;
strcpy_s(q1->OutLook, q->OutLook);
strcpy_s(q1->Humidity, q->Humidity);
strcpy_s(q1->Temperature, q->Temperature);
strcpy_s(q1->Wind, q->Wind);
strcpy_s(q1->PlayTennis, q->PlayTennis);
q1->next = NULL;
q1->next = link_child->next;
link_child->next = q1;
}
q = q->next;
}
}
//通過正負樣本和訓練樣本子鏈表,屬性子鏈表的關係,遞歸建樹
int p = 0, n = 0;
PN_Num(link_child, p, n);
if (p != 0 && n != 0)
{
//這裏操作對象是兄弟結點
C4_5(tree_p->nextsibling, link_child, Target_Attr, attr_child);
FreeLink(link_child);
}
else if (p == 0)
{
strcpy_s(tree_p->nextsibling->data, "No");
FreeLink(link_child);
}
else if (n == 0)
{
strcpy_s(tree_p->nextsibling->data, "Yes");
FreeLink(link_child);
}
tree_p = tree_p->nextsibling;//建立所有的孩子結點
}//建立決策樹結束
}
else
{
q = L->next;
strcpy_s(T->data, q->PlayTennis);
//這個地方要賦以訓練樣本Example中最普遍的Target_attributes的值
}
}
void show_txt(link LL, const char* Attributes_kind[], const char* Examples[][COL])
{
FILE* fp;
if ((fp = fopen("train.txt", "w+")) == NULL)
{
printf("File open error!\n");
exit(0);
}
fprintf(fp, "%s: %s %s %s\n", Attributes_kind[0], OutLook_kind[0], OutLook_kind[1], OutLook_kind[2]);
fprintf(fp, "%s: %s %s %s\n", Attributes_kind[1], Temperature_kind[0], Temperature_kind[1], Temperature_kind[2]);
fprintf(fp, "%s: %s %s\n", Attributes_kind[2], Humidity_kind[0], Humidity_kind[1]);
fprintf(fp, "%s: %s %s\n", Attributes_kind[3], Wind_kind[0], Wind_kind[1]);
fprintf(fp, "\n\n");
fprintf(fp, "%s %s %s %s PlayTennis\n", Attributes_kind[0], Attributes_kind[1], Attributes_kind[2], Attributes_kind[3]);
for (int i = 0; i < ROW; i++)
{
fprintf(fp, "%s ", Examples[i][0]);
fprintf(fp, "%s ", Examples[i][1]);
fprintf(fp, "%s ", Examples[i][2]);
fprintf(fp, "%s ", Examples[i][3]);
fprintf(fp, "%s\n", Examples[i][4]);
}
fprintf(fp, "\n\n");
int positive = 0, negative = 0;
PN_Num(LL, positive, negative);
fprintf(fp, "正樣本:%d;負樣本:%d\n", positive, negative);
if (fclose(fp)) {
printf("Can not close the file!\n");
exit(0);
}
}
//graphic工具,所想即所得
void graphic1(tree T)
{
FILE* stream1;
freopen_s(&stream1,"graph1.dot", "w+", stdout);
if (stdout == NULL)
{
printf("File open error!\n");
exit(0);
}
cout << "digraph G{" << endl;
cout << T->data << "->" << T->firstchild->data << '[' << "label=" << '"' << T->firstchild->weight << '"' << ']' << ';' << endl;
cout << T->firstchild->data << "->" << T->firstchild->firstchild->data << '1' << '[' << "label=" << '"' << T->firstchild->firstchild->weight << '"' << ']' << ';' << endl;
cout << T->firstchild->data << "->" << T->firstchild->firstchild->nextsibling->data << '1' << '[' << "label=" << '"' << T->firstchild->firstchild->nextsibling->weight << '"' << ']' << ';' << endl;
cout << T->data << "->" << T->firstchild->nextsibling->data << '2' << '[' << "label=" << '"' << T->firstchild->nextsibling->weight << '"' << ']' << ';' << endl;
cout << T->data << "->" << T->firstchild->nextsibling->nextsibling->data << '[' << "label=" << '"' << T->firstchild->nextsibling->nextsibling->weight << '"' << ']' << ';' << endl;
cout << T->firstchild->nextsibling->nextsibling->data<<"->"<<T->firstchild->nextsibling->nextsibling->firstchild->data << '[' << "label=" << '"' << T->firstchild->nextsibling->nextsibling->firstchild->weight << '"' << ']' << ';' << endl;
cout << T->firstchild->nextsibling->nextsibling->data << "->" << T->firstchild->nextsibling->nextsibling->firstchild->nextsibling->data << '[' << "label=" << '"' << T->firstchild->nextsibling->nextsibling->firstchild->nextsibling->weight << '"' << ']' << ';' << endl;
cout << "}" << endl;
//fclose(stdout);
if (fclose(stdout))
{
printf("Can not close the file!\n");
exit(0);
}
system("dot -Tpng graph1.dot -o sample1.png");
}
void graphic2(tree T)
{
FILE* stream1;
freopen_s(&stream1, "graph2.dot", "w+", stdout);
if (stdout == NULL)
{
printf("File open error!\n");
exit(0);
}
cout << "digraph G{" << endl;
cout << T->data << "->" << T->firstchild->data <<'1'<< '[' << "label=" << '"' << T->firstchild->weight << '"' << ']' << ';' << endl;
cout << T->data << "->" << T->firstchild->nextsibling->data<<'1' << '[' << "label=" << '"' << T->firstchild->nextsibling->weight << '"' << ']' << ';' << endl;
cout<<T->firstchild->data<<'1'<<"->"<<T->firstchild->firstchild->data<<'1' << '[' << "label=" << '"' << T->firstchild->firstchild->weight << '"' << ']' << ';' << endl;
cout<<T->firstchild->data<<'1'<<"->"<<T->firstchild->firstchild->nextsibling->data<<'1' << '[' << "label=" << '"' << T->firstchild->firstchild->nextsibling->weight << '"' << ']' << ';' << endl;
cout<<T->firstchild->data<<'1'<<"->"<<T->firstchild->firstchild->nextsibling->nextsibling->data<<'2'<< '[' << "label=" << '"' << T->firstchild->firstchild->nextsibling->nextsibling->weight << '"' << ']' << ';' << endl;
cout<<T->firstchild->firstchild->nextsibling->nextsibling->data<<'2'<<"->"<<T->firstchild->firstchild->nextsibling->nextsibling->firstchild->data <<'2'<< '[' << "label=" << '"' << T->firstchild->firstchild->nextsibling->nextsibling->firstchild->weight << '"' << ']' << ';' << endl;
cout<<T->firstchild->firstchild->nextsibling->nextsibling->data<<'2'<<"->"<<T->firstchild->firstchild->nextsibling->nextsibling->firstchild->nextsibling->data<<'2' << '[' << "label=" << '"' << T->firstchild->firstchild->nextsibling->nextsibling->firstchild->nextsibling->weight << '"' << ']' << ';' << endl;
cout<<T->firstchild->nextsibling->data<<'1'<<"->"<<T->firstchild->nextsibling->firstchild->data<<'3' << '[' << "label=" << '"' << T->firstchild->nextsibling->firstchild->weight << '"' << ']' << ';' << endl;
cout<<T->firstchild->nextsibling->data<<'1'<<"->"<<T->firstchild->nextsibling->firstchild->nextsibling->data <<'2'<< '[' << "label=" << '"' << T->firstchild->nextsibling->firstchild->nextsibling->weight << '"' << ']' << ';' << endl;
cout<<T->firstchild->nextsibling->firstchild->nextsibling->data<<'2'<<"->"<<T->firstchild->nextsibling->firstchild->nextsibling->firstchild->data <<'4'<< '[' << "label=" << '"' << T->firstchild->nextsibling->firstchild->nextsibling->firstchild->weight << '"' << ']' << ';' << endl;
cout<<T->firstchild->nextsibling->firstchild->nextsibling->data<<'2'<<"->"<<T->firstchild->nextsibling->firstchild->nextsibling->firstchild->nextsibling->data <<'5'<< '[' << "label=" << '"' << T->firstchild->nextsibling->firstchild->nextsibling->firstchild->nextsibling->weight << '"' << ']' << ';' << endl;
cout<<T->firstchild->nextsibling->firstchild->nextsibling->data<<'2'<<"->"<<T->firstchild->nextsibling->firstchild->nextsibling->firstchild->nextsibling->nextsibling->data<<'4' << '[' << "label=" << '"' << T->firstchild->nextsibling->firstchild->nextsibling->firstchild->nextsibling->nextsibling->weight << '"' << ']' << ';' << endl;
cout << "}" << endl;
//fclose(stdout);
if (fclose(stdout))
{
printf("Can not close the file!\n");
exit(0);
}
system("dot -Tpng graph2.dot -o sample2.png");
}
2.1.1 安裝Graphviz(可視化決策樹)
DOT語言是用來描述圖形的一種語言,而Graphviz則是用來處理這種語言的工具,適用於樹結構的可視化。
該地址下下載graphviz-2.38.msi
https://graphviz.gitlab.io/_pages/Download/Download_windows.html
下載後會出現
然後一直點next,直到出現安裝路徑時,它會給你默認安裝路徑(這裏個人建議放在你自己熟悉的路徑下,方便後面配置環境變量)
安裝後
環境變量配置:
搜索—>編輯系統環境變量—>高級—>環境變量—>系統變量中找到path—>添加安裝路徑加上bin
有的電腦需要重啓一下,使得環境變量生效。
bin文件下標註的gvedit可以直接調用該工具
dot語言可參考:https://blog.csdn.net/STILLxjy/article/details/86004519
注意: 該工具在c++環境下不夠友好,基本每次運行只能調用一次。
2.2 Python下實現CART算法
- 1.安裝Graphviz(如上)
- 2.pip install graphviz
- 3.pip install pydotplus
數據集(配置適合你的隱形眼鏡)
# -*- coding: UTF-8 -*-
from sklearn.preprocessing import LabelEncoder, OneHotEncoder
from sklearn.externals.six import StringIO
from sklearn import tree
import pandas as pd
import numpy as np
import pydotplus
import os
if __name__ == '__main__':
os.environ["PATH"] += os.pathsep + 'E:/Graphviz-2.38/bin' #若環境配置不成功,則直接代碼調用環境,E:/Graphviz-2.38/bin爲我的安裝路徑
with open('D:/desktop/experience/lenses1.txt', 'r') as fr: #加載文件
lenses = [inst.strip().split('\t') for inst in fr.readlines()] #處理文件
#print(lenses)
lenses_target = [] #提取每組數據的類別,保存在列表裏
for each in lenses:
#print(each[-1])
lenses_target.append(each[-1])
#print(lenses_target)
lensesLabels = ['age', 'prescript', 'astigmatic', 'tearRate'] #特徵標籤
lenses_list = [] #保存lenses數據的臨時列表
lenses_dict = {} #保存lenses數據的字典,用於生成pandas
for each_label in lensesLabels: #提取信息,生成字典
for each in lenses:
lenses_list.append(each[lensesLabels.index(each_label)])
lenses_dict[each_label] = lenses_list
lenses_list = []
#print(lenses_dict) #打印字典信息
lenses_pd = pd.DataFrame(lenses_dict) #生成pandas.DataFrame
print(lenses_pd) #打印pandas.DataFrame
le = LabelEncoder() #創建LabelEncoder()對象,用於序列化
for col in lenses_pd.columns: #序列化
lenses_pd[col] = le.fit_transform(lenses_pd[col])
print(lenses_pd) #打印編碼信息
clf = tree.DecisionTreeClassifier(max_depth = 4) #創建DecisionTreeClassifier()類
clf = clf.fit(lenses_pd.values.tolist(), lenses_target) #使用數據,構建決策樹
dot_data = StringIO()
tree.export_graphviz(clf, out_file = dot_data, #繪製決策樹
feature_names = lenses_pd.keys(),
class_names = clf.classes_,
filled=True, rounded=True,
special_characters=True)
graph = pydotplus.graph_from_dot_data(dot_data.getvalue())
graph.write_pdf("tree1.pdf") #保存繪製好的決策樹,以PDF的形式存儲。
決策樹可視化: