python機器學習之用隨機森林處理泰坦尼克號數據

原創

龙在水中游

2020-06-16 08:55

隨機森林是集成方法的一種。

集成方法指的是用多個分類器進行組合而成的方法。

隨機森林是用多個決策樹組合起來的方法。

兩個隨機：

1.訓練集隨機：N個樣本中隨機有放回的出去N個。

2.特徵隨機：從M個特徵中隨機抽取m個特徵，其中M>>m。

這相當於一種降維方法。

優點：1、具有極好的準確率，會比用到的單一的分類器效果要好。

2、適合在大數據集，處理高維特徵，不需要降維。

python 代碼爲：

首先需要引入隨機森林庫以及用到的所需要的庫：

from sklearn.ensemble import RandomForestClassifier

from sklearn.feature_extraction import DictVectorizer

from sklearn.model_selection import GridSearchCV

from sklearn.model_selection import train_test_split

定義函數randForest():

def randForest():
    path = "E:\data\\titanic.csv"
    titanic = pd.read_csv(path)
    # print("type(titanic):",type(titanic))
    # 1.處理特徵值和目標值
    x = titanic[["pclass", "age", "sex"]]
    # print(x)
    y = titanic["survived"]
    # print(y)
    # print("x1:",x,type(x))
    # 2.特徵值處理
    # （1）缺失值處理
    x["age"].fillna(x["age"].mean(), inplace=True)
    #  print("x2:", x)
    #  #(2)轉化成字典
    x = x.to_dict(orient="records")
    # print("x3:", x)
    # #3.數據集劃分
    x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=22)
    # print(x_train)
    # #4.字典特徵抽取
    transfer = DictVectorizer()
    x_train = transfer.fit_transform(x_train)
    x_test = transfer.transform(x_test)
    #5.隨機森林 網格搜索
    estimator=RandomForestClassifier() #在此處與決策樹不同，此處用的隨機森林。
    param_dict = {"n_estimators": [120, 200, 300, 800, 1200],"max_depth":[5,8,15,25,30]} #n_estimators 表示樹的數量
    estimator = GridSearchCV(estimator, param_grid=param_dict, cv=3)
    estimator.fit(x_train, y_train)

    # 5.模型評估
    # 1.直接比對
    y_predict = estimator.predict(x_test)
    print("y_predict:\n", y_predict)
    print("直接比對真實值和預測值：\n", y_test == y_predict)
    # 2.計算準確率
    score = estimator.score(x_test, y_test)
    print("準確率爲：\n", score)
    # 最佳參數：best_params_
    print("最佳參數：\n", estimator.best_params_)
    # 最佳結果：best_score_
    print("最佳結果：\n", estimator.best_score_)
    # 最佳估計器：best_estimator_
    print("最佳估計器：\n", estimator.best_estimator_)
    # 交叉驗證結果：cv_results_
    print("交叉驗證結果：\n", estimator.cv_results_)
    return None

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

python機器學習之用隨機森林處理泰坦尼克號數據

關於遊戲付費的一點想法

我通過CKA和CKS啦！

一個簡單的用matlab畫散點圖長方形的步驟

一種用matlab讀取實驗文件數據並動態顯示圖像的方法

一個matlab畫散點圖的一個實例

latex中波浪號的作用和定義獲取圖片的命令

一種在latex中多圖排列的方法

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結