scikit-learn中隨機森林使用詳解

最近學了一下隨機森林，本來想自己總結一下，但是覺得有一篇已經很好的博客，就給大家分享，我主要講講scikit-learn中如何使用隨機森林算法。

scikit-learn中和隨機森林算法相關的類爲RangeForestClassifier，相關官方文檔講解點擊這裏，這個類的主要參數和方法如下：

類的構造函數爲：

RandomForestClassifier(n_estimators=10,criterion=’gini’, max_depth=None,min_samples_split=2,min_samples_leaf=1,
min_weight_fraction_leaf=0.0, max_features=’auto’, max_leaf_nodes=None, min_impurity_decrease=0.0,
min_impurity_split=None, bootstrap=True, oob_score=False, n_jobs=1, random_state=None,
verbose=0, warm_start=False, class_weight=None)

其中構造函數的參數說明爲：

參數(params)：
    n_estimators:數值型取值
        森林中決策樹的個數，默認是10

    criterion:字符型取值
        採用何種方法度量分裂質量，信息熵或者基尼指數，默認是基尼指數

    max_features:取值爲int型, float型, string類型, or None()，默認"auto"
        尋求最佳分割時的考慮的特徵數量，即特徵數達到多大時進行分割。
        int:max_features等於這個int值
        float:max_features是一個百分比，每(max_features * n_features)特徵在每個分割出被考慮。
        "auto":max_features等於sqrt(n_features)
        "sqrt":同等於"auto"時
        "log2":max_features=log2(n_features)
        None:max_features = n_features

    max_depth:int型取值或者None，默認爲None
        樹的最大深度

    min_samples_split:int型取值，float型取值，默認爲2
        分割內部節點所需的最少樣本數量
        int:如果是int值，則就是這個int值
        float:如果是float值，則爲min_samples_split * n_samples

    min_samples_leaf:int取值，float取值，默認爲1
        葉子節點上包含的樣本最小值
        int:就是這個int值
        float:min_samples_leaf * n_samples

    min_weight_fraction_leaf : float，default=0.
        能成爲葉子節點的條件是：該節點對應的實例數和總樣本數的比值，至少大於這個min_weight_fraction_leaf值

    max_leaf_nodes:int類型，或者None(默認None)
        最大葉子節點數，以最好的優先方式生成樹，最好的節點被定義爲雜質相對較少，即純度較高的葉子節點

    min_impurity_split:float取值 
        樹增長停止的閥值。一個節點將會分裂，如果他的雜質度比這個閥值；如果比這個值低，就會成爲一個葉子節點。

    min_impurity_decrease:float取值，默認0.
        一個節點將會被分裂，如果分裂之後，雜質度的減少效果高於這個值。

    bootstrap:boolean類型取值，默認True
        是否採用有放回式的抽樣方式

    oob_score:boolean類型取值，默認False
        是否使用袋外樣本來估計該模型大概的準確率

    n_jobs:int類型取值，默認1
        擬合和預測過程中並行運用的作業數量。如果爲-1，則作業數設置爲處理器的core數。

    class_weight:dict, list or dicts, "balanced"
        如果沒有給定這個值，那麼所有類別都應該是權重1
        對於多分類問題，可以按照分類結果y的可能取值的順序給出一個list或者dict值，用來指明各類的權重.
        "balanced"模式，使用y值自動調整權重，該模式類別權重與輸入數據中的類別頻率成反比，
即n_samples / (n_classes * np.bincount(y))，分佈爲第n個類別對應的實例數。
        "balanced_subsample"模式和"balanced"模式類似，只是它計算使用的是有放回式的取樣中取得樣本數，而不是總樣本數

該類主要的屬性爲：

屬性：
    estimators_:決策樹列表
        擬合好的字分類器列表，也就是單個決策樹

    classes_:array of shape = [n_features]
        類別標籤列表

    n_classes_:int or list
        類別數量

    n_features:int
        擬合過程中使用的特徵的數量

    n_outputs:int 
        擬合過程中輸出的數量

    featrue_importances:特徵重要程度列表
        值越大，說明越重要

    oob_score:array of shape = [n_features]
        使用oob數據集測試得到的得分數

    oob_decision_funtion_:array of shape = [n_features, n_classes]
        oob樣本預測結果，每一個樣本及相應結果對列表

該類主要的方法爲：

方法：
    apply(X):用構造好的森林中的樹對數據集X進行預測，返回每棵樹預測的葉子節點。所以結果應該是二維矩陣，
行爲樣本第幾個樣本，列爲每棵樹預測的葉子節點。 

    decision_path(X):返回森林中的決策路徑

    fit(X, y[, sample_weight]):用訓練數據集(x, y)來構造森林

    get_params([deep]):獲得分類器的參數

    predict(X):預測X的類別

    predict_log_proba(X):預測X的類的對數概率，和predict_proba類似，只是取了對數

    predict_proba(X):預測X的類別的概率。輸入樣本的預測類別概率被計算爲森林中樹木的平均預測類別概率。
單個樹的類概率是葉中同一類的樣本的比率。因爲葉子節點並不是完全純淨的，它也有雜質，
不同種類所佔惡比率是不一樣的，但肯定有一類純度很高。返回值是array of shape = [n_samples, n_classes]

    score(X, y[,sample_weight]):返回給定的數據集（數據集指定了類別）的預測準確度

    set_params(**params):設置決策樹的參數

分享一段本人用隨機森林算法寫的關於kaggle上舊金山犯罪預測的題目，使用的是Python，正確率還未知，代碼如下（剛學着用scikit-learn等工具，可能代碼比較粗糙，敬請諒解！）：

#author = liuwei

import pandas as pd 
import numpy as np 
import joblib
from sklearn.ensemble import RandomForestClassifier


def split_single_date(date_time_):
    '''split dateTime to year, month, day, hour, minute, and nomalize'''

    #split, use space
    tmp_ = date_time_.split(' ')        
    date_, time_ = tmp_[0], tmp_[1]    

    #split date with '-'
    date_tmp_ = date_.split('-')  
    year_, month_, day_ = int(date_tmp_[0]), int(date_tmp_[1]), int(date_tmp_[2])

    #split time with ':'
    hour_tmp_ = time_.split(':')
    hour_ = int(hour_tmp_[0])

    return year_, month_, day_, hour_  


def str_to_int(values):
    '''transform str to int, only simple enumerate, use the position value in values to replace value'''

    #get all possible value
    categorys_ =pd.Categorical(values).codes

    return categorys_




#read data
train_data_ = pd.read_csv('datas/train.csv')



test_data_ = pd.read_csv('datas/test.csv')

#print('train_data:' + str(train_data_[:10]))

#######################train_datas################################

#get weekdays datas,and transform string to int
train_weekday_values_ = train_data_.DayOfWeek.values
train_weekday_ = str_to_int(train_weekday_values_)
train_weekdays_ = pd.Series(train_weekday_, name = 'weekday')

#get district datas,and transform string to int
train_district_values_ = train_data_.PdDistrict.values
train_district_ = str_to_int(train_district_values_)
train_districts_ = pd.Series(train_district_, name = 'district')

train_dates_ = train_data_.Dates.values

#map(function, list):use function  on every element in the list
train_parse_date_ = list(map(split_single_date, train_dates_))

#lambda is a way of simple function
train_years_ = pd.Series(data = map(lambda x : x[0], train_parse_date_), name = 'year')
train_months_ = pd.Series(data = map(lambda x : x[1], train_parse_date_), name = 'month')
train_days_ = pd.Series(data = map(lambda x : x[2], train_parse_date_), name = 'day')
train_hours_ = pd.Series(data = map(lambda x : x[3], train_parse_date_), name = 'hour')

#the crime type
train_categorys_ = train_data_.Category.values

#build the new train_datas, the method pd.concat's param asix is tell how to contract, when 1 is contract by column, 0 is row
train_datas_ = pd.concat([train_years_, train_months_, train_days_, train_hours_, train_weekdays_, train_districts_], axis = 1)


######################test_datas#############################
test_weekday_values_ = test_data_.DayOfWeek.values
test_weekday_ = str_to_int(test_weekday_values_)
test_weekdays_ = pd.Series(test_weekday_, name = 'weekday')

test_district_values_ = test_data_.PdDistrict.values
test_district_ = str_to_int(test_district_values_)
test_districts_ = pd.Series(test_district_, name = 'district')

test_dates_ = test_data_.Dates.values

test_parse_date_ = list(map(split_single_date, test_dates_))

test_years_ = pd.Series(data = map(lambda x : x[0], test_parse_date_), name = 'year')

test_months_ = pd.Series(data = map(lambda x : x[1], test_parse_date_), name = 'month') 

test_days_ = pd.Series(data = map(lambda x : x[2], test_parse_date_), name = 'day')

test_hours_ = pd.Series(data = map(lambda x : x[3], test_parse_date_), name = 'hour')



test_datas_ = pd.concat([test_years_, test_months_, test_days_, test_hours_, test_weekdays_, test_districts_], axis = 1)


#######################RandomForestClassifier###############

#use RandomForestClassifier to be the Classifier
#clf = RandomForestClassifier(n_estimators = 20, min_samples_leaf = 2000, bootstrap = True, oob_score = True, criterion = 'gini')
clf = RandomForestClassifier(bootstrap = True, oob_score = True, criterion = 'gini')

#train the data
print('-------------train start---------------')
clf.fit(train_datas_, train_categorys_)

#save the model,if need load the model, joblib.load(filename)
joblib.dump(clf, 'model/clf.pkl')

#predict test_datas, every class will has a probabilities,the order of the class same to the attribute classes_ 
print('-------------train end---------------')

print('-------------predict start---------------')
result_ = clf.predict_proba(test_datas_)

print('-------------predict end---------------')

#get all classes
classes_ = clf.classes_

#save the predict result to file
res_data_frame_ = pd.DataFrame(data = result_, columns = classes_)
res_data_frame_.to_csv('result.csv', index_label = 'Id')

因爲ubuntu中的sublime text3無法寫中文，所以註釋寫的是英文，但是很簡單的英文，稍微看一下應該就能懂了！！！有問題可以一起交流！！！

scikit-learn中隨機森林使用詳解

PCA解析

Python中list（列表）

關聯性挖掘--Apriori算法詳解

ELMo代碼詳解(一)：數據準備

支持向量機(SVM)推導

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結