scikit-learn補充筆記1

原創

2020-07-07 18:17

1.將所有數據分成train和test兩部分

from sklearn.cross_validation import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)

2.使用LR來訓練,並得到測試集的y的值

from sklearn.linear_model import LogisticRegression

logreg = LogisticRegression()
logreg.fit(X_train, y_train)

y_pred_class = logreg.predict(X_test)

3.使用accuracy來評價訓練結果

from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class)

4.使用series.value_counts()統計真實值的label分佈情況,使用series.mean() 計算均值

5.使用y_test.mean()計算1的比例，使用1-y_test.mean()計算0的比例

6.使用confusion_matrix來進行評價

from sklearn import metrics
metrics.confusion_matrix(y_test, y_pred_class)

# TN:Truly predicted that it's Negative values(actual 0, predict 0)
# TP:Truly predicted that it's Positive values(actual 1, predict 1)
# FN:Falsely predicted that it's Negative values(actual 1, predict 0)
# FP:Falsely predicted that it's Positive values(actual 0, predict 1)

TP=confusion[1,1]
TN=confusion[0,0]
FP=confusion[0,1]
FN=confusion[1,0]

7.accuray_score=(TN+TP)/(TN+TP+FN+FP)

8.Classification Error(Misclassfication Rate):爲1-metrics_accuracy_score(y_test, y_pred)，即 (FP+FN)/(TP+TN+FP+FN)。

9.Sensitivity(Recall, True Positive Rate)：當實際值爲Positive(1)的時候，判斷預測正確的比率

sensitivity = TP/(TP+FN)
metrics.recall_score(y_test, y_pred)

10.Specificity：當實際值爲Negative(0)的時候，判斷預測正確的比率

specificity = TN/(TN+FP)

11.False Positive Rate:
當實際值是Negative(0)的時候，預測錯誤的比率
FP/(TN+FP)

12.Precision:
當預測值爲Positive(1)時，預測正確的比率

precision = TP/(TP+FP)
metrics.precision_score(y_test, y_pred)

13.調整預測的概率閾值
predict_proba會返回一個shape爲(n,2)的矩陣，第一列是預測爲0的概率，第二列是預測爲1的概率

from sklearn.preprocessing import binarize
binarize(matrix, threshold=0.3) #返回一個0/1的matrix，將matrix中大於0.3的值記爲1，小於0.3的記爲0
#binarize默認以0.5作爲閾值

14.ROC: 受試者工作特徵曲線（receiver operating characteristic curve，簡稱ROC曲線）,可用來查看sensitivity和specificity隨不同閾值的變化，而不用時刻改變閾值

y_pred_prob = metrics.predict_prob(X_test)[:,1]

fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_prob)
plt.plot(fpr, tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.title("ROC curve")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.grid(True)
plt.show()

但是閾值無法從ROC中看出，因此需要自己計算

def evaluate_threshold(threshold):
    print("sensitivity:" tpr[threshlds > threshold][-1])
    print("specificity:" 1-fpr(thresholds > threshold][-1])

evaluate_threshold(0.5)
evaluate_threshold(0.3)

通過這種方法來尋找最佳閾值，最佳閾值是使得ROC的橫座標儘可能小，縱座標儘可能大

15.AUC(Area under the roc Curve)

print(metrics.roc_auc_score(y_test, y_pred_prob))

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

scikit-learn補充筆記1

Spring Cloud 部署時如何使用 Kubernetes 作爲註冊中心和配置中心

windows下pip安裝某些庫遇到錯誤error: Microsoft Visual C++ 14.0 is required

機器學習和深度學習相關問題總結

ubuntu添加用戶並賦予sudo權限

scikit-learn補充筆記1

主成成分分析(Principal Component Analysis)

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結