1.將所有數據分成train和test兩部分
from sklearn.cross_validation import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, random_state=0)
2.使用LR來訓練,並得到測試集的y的值
from sklearn.linear_model import LogisticRegression
logreg = LogisticRegression()
logreg.fit(X_train, y_train)
y_pred_class = logreg.predict(X_test)
3.使用accuracy來評價訓練結果
from sklearn import metrics
print(metrics.accuracy_score(y_test, y_pred_class)
4.使用series.value_counts()統計真實值的label分佈情況,使用series.mean() 計算均值
5.使用y_test.mean()計算1的比例,使用1-y_test.mean()計算0的比例
6.使用confusion_matrix來進行評價
from sklearn import metrics
metrics.confusion_matrix(y_test, y_pred_class)
# TN:Truly predicted that it's Negative values(actual 0, predict 0)
# TP:Truly predicted that it's Positive values(actual 1, predict 1)
# FN:Falsely predicted that it's Negative values(actual 1, predict 0)
# FP:Falsely predicted that it's Positive values(actual 0, predict 1)
- TP=confusion[1,1]
- TN=confusion[0,0]
- FP=confusion[0,1]
- FN=confusion[1,0]
7.accuray_score=(TN+TP)/(TN+TP+FN+FP)
8.Classification Error(Misclassfication Rate):爲1-metrics_accuracy_score(y_test, y_pred),即 (FP+FN)/(TP+TN+FP+FN)。
9.Sensitivity(Recall, True Positive Rate):當實際值爲Positive(1)的時候,判斷預測正確的比率
sensitivity = TP/(TP+FN)
metrics.recall_score(y_test, y_pred)
10.Specificity:當實際值爲Negative(0)的時候,判斷預測正確的比率
specificity = TN/(TN+FP)
11.False Positive Rate:
當實際值是Negative(0)的時候,預測錯誤的比率
FP/(TN+FP)
12.Precision:
當預測值爲Positive(1)時,預測正確的比率
precision = TP/(TP+FP)
metrics.precision_score(y_test, y_pred)
13.調整預測的概率閾值
predict_proba會返回一個shape爲(n,2)的矩陣,第一列是預測爲0的概率,第二列是預測爲1的概率
from sklearn.preprocessing import binarize
binarize(matrix, threshold=0.3) #返回一個0/1的matrix,將matrix中大於0.3的值記爲1,小於0.3的記爲0
#binarize默認以0.5作爲閾值
14.ROC: 受試者工作特徵曲線 (receiver operating characteristic curve,簡稱ROC曲線),可用來查看sensitivity和specificity隨不同閾值的變化,而不用時刻改變閾值
y_pred_prob = metrics.predict_prob(X_test)[:,1]
fpr, tpr, thresholds = metrics.roc_curve(y_test, y_pred_prob)
plt.plot(fpr, tpr)
plt.xlim([0.0, 1.0])
plt.ylim([0.0, 1.0])
plt.title("ROC curve")
plt.xlabel("False Positive Rate")
plt.ylabel("True Positive Rate")
plt.grid(True)
plt.show()
但是閾值無法從ROC中看出,因此需要自己計算
def evaluate_threshold(threshold):
print("sensitivity:" tpr[threshlds > threshold][-1])
print("specificity:" 1-fpr(thresholds > threshold][-1])
evaluate_threshold(0.5)
evaluate_threshold(0.3)
通過這種方法來尋找最佳閾值,最佳閾值是使得ROC的橫座標儘可能小,縱座標儘可能大
15.AUC(Area under the roc Curve)
print(metrics.roc_auc_score(y_test, y_pred_prob))