kaggle（二）：最大利潤問題

這是一個監督學習求解最大利潤的題目。給很多人去放款貸款，目的是預測這些貸款的人會不會還款；如果還，標籤爲1，說明銀行預測正確，可以得到利潤；如果不還，標籤爲0，銀行不可以得到利潤。模型預測之後，和真實的標籤去對比，評估模型的好壞。

這道題牽扯到了比kaggle（一）更多的屬性特徵和樣本數，（二）更多的數據清洗操作；（三）模型評估指標的應用。

# coding: utf-8
import pandas as pd

load_2007 = pd.read_csv("LoanStats3a.csv",skiprows=1)   #讀取文件
len(load_2007)   #查看有多少row，行數

load_2007.shape   #查看數組的二維特徵，(rows,columns)
print(load_2007.shape[0])   #(rows)
print(load_2007.shape[1])   #(columns)

#刪除缺失值 根據缺失值佔總樣本數多少的刪除掉 缺失率
#當刪除行時，axis = 0, 如果這一個樣本有一半的特徵都沒有數據。
#當刪除特徵時，axis = 1，如果這一個特徵有一半的樣本都沒有數據。

#設定閾值
half_count_column = int(load_2007.shape[0]/2)   #計算一半的樣本個數
half_count_row = int(load_2007.shape[1]/2)   #計算一半的屬性個數
print(half_count_column)

load_2007 = load_2007.dropna(thresh=half_count,axis = 1)  #保留column的數據，如果column至少有一半不等於na，axis=1按照column取
load_2007.shape

load_2007 = load_2007.drop(['desc','url'],axis = 1) #axie = 1 column, axis = 0,index
load_2007.to_csv('load_2007.csv',index = False)

print(load_2007.iloc[1,:])   #索引的用法 loc和iloc loc[row_name:row_name,column_name:column_name]根據名稱切  iloc[1:2,1:2]根據index切

load_2007.columns.values   #查看columns的名稱，輸出是一個list形式，名稱爲str的格式

#把一些和loans無關的去掉 drop("column_name",axis=1)

load_2007 = load_2007.drop(["id", "member_id", "funded_amnt", "funded_amnt_inv", "grade", "sub_grade", "emp_title", "issue_d"], axis=1)

load_2007 = load_2007.drop(["zip_code", "out_prncp", "out_prncp_inv", "total_pymnt", "total_pymnt_inv", "total_rec_prncp"], axis=1)

load_2007 = load_2007.drop(["total_rec_int", "total_rec_late_fee", "recoveries", "collection_recovery_fee", "last_pymnt_d", "last_pymnt_amnt"], axis=1)

load_2007.shape[1]

#pandas 計1  計算公式1
print(load_2007['loan_status'].value_counts())  #統計這個columns每個分屬性的個數

#pandas 查1  #|號是或  查找公式1
load_2007 = load_2007[(load_2007['loan_status'] == "Fully Paid") | (load_2007['loan_status'] == "Charged Off")]
load_2007.shape  #bool條件判斷，提取某列的值爲True的行

#pandas 改1  修改公式1 series的map函數可以傳入字典格式，通過鍵值修改
loans_status = {"Fully Paid":1,"Charged Off":0}   #series map函數替換某列的值
load_2007["loan_status"] = load_2007["loan_status"].map(loans_status)
load_2007.head(6)

#pandas查2  查找公式2 如果這個column中全都是一個數，刪除
# load_2007.columns 這個輸出的是index 感覺用法和df.columns.values一樣。。。
column_name = load_2007.columns
drop_column = []
for col in column_name:
    col_series = load_2007[col].dropna().unique()
    if len(col_series) == 1:
        drop_column.append(col)
        load_2007 = load_2007.drop([col],axis = 1)
print(drop_column)
# orig_columns = load_2007.columns
# drop_columns = []
# for col in orig_columns:
#     col_series = load_2007[col].dropna().unique()
#     if len(col_series) == 1:
#         drop_columns.append(col)
# load_2007 = load_2007.drop(drop_columns, axis=1)
# print(drop_columns)
# print (load_2007.shape)

#查看缺失值的情況
load_2007.isnull().sum()  #統計每列的缺失值

loads = load_2007.drop(["pub_rec_bankruptcies"],axis = 1)  #把含有很多na的列去掉
loads = loads.dropna(axis = 0)  #把含有na的行去掉
print(loads.dtypes.value_counts()) #統計有幾個object的特徵，有幾個float類型的特徵，有幾個int類型的特徵

#根據數據類型選擇columns
object_columns_df = loads.select_dtypes(include = ["object"])  #中括號別忘記
print(object_columns_df.iloc[1])


#查看每個object裏面分多少小的類
#需要把字符型全部轉化成數值型
col_object = ['home_ownership','verification_status','purpose','addr_state']
for i in col_object:
    print(object_columns_df[i].value_counts())


print(object_columns_df["purpose"].value_counts())
object_columns_df.head(3)

#字典進行map，其實replace也可以，我習慣用字典，思路簡單，理解方便
mapping_emp_length = {
    "10+ years": 10,
    "9 years": 9,
    "8 years":8,
    "7 years":7,
    "6 years":6,
    "5 years":5,
    "4 years":4,
    "3 years":3,
    "2 years":2,
    "1 years":1,
    "<1 years":0,
    "n/a":0
}
loads =loads.drop(["last_credit_pull_d","earliest_cr_line","addr_state","title"],axis = 1)
loads["int_rate"] = loads["int_rate"].str.strip("%").astype(float)   #去掉%號
loads["revol_util"] = loads["revol_util"].str.strip("%").astype(float)
loads["emp_length"] = loads["emp_length"].map(mapping_emp_length)

loads.head(10)

#發現na沒用啊，只能自己手動填充了
loads["emp_length"] = loads["emp_length"].fillna(value = 0)  #把0填充入na
loads.head(10)

#one-hot處理
cat_columns = ["home_ownership", "verification_status", "emp_length", "purpose", "term"]
dumies= pd.get_dummies(loads[cat_columns])
loads = pd.concat([loads, dumies],axis = 1)
loads = loads.drop(cat_columns,axis = 1)  #labels : single label or list-like
loads = loads.drop("pymnt_plan", axis=1)

loads.info()

模型評估

模型評估：那麼，模型分析完後，每個樣本row會有一個1,0的數字，代表樣本的是或者否。然後樣本有一個真實的1或者0.對比預測的1或者0和實際的1或者0，進行分析。對吧。

注意，迴歸問題和分類問題的評估指標不一樣。迴歸問題是求解超平面和樣本點之間的距離的，相對簡單：比如，MSE（均方差損失）。而分類問題是看分類模型好壞的。

混淆矩陣可以瞭解一下：我的理解是True/False是代表你的預測結果，Positive/Negative是預測值。

預測值	實際值	預測結果
1	1	True Positive（預測爲True,預測的結果是Positive，真實值是Positive）
1	0	False Positive （預測爲False，預測的結果是Positive，真實值是Negative）
0	0	True Negative （預測爲False，預測結果爲Nagative，真實值是Negative）
0	1	False Negative （預測爲False，預測結果爲Negative，真實值爲Positive）

如何運用這四個指標呢？TPR ture positive rate和FPR false positive rate。

Precision（查準率）和Recall（查全率）（英文的感覺容易理解）。

Precision：是模型預測的正樣本中預測正確的比例，也就是分子是：預測正確的正樣本數分母：預測正確的正樣本+預測錯誤的正樣本，取值越大，預測效果越好。 TP/TP + FP

Recall(TPR)：模型預測正確的正樣本佔總體正樣本的比例。分子：預測正確的正樣本分母：真實值中的正樣本，TP + FN（預測錯誤的預測負樣本）。取值越大，預測效果越好。TP/TP + FN

FPR：FP/FP + TN 模型預測正確的錯誤樣本佔總體錯誤樣本的比例。

那麼，我們希望有一個高的TPR和低的FPR。

# from sklearn.linear_model import LogisticRegression
# lr = LogisticRegression()
loans_result = loans["loan_status"]
loans_test = loans.drop("loan_status",axis=1)

from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import cross_val_predict, KFold

lr = LogisticRegression()
kf = KFold(loans_test.shape[0], random_state= 1)
predictions = cross_val_predict(lr, loans_test, loans_result, cv=kf)
predictions = pd.Series(predictions)  #把list轉換成series，可以讓pandas操作

#TN
tn = (predictions == 0) & (loans["loan_status"] == 0)
tn_num = len(predictions[tn])

#TP
tp = (predictions == 1) & (loans["loan_status"] == 1)
tp_num = len(predictions[tp])

#FP
fp = (predictions == 1) & (loans["loan_status"] == 0)
fp_num = len(predictions[fp])

#FN
fn = (predictions == 0) & (loans["loan_status"] == 1)
fn_num = len(predictions[fn])

#rate
fpr = fp_num / float((fp_num + tn_num))
tpr = tp_num / float((tp_num + fn_num))

print(tpr)
print(fpr)
print(predictions[:20])

0.9988995955270045
0.9991050653302309

tpr和fpr都很大，很明顯不滿足，爲什麼會這樣呢？
查看一下loans_result
loans_result.value_counts()
#1 33859
#0 5639
可以看出，正樣本比負樣本多出很多，這樣會造成樣本不均衡，回過頭來想一下在模型調優那裏
如果樣本不均衡，正樣本多一點點的話，要下采樣；正樣本多很多，要重新多收集數據，或者對負樣本過採樣，採樣時記得隨機採樣和分層採樣
還有一種方法，就是給正樣本和負樣本不同的權重

#給權重
from sklearn.linear_model import LinearRegression
from sklearn.cross_validation import cross_val_predict, KFold

lr = LogisticRegression(class_weight="balanced") #這裏是調權重參數，class_weight= 是grid search插入位點
kf = KFold(loans_test.shape[0], random_state= 1)
predictions = cross_val_predict(lr, loans_test, loans_result, cv=kf)
predictions = pd.Series(predictions)

#TN
tn = (predictions == 0) & (loans["loan_status"] == 0)
tn_num = len(predictions[tn])

#TP
tp = (predictions == 1) & (loans["loan_status"] == 1)
tp_num = len(predictions[tp])

#FP
fp = (predictions == 1) & (loans["loan_status"] == 0)
fp_num = len(predictions[fp])

#FN
fn = (predictions == 0) & (loans["loan_status"] == 1)
fn_num = len(predictions[fn])

#rate
fpr = fp_num / float((fp_num + tn_num))
tpr = tp_num / float((tp_num + fn_num))

print(tpr)
print(fpr)
print(predictions[:20])

0.6353794908398763
0.6207266869518525

#可以傳入字典形式
from sklearn.linear_model import LogisticRegression
from sklearn.cross_validation import cross_val_predict, KFold
penalty = {
    0: 5,
    1: 1
}

lr = LogisticRegression(class_weight=penalty)
kf = KFold(loans_test[0], random_state=1)
predictions = cross_val_predict(lr, loans_test, loans_result, cv=kf)
predictions = pd.Series(predictions)

#TN
tn = (predictions == 0) & (loans["loan_status"] == 0)
tn_num = len(predictions[tn])

#TP
tp = (predictions == 1) & (loans["loan_status"] == 1)
tp_num = len(predictions[tp])

#FP
fp = (predictions == 1) & (loans["loan_status"] == 0)
fp_num = len(predictions[fp])

#FN
fn = (predictions == 0) & (loans["loan_status"] == 1)
fn_num = len(predictions[fn])

#rate
fpr = fp_num / float((fp_num + tn_num))
tpr = tp_num / float((tp_num + fn_num))

print(tpr)
print(fpr)
print(predictions[:20])

from sklearn.ensemble import RandomForestClassifier
from sklearn.cross_validation import cross_val_predict
rf = RandomForestClassifier(n_estimators=10,class_weight="balanced", random_state=1)
#print help(RandomForestClassifier)
kf = KFold(loans_test.shape[0], random_state=1)
predictions = cross_val_predict(rf, loans_test, loans_result, cv=kf)
predictions = pd.Series(predictions)

# False positives.
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])

# True positives.
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])

# False negatives.
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])

# True negatives
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])

# Rates
tpr = tp / float((tp + fn))
fpr = fp / float((fp + tn))
print(tpr)
print(fpr)

kaggle（二）：最大利潤問題

模型評估

再談23種設計模式（3）：行爲型模式（學習筆記）

Power Automate Desktop 安裝完，登錄後老是提示one driver 錯誤

微前端學習筆記(4):從微前端到微模塊之EMP與hel-micro方案探索

微前端學習筆記（1）：微前端總體架構概述，從微服務發微

985 碩士程序員，空窗 4 個月沒有 Offer！

一文搞懂 Spring 循環依賴

賽博鬥地主——使用大語言模型扮演Agent智能體玩牌類遊戲。

VScode右鍵打開(添加到右鍵)

記一次 .NET某工控視覺自動化系統卡死分析

WindowsServer--SQL Server搭建主從同步實現讀寫分離 - 事務性分發

面向對象編程（複習用）

網絡通信原理

正則表達式

XML模塊

Kaggle（一）：Titanic

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結