网站该不该给用户贷款呢

原創

2020-07-07 09:42

1.数据预处理

1.1去掉Url以及描述等内容

import pandas as pd
loans_2007 = pd.read_csv('LoanStats3a.csv', skiprows=1) #skiprows=1表示从第二行开始读
half_count = len(loans_2007) / 2
loans_2007 = loans_2007.dropna(thresh=half_count, axis=1)#thresh=n，即剔除NA值，保留下来的每一行，其非NA的数目>=n
loans_2007 = loans_2007.drop(['desc', 'url'],axis=1)
loans_2007.to_csv('loans_2007.csv', index=False)

查看上一步处理后的数据信息

loans_2007 = pd.read_csv("loans_2007.csv")
#loans_2007.drop_duplicates()
print(loans_2007.iloc[0])
print(loans_2007.shape[1])

结果显示总共有52个维度（图片里我没有放全），接下来的问题是怎样从这52个维度中选取机器学习要用到的维度。

1.2去掉与贷款可能无关的维度

loans_2007 = loans_2007.drop(["id", "member_id", "funded_amnt", "funded_amnt_inv", "grade", "sub_grade", "emp_title", "issue_d","zip_code", "out_prncp", "out_prncp_inv", "total_pymnt", "total_pymnt_inv", "total_rec_prncp","total_rec_int", "total_rec_late_fee", "recoveries", "collection_recovery_fee", "last_pymnt_d", "last_pymnt_amnt"], axis=1)
print(loans_2007.iloc[0])
print(loans_2007.shape[1])

1.3选择最后是否贷款那一列的分类结果

print(loans_2007['loan_status'].value_counts())

用pandas的value_counts()方法可以看loan_status这一列每种结果出现的次数的一个排名，我在最后只选择了全额贷款（Fully Paid）以及不给放款（Charge Off）这两个来作为二分类的label值，分别用1和0表示

loans_2007 = loans_2007[(loans_2007['loan_status'] == "Fully Paid") | (loans_2007['loan_status'] == "Charged Off")]

status_replace = {
    "loan_status" : {
        "Fully Paid": 1,
        "Charged Off": 0,
    }
}
loans_2007 = loans_2007.replace(status_replace)

注意：刚刚处理完的数据还有32列

1.4去掉某一列的值是完全一样的

orig_columns = loans_2007.columns
drop_columns = []
for col in orig_columns:
    col_series = loans_2007[col].dropna().unique() #需要先去掉缺失值，否则NAN也会被算一种，可能就不会出现某一列是全部一样的
    if len(col_series) == 1:
        drop_columns.append(col)
loans_2007 = loans_2007.drop(drop_columns, axis=1)
print(drop_columns)
print(loans_2007.shape)
loans_2007.to_csv('filtered_loans_2007.csv', index=False)

1.5 缺失值处理

loans = pd.read_csv('filtered_loans_2007.csv')
null_counts = loans.isnull().sum()
print(null_counts)

处理方法：缺失值较少的去掉有缺失的样本，缺失值多的直接去掉该列

loans = loans.drop("pub_rec_bankruptcies", axis=1)
loans = loans.dropna(axis=0)
print(loans.dtypes.value_counts())

需要将字符类型进行转换

object_columns_df = loans.select_dtypes(include=["object"])#只选中了字符型的列
print(object_columns_df.iloc[0])

将字符型的数据今行处理，比如rent类型的替换成数值型

cols = ['home_ownership', 'verification_status', 'emp_length', 'term', 'addr_state']
for c in cols:
    print(loans[c].value_counts())

贷款的目的和标题可以只选择其中之一，因为title的可选项较多，所以选择purpose

print(loans["purpose"].value_counts())
print(loans["title"].value_counts())

mapping_dict = {
    "emp_length": {
        "10+ years": 10,
        "9 years": 9,
        "8 years": 8,
        "7 years": 7,
        "6 years": 6,
        "5 years": 5,
        "4 years": 4,
        "3 years": 3,
        "2 years": 2,
        "1 year": 1,
        "< 1 year": 0,
        "n/a": 0
    }
}
loans = loans.drop(["last_credit_pull_d", "earliest_cr_line", "addr_state", "title"], axis=1)
loans["int_rate"] = loans["int_rate"].str.rstrip("%").astype("float")
loans["revol_util"] = loans["revol_util"].str.rstrip("%").astype("float")
loans = loans.replace(mapping_dict)

cat_columns = ["home_ownership", "verification_status", "emp_length", "purpose", "term"]
dummy_df = pd.get_dummies(loans[cat_columns])
loans = pd.concat([loans, dummy_df], axis=1)
loans = loans.drop(cat_columns, axis=1)
loans = loans.drop("pymnt_plan", axis=1)

loans.to_csv('cleaned_loans2007.csv', index=False)

2.网站获取最大利润的做法

loans = pd.read_csv("cleaned_loans2007.csv")
print(loans.info())

最大利润情况：考虑光用精度可能不行，因为样本极其不平衡，借钱样本很多，不借钱的样本很少。要考虑FPR和TPR，也就是不还钱的人尽量不借，FPR越小，TPR越高越好。

from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import cross_val_predict
lr = LogisticRegression(class_weight="balanced")
# kf = KFold(features.shape[0], random_state=1)
predictions = cross_val_predict(lr, features, target, cv=5)
predictions = pd.Series(predictions)

# False positives.
fp_filter = (predictions == 1) & (loans["loan_status"] == 0)
fp = len(predictions[fp_filter])

# True positives.
tp_filter = (predictions == 1) & (loans["loan_status"] == 1)
tp = len(predictions[tp_filter])

# False negatives.
fn_filter = (predictions == 0) & (loans["loan_status"] == 1)
fn = len(predictions[fn_filter])

# True negatives
tn_filter = (predictions == 0) & (loans["loan_status"] == 0)
tn = len(predictions[tn_filter])

# Rates
tpr = tp / float((tp + fn))  #也可用roc_curve直接求
fpr = fp / float((fp + tn))

print(tpr)
print(fpr)
print predictions[:20]

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

网站该不该给用户贷款呢

使用c#强大的表达式树实现对象的深克隆之解决循环引用的问题

痞子衡嵌入式：恩智浦i.MX RT1xxx系列MCU启动那些事（12.A）- uSDHC eMMC启动时间(RT1170)

本地SSL证书过期输入命令在IIS自动生成

scikit-learn實現ROC

網站該不該給用戶貸款呢

mysql--分支/循環

邏輯迴歸實現自動分類

自然語言處理工具包 - NLTK

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結