Home Credit Default Risk
結論
數據由Home Credit(中文名:捷信)提供,Home Credit致力於向無銀行賬戶的人羣提供信貸。任務要求預測客戶是否償還貸款或遇到困難。使用AUC(ROC)作爲模型的評估標準。
本篇博客只對 application_train/application_test的數據進行分析,使用Logistic Regression進行分類預測。通過grid search調節超參數能夠得到 public board 0.749,private board 0.748的成績。而Baseline是0.68810,最好成績能到0.79857,剩餘部分在進階篇。
背景知識
Home Credit (捷信)是中東歐以及亞洲領先的消費金融提供商之一。在中東歐(CEE),俄羅斯,獨聯體國家(CIS)和亞洲各地服務於消費者。
查詢用戶在徵信機構的歷史徵信記錄可以用來作爲風險評估的參考,但是徵信數據往往不全,因爲這些人本身就很少有銀行記錄。數據集中bureau.csv和 bureau_balance.csv 對應這部分數據。
Home Credit有三類產品,信用卡,POS(消費貸),現金貸。信用卡在歐洲和美國很流行,但在以上這些國家並非如此。所以數據集中信用卡數據偏少。POS只能買東西,現金貸可以得到現金。三類產品都有申請和還款記錄。數據集中previous_application.csv, POS_CASH_BALANCE.csv,credit_card_balance.csv,installments_payment.csv對應這部分數據。
三類產品的英文分別是:Revolving loan (credit card),Consumer installment loan (Point of sales loan – POS loan),Installment cash loan。
數據集
根據上面的介紹,數據集包括了8個不同的數據文件,就可以分爲三大類。
-
application_train, application_test:
訓練集包括Home Credit每個貸款申請的信息。每筆貸款都有自己的行,並由SK_ID_CURR標識。訓練集的TARGET 0:貸款已還清,1:貸款未還清。
通過這兩個文件,就能對這個任務做基本的數據分析和建模,也是本篇博客的內容。 -
bureau, bureau_balance:
這兩個文件是徵信機構提供的,用戶在其他金融機構的貸款申請數據,以及每個月的還款欠款記錄。一個用戶(SK_ID_CURR)可以有多筆貸款申請數據(SK_ID_BUREAU)。 -
previous_application, POS_CASH_BALANCE,credit_card_balance,installments_payment
這四個文件來自Home Credit。用戶可能已經在Home Credit 使用過POS服務,信用卡,和申請貸款。這些文件就是相關申請數據和還款記錄。一個用戶(SK_ID_CURR)可以有多筆歷史數據(SK_ID_PREV)。
下圖顯示了所有數據是如何相關的,通過三個ID將三部分數據聯繫起來,SK_ID_CURR,SK_ID_BUREAU和SK_ID_PREV。
數據分析
平衡度
從application_train和application_test中讀取數據。訓練集中一共307511條數據,122個特徵。其中違約(Target = 1)的數量爲24825,沒有違約(Target = 0)的數量282686,imbalance不算嚴重。
# -*- coding: utf-8 -*-
##307511, 122, 30萬數據,28萬0, 2萬1,這個imbalance不算嚴重
app_train = pd.read_csv('input/application_train.csv')
app_test = pd.read_csv('input/application_test.csv')
print('training data shape is', app_train.shape)
print(app_train['TARGET'].value_counts())
training data shape is (307511, 122)
0 282686
1 24825
Name: TARGET, dtype: int64
數據缺失
122個特徵中,67個特徵有缺失,其中49個數據缺失超過47%。
## 67個特徵有缺失,49個數據缺失超過47%, 其中47個與住房特徵有關,是否可以將這個住房特徵用PCA降階。看其他人如何處理。
## 剩下ext_source 和 own_car_age。或者這49個特徵可以嘗試刪除。
mv = app_train.isnull().sum().sort_values()
mv = mv[mv>0]
mv_rate = mv/len(app_train)
mv_df = pd.DataFrame({'mv':mv, 'mv_rate':mv_rate})
print('number of features with more than 47% missing', len(mv_rate[mv_rate>0.47]))
mv_rate[mv_rate> 0.47]
number of features with more than 47% missing 49
EMERGENCYSTATE_MODE 0.473983
TOTALAREA_MODE 0.482685
YEARS_BEGINEXPLUATATION_MODE 0.487810
YEARS_BEGINEXPLUATATION_AVG 0.487810
YEARS_BEGINEXPLUATATION_MEDI 0.487810
FLOORSMAX_AVG 0.497608
FLOORSMAX_MEDI 0.497608
FLOORSMAX_MODE 0.497608
HOUSETYPE_MODE 0.501761
LIVINGAREA_AVG 0.501933
LIVINGAREA_MODE 0.501933
LIVINGAREA_MEDI 0.501933
ENTRANCES_AVG 0.503488
ENTRANCES_MODE 0.503488
ENTRANCES_MEDI 0.503488
APARTMENTS_MEDI 0.507497
APARTMENTS_AVG 0.507497
APARTMENTS_MODE 0.507497
WALLSMATERIAL_MODE 0.508408
ELEVATORS_MEDI 0.532960
ELEVATORS_AVG 0.532960
ELEVATORS_MODE 0.532960
NONLIVINGAREA_MODE 0.551792
NONLIVINGAREA_AVG 0.551792
NONLIVINGAREA_MEDI 0.551792
EXT_SOURCE_1 0.563811
BASEMENTAREA_MODE 0.585160
BASEMENTAREA_AVG 0.585160
BASEMENTAREA_MEDI 0.585160
LANDAREA_MEDI 0.593767
LANDAREA_AVG 0.593767
LANDAREA_MODE 0.593767
OWN_CAR_AGE 0.659908
YEARS_BUILD_MODE 0.664978
YEARS_BUILD_AVG 0.664978
YEARS_BUILD_MEDI 0.664978
FLOORSMIN_AVG 0.678486
FLOORSMIN_MODE 0.678486
FLOORSMIN_MEDI 0.678486
LIVINGAPARTMENTS_AVG 0.683550
LIVINGAPARTMENTS_MODE 0.683550
LIVINGAPARTMENTS_MEDI 0.683550
FONDKAPREMONT_MODE 0.683862
NONLIVINGAPARTMENTS_AVG 0.694330
NONLIVINGAPARTMENTS_MEDI 0.694330
NONLIVINGAPARTMENTS_MODE 0.694330
COMMONAREA_MODE 0.698723
COMMONAREA_AVG 0.698723
COMMONAREA_MEDI 0.698723
dtype: float64
數據類型
122個特徵中,65個 浮點類型,41 整型類型,16個非數據特徵。對非數據類型特徵畫圖分析其與Target的關係。OCCUPATION_TYPE 和 ORGANIZATION_TYPE可以手動進行數值編碼,其餘的直接用Onehot encoding。
## 65 浮點,41 整型,16個非數據特徵,除了OCCUPATION_TYPE 和 ORGANIZATION_TYPE 其餘應該都可以手動編碼
categorical = [col for col in app_train.columns if app_train[col].dtypes == 'object']
ct = app_train[categorical].nunique().sort_values()
for col in categorical:
if (col!='OCCUPATION_TYPE') & (col!='ORGANIZATION_TYPE'):
plt.figure(figsize = [10,10])
sns.barplot(y = app_train[col], x = app_train['TARGET'])
# 對特徵數爲2的特徵編碼, nunique 和 len(unique)是不一樣的,前者不計算null,後者會計算null
from sklearn.preprocessing import LabelEncoder
lb = LabelEncoder()
count = 0
for col in categorical:
if len(app_train[col].unique()) == 2:
count = count + 1
lb.fit(app_train[col])
app_train['o' + col] = lb.transform(app_train[col])
app_test['o' + col] = lb.transform(app_test[col])
# housing type mode, family status的逾期率和分類有一定關係。
# OCCUPATION可以按以下編碼
col = 'OCCUPATION_TYPE'
#occ_sort = app_train.groupby(['OCCUPATION_TYPE'])['TARGET'].agg(['mean','count']).sort_values(by = 'mean')
#order1 = list(occ_sort.index)
#plt.figure(figsize = [10,10])
#sns.barplot(y = app_train[col], x = app_train['TARGET'], order = order1)
dict1 = {'Accountants' : 1,
'High skill tech staff':2, 'Managers':2, 'Core staff':2,
'HR staff' : 2,'IT staff': 2, 'Private service staff': 2, 'Medicine staff': 2,
'Secretaries': 2,'Realty agents': 2,
'Cleaning staff': 3, 'Sales staff': 3, 'Cooking staff': 3,'Laborers': 3,
'Security staff': 3, 'Waiters/barmen staff': 3,'Drivers': 3,
'Low-skill Laborers': 4}
app_train['oOCCUPATION_TYPE'] = app_train['OCCUPATION_TYPE'].map(dict1)
app_test['oOCCUPATION_TYPE'] = app_test['OCCUPATION_TYPE'].map(dict1)
plt.figure(figsize = [10,10])
sns.barplot(x = app_train['oOCCUPATION_TYPE'], y = app_train['TARGET'])
##
col = 'ORGANIZATION_TYPE'
#organ_sort = app_train.groupby(['ORGANIZATION_TYPE'])['TARGET'].agg(['mean','count']).sort_values(by = 'mean')
#order1 = list(organ_sort.index)
#plt.figure(figsize = [20,20])
#sns.barplot(y = app_train[col], x = app_train['TARGET'], order = order1)
dict1 = {'Trade: type 4' :1, 'Industry: type 12' :1, 'Transport: type 1' :1, 'Trade: type 6' :1,
'Security Ministries' :1, 'University' :1, 'Police' :1, 'Military' :1, 'Bank' :1, 'XNA' :1,
'Culture' :2, 'Insurance' :2, 'Religion' :2, 'School' :2, 'Trade: type 5' :2, 'Hotel' :2, 'Industry: type 10' :2,
'Medicine' :2, 'Services' :2, 'Electricity' :2, 'Industry: type 9' :2, 'Industry: type 5' :2, 'Government' :2,
'Trade: type 2' :2, 'Kindergarten' :2, 'Emergency' :2, 'Industry: type 6' :2, 'Industry: type 2' :2, 'Telecom' :2,
'Other' :3, 'Transport: type 2' :3, 'Legal Services' :3, 'Housing' :3, 'Industry: type 7' :3, 'Business Entity Type 1' :3,
'Advertising' :3, 'Postal':3, 'Business Entity Type 2' :3, 'Industry: type 11' :3, 'Trade: type 1' :3, 'Mobile' :3,
'Transport: type 4' :4, 'Business Entity Type 3' :4, 'Trade: type 7' :4, 'Security' :4, 'Industry: type 4' :4,
'Self-employed' :5, 'Trade: type 3' :5, 'Agriculture' :5, 'Realtor' :5, 'Industry: type 3' :5, 'Industry: type 1' :5,
'Cleaning' :5, 'Construction' :5, 'Restaurant' :5, 'Industry: type 8' :5, 'Industry: type 13' :5, 'Transport: type 3' :5}
app_train['oORGANIZATION_TYPE'] = app_train['ORGANIZATION_TYPE'].map(dict1)
app_test['oORGANIZATION_TYPE'] = app_test['ORGANIZATION_TYPE'].map(dict1)
plt.figure(figsize = [10,10])
sns.barplot(x = app_train['oORGANIZATION_TYPE'], y = app_train['TARGET'])
## 只對這幾種進行ordinary編碼吧,剩下的用ohe(307511, 127),(48744, 126) drop之後,feature分別爲 122,121
discard_features = ['ORGANIZATION_TYPE', 'OCCUPATION_TYPE','FLAG_OWN_CAR','FLAG_OWN_REALTY','NAME_CONTRACT_TYPE']
app_train.drop(discard_features,axis = 1, inplace = True)
app_test.drop(discard_features,axis = 1, inplace = True)
# 然後使用 get_dummies(307511, 169) (48744, 165)
app_train = pd.get_dummies(app_train)
app_test = pd.get_dummies(app_test)
print('Training Features shape: ', app_train.shape)
print('Testing Features shape: ', app_test.shape)
# 有些特徵 test裏面沒有,需要對齊 (307511, 166)(48744, 165)
train_labels = app_train['TARGET']
app_train, app_test = app_train.align(app_test, join = 'inner', axis = 1)
app_train['TARGET'] = train_labels
print('Training Features shape: ', app_train.shape)
print('Testing Features shape: ', app_test.shape)
Training Features shape: (307511, 169)
Testing Features shape: (48744, 165)
Training Features shape: (307511, 166)
Testing Features shape: (48744, 165)
離羣值
在特徵DAYS_EMPLOYED 裏面發現有不正常數據,而這個特徵與Target的相關性很強,需要特殊處理。另外增加一個特徵,DAYS_EMPLOYED_ANOM,來表徵這個特徵是否不正常。這種處理應該是對線性方法比較有效,而基於樹的方法應該可以自動識別。
## 繼續EDA, 在DAYS_EMPLOYED 裏面發現有問題數據,這種處理應該是對線性方法有效,boost方法應該可以自動識別。
app_train['DAYS_EMPLOYED'].plot.hist(title = 'DAYS_EMPLOYMENT HISTOGRAM')
app_test['DAYS_EMPLOYED'].plot.hist(title = 'DAYS_EMPLOYMENT HISTOGRAM')
app_train['DAYS_EMPLOYED_ANOM'] = app_train['DAYS_EMPLOYED'] == 365243
app_train['DAYS_EMPLOYED'].replace({365243: np.nan}, inplace = True)
app_test['DAYS_EMPLOYED_ANOM'] = app_test["DAYS_EMPLOYED"] == 365243
app_test["DAYS_EMPLOYED"].replace({365243: np.nan}, inplace = True)
app_train['DAYS_EMPLOYED'].plot.hist(title = 'Days Employment Histogram');
plt.xlabel('Days Employment')
## 提取相關性
correlations = app_train.corr()['TARGET'].sort_values()
print('Most Positive Correlations:\n', correlations.tail(15))
print('\nMost Negative Correlations:\n', correlations.head(15))
Most Positive Correlations:
REG_CITY_NOT_LIVE_CITY 0.044395
FLAG_EMP_PHONE 0.045982
NAME_EDUCATION_TYPE_Secondary / secondary special 0.049824
REG_CITY_NOT_WORK_CITY 0.050994
DAYS_ID_PUBLISH 0.051457
CODE_GENDER_M 0.054713
DAYS_LAST_PHONE_CHANGE 0.055218
NAME_INCOME_TYPE_Working 0.057481
REGION_RATING_CLIENT 0.058899
REGION_RATING_CLIENT_W_CITY 0.060893
oORGANIZATION_TYPE 0.070121
DAYS_EMPLOYED 0.074958
oOCCUPATION_TYPE 0.077514
DAYS_BIRTH 0.078239
TARGET 1.000000
Name: TARGET, dtype: float64
Most Negative Correlations:
EXT_SOURCE_3 -0.178919
EXT_SOURCE_2 -0.160472
EXT_SOURCE_1 -0.155317
NAME_EDUCATION_TYPE_Higher education -0.056593
CODE_GENDER_F -0.054704
NAME_INCOME_TYPE_Pensioner -0.046209
DAYS_EMPLOYED_ANOM -0.045987
FLOORSMAX_AVG -0.044003
FLOORSMAX_MEDI -0.043768
FLOORSMAX_MODE -0.043226
EMERGENCYSTATE_MODE_No -0.042201
HOUSETYPE_MODE_block of flats -0.040594
AMT_GOODS_PRICE -0.039645
REGION_POPULATION_RELATIVE -0.037227
ELEVATORS_AVG -0.034199
Name: TARGET, dtype: float64
填充缺失值
## 特徵工程
## 填充缺失值
from sklearn.preprocessing import Imputer, MinMaxScaler
imputer = Imputer(strategy = 'median')
scaler = MinMaxScaler(feature_range = [0,1])
train = app_train.drop(columns = ['TARGET'])
imputer.fit(train)
train = imputer.transform(train)
test = imputer.transform(app_test)
scaler.fit(train)
train = scaler.transform(train)
test = scaler.transform(test)
print('Training data shape: ', train.shape)
print('Testing data shape: ', test.shape)
D:\Anaconda3\lib\site-packages\sklearn\utils\deprecation.py:66: DeprecationWarning: Class Imputer is deprecated; Imputer was deprecated in version 0.20 and will be removed in 0.22. Import impute.SimpleImputer from sklearn instead.
warnings.warn(msg, category=DeprecationWarning)
Training data shape: (307511, 166)
Testing data shape: (48744, 166)
建模
Logistic Regression
使用邏輯迴歸來建模,使用grid search搜索最佳參數,結果顯示 ‘C’ = 1, ‘Penalty’ = 'l1’的時候,性能最好。
##
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import GridSearchCV
param_grid = {'C' : [0.01,0.1,1,10,100],
'penalty' : ['l1','l2']}
log_reg = LogisticRegression()
grid_search = GridSearchCV(log_reg, param_grid, scoring = 'roc_auc', cv = 5)
grid_search.fit(train, train_labels)
# Train on the training data
log_reg_best = grid_search.best_estimator_
log_reg_pred = log_reg_best.predict_proba(test)[:, 1]
submit = app_test[['SK_ID_CURR']]
submit['TARGET'] = log_reg_pred
submit.head()
submit.to_csv('log_reg_baseline_gridsearch2.csv', index = False)
最後的結果
public board 0.73889
private board 0.73469
LightGBM
folds = KFold(n_splits= num_folds, shuffle=True, random_state=1001)
# Create arrays and dataframes to store results
oof_preds = np.zeros(train_df.shape[0])
sub_preds = np.zeros(test_df.shape[0])
feature_importance_df = pd.DataFrame()
feats = [f for f in train_df.columns if f not in ['TARGET','SK_ID_CURR','SK_ID_BUREAU','SK_ID_PREV','index']]
for n_fold, (train_idx, valid_idx) in enumerate(folds.split(train_df[feats], train_df['TARGET'])):
train_x, train_y = train_df[feats].iloc[train_idx], train_df['TARGET'].iloc[train_idx]
valid_x, valid_y = train_df[feats].iloc[valid_idx], train_df['TARGET'].iloc[valid_idx]
# LightGBM parameters found by Bayesian optimization
clf = LGBMClassifier(
nthread=4,
n_estimators=10000,
learning_rate=0.02,
num_leaves=34,
colsample_bytree=0.9497036,
subsample=0.8715623,
max_depth=8,
reg_alpha=0.041545473,
reg_lambda=0.0735294,
min_split_gain=0.0222415,
min_child_weight=39.3259775,
silent=-1,
verbose=-1, )
clf.fit(train_x, train_y, eval_set=[(train_x, train_y), (valid_x, valid_y)],
eval_metric= 'auc', verbose= 200, early_stopping_rounds= 200)
oof_preds[valid_idx] = clf.predict_proba(valid_x, num_iteration=clf.best_iteration_)[:, 1]
sub_preds += clf.predict_proba(test_df[feats], num_iteration=clf.best_iteration_)[:, 1] / folds.n_splits
fold_importance_df = pd.DataFrame()
fold_importance_df["feature"] = feats
fold_importance_df["importance"] = clf.feature_importances_
fold_importance_df["fold"] = n_fold + 1
feature_importance_df = pd.concat([feature_importance_df, fold_importance_df], axis=0)
print('Fold %2d AUC : %.6f' % (n_fold + 1, roc_auc_score(valid_y, oof_preds[valid_idx])))
del clf, train_x, train_y, valid_x, valid_y
gc.collect()
print('Full AUC score %.6f' % roc_auc_score(train_df['TARGET'], oof_preds))
private board 0.74847
public board0.74981
Feature importance
附錄:各字段的意義
字段 | 含義 |
---|---|
SK_ID_CURR | 此次申請的ID |
TARGET | 申請人本次申請的還款風險:1-風險較高;0-風險較低 |
NAME_CONTRACT_TYPE | 貸款類型:cash(現金)還是revolving(週轉金,一次申請,多次循環提取) |
CODE_GENDER | 申請人性別 |
FLAG_OWN_CAR | 申請人是否有車 |
FLAG_OWN_REALTY | 申請人是否有房 |
CNT_CHILDREN | 申請人子女個數 |
AMT_INCOME_TOTAL | 申請人收入狀況 |
AMT_CREDIT | 此次申請的貸款金額 |
AMT_ANNUITY | 貸款年金 |
AMT_GOODS_PRICE | 如果是消費貸款,改字段表示商品的實際價格 |
NAME_TYPE_SUITE | 申請人此次申請的陪同人員 |
NAME_INCOME_TYPE | 申請人收入類型 |
NAME_EDUCATION_TYPE | 申請人受教育程度 |
NAME_FAMILY_STATUS | 申請人婚姻狀況 |
NAME_HOUSING_TYPE | 申請人居住狀況(租房,已購房,和父母一起住等) |
REGION_POPULATION_RELATIVE | 申請人居住地人口密度,已標準化 |
DAYS_BIRTH | 申請人出生日(距離申請當日的天數,負值) |
DAYS_EMPLOYED | 申請人當前工作的工作年限(距離申請當日的天數,負值) |
DAYS_REGISTRATION | 申請人最近一次修改註冊信息的時間(距離申請當日的天數,負值) |
DAYS_ID_PUBLISH | 申請人最近一次修改申請貸款的身份證明文件的時間(距離申請當日的天數,負值) |
FLAG_MOBIL | 申請人是否提供個人電話(1-yes,0-no) |
FLAG_EMP_PHONE | 申請人是否提供家庭電話(1-yes,0-no) |
FLAG_WORK_PHONE | 申請人是否提供工作電話(1-yes,0-no) |
FLAG_CONT_MOBILE | 申請人個人電話是否能撥通(1-yes,0-no) |
FLAG_EMAIL | 申請人是否提供電子郵箱(1-yes,0-no) |
OCCUPATION_TYPE | 申請人職務 |
REGION_RATING_CLIENT | 本公司對申請人居住區域的評分等級(1,2,3) |
REGION_RATING_CLIENT_W_CITY | 在考慮所在城市的情況下,本公司對申請人居住區域的評分等級(1,2,3) |
WEEKDAY_APPR_PROCESS_START | 申請人發起申請日是星期幾 |
HOUR_APPR_PROCESS_START | 申請人發起申請的hour |
REG_REGION_NOT_LIVE_REGION | 申請人提供的的永久地址和聯繫地址是否匹配(1-不匹配,2-匹配,區域級別的) |
REG_REGION_NOT_WORK_REGION | 申請人提供的的永久地址和工作地址是否匹配(1-不匹配,2-匹配,區域級別的) |
LIVE_REGION_NOT_WORK_REGION | 申請人提供的的聯繫地址和工作地址是否匹配(1-不匹配,2-匹配,區域級別的) |
REG_CITY_NOT_LIVE_CITY | 申請人提供的的永久地址和聯繫地址是否匹配(1-不匹配,2-匹配,城市級別的) |
REG_CITY_NOT_WORK_CITY | 申請人提供的的永久地址和工作地址是否匹配(1-不匹配,2-匹配,城市級別的) |
LIVE_CITY_NOT_WORK_CITY | 申請人提供的的聯繫地址和工作地址是否匹配(1-不匹配,2-匹配,城市級別的) |
ORGANIZATION_TYPE | 申請人工作所屬組織類型 |
EXT_SOURCE_1 | 外部數據源1的標準化評分 |
EXT_SOURCE_2 | 外部數據源2的標準化評分 |
EXT_SOURCE_3 | 外部數據源3的標準化評分 |
APARTMENTS_AVG <----> EMERGENCYSTATE_MODE | 申請人居住環境各項指標的標準化評分 |
OBS_30_CNT_SOCIAL_CIRC LE <----> DEF_60_CNT_SOCIAL_CIRCLE | 這部分字段含義沒看懂 |
DAYS_LAST_PHONE_CHANGE | 申請人最近一次修改手機號碼的時間(距離申請當日的天數,負值) |
FLAG_DOCUMENT_2 <----> FLAG_DOCUMENT_21 | 申請人是否額外提供了文件2,3,4. . .21 |
AMT_REQ_CREDIT_BUREAU_HOUR | 申請人發起申請前1個小時以內,被查詢徵信的次數 |
AMT_REQ_CREDIT_BUREAU_DAY | 申請人發起申請前一天以內,被查詢徵信的次數 |
AMT_REQ_CREDIT_BUREAU_WEEK | 申請人發起申請前一週以內,被查詢徵信的次數 |
AMT_REQ_CREDIT_BUREAU_MONTH | 申請人發起申請前一個月以內,被查詢徵信的次數 |
AMT_REQ_CREDIT_BUREAU_QRT | 申請人發起申請前一個季度以內,被查詢徵信的次數 |
AMT_REQ_CREDIT_BUREAU_YEAR | 申請人發起申請前一年以內,被查詢徵信的次數 |
從表中就可以大致猜測出一些信息,例如:OCCUPATION_TYPE,NAME_INCOME_TYPE以及ORGANIZATION_TYPE應該有很強的線性相關性;DAYS_LAST_PHONE_CHANGE,HOUR_APPR_PROCESS_START等信息可能不是重要的特徵;可以在後面的特徵分析時加以驗證並採取特定的降維手段