[B11]數據挖掘實戰：客戶流失預警系統

*這是一個數據挖掘的小項目，將從以下幾個方面來分析：

數據清洗與格式轉換
探索性數據分析
特徵篩選
特徵工程
建立多種基礎模型，嘗試多種算法
模型調參/提升模型
評估測試/結論彙報

分析與準備數據

數據簡介
State:州名
Account Length:賬戶長度
Area Code：區號
Phone：電話號碼
‘Int'l Plan：國際漫遊需求與否
VMail Plan：參與活動
VMail Message：語音郵箱
Day Mins:白天通話分鐘數
Day Calls:白天打電話個數
Day Charge:白天收費情況
Eve Mins:晚間通話分鐘數
Eve Calls：晚間打電話個數
Eve Charge：晚間收費情況
Night Mins：夜間通話分鐘數
Night Calls：夜間打電話個數
Night Charge：夜間收費情況
Intl Mins：國際通話分鐘數
Intl Calls：國際打電話個數
Intl Charge：國際收費
CustServ Calls：客服電話數量
Churn：流失與否

一.數據清洗與格式轉換

**Step.1 通過pandas來導入csv：查看一下數據的基本情況，可以看到，整個數據集有3333條數據， 21個維度，最後一列是分類

from __future__ import division # 精確除法，“/”操作執行的是截斷除法

import pandas as pd
import numpy as np

churn_df = pd.read_csv('churn.csv')
col_names = churn_df.columns.tolist() #所有的列展示出來

print("Column names:")
print(col_names)

**Step.2 基本信息以及類型

to_show = col_names[:5] + col_names[-5:] #前5列和後5列

churn_df[to_show].head(5) #查看前5行

churn_df.info() # 是否有缺失值

churn_df.describe()  #describe() 可以返回具體的結果， 對於每一列。
#數量 平均值 標準差 25% 分位 50% 分位數 75% 分位數 最大值 很多時候你可以得到NA的數量和比例。

二.探索性數據分析

**Step1.顯示特徵的信息

#我們先來看一下流失比例， 以及關於打客戶電話的個數分佈
import matplotlib.pyplot as plt # 仿真
%matplotlib inline

fig = plt.figure()
fig.set(alpha=0.3)  # 設定圖表顏色alpha參數
#subplot2grid(shape , loc )
plt.subplot2grid((1,2),(0,0))# 圖像幾行幾列，從第0行第0列，

# line bar barsh kde
churn_df['Churn?'].value_counts().plot(kind='bar') #把用戶是否流失分組起來，流失的有多少人，沒有流失的有多少人

plt.title("stat for churn") # 設置標題

plt.ylabel("number")  #流失與否的數量，一共3333行，沒有流失的約佔2700 ，流失的佔500左右

plt.subplot2grid((1,2),(0,1))            
churn_df['CustServ Calls'].value_counts().plot(kind='bar') # 客服電話， 客戶打電話投訴多那流失率可能會大
plt.title("stat for cusServCalls") # 標題
plt.ylabel("number") #客戶打1個客服電話的有1400個左右，客戶.....總計加起來有3333個 

plt.savefig("C:\Jupyter_working_path\Projects\數據挖掘項目：用戶流失預警系統_\picture")

**一共3333個樣本，False代表流失了2700個左右，沒有流失月400個左右
**客戶打1個客服電話的有1400個左右，客戶打2個客服電話的有760個人個左右，客戶…總計加起來有3333個
**說明打客服電話的越多，流失的越多，因爲人都沒了。

import matplotlib.pyplot as plt

%matplotlib inline
fig = plt.figure()
fig.set(alpha=0.2)  # 設定圖表顏色alpha參數

plt.subplot2grid((1,3),(0,0)) # 在一張大圖裏分列幾個小圖
churn_df['Day Mins'].plot(kind='kde') # 白天通話分鐘數，圖用的kde的圖例
plt.xlabel(u"Mins")# 橫軸是分鐘數
plt.ylabel(u"density")  # density：密度
plt.title(u"dis for day mins") #標題

plt.subplot2grid((1,3),(0,1))            
churn_df['Day Calls'].plot(kind='kde')# 白天打電話個數
plt.xlabel(u"call")# 客戶打電話個數
plt.ylabel(u"density") #密度
plt.title(u"dis for day calls") #標題

plt.subplot2grid((1,3),(0,2))           
churn_df['Day Charge'].plot(kind='kde') # 白天收費情況
plt.xlabel(u"Charge")# 橫軸是白天收費情況
plt.ylabel(u"density") #密度
plt.title(u"dis for day charge")

plt.savefig("C:\Jupyter_working_path\Projects\數據挖掘項目：用戶流失預警系統_\picture1")

**Step.2 特徵和分類的關聯

#import matplotlib.pyplot as plt
fig = plt.figure()
fig.set(alpha=0.2)  # 設定圖表顏色alpha參數

#查看流失與國際漫遊之間的關係
int_yes = churn_df['Churn?'][churn_df['Int\'l Plan'] == 'yes'].value_counts() # 分組，yes:參與了有國際漫遊需求的統計出來
int_no = churn_df['Churn?'][churn_df['Int\'l Plan'] == 'no'].value_counts() #分組：no:沒有參與國際漫遊的統計出來

#用DataFrame做圖例上的標籤 ，在右上角
df_int=pd.DataFrame({u'int plan':int_yes, u'no int plan':int_no})

df_int.plot(kind='bar', stacked=True)
plt.title(u"statistic between int plan and churn")
plt.xlabel(u"int or not") 
plt.ylabel(u"number")

plt.savefig("C:\Jupyter_working_path\Projects\數據挖掘項目：用戶流失預警系統_\picture2")

**我們可以看到，有國際電話的流失率較高。猜測也許他們有更多的選擇，或者對服務有更多的要求。需要特別對待。也許你需要電話多收集一下意見了。

#查看客戶服務電話和結果的關聯
fig = plt.figure()
fig.set(alpha=0.2)  # 設定圖表顏色alpha參數

cus_0 = churn_df['CustServ Calls'][churn_df['Churn?'] == 'False.'].value_counts()
cus_1 = churn_df['CustServ Calls'][churn_df['Churn?'] == 'True.'].value_counts()
df=pd.DataFrame({u'churn':cus_1, u'retain':cus_0})
df.plot(kind='bar', stacked=True)
plt.title(u"Static between customer service call and churn")
plt.xlabel(u"Call service") 
plt.ylabel(u"Num") 

plt.savefig("C:\Jupyter_working_path\Projects\數據挖掘項目：用戶流失預警系統_\picture3")

三、特徵篩選

**根據對問題的分析，我們做第一件事情，去除三列無關列。州名，電話，區號
**轉化成數值類型：對於有些特徵，本身不是數值類型的，這些數據是不能被算法直接使用的，所以我們來處理一下

# 對於標籤數據需要整合
ds_result = churn_df['Churn?']

#shift+tab:condition是布爾類型的數組，每個條件都和x ,y 對應
#等於True爲1 ，等於False爲0

Y = np.where(ds_result == 'True.',1,0) 

dummies_int = pd.get_dummies(churn_df['Int\'l Plan'], prefix='_int\'l Plan') #prefix：前綴
# VMail Plan：某個策劃活動  prefix：前綴
dummies_voice = pd.get_dummies(churn_df['VMail Plan'], prefix='VMail')

#concat：用來合併2個或者2個以上的數組
ds_tmp=pd.concat([churn_df, dummies_int, dummies_voice], axis=1)

# 刪除州名、地區編號、手機號、用戶是否流失、各種策略活動
to_drop = ['State','Area Code','Phone','Churn?', 'Int\'l Plan', 'VMail Plan']
df = ds_tmp.drop(to_drop,axis=1)

print("after convert ")
df.head(5)

四、特徵工程

#數量級不一樣，，通過Scaler實現去量綱的影響
#在訓練模型時之前經常要對數據進行數組轉化，as_matrix()：把所有的特徵都轉化爲np.float
X = df.as_matrix().astype(np.float)

from sklearn.preprocessing import StandardScaler # 標準化

scaler = StandardScaler()

X = scaler.fit_transform(X)

print("Feature space holds %d observations and %d features" % X.shape) #  3333行 * 19列
print("---------------------------------")
print("Unique target labels:", np.unique(Y)) # 標籤的唯一值
print("---------------------------------")
print(len(Y[Y==0])) # 沒丟失的有2850
print("---------------------------------")
print(len(Y[Y==1])) # 丟失的有483

# 整理好的數據拿過來
churn_result = churn_df['Churn?']
y = np.where(churn_result == 'True.',1,0)
to_drop = ['State','Area Code','Phone','Churn?']
churn_feat_space = churn_df.drop(to_drop,axis=1)
yes_no_cols = ["Int'l Plan","VMail Plan"]
churn_feat_space[yes_no_cols] = churn_feat_space[yes_no_cols] == 'yes'
features = churn_feat_space.columns
X = churn_feat_space.as_matrix().astype(np.float)
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X = scaler.fit_transform(X)

print("Feature space holds %d observations and %d features" % X.shape)
print("---------------------------------")
print("Unique target labels:", np.unique(y))
print("---------------------------------")
print(X[0])#第1行
print("---------------------------------")
print(len(y[y == 0]))

五、建立多種基礎模型，嘗試多種算法

# 手寫一個交叉驗證：調參
from sklearn.model_selection import KFold

def run_cv(X,y,clf_class,**kwargs):
    # Construct a kfolds object
    kf = KFold(5,shuffle=True) # 5折
    y_pred = y.copy() #把所有的標籤y拿出來備份一下copy

    # 一共是5份，沒四份兒當做訓練集 ，剩下的一份驗證集
    for train_index, test_index in kf.split(X):
    
        X_train, X_test = X[train_index], X[test_index]
        y_train = y[train_index]
        # Initialize a classifier with key word arguments
        clf = clf_class(**kwargs)
        clf.fit(X_train,y_train)
        y_pred[test_index] = clf.predict(X_test)
    return y_pred

#手寫的測試
from sklearn.svm import SVC
from sklearn.linear_model import LogisticRegression as LR
from sklearn.neighbors import KNeighborsClassifier as KNN

def accuracy(y_true,y_pred):
    # NumPy interprets True and False as 1. and 0.
    return np.mean(y_true == y_pred) # 相等爲True ，不等爲False ，  1+0+1+0.../3333

print("Support vector machines:")
print("%.3f" % accuracy(y, run_cv(X,y,SVC)))
print("----------------------------")
print("LogisticRegression :")
print("%.3f" % accuracy(y, run_cv(X,y,LR)))
print("----------------------------")
print("K-nearest-neighbors:")
print("%.3f" % accuracy(y, run_cv(X,y,KNN)))

# 調入工具包
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC
from sklearn.model_selection import cross_val_score,KFold
from sklearn.neighbors import KNeighborsClassifier 
import matplotlib.pyplot as plt

# 初始化模型
models = []
models.append(('KNN', KNeighborsClassifier()))

models.append(('LR', LogisticRegression()))

models.append(('SVM', SVC()))

# 初始化
results = []
names = []
scoring = 'accuracy' # 準確率
for name, model in models:
    
    #random_state = 0 
    kfold = KFold(5,shuffle=True,random_state = 0) # 5折
    cv_results = cross_val_score(model, X, Y, cv=kfold)#scoring=scoring 默認爲None
    results.append(cv_results)#交叉驗證給的結果分
    names.append(name)
    #模型的標準差，體現模型的分值的波動，std越小越穩定
    msg = "%s: %f (%f)" % (name, cv_results.mean(), cv_results.std())
    print(msg)
    print("------------------------------")
# boxplot algorithm comparison
fig = plt.figure()
fig.suptitle('Algorithm Comparison')
ax = fig.add_subplot(111)

plt.boxplot(results)
ax.set_xticklabels(names)
plt.savefig("C:\Jupyter_working_path\Projects\數據挖掘項目：用戶流失預警系統_\picture4")

# 總結：SVM的效果比較好

六、模型調參/提升模型

**提升的部分，如何使用提升算法。比如隨機森林

from sklearn.ensemble import RandomForestClassifier as RF
num_trees = 100
max_features = 3
kfold = KFold(n_splits=10, random_state=7)
model = RF(n_estimators=num_trees, max_features=max_features)
results = cross_val_score(model, X, Y, cv=kfold)
print(results.mean())

from sklearn.ensemble import GradientBoostingClassifier
seed = 7
num_trees = 100
kfold = KFold(n_splits=10, random_state=seed)

model = GradientBoostingClassifier(n_estimators=num_trees, random_state=seed)
results = cross_val_score(model, X, Y, cv=kfold)

print(results.mean())

第一個結果是0.9525957094819371；第二個結果是0.9525966085846325。

**可以看到，這兩種算法對單個算法的提升還是很明顯的。進一步的，也可以繼續調整tree的數目，但是效果應該差不多了

七、評估測試/結論彙報

def run_prob_cv(X, y, clf_class, **kwargs):
    kf = KFold(5,True)
    y_prob = np.zeros((len(y),2))
    for train_index, test_index in kf.split(X):
        X_train, X_test = X[train_index], X[test_index]
        y_train = y[train_index]
        clf = clf_class(**kwargs)
        clf.fit(X_train,y_train)
        # Predict probabilities, not classes
        y_prob[test_index] = clf.predict_proba(X_test) #返回的是概率值 ，屬於0的概率多少，屬於1的概率是多少
    return y_prob

import warnings
warnings.filterwarnings('ignore')

# Use 10 estimators so predictions are all multiples of 0.1
pred_prob = run_prob_cv(X, y, RF, n_estimators=10)
#print pred_prob[0]
pred_churn = pred_prob[:,1]#只要屬於1的概率是多少 ，因爲咱們關注的是流失的
is_churn = y == 1

# Number of times a predicted probability is assigned to an observation
counts = pd.value_counts(pred_churn) # 屬於1的概率多少進行分組統計 ， 即：pred_prob	count
#print counts

# calculate true probabilities
true_prob = {}
for prob in counts.index:
    true_prob[prob] = np.mean(is_churn[pred_churn == prob]) 
    true_prob = pd.Series(true_prob)

# pandas-fu
counts = pd.concat([counts,true_prob], axis=1).reset_index()
counts.columns = ['pred_prob', 'count', 'true_prob']
counts

[B11]數據挖掘實戰：客戶流失預警系統

分析與準備數據

一.數據清洗與格式轉換

二.探索性數據分析

三、特徵篩選

四、特徵工程

五、建立多種基礎模型，嘗試多種算法

六、模型調參/提升模型

七、評估測試/結論彙報

[B4]鏈家二手房價格預測

[B11]數據挖掘實戰：客戶流失預警系統

[B5]我的第一個量化策略

[B9]爬蟲課程01

[B10]爬蟲課程02

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結