数据挖掘实战：汽车销售业偷漏税识别

案例来自《python数据分析与挖掘实战》

数据集可以到天池下载

背景

问题

企业偷漏税泛滥，影响国家经济基础
汽车销售业，少开发票、少计收入、保修索赔款不及时确认等偷漏税行为

目标

根据汽车销售行业纳税人的各项经营指标，建立模型，识别偷漏税的企业

数据分析

已知数据

处理流程

类似：数据挖掘实战：电力窃漏电用户自动识别

准备工作

数据集下载：python_data_analysis_and_mining_action

代码练习平台：google colab

上传数据到google colab

from google.colab import files
files.upload()

数据预处理

读取数据

import pandas as pd
data = pd.read_csv("汽车销售行业纳税人偷漏税数据.csv")
data=data.drop(columns='纳税人编号')

数据信息

data.info()
data.describe()

显示如下，没有缺失值：

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 124 entries, 0 to 123
Data columns (total 15 columns):
销售类型             124 non-null object
销售模式             124 non-null object
汽车销售平均毛利         124 non-null float64
维修毛利             124 non-null float64
企业维修收入占销售收入比重    124 non-null float64
增值税税负            124 non-null float64
存货周转率            124 non-null float64
成本费用利润率          124 non-null float64
整体理论税负           124 non-null float64
整体税负控制数          124 non-null float64
办牌率              124 non-null float64
单台办牌手续费收入        124 non-null float64
代办保险率            124 non-null float64
保费返还率            124 non-null float64
输出               124 non-null object
dtypes: float64(12), object(3)
memory usage: 14.7+ KB
汽车销售平均毛利	维修毛利	企业维修收入占销售收入比重	增值税税负	存货周转率	成本费用利润率	整体理论税负	整体税负控制数	办牌率	单台办牌手续费收入	代办保险率	保费返还率
count	124.000000	124.000000	124.000000	124.000000	124.000000	124.000000	124.000000	124.000000	124.000000	124.000000	124.000000	124.000000
mean	0.023709	0.154894	0.068717	0.008287	11.036540	0.174839	0.010435	0.006961	0.146077	0.016387	0.169976	0.039165
std	0.103790	0.414387	0.158254	0.013389	12.984948	1.121757	0.032753	0.008926	0.236064	0.032510	0.336220	0.065910
min	-1.064600	-3.125500	0.000000	0.000000	0.000000	-1.000000	-0.181000	-0.007000	0.000000	0.000000	0.000000	-0.014800
25%	0.003150	0.000000	0.000000	0.000475	2.459350	-0.004075	0.000725	0.000000	0.000000	0.000000	0.000000	0.000000
50%	0.025100	0.156700	0.025950	0.004800	8.421250	0.000500	0.009100	0.006000	0.000000	0.000000	0.000000	0.000000
75%	0.049425	0.398925	0.079550	0.008800	15.199725	0.009425	0.015925	0.011425	0.272325	0.020000	0.138500	0.081350
max	0.177400	1.000000	1.000000	0.077000	96.746100	9.827200	0.159300	0.057000	0.877500	0.200000	1.529700	0.270000

将类别型属性转为数值型

data[u'输出'] = data[u'输出'].map({u'正常': 0, u'异常': 1})
data[u'销售类型'] = data[u'销售类型'].map({u'国产轿车': 1, u'进口轿车': 2, u'大客车': 3,
                                           u'卡车及轻卡': 4, u'微型面包车': 5, u'商用货车': 6,
                                           u'工程车': 7, u'其它': 8})
data[u'销售模式'] = data[u'销售模式'].map({u'4S店': 1, u'一级代理商': 2, u'二级及二级以下代理商': 3,
                                           u'多品牌经营店': 4, u'其它': 5})

模型训练

划分训练集与测试集

from random import shuffle
data = data.as_matrix()
shuffle(data)  # 随机打乱数据
# 设置训练数据比8:2
p = 0.8
train = data[:int(len(data) * p), :]
test = data[int(len(data) * p):, :]

混淆矩阵可视化函数

import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, roc_curve
def cm_plot(y, yp):
    cm = confusion_matrix(y, yp)
 
    plt.matshow(cm, cmap=plt.cm.Greens)
    plt.colorbar()
 
    for x in range(len(cm)):
        for y in range(len(cm)):
            plt.annotate(
                cm[x, y],
                xy=(x, y),
                horizontalalignment='center',
                verticalalignment='center')
 
    plt.ylabel('True label')
    plt.xlabel('Predicted label')
    return plt

构建LM神经网络

from keras.layers.core import Activation, Dense
from keras.models import Sequential
    
# 构建LM神经网络模型
netfile = 'net.model'
 
net = Sequential()  # 建立神经网络
# 添加输入层（14节点）到隐藏层（10节点）的连接
net.add(Dense(10, input_shape=(14, )))
net.add(Activation('relu'))  # 隐藏层使用relu激活函数
#添加隐藏层（10节点）到输出层（1节点）的连接
net.add(Dense(1, input_shape=(10, )))
net.add(Activation('sigmoid'))  # 输出层使用sigmoid激活函数
net.compile(loss='binary_crossentropy', optimizer='adam',sample_weight_mode="binary")  #编译模型，使用adam方法求解
net.fit(train[:, :14], train[:, 14], epochs=100, batch_size=1)
net.save_weights(netfile)
 
predict_result = net.predict_classes(train[:, :14]).reshape(len(train))  # 预测结果变形
'''这里要提醒的是，keras用predict给出预测概率，predict_classes才是给出预测类别，而且两者的预测结果都是n x 1维数组，而不是通常的 1 x n'''
cm_plot(train[:, 14], predict_result).show()

predict_result = net.predict(test[:, :14]).reshape(len(test))
fpr, tpr, thresholds = roc_curve(test[:, 14], predict_result, pos_label=1)
plt.plot(fpr, tpr, linewidth=2, label='ROC of LM')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
plt.ylim(0, 1.05)
plt.xlim(0, 1.05)
plt.legend(loc=4)
plt.show()
print(thresholds)

查看每层网络名称，根据输出的层名获取权重和偏置：

for layer in net.layers:
  print (layer.name)

#得到每一层名字例如dense_1
w,b = net.get_layer('dense_1').get_weights()

第一层14入10出，权重系数：

array([[-0.33323944, -0.17473654,  0.87461054, -0.8462864 ,  0.25585997,
         0.04722638,  0.65266305, -0.08507098, -0.29267487, -0.25785267],
       [ 0.5202124 ,  0.03159818,  0.4351549 ,  0.81486964,  0.44043756,
        -0.26803496, -0.7636168 ,  0.21452232, -0.0635082 , -0.09977171],
       [ 0.4392051 , -0.42368022, -0.7336786 , -0.46694276, -0.6174017 ,
         0.16300084,  0.9625222 ,  0.0446047 ,  0.52611935,  0.19423229],
       [ 1.8234462 ,  0.06254685, -1.4471666 ,  0.26359963, -1.7079837 ,
         2.2589228 ,  1.7308791 , -0.16358732,  1.3756496 ,  1.4106685 ],
       [ 1.0737519 ,  0.18168104, -1.619209  ,  0.39016438, -1.6890421 ,
         1.3992369 ,  0.9619709 ,  1.0533664 ,  1.6621295 ,  1.8717662 ],
       [-0.92529964, -0.3892988 ,  0.4119622 ,  1.6253519 ,  0.5639266 ,
        -0.40880913, -0.7206734 ,  0.67802995, -0.12859246,  0.16787821],
       [ 0.22189063, -0.1612761 ,  0.18605754, -0.47543216,  0.26998442,
         0.0331102 , -0.02441032, -0.28508526,  0.35632288,  0.07961233],
       [-0.12109375, -0.18872818,  0.54819757,  0.94135314,  0.2317546 ,
        -0.753005  , -0.37716654,  0.51411617, -0.5405725 , -1.0390322 ],
       [ 2.3368533 , -0.04014749, -1.8256061 ,  0.95449895, -2.0859642 ,
         1.8795613 ,  0.61885285,  0.1487553 ,  1.8318983 ,  1.9700083 ],
       [ 1.92646   ,  0.06230724, -1.6802945 ,  0.40118444, -1.2818127 ,
         2.5196013 ,  1.8299726 , -0.08480016,  1.6552794 ,  1.3718019 ],
       [-0.44886026, -0.2739805 ,  0.0532513 , -0.4950459 ,  0.19649826,
         0.18537489, -0.74298495, -0.34665307, -0.28344223, -0.24201499],
       [-0.31021225, -0.11335158,  0.67558414, -0.24258912,  0.3580687 ,
         0.23020446,  0.11817209,  0.37445474,  0.21545577, -0.52446955],
       [ 0.5257165 , -0.02639258, -1.0446917 , -0.49409497, -1.0229766 ,
         1.2764192 , -0.3819442 , -0.37377504,  0.95819145,  0.41725355],
       [ 1.7603374 , -0.24498129, -1.9779907 ,  0.05246234, -1.8831204 ,
         2.1065228 ,  0.329442  , -0.21982898,  1.606011  ,  1.9765277 ]],
      dtype=float32)

偏置：

array([ 0.03276258, -0.0112296 , -0.05813374,  0.19073635, -0.03967857,
        0.49254265,  0.6294645 , -0.13973917,  0.13733548,  0.3180599 ],
      dtype=float32)

第二层10入1出也可以用相同方法查看。

构建CART决策树

from sklearn.tree import DecisionTreeClassifier
from sklearn.externals import joblib 
 
# 构建CART决策树模型
treefile = 'tree.pkl'
tree = DecisionTreeClassifier()
tree.fit(train[:, :14], train[:, 14])
 
joblib.dump(tree, treefile)
 
cm_plot(train[:, 14], tree.predict(train[:, :14])).show()  # 显示混淆矩阵可视化结果
# 注意到Scikit-Learn使用predict方法直接给出预测结果。
 
fpr, tpr, thresholds = roc_curve(test[:, 14], tree.predict_proba(test[:, :14])[:, 1], pos_label=1)
plt.plot(fpr, tpr, linewidth=2, label='ROC of CART', color='green')
plt.xlabel('False Positive Rate')
plt.ylabel('True Positive Rate')
# 设定边界范围
plt.ylim(0, 1.05)
plt.xlim(0, 1.05)
plt.legend(loc=4)
plt.show()
print(thresholds)

将决策树可视化

import sklearn.tree
from IPython.display import Image  
import pydotplus 
dot_data = sklearn.tree.export_graphviz(tree, out_file=None, 
                         feature_names=None,  
                         class_names=None,  
                         filled=True, rounded=True,  
                         special_characters=True)  
graph = pydotplus.graph_from_dot_data(dot_data)  
Image(graph.create_png())