Table of Contents

之前参加了一次公司举办的客户流失预测比赛，当时是机器学习小白（虽然现在也是），为了这场比赛收集整理了不少资料，在此稍微做一个简单的总结，和大家分享一下。

0. 题目简介

问题：预测客户未来三个月续订订单的概率
数据源：包括Product、ProductFamily、Service、Service Family、Partner、Customer、Order、Order Line、Offer、Customer’s Company、Usage、Revenue、Bill等多个维度在内的数据
评判标准：Score Set的交叉熵
模型选择：最后使用了XGBoost和GBDT融合

1. 资源收集和整理

因为之前没有接触过机器学习相关的工作，是个纯小白，所以我们首先手机资料，了解相关比赛所采用的思路和方法。通过学习了解到通用的机器学习流程如下：

数据分析和预处理
特征工程（特征选择）
模型选择（分类问题/回归问题)
训练和调优

接下来，针对这些步骤，我们对其进行更细致的强化学习。

2. 相关学习资料

2.1 相似比赛思路

通过阅读下面的博客，了解到这种类型的比赛通用的解题流程，对机器学习解决问题的流程有了初步的概念。

O2O优惠券使用预测复赛第三名：http://blog.csdn.net/bryan__/article/details/53907292
O2O优惠券使用预测复赛第一名代码：https://github.com/wepe/O2O-Coupon-Usage-Forecast
各类比赛的思路总结：http://blog.csdn.net/bryan__/article/details/51713596
【天池竞赛系列】阿里移动推荐算法思路解析：http://blog.csdn.net/bryan__/article/details/47112993
大数据竞赛技术分享：http://blog.csdn.net/bryan__/article/details/51745563
Scikit-learn 预测用户流失： http://blog.csdn.net/BaiHuaXiu123/article/details/62063415
天池数据挖掘比赛技术与套路总结：https://blog.csdn.net/mr_tyting/article/details/73548245
美团流失预测：很具有参考价值 - http://blog.csdn.net/shenxiaoming77/article/details/51543724
用户流失分析中的关键技术：http://blog.csdn.net/u013915133/article/details/78525133
基于python的逻辑回归实现及数据挖掘应用案例讲解：http://blog.csdn.net/yawei_liu1688/article/details/78733428
LogisticRegression用户流失预测模型初探：http://blog.csdn.net/java1573/article/details/78830607

2.2 数据处理+特征工程

2.2.1 数据处理：

数据可视化: http://blog.csdn.net/mr_tyting/article/details/73196119
数据离散化：http://blog.csdn.net/mr_tyting/article/details/75212250
数据清理：
- 异常样本检测和去除极端数据：http://blog.csdn.net/mr_tyting/article/details/77371157
数据预处理方法：http://blog.csdn.net/sinat_33761963/article/details/53433799
数据预处理： http://blog.csdn.net/bryan__/article/details/51228971
机器学习-->sklearn数据预处理：http://blog.csdn.net/mr_tyting/article/details/73381661

2.2.2 特征工程：

特征工程完全总结（Python源码）：https://blog.csdn.net/javastart/article/details/77015603
特征选择
- 概述：
  - 机器学习-->特征选择：http://blog.csdn.net/mr_tyting/article/details/73413979
  - http://machinelearningmastery.com/discover-feature-engineering-how-to-engineer-features-and-how-to-get-good-at-it/
  - Python机器学习库SKLearn的特征选择：https://blog.csdn.net/cheng9981/article/details/71023709
  - 特征选择的方法： https://blog.csdn.net/bryan__/article/details/51607215
  - 特征选择： http://blog.csdn.net/bryan__/article/details/51607215
- 自动特征选择：
  - 使用GBDT与LR融合自动选择特征：http://blog.csdn.net/lilyth_lilyth/article/details/48032119
  - 利用GBDT构建组合特征： http://blog.csdn.net/sb19931201/article/details/65445514
  - 利用随机森林评估特征重要性：
    - http://blog.csdn.net/HowardWood/article/details/79525326
    - http://blog.csdn.net/yawei_liu1688/article/details/78733428 -- 实例讲解

2.3 模型和参数调优

通用：
- 模型选择： http://blog.csdn.net/mr_tyting/article/details/73440712
- 模型融合：http://blog.csdn.net/mr_tyting/article/details/72957853
- 改善模型的方法： http://blog.csdn.net/roslei/article/details/53465283
- python sklearn 分类算法简单调用： http://blog.csdn.net/bryan__/article/details/51288953
XGBoost+GBDT
- XGBoost API: http://xgboost.readthedocs.io/en/latest/python/python_api.html#xgboost.XGBClassifier.fit
- XGBoost 源码：https://github.com/dmlc/xgboost
- 余音大神-介绍GBDT，XGBoost，Blending等实现方法https://github.com/lytforgood/MachineLearningTrick
- GBDT原理及应用： http://blog.csdn.net/q383700092/article/details/53744277
- XGBoost 参数调优：
  - https://www.analyticsvidhya.com/blog/2016/03/complete-guide-parameter-tuning-xgboost-with-codes-python
  - 机器学习系列(12)_XGBoost参数调优完全指南（附Python代码） https://blog.csdn.net/han_xiaoyang/article/details/52665396
  - XGBoost参数调优完全指南（附Python代码）： https://blog.csdn.net/u010657489/article/details/51952785
  - Parameters Tuning: bias-variance tradeoff：
    - 参数说明文档： https://xgboost.readthedocs.io/en/latest/parameter.html
  - 参数调参和预处理样例：https://github.com/aarshayj/Analytics_Vidhya/tree/master/Articles/Parameter_Tuning_XGBoost_with_Example

3. 一些笔记

3.1 《结合Scikit-learn介绍几种常用的特征选择方法》

如何用回归模型的系数来选择特征。越是重要的特征在模型中对应的系数就会越大，而跟输出变量越是无关的特征对应的系数就会越接近于0。在噪音不多的数据上，或者是数据量远远大于特征数的数据上，如果特征之间相对来说是比较独立的，那么即便是运用最简单的线性回归模型也一样能取得非常好的效果。

在很多实际的数据当中，往往存在多个互相关联的特征，这时候模型就会变得不稳定，数据中细微的变化就可能导致模型的巨大变化（模型的变化本质上是系数，或者叫参数，可以理解成W），这会让模型的预测变得困难，这种现象也称为多重共线性。

3.2 《[机器学习实战]使用 scikit-learn 预测用户流失》

流程：数据预处理 (char --> bool, delete less useful columns, 归一化）--> 模型：KNN, SVM, RF（交叉验证）

3.3 《天池数据挖掘比赛技术与套路总结》

流程：数据可视化 ----》数据预处理---》特征工程 ---》模型融合

数据可视化：验证我们对数据分布的一些猜想，使我们对数据分布有一个清晰的认识和理解，并且由此设计一些合理的人工规则。
- 参考：http://blog.csdn.net/mr_tyting/article/details/73196119
数据预处理：
- 数据清洗：
  - 异常样本检测和去除极端数据 http://blog.csdn.net/mr_tyting/article/details/77371157
  - 缺省字段处理（缺省值很多，非连续特征缺省值适中，连续特征缺省值适中，缺省较少）
- 数据采样：正负样本不均衡问题
特征工程：
- 特征处理 http://blog.csdn.net/mr_tyting/article/details/73381661
- 连续特征离散化 http://blog.csdn.net/mr_tyting/article/details/75212250
- 特征选择： http://blog.csdn.net/mr_tyting/article/details/73413979
- 模型选择： http://blog.csdn.net/mr_tyting/article/details/73440712
模型融合：http://blog.csdn.net/mr_tyting/article/details/72957853
- Bagging （Random Forest）
- Stacking
- Boosting

3.4 《LogisticRegression用户流失预测模型初探【推荐】》

逻辑回归:
- Logistic regression 是二项分布，比较常用来进行二分类
- Logistic回归的主要用途：
  - 寻找危险因素：寻找某一疾病的危险因素等；
  - 预测：根据模型，预测在不同的自变量情况下，发生某病或某种情况的概率有多大；
  - 判别：实际上跟预测有些类似，也是根据模型，判断某人属于某病或属于某种情况的概率有多大，也就是看一下这个人有多大的可能性是属于某病。
解决问题:

根据自己工作中的一个产品作为主题，预测其用户流失与留存。

流失=上个月有消费，本月无消费表流失（其实也是消费流失啦）。

数据周期使用的是一两个月来做分析，什么情况下用户会消费流失？

于是挑选了一些指标特征来做分析，比如上个月的消费次数、最近的消费时间（可量化），消费金额，rmf这个原理还是有一个分析依据的。当然还有其他特征如，用户观看总时长、用户活跃天数、停留时长、启动次数等。

用户流失分析中的关键技术:

分类模型：决策树（ID3，C4.5 ， C50）

流程: 准备工作（明确自变量和因变量，确定信息度量的方式（熵，信息增益），确定最终的终止条件（纯度，记录条数，循环次数）） ---》选择特征 ---》创建分支 ----》是否终止----》结果生成

3.5 《基于Python的逻辑回归实现及数据挖掘应用案例讲解》

Step1：数据库提取数据：结合106个指标
Step2：数据查看及处理
Step3：LR模型训练
Step4：模型预测及评估
Step5：模型优化：交叉验证+grid search重新训练，

注意：模型的查全率recall和查准率precision那个更重要。判断那个度量指标最重要！！本次比赛中precision更重要

3.6 《大数据竞赛技术分享》

预处理 http://blog.csdn.net/bryan__/article/details/51228971
- 特征标准化：z-score，使其具有0均值，单位方差
- 最大最小规范化
- 规范化：规范化是将不同变化范围的值映射到相同的固定范围，常见的是[0,1]，此时也称为归一化。（L1，L2）
- 特征二值化
- 标签二值化
- 类别特征编码
- 标签编码
- 特征中含有异常值
- 生成多项式特征
特征工程
- 按业务逻辑构建特征
- 交叉特征
- 变换特征
- 基于时间窗滑动的特征
- 避免特征穿越
- 尺度一致
- 连续特征离散化
- 离散特征连续化（one-hot，sklearn-lableencoder
- 使用GBDT(Gradient Boost Decision Tree)与LR融合，自动发现组合特征，省去人工构造
  - http://blog.csdn.net/lilyth_lilyth/article/details/48032119
- GBDT+FM
- 特征选择： http://blog.csdn.net/bryan__/article/details/51607215
模型设计=分类问题： http://blog.csdn.net/bryan__/article/details/51288953
模型设计-回归问题

3.7 Xgboost参数说明

参数说明文档： https://xgboost.readthedocs.io/en/latest/parameter.html

Control overfitting:

First way: directly control model complexity
- Include max_depth, min_child_weight, gamma
Second way: add randomness to make training robust to noise
- Subsample, colsample_bytree
- Reduce step_size eta, but needs to remember to increase num_round when you do so\

Handle imbalanced dataset

训练样本中，类别之间的样本数据量比例超过4：1，可以认为样本存在不均衡的问题

If you care only about the ranking order (AUC) of your prediction
- Balance the positive and negative weights, via scale_pos_weight
- Use AUC for evaluation
If you care about predicting the right probability
- In such a case ,you cannot rebalance the dataset
- In such a case, set parameter max_delta_step to finite (say 1) will help convergence

Xgboost 参数说明

Before running xgboost, set three types of parameters: general, booster and learning task parameters
- General parameters: relates to which booster we are using to do boosting, commonly tree and linear model
- Booster parameters: depends on which booster you have chosen
- Learning task parameters that decides on the learning scenario, for example, regress tasks may use different parameters with ranking tasks
General parameters:
- Booster [default=gbtree]
  - Which booster to use, can be gbtree, gblinear or dart. Gbtree and dart based on tree model, which gblinear uses linear function
- Silent [default=0]
  - 0 means printing running message; 1 means silent mode
- Nthread [default to maximum number of thread availiable if not set]
- Num_pbuffer[set automatically by xgboost]
- Num_feature[set automatically by xgboost]
Parameters for tree booster
- Eta[default=0.3, alias: learning_rate]
  - Step size shrinkage used in update to prevents overfitting.
  - Range: [0,∞]
  - The larger, the more conservative
- Gamma[default=0, alias: min_split_loss]
  - Minimum loss reduction required to make a further partition on a leaf node of a tree.
  - The larger, the more conservative the algorithm will be.
  - range: [0,∞]
- max_depth [default=6]
  - maximum depth of a tree, increase this value will make the model more complex / likely to be overfitting. 0 indicates no limit, limit is required for depth-wise grow policy.
  - range: [0,∞]

客户流失预测 —— 资源汇总

0. 题目简介

1. 资源收集和整理

2. 相关学习资料

2.1 相似比赛思路

2.2 数据处理+特征工程

2.2.1 数据处理：

2.2.2 特征工程：

2.3 模型和参数调优

3. 一些笔记

3.1 《结合Scikit-learn介绍几种常用的特征选择方法》

3.2 《[机器学习实战]使用 scikit-learn 预测用户流失》

3.3 《天池数据挖掘比赛技术与套路总结》

3.4 《LogisticRegression用户流失预测模型初探【推荐】》

3.5 《基于Python的逻辑回归实现及数据挖掘应用案例讲解》

3.6 《大数据竞赛技术分享》

3.7 Xgboost参数说明

Control overfitting:

Handle imbalanced dataset

Xgboost 参数说明

lightdb hash index的性能和限制

《Attention is All You Need》論文學習筆記

Mac上安裝 Cx_Oracle+Python+Pycharm+SQL Developer安裝

C語言快速入門和相關資料

Docker入門 - 簡介/安裝/運行原理/常用命令/鏡像/容器/DockerFile

學習資源整理：Java/BigData/C/C++/Python/NLP/ML/DL/CV/數據分析

Mac下配置sublime實現LaTeX

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結