1 如何处理类别变量?
方法一:丢弃(一般不用)
方法二:LabelEncoder
from sklearn.processing import LabelEncoder
label_encoder = LabelEncoder()
X[col] = label_encoder.fit_transform(X[col])
X_val[col] = label_encoder.transform(X_val[col]
方法三:OneHotEncoding:
作用:可用来处理无序的类别特征。
注意:当特征类别数大于15的时候不使用该方法
from sklearn.procession import OneHotEncoding
One_H_encoder = OneHotEncoding(handle_unknown='ignore',sparse=False)
OH_cols_train = pd.DataFrame(One_H_encoder.fit_transform(X_train[object_cols])
OH_cols_val = pd.DataFrame(One_H_encoder.transform(X_val[object_cols])
OH_cols_train.index = X_train.index
OH_cols_val.index = X_val.index
num_X_train = X_train.drop(object_cols,axis=1)
num_X_val = X_val.drop(object_cols,axis=1)
OH_X_train = pd.concat([num_X_train,OH_cols_train],axis=1)
OH_X_test = pd.concat([num_X_val,OH_cols_val],axis=1)
2 Pipeline
Pipeline 好处:
- 让代码更精简与直观
- 减少出现Bug的可能性
- 可以批量进行
代码:
import pandas as pd
import numpy as np
from sklearn.pipeline import Pipeline
from sklearn.ensemble import RandomForestRegressor
from sklearn.processing import OneHotEncoder
from sklearn.imputer import SimpleImputer
from sklearn.model_selection import train_test_split
from sklearn.compose import ColumnTransformer
from sklearn.metrics import mean_absolute_error
X_full = pd.read_csv('X_train.csv')
X_test_full = pd.read_csv('X_test.csv')
X_full.dropna(axis=0,subset=['SalePrice'],inplace=True)
y = X_full.SalePrice
X_full.drop('SalePrice',axis=1)
X_train_full,X_val_full,y_train,y_vaild = train_test_split(X_full,y,test_size=0.3,random_state=0)
categorical_cols = [cols for cols in X_train_full.columns if X_train_full[cols].nunique() <10 and X_train_full[cols].dtype == 'object']
numerical_cols = [cols for cols in X_train_full.columns if X_train_full[cols].dtype in ['int64','float64']]
my_cols = categorical_cols + numerical_cols
X_train = X_train_full[my_cols].copy()
X_val = X_val_full[my_cols].copy()
X_test = X_test[my_cols].copy()
#Step1:
numerical_transform = SimpleImputer()
categorical_transform = Pipline(steps=[('imputer',SimpleImputer(strategy='most_frequent')),\
('onehot',OneHotEncoder(handle_unknown='ignore',sparse=False))]
processor = ColumnTransformer(transformers=[('numerical',numerical_transform,numerical_cols),('cat',categorical_transform,categorical_cols))
#Step2:
model = RandomForestRegressor(n_estimators=100,random_state=0)
my_pipeline = Pipeline(steps=[('processor',processor),('model',model)])
my_pipeline.fit(X_train,y_train)
preds = my_pipeline.predict(X_val)
mean_error = mean_absolute_error(preds,y_val)
3 Cross Validation
适用于:小数据集
代码:
from sklearn.model_selection import cross_val_error
from sklern.ensemble import RandomForestRegressor
from sklearn.pipeline import Pipeline
from sklearn.imputer import SimpleImputer
def get_score(n_estimators,X,y)
my_pipeline = Pipeline(steps=[('Imputer',SimpleImputer(strategy='median'))
\,('model',RandomForestRegressor(n_estimators=n_estimators,random_state=0))])
score = -1*cross_val_score(my_pipeline,X,y,cv=5,scoring='neg_mean_absolute_error')
return score.mean()
4 XGBoost
理念:首先初始化一个弱学习器,然后用这个弱学习器进行预测并计算损失,然后根据损失训练出一个新的学习器,将新学习器加入到大的学习器当中,然后迭代上面的步骤。
重要参数:
- n_estimators:学习器的数量,也可以看作是迭代的轮数,通常设为100-1000之间,太低会欠拟合,太高会过拟合。
- early_stopping_rounds:若loss值几轮未改变,就提早停止,通常设为5,用较高的n_estimators和early_stoppint_rounds搭配是个好选择
- eval_set:与early_stoppint_rounds一起搭配使用,用来计算validation score.
- n_jobs:当数据集很大的时候,可以设置这个参数,相当于分布式运算。
- learning_rate:给每个基学习器一个权重,而非简单相加,默认为0.1
代码:
from xgboost import XGBRegressor
model = XGBRegressor(n_estimators=500,learning_rate=0.01,random_state=0)
model.fit(X_train,y_trian,early_stopping_rounds=5,eval_set=[(X_valid,y_valid)],verbose=False)
5 Data leakage
1.Target Leakage:
- 发生场景:当数据集当中包含着那些在预测时不会发挥作用的样本时。
- 在上面的这个数据集中可以发现,took_antibiotic_medicine的改变经常会使得got_pneumonia发生改变,利用这个数据训练出来的模型虽然在验证集上表现很好,但是当我们拿到现实世界去的时候往往精度非常的低。原因在于:使用这个模型的目的在于预测某位病人是否得了这个病,所以一般来看病的人,即使他们已经患上了,他们也尚未拿到药,所以利用这个模型进行预测显然是不准确的,因为一些数据在预测中是不起作用的。
2.Train-Test Contamination
- 发生场景:当我们在分离训练集和验证集之前,对数据集进行了填充或者归一化。