目錄
object類型的變量是分類變量,查看所有分類變量的取值個數
Label Encoder - 注意要同時code train和test集!
讀取數據
表格類型數據
-
讀數據,看行數、列數,前幾行
df = pd.read_csv("./Data/application_train.csv")
print("Training data shape: ", df.shape)
df.head()
EDA
查看目標變量分佈
-
目標變量爲分類變量
df['TARGET'].value_counts()
df['TARGET'].plot.hist()
查看缺失值
-
目標dataframe缺失數據的分佈
輸入:目標dataframe
輸出:dataframe裏所有有缺失值的變量爲列,行爲缺失值的個數,和缺失值比例
def missing_values_table(df):
# Total missing values
mis_val = df.isnull().sum()
# Percentage of missing values
mis_val_percent = 100 * df.isnull().sum() / df.shape[0]
# Make a table with the result
mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
# Rename columns
mis_val_table_re_columns = mis_val_table.rename(
columns = {0: 'Missing Values',
1: '% of Total Missing Values'})
# Sort the table by percentage of missing, descending
mis_val_table_re_columns = mis_val_table_re_columns[
mis_val_table_re_columns["Missing Values"]!=0
].sort_values(by=["% of Total Missing Values"], ascending=False)
# Print summary information
print("Your selected df has " + str(df.shape[1]) + " columns.\n",
"There are " + str(mis_val_table_re_columns.shape[0]) + "columns have missing values.")
return mis_val_table_re_columns
查看不同類型變量情況
df.dtypes.value_counts()
Category/分類變量預處理
-
object類型的變量是分類變量,查看所有分類變量的取值個數
df.select_dtypes('object').apply(pd.Series.nunique, axis=0)
-
Label Encoder - 注意要同時code train和test集!
# Create a label encoder object
le = LabelEncoder()
le_count = 0
# Iterate through the columns
for col in app_train:
if app_train[col].dtype == 'object':
# If 2 or fewer unique categories
if len(list(app_train[col].unique())) <= 2:
# Train on the training data
le.fit(app_train[col])
#Transform both trainin and testing data
app_train[col] = le.transform(app_train[col])
app_test[col] = le.transform(app_test[col])
# Keep track of how many columns are label encoded
le_count += 1
print("{} columns were label encoded.".format(le_count))
-
OneHot Encoder
# OneHot encoding of categorical variables
app_train = pd.get_dummies(app_train)
app_test = pd.get_dummies(app_test)
注意在train集和test集上,feature(column)的數量應當是相同的,但在OneHot Encoding之後,如果train和test集的特徵取值範圍不同,有些train集的特徵取值在test集上沒有,則需要align train和test集 -
train_labels = app_train['TARGET']
# Align the training and testing data, keep only columns present in both dataframes
app_train, app_test = app_train.align(app_test, join='inner', axis=1)
# Add the target back in train data
app_train['TARGET'] = train_labels
print('Training Features shape: ', app_train.shape)
print('Testing Features shape: ', app_test.shape)
在OneHot Encoding之後,特徵個數顯著增多,如果需要,做PCA
檢查異常值
-
檢查是否有不合常理的值
檢查最大和最小值
app_train['DAYS_EMPLOYED'].describe()
如果在數據裏發現異常值,不要草率處理,如全部填零等。
safest way是首先看異常值的分佈是否有特點,比如是否異常值都相同,有異常值的觀測值是否對目標變量有影響(爲查看這一點,可以把觀測值按是否有異常值分組,看各組的目標變量均值是否相等)。
如果異常值的分佈有其特點,處理方法可以是 - 另外創建一列,用來表明其對應的列是否爲異常值,然後給所有異常值填充np.nan,以備後續處理。
注意 - 任何在training set上做的處理,需要同樣在test set上做!
特徵和目標相關性
Some general interpretations of the absolute value of the correlation coefficent are:
- .00-.19 “very weak”
- .20-.39 “weak”
- .40-.59 “moderate”
- .60-.79 “strong”
- .80-1.0 “very strong”
-
全部特徵和目標變量的相關性
# Find correlations with the target and sort
correlations = app_train.corr()['TARGET'].sort_values()
# Display correlations
print('Most Positive Correlations:\n', correlations.tail(15))
print('\nMost Negative Correlations:\n', correlations.head(15))
-
深入探索某個連續特徵和目標變量(類別變量)的相關性
首先畫histgram查看分佈 -
# Set the style of plots
plt.style.use('fivethirtyeight')
#Plot the distribution of ages in years
plt.hist(app_train['DAYS_BIRTH'] / 365, edgecolor = 'k', bins=25)
plt.title('Age of Client')
plt.xlabel('Age(years)')
plt.ylabel('Count')
然後做KDE圖,看目標變量取值不同時,特徵的分佈情況
# KDE plot of loans that were repaid on time
sns.kdeplot(app_train.loc[app_train['TARGET']==0, 'DAYS_BIRTH'] / 365, label = 'target==0')
# KDE plot of loans that were not repaid on time
sns.kdeplot(app_train.loc[app_train['TARGET']==1, 'DAYS_BIRTH'] / 365, label = 'target==1')
嘗試將連續特徵轉換成離散特徵,探索其和目標變量的關係
# Age data saved in another dataframe
age_data = app_train[['TARGET', 'DAYS_BIRTH']]
age_data['YEARS_BIRTH'] = age_data['DAYS_BIRTH'] / 365
# BIn the age data
age_data['YEARS_BINNED'] = pd.cut(age_data['YEARS_BIRTH'], bins=np.linspace(20, 70, num=11))
age_data.head(10)
np.linspace(start, end, num) - 在[start, end]返回num個均勻的樣本
pd.cut(array-like x, bins) - 返回一個array-like對象,按照bins分箱
做bar plot
# Draw a bar plot for the age bins
plt.bar(age_groups.index.astype(str), 100*age_groups['TARGET'])
#Plot labeling
plt.xticks(rotation=75)
plt.xlabel('Age Group (years)')
plt.ylabel('Failure to Repay (%)')
plt.title('Failure to Repay by Group')
-
同時探索幾個相關連續特徵對目標變量(類別變量)的影響
查看特徵間關係,及其與目標變量間的關係,熱力圖,KDE圖
ext_data = app_train[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'TARGET', 'DAYS_BIRTH']]
corr_ext = ext_data.corr()
corr_ext
sns.heatmap(corr_ext, cmap=plt.cm.RdYlBu_r, vmin=-0.25, annot=True, vmax=0.6)
plt.title('Correlation Heatmap')
特徵工程
Polynomial Features (多項式特徵)
生成多項式特徵,調用sklearn包中的PolynomialFeatures
# Import Polynomial features tool
from sklearn.preprocessing import PolynomialFeatures
poly_transformer = PolynomialFeatures(degree=3)
# Train the polynomial features
poly_transformer.fit(poly_features)
# Transform the features
poly_features = poly_transformer.transform(poly_features)
poly_features_test = poly_transformer.transform(poly_features_test)
print('Polynomial features shape: ', poly_features.shape)
注意PolynomialFeatures的transform方法輸出的是一個numpy array,需要特別轉換成Dataframe,並且用get_feature_names方法得到新的特徵名。得到的特徵,需要添加primary key,然後merge回原本的train和test集,注意這點與get_dummies()方法不同。
# Create a dataframe of the features
poly_features = pd.DataFrame(poly_features,
columns = poly_transformer.get_feature_names(['EXT_SOURCE_1', 'EXT_SOURCE_2',
'EXT_SOURCE_3', 'DAYS_BIRTH']))
# Put test features into dataframe
poly_features_test = pd.DataFrame(poly_features_test,
columns = poly_transformer.get_feature_names(['EXT_SOURCE_1', 'EXT_SOURCE_2',
'EXT_SOURCE_3', 'DAYS_BIRTH']))
# Merge polynomial features into training dataframe
poly_features['SK_ID_CURR'] = app_train['SK_ID_CURR']
app_train_poly = app_train.merge(poly_features, on = 'SK_ID_CURR', how = 'left')
# Merge polnomial features into testing dataframe
poly_features_test['SK_ID_CURR'] = app_test['SK_ID_CURR']
app_test_poly = app_test.merge(poly_features_test, on = 'SK_ID_CURR', how = 'left')
# Align the dataframes
app_train_poly, app_test_poly = app_train_poly.align(app_test_poly, join = 'inner', axis = 1)
# Print out the new shapes
print('Training data with polynomial features shape: ', app_train_poly.shape)
print('Testing data with polynomial features shape: ', app_test_poly.shape)
Domain Knowledge