[Kaggle] kernel中常用方法和語句總結

目錄

 

讀取數據

表格類型數據

讀數據,看行數、列數,前幾行

EDA

查看目標變量分佈

目標變量爲分類變量

查看缺失值

目標dataframe缺失數據的分佈

查看不同類型變量情況

Category/分類變量預處理

object類型的變量是分類變量,查看所有分類變量的取值個數

Label Encoder - 注意要同時code train和test集!

OneHot Encoder

檢查異常值

檢查是否有不合常理的值

特徵和目標相關性

全部特徵和目標變量的相關性

深入探索某個連續特徵和目標變量(類別變量)的相關性

同時探索幾個相關連續特徵對目標變量(類別變量)的影響


讀取數據

表格類型數據

  • 讀數據,看行數、列數,前幾行

df = pd.read_csv("./Data/application_train.csv")
print("Training data shape: ", df.shape)
df.head()

EDA

查看目標變量分佈

  • 目標變量爲分類變量

df['TARGET'].value_counts()
df['TARGET'].plot.hist()

查看缺失值

  • 目標dataframe缺失數據的分佈

輸入:目標dataframe

輸出:dataframe裏所有有缺失值的變量爲列,行爲缺失值的個數,和缺失值比例

def missing_values_table(df):
    # Total missing values
    mis_val = df.isnull().sum()
    
    # Percentage of missing values
    mis_val_percent = 100 * df.isnull().sum() / df.shape[0]
    
    # Make a table with the result
    mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
    
    # Rename columns
    mis_val_table_re_columns = mis_val_table.rename(
        columns = {0: 'Missing Values',
                                1: '% of Total Missing Values'})
    
    # Sort the table by percentage of missing, descending
    mis_val_table_re_columns = mis_val_table_re_columns[
        mis_val_table_re_columns["Missing Values"]!=0
    ].sort_values(by=["% of Total Missing Values"], ascending=False)
    
    # Print summary information
    print("Your selected df has " + str(df.shape[1]) + " columns.\n",
                 "There are " + str(mis_val_table_re_columns.shape[0]) + "columns have missing values.")
    
    return mis_val_table_re_columns

查看不同類型變量情況

df.dtypes.value_counts()

Category/分類變量預處理

  • object類型的變量是分類變量,查看所有分類變量的取值個數

df.select_dtypes('object').apply(pd.Series.nunique, axis=0)
  • Label Encoder - 注意要同時code train和test集!

# Create a label encoder object
le = LabelEncoder()
le_count = 0

# Iterate through the columns
for col in app_train:
    if app_train[col].dtype == 'object':
        # If 2 or fewer unique categories
        if len(list(app_train[col].unique())) <= 2:
            # Train on the training data
            le.fit(app_train[col])
            
            #Transform both trainin and testing data
            app_train[col] = le.transform(app_train[col])
            app_test[col] = le.transform(app_test[col])
            
            # Keep track of how many columns are label encoded
            le_count += 1

print("{} columns were label encoded.".format(le_count))
  • OneHot Encoder

# OneHot encoding of categorical variables
app_train = pd.get_dummies(app_train)
app_test = pd.get_dummies(app_test)

注意在train集和test集上,feature(column)的數量應當是相同的,但在OneHot Encoding之後,如果train和test集的特徵取值範圍不同,有些train集的特徵取值在test集上沒有,則需要align train和test集 - 

train_labels = app_train['TARGET']

# Align the training and testing data, keep only columns present in both dataframes
app_train, app_test = app_train.align(app_test, join='inner', axis=1)

# Add the target back in train data
app_train['TARGET'] = train_labels

print('Training Features shape: ', app_train.shape)
print('Testing Features shape: ', app_test.shape)

在OneHot Encoding之後,特徵個數顯著增多,如果需要,做PCA

檢查異常值

  • 檢查是否有不合常理的值

檢查最大和最小值

app_train['DAYS_EMPLOYED'].describe()

如果在數據裏發現異常值,不要草率處理,如全部填零等。

safest way是首先看異常值的分佈是否有特點,比如是否異常值都相同,有異常值的觀測值是否對目標變量有影響(爲查看這一點,可以把觀測值按是否有異常值分組,看各組的目標變量均值是否相等)。

如果異常值的分佈有其特點,處理方法可以是 - 另外創建一列,用來表明其對應的列是否爲異常值,然後給所有異常值填充np.nan,以備後續處理。

注意 - 任何在training set上做的處理,需要同樣在test set上做!

特徵和目標相關性

Some general interpretations of the absolute value of the correlation coefficent are:

  1. .00-.19 “very weak”
  2. .20-.39 “weak”
  3. .40-.59 “moderate”
  4. .60-.79 “strong”
  5. .80-1.0 “very strong”
  • 全部特徵和目標變量的相關性

# Find correlations with the target and sort
correlations = app_train.corr()['TARGET'].sort_values()

# Display correlations
print('Most Positive Correlations:\n', correlations.tail(15))
print('\nMost Negative Correlations:\n', correlations.head(15))
  • 深入探索某個連續特徵和目標變量(類別變量)的相關性

首先畫histgram查看分佈 - 

# Set the style of plots
plt.style.use('fivethirtyeight')

#Plot the distribution of ages in years
plt.hist(app_train['DAYS_BIRTH'] / 365, edgecolor = 'k', bins=25)
plt.title('Age of Client')
plt.xlabel('Age(years)')
plt.ylabel('Count')

然後做KDE圖,看目標變量取值不同時,特徵的分佈情況

# KDE plot of loans that were repaid on time
sns.kdeplot(app_train.loc[app_train['TARGET']==0, 'DAYS_BIRTH'] / 365, label = 'target==0')

# KDE plot of loans that were not repaid on time
sns.kdeplot(app_train.loc[app_train['TARGET']==1, 'DAYS_BIRTH'] / 365, label = 'target==1')

嘗試將連續特徵轉換成離散特徵,探索其和目標變量的關係

# Age data saved in another dataframe
age_data = app_train[['TARGET', 'DAYS_BIRTH']]
age_data['YEARS_BIRTH'] = age_data['DAYS_BIRTH'] / 365

# BIn the age data
age_data['YEARS_BINNED'] = pd.cut(age_data['YEARS_BIRTH'], bins=np.linspace(20, 70, num=11))
age_data.head(10)

np.linspace(start, end, num) - 在[start, end]返回num個均勻的樣本

pd.cut(array-like x, bins) - 返回一個array-like對象,按照bins分箱

做bar plot

# Draw a bar plot for the age bins
plt.bar(age_groups.index.astype(str), 100*age_groups['TARGET'])

#Plot labeling
plt.xticks(rotation=75)
plt.xlabel('Age Group (years)')
plt.ylabel('Failure to Repay (%)')
plt.title('Failure to Repay by Group')
  • 同時探索幾個相關連續特徵對目標變量(類別變量)的影響

查看特徵間關係,及其與目標變量間的關係,熱力圖,KDE圖

ext_data = app_train[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'TARGET', 'DAYS_BIRTH']]
corr_ext = ext_data.corr()
corr_ext

sns.heatmap(corr_ext, cmap=plt.cm.RdYlBu_r, vmin=-0.25, annot=True, vmax=0.6)
plt.title('Correlation Heatmap')

特徵工程

Polynomial Features (多項式特徵)

生成多項式特徵,調用sklearn包中的PolynomialFeatures

# Import Polynomial features tool
from sklearn.preprocessing import PolynomialFeatures

poly_transformer = PolynomialFeatures(degree=3)

# Train the polynomial features
poly_transformer.fit(poly_features)

# Transform the features
poly_features = poly_transformer.transform(poly_features)
poly_features_test = poly_transformer.transform(poly_features_test)

print('Polynomial features shape: ', poly_features.shape)

注意PolynomialFeatures的transform方法輸出的是一個numpy array,需要特別轉換成Dataframe,並且用get_feature_names方法得到新的特徵名。得到的特徵,需要添加primary key,然後merge回原本的train和test集,注意這點與get_dummies()方法不同。

# Create a dataframe of the features
poly_features = pd.DataFrame(poly_features, 
                             columns = poly_transformer.get_feature_names(['EXT_SOURCE_1', 'EXT_SOURCE_2', 
'EXT_SOURCE_3', 'DAYS_BIRTH']))

# Put test features into dataframe
poly_features_test = pd.DataFrame(poly_features_test, 
                                  columns = poly_transformer.get_feature_names(['EXT_SOURCE_1', 'EXT_SOURCE_2', 
                                                                                'EXT_SOURCE_3', 'DAYS_BIRTH']))

# Merge polynomial features into training dataframe
poly_features['SK_ID_CURR'] = app_train['SK_ID_CURR']
app_train_poly = app_train.merge(poly_features, on = 'SK_ID_CURR', how = 'left')

# Merge polnomial features into testing dataframe
poly_features_test['SK_ID_CURR'] = app_test['SK_ID_CURR']
app_test_poly = app_test.merge(poly_features_test, on = 'SK_ID_CURR', how = 'left')

# Align the dataframes
app_train_poly, app_test_poly = app_train_poly.align(app_test_poly, join = 'inner', axis = 1)

# Print out the new shapes
print('Training data with polynomial features shape: ', app_train_poly.shape)
print('Testing data with polynomial features shape:  ', app_test_poly.shape)

Domain Knowledge

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章