讀取數據

表格類型數據

讀數據，看行數、列數，前幾行

df = pd.read_csv("./Data/application_train.csv")
print("Training data shape: ", df.shape)
df.head()

EDA

查看目標變量分佈

目標變量爲分類變量

df['TARGET'].value_counts()
df['TARGET'].plot.hist()

查看缺失值

目標dataframe缺失數據的分佈

輸入：目標dataframe

輸出：dataframe裏所有有缺失值的變量爲列，行爲缺失值的個數，和缺失值比例

def missing_values_table(df):
    # Total missing values
    mis_val = df.isnull().sum()
    
    # Percentage of missing values
    mis_val_percent = 100 * df.isnull().sum() / df.shape[0]
    
    # Make a table with the result
    mis_val_table = pd.concat([mis_val, mis_val_percent], axis=1)
    
    # Rename columns
    mis_val_table_re_columns = mis_val_table.rename(
        columns = {0: 'Missing Values',
                                1: '% of Total Missing Values'})
    
    # Sort the table by percentage of missing, descending
    mis_val_table_re_columns = mis_val_table_re_columns[
        mis_val_table_re_columns["Missing Values"]!=0
    ].sort_values(by=["% of Total Missing Values"], ascending=False)
    
    # Print summary information
    print("Your selected df has " + str(df.shape[1]) + " columns.\n",
                 "There are " + str(mis_val_table_re_columns.shape[0]) + "columns have missing values.")
    
    return mis_val_table_re_columns

查看不同類型變量情況

df.dtypes.value_counts()

Category/分類變量預處理

object類型的變量是分類變量，查看所有分類變量的取值個數

df.select_dtypes('object').apply(pd.Series.nunique, axis=0)

Label Encoder - 注意要同時code train和test集！

# Create a label encoder object
le = LabelEncoder()
le_count = 0

# Iterate through the columns
for col in app_train:
    if app_train[col].dtype == 'object':
        # If 2 or fewer unique categories
        if len(list(app_train[col].unique())) <= 2:
            # Train on the training data
            le.fit(app_train[col])
            
            #Transform both trainin and testing data
            app_train[col] = le.transform(app_train[col])
            app_test[col] = le.transform(app_test[col])
            
            # Keep track of how many columns are label encoded
            le_count += 1

print("{} columns were label encoded.".format(le_count))

OneHot Encoder

# OneHot encoding of categorical variables
app_train = pd.get_dummies(app_train)
app_test = pd.get_dummies(app_test)

注意在train集和test集上，feature(column)的數量應當是相同的，但在OneHot Encoding之後，如果train和test集的特徵取值範圍不同，有些train集的特徵取值在test集上沒有，則需要align train和test集 -

train_labels = app_train['TARGET']

# Align the training and testing data, keep only columns present in both dataframes
app_train, app_test = app_train.align(app_test, join='inner', axis=1)

# Add the target back in train data
app_train['TARGET'] = train_labels

print('Training Features shape: ', app_train.shape)
print('Testing Features shape: ', app_test.shape)

在OneHot Encoding之後，特徵個數顯著增多，如果需要，做PCA

檢查異常值

檢查是否有不合常理的值

檢查最大和最小值

app_train['DAYS_EMPLOYED'].describe()

如果在數據裏發現異常值，不要草率處理，如全部填零等。

safest way是首先看異常值的分佈是否有特點，比如是否異常值都相同，有異常值的觀測值是否對目標變量有影響（爲查看這一點，可以把觀測值按是否有異常值分組，看各組的目標變量均值是否相等）。

如果異常值的分佈有其特點，處理方法可以是 - 另外創建一列，用來表明其對應的列是否爲異常值，然後給所有異常值填充np.nan，以備後續處理。

注意 - 任何在training set上做的處理，需要同樣在test set上做！

.00-.19 “very weak”
.20-.39 “weak”
.40-.59 “moderate”
.60-.79 “strong”
.80-1.0 “very strong”

全部特徵和目標變量的相關性

# Find correlations with the target and sort
correlations = app_train.corr()['TARGET'].sort_values()

# Display correlations
print('Most Positive Correlations:\n', correlations.tail(15))
print('\nMost Negative Correlations:\n', correlations.head(15))

深入探索某個連續特徵和目標變量（類別變量）的相關性

首先畫histgram查看分佈 -

# Set the style of plots
plt.style.use('fivethirtyeight')

#Plot the distribution of ages in years
plt.hist(app_train['DAYS_BIRTH'] / 365, edgecolor = 'k', bins=25)
plt.title('Age of Client')
plt.xlabel('Age(years)')
plt.ylabel('Count')

然後做KDE圖，看目標變量取值不同時，特徵的分佈情況

# KDE plot of loans that were repaid on time
sns.kdeplot(app_train.loc[app_train['TARGET']==0, 'DAYS_BIRTH'] / 365, label = 'target==0')

# KDE plot of loans that were not repaid on time
sns.kdeplot(app_train.loc[app_train['TARGET']==1, 'DAYS_BIRTH'] / 365, label = 'target==1')

嘗試將連續特徵轉換成離散特徵，探索其和目標變量的關係

# Age data saved in another dataframe
age_data = app_train[['TARGET', 'DAYS_BIRTH']]
age_data['YEARS_BIRTH'] = age_data['DAYS_BIRTH'] / 365

# BIn the age data
age_data['YEARS_BINNED'] = pd.cut(age_data['YEARS_BIRTH'], bins=np.linspace(20, 70, num=11))
age_data.head(10)

np.linspace(start, end, num) - 在[start, end]返回num個均勻的樣本

pd.cut(array-like x, bins) - 返回一個array-like對象，按照bins分箱

做bar plot

# Draw a bar plot for the age bins
plt.bar(age_groups.index.astype(str), 100*age_groups['TARGET'])

#Plot labeling
plt.xticks(rotation=75)
plt.xlabel('Age Group (years)')
plt.ylabel('Failure to Repay (%)')
plt.title('Failure to Repay by Group')

同時探索幾個相關連續特徵對目標變量（類別變量）的影響

查看特徵間關係，及其與目標變量間的關係，熱力圖，KDE圖

ext_data = app_train[['EXT_SOURCE_1', 'EXT_SOURCE_2', 'EXT_SOURCE_3', 'TARGET', 'DAYS_BIRTH']]
corr_ext = ext_data.corr()
corr_ext

sns.heatmap(corr_ext, cmap=plt.cm.RdYlBu_r, vmin=-0.25, annot=True, vmax=0.6)
plt.title('Correlation Heatmap')

特徵工程

Polynomial Features (多項式特徵)

生成多項式特徵，調用sklearn包中的PolynomialFeatures

# Import Polynomial features tool
from sklearn.preprocessing import PolynomialFeatures

poly_transformer = PolynomialFeatures(degree=3)

# Train the polynomial features
poly_transformer.fit(poly_features)

# Transform the features
poly_features = poly_transformer.transform(poly_features)
poly_features_test = poly_transformer.transform(poly_features_test)

print('Polynomial features shape: ', poly_features.shape)

注意PolynomialFeatures的transform方法輸出的是一個numpy array，需要特別轉換成Dataframe，並且用get_feature_names方法得到新的特徵名。得到的特徵，需要添加primary key，然後merge回原本的train和test集，注意這點與get_dummies()方法不同。

# Create a dataframe of the features
poly_features = pd.DataFrame(poly_features, 
                             columns = poly_transformer.get_feature_names(['EXT_SOURCE_1', 'EXT_SOURCE_2', 
'EXT_SOURCE_3', 'DAYS_BIRTH']))

# Put test features into dataframe
poly_features_test = pd.DataFrame(poly_features_test, 
                                  columns = poly_transformer.get_feature_names(['EXT_SOURCE_1', 'EXT_SOURCE_2', 
                                                                                'EXT_SOURCE_3', 'DAYS_BIRTH']))

# Merge polynomial features into training dataframe
poly_features['SK_ID_CURR'] = app_train['SK_ID_CURR']
app_train_poly = app_train.merge(poly_features, on = 'SK_ID_CURR', how = 'left')

# Merge polnomial features into testing dataframe
poly_features_test['SK_ID_CURR'] = app_test['SK_ID_CURR']
app_test_poly = app_test.merge(poly_features_test, on = 'SK_ID_CURR', how = 'left')

# Align the dataframes
app_train_poly, app_test_poly = app_train_poly.align(app_test_poly, join = 'inner', axis = 1)

# Print out the new shapes
print('Training data with polynomial features shape: ', app_train_poly.shape)
print('Testing data with polynomial features shape:  ', app_test_poly.shape)

[Kaggle] kernel中常用方法和語句總結

讀取數據

表格類型數據

讀數據，看行數、列數，前幾行

EDA

查看目標變量分佈

目標變量爲分類變量

查看缺失值

目標dataframe缺失數據的分佈

查看不同類型變量情況

Category/分類變量預處理

object類型的變量是分類變量，查看所有分類變量的取值個數

Label Encoder - 注意要同時code train和test集！

OneHot Encoder

檢查異常值

檢查是否有不合常理的值

特徵和目標相關性

全部特徵和目標變量的相關性

深入探索某個連續特徵和目標變量（類別變量）的相關性

同時探索幾個相關連續特徵對目標變量（類別變量）的影響

特徵工程

Polynomial Features (多項式特徵)

Domain Knowledge

linux安裝cuda和cudnn

測試人員都是畫畫大神，讓我看看誰還不會用代碼圖？

Object.values()對象遍歷

我拍了拍Redis，被移出了羣聊···

網絡現代化通向雲原生應用的高速公路

面試官：說說你對序列化的理解

我宣佈，這是我找到的史上AI最全論文體系！

[機器學習 - 算法原理] lasso與嶺迴歸的差異

[Kaggle] kernel中常用方法和語句總結

[機器學習 - 基本算法] 感知機

[機器學習 - 特徵工程] Category/分類變量預處理 - Label Encoding和OneHot Encoding的選擇

[機器學習 - 算法原理] CART樹剪枝的理解

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結