機器學習項目實戰-能源利用率1-數據預處理

* 項目工作流程

基本流程:

數據清洗與格式轉換
探索性數據分析
特徵工程
建立基礎模型，嘗試多種算法
模型調參
評估與測試
解釋我們的模型
完成項目

一. 數據清洗與格式轉換

import warnings
warning.filterwarnings('ignore')

import pandas as pd
import numpy as np
pd.set_option('display.max_columns', 60)
pd.options.mode.chained_assignment = None
# No warnings about setting value on copy of slice

import matplotlib.pyplot as plt
%matplotlib inline
plt.rcParams['font.size'] = 24
from IPython.core.pylabtools import figsize

import seaborn as sns
sns.set(font_scale = 2)

data = pd.read_csv('Energy_and_Water_Data_Disclosure_for_Local_Law_84_2017__Data_for_Calendar_Year_2016_.csv')
data.head()

1.1 數據類型與缺失值

data.info()

將Not Available轉換爲np.nan，再將部分數值型數據轉換成float

data = data.replace({'Not Available': np.nan})

for col in list(data.columns):
    if ('ft²' in col or 'kBtu' in col or 'Metric Tons CO2e' in col or 'kWh' in
       col or 'therms' in col or 'gal' in col or 'Score' in col):
		data[col] = data[col].astype(float)

data.describe()

1.2 缺失值處理

import missingno as msno
msno.matrix(data, figsize = (16, 5))

1.2.1 缺失值比例函數:

def missing_values_table(df):
	mis_val = df.isnull().sum() # 總缺失值
    mis_val_percent = 100 * df.isnull().sum() / len(df) # 缺失值比例
    mis_val_table = pd.concat([mis_val, mis_val_percent], axis = 1) # 缺失值製成表格
    mis_val_table_ren_columns = mis_val_table.rename(columns = {0:'Missing Values',
                                                               1:'% of Total Values'})
    mis_val_table_ren_columns = mis_val_table_ren_columns[
        mis_val_table_ren_columns.iloc[:,1] != 0].sort_values('% of Total Values',ascending=False).round(1)
    # 缺失值比例列由大到小排序
    
    print('Your selected dataframe has {} columns.\nThere are {} columns that have missing values.'.format(df.shape[1], mis_val_table_ren_columns.shape[0]))
    # 打印缺失值信息
    
    return mis_val_table_ren_columns

missing_values_table(data)

Your selected dataframe has 60 columns.
There are 46 columns that have missing values.

1.2.2 獲取缺失值比例 > 50% 的列

missing_df = missing_values_table(data)
missing_columns = list(missing_df[missing_df['% of Total Values'] > 50].index)
print('We will remove %d columns.' % len(missing_columns))

Your selected dataframe has 60 columns.
There are 46 columns that have missing values.
We will remove 11 columns.

1.2.3 刪除缺失值比例高於50%的列

data = data.drop(columns = list(missing_columns))

二. 探索性數據分析

Exploratory Data Analysis, 就是畫圖來理解數據。。。

2.1 單變量繪圖

標籤數據

data = data.rename(columns = {'ENERGY STAR Score': 'score'})

plt.figure(figsize = (8, 6))
plt.style.use('ggplot')
plt.hist(data['score'].dropna(), bins = 100, edgecolor = 'k')
plt.xlabel('Score'); plt.ylabel('Number of Buildings')
plt.title('Energy Star Score Distribution')

Site EUI 特徵

plt.style.use('ggplot')
plt.figure(figsize(8, 6))
plt.hist(data['Site EUI (kBtu/ft²)'].dropna(), bins = 20, edgecolor = 'black')
plt.xlabel('Site EUI'); plt.ylabel('Count'); plt.title('Site EUI Distribution')

data['Site EUI (kBtu/ft²)'].describe()

data['Site EUI (kBtu/ft²)'].dropna().sort_values().tail(10)

存在着一些特別大的值，這些可能是離羣點或記錄錯誤點，對我們結果會有一些影響的。

2.2 剔除離羣點

離羣點的選擇可能需要再斟酌一些，這裏選擇的方法是extreme outlier。

First Quartile − 3 ∗ Interquartile Range
First Quartile + 3 ∗ Interquartile Range

first_quartile = data['Site EUI (kBtu/ft²)'].describe()['25%']
third_quartile = data['Site EUI (kBtu/ft²)'].describe()['75%']
iqr = third_quartile - first_quartile

data = data[(data['Site EUI (kBtu/ft²)'] > (first_quartile - 3 * iqr)) &
           (data[['Site EUI (kBtu/ft²)'] < (third_quartile + 3 * iqr))]

plt.figure(figsize = (8, 6))
plt.hist(data['Site EUI (kBtu/ft²)'].dropna(), bins = 50, edgecolor = 'black')
plt.xlabel('Site EUI'); plt.ylabel('Count'); plt.title('Site EUI Distribution')

2.3 觀察哪些變量會對結果產生影響

選擇大於80條數據的

Lput = data.dropna(subset = ['score'])['Largest Property Use Type'].value_counts()
Lput = list(Lput[Lput.values > 80].index)

plt.figure(figsize = (12, 10))
for lput in Lput:
    subset = data[data['Largest Property Use Type'] == lput]
    sns.kdeplot(subset['score'].dropna(), label = lput, shade = False, alpha = 0.8)
plt.xlabel('Energy Star Score', fontsize = 18)
plt.ylabel('Density', fontsize = 18)
plt.title('Density Plot of Energy Star Scores by Building Type', size = 24)

不同類型的建築看起來對結果的影響是不一樣的，所以我們需要充分利用這個變量的！

boroughs = data.dropna(subset = ['score'])['Borough'].value_counts()
boroughs = list(boroughs[boroughs.values > 150].index)

plt.figure(figsize = (12, 10))
for borough in boroughs:
    subset = data[data['Borough'] == borough]
    sns.kdeplot(subset['score'].dropna(), label = borough)
plt.xlabel('Energy Star Score', fontsize = 18)
plt.ylabel('Density', fontsize = 18)
plt.title('Density Plot of Energy Star Scores by Borough', fontsize = 24)

對於鎮區這個特徵來說看起來影響就不大，因爲這幾條線都差不多。

2.4 特徵和標籤之間的相關性

Pearson相關係數，幫助我們來篩選特徵

corr_data = data.corr()['score'].sort_values()
print(corr_data.head(15), '\n')
print(corr_data.tail(15))

Site EUI (kBtu/ft²)和 Weather Normalized Site EUI (kBtu/ft²) 呈現出明顯的負相關，單位用電量越多，能源利用得分越低。

還需要在考慮下非線性變換的特徵，比如平方，log等等，都可以來試試，對於類別變量還可以用one-hot encode來轉換下。

2.4.* 特徵變換與 one-hot encode

numeric_subset = data.select_dtypes('number') # 選擇數值型列
for col in numeric_subset.columns: # 對數值型列開平方根和對數, 創建新的列
    if col == 'score':
        next
    else:
        numeric_subset['sqrt_' + col] = np.sqrt(numeric_subset[col])
        numeric_subset['log_' + col] = np.log(numeric_subset[col])

categorical_subset = data[['Borough', 'Largest Property Use Type']] # 選擇類別型列
categorical_subset = pd.get_dummies(categorical_subset) # One hot encode

features = pd.concat([numeric_subset, categorical_subset], axis = 1) # concat兩個類型數據
features = features.dropna(subset = ['score']) # 刪除標籤列中的缺失值行

correlations = features.corr()['score'].dropna().sort_values() # 標籤的相關係數

correlations.head(15)
correlations.tail(15)

2.5 雙變量繪圖

plt.figure(figsize = (12, 10))
features['Largest Property Use Type'] = data.dropna(subset =['score'])['Largest Property Use Type']
# 提取建築類型特徵

features = features[features['Largest Property Use Type'].isin(Lput)]
# Limit to building types with more than 80 observations

sns.lmplot('Site EUI (kBtu/ft²)', 'score', hue = 'Largest Property Use Type',
          data = features, scatter_kws = {'alpha':0.7, 's':50}, fit_reg = False,
          height = 12, aspect = 1.2)
plt.xlabel('Site EUI', fontsize = 24)
plt.ylabel('Energy Star Score', fontsize = 24)
plt.title('Energy Star Score vs Site EUI', fontsize = 30)

2.6 Pairs Plot

plot_data = features[['score', 'Weather Normalized Source EUI (kBtu/ft²)',
                      'Site EUI (kBtu/ft²)', 'sqrt_Source EUI (kBtu/ft²)']]
plot_data = plot_data.replace({np.inf: np.nan, -np.inf: np.nan}) # 無窮大和負無窮大替換爲nan
plot_data = plot_data.rename(columns = {'Site EUI (kBtu/ft²)': 'Site EUI',
                       'sqrt_Source EUI (kBtu/ft²)': 'sqrt Source EUI',
                       'Weather Normalized Source EUI (kBtu/ft²)': 'Weather Norm EUI'})
plot_data = plot_data.dropna()

def corr_func(x, y, **kwargs):
    r = np.corrcoef(x, y)[0][1] # x和y的皮爾遜相關係數
    ax = plt.gca()
    ax.annotate('r = {:.2f}'.format(r), xy = (.2, .8), xycoords=ax.transAxes, size=30)
    
grid = sns.PairGrid(data = plot_data, height = 4)
grid.map_upper(plt.scatter, alpha = 0.6)
grid.map_diag(plt.hist, edgecolor = 'black')
grid.map_lower(corr_func)
grid.map_lower(sns.kdeplot, cmap = plt.cm.Reds)

plt.suptitle('Pairs Plot of Energe Data', fontsize = 28, y = 1.05)

三. 特徵工程與特徵篩選

一般情況下我們分兩步走：特徵工程與特徵篩選：

特徵工程：概括性來說就是儘可能的多在數據中提取特徵，各種數值變換，特徵組合，分解等各種手段齊上陣。
特徵選擇：就是找到最有價值的那些特徵作爲我們模型的輸入，但是之前做了那麼多，可能有些是多餘的，有些還沒被發現，所以這倆階段都是一個反覆在更新的過程。比如我在建模之後拿到了特徵重要性，這就爲特徵選擇做了參考，有些不重要的我可以去掉，那些比較重要的，我還可以再想辦法讓其做更多變換和組合來促進我的模型。所以特徵工程並不是一次性就能解決的，需要通過各種結果來反覆斟酌。

3.1 特徵變換與 One-hot encode

同2.4.* 特徵變換與 one-hot encode

features = data.copy()
numeric_subset = data.select_dtypes('number')
for col in numeric_subset.columns:
    if col == 'score':
        next
    else:
        numeric_subset['log_' + col] = np.log(numeric_subset[col])
        
categorical_subset = data[['Borough', 'Largest Property Use Type']]
categorical_subset = pd.get_dummies(categorical_subset)

features = pd.concat([numeric_subset, categorical_subset], axis = 1)
features.shape

(11319, 110)

3.2 共線特徵

在數據中Site EUI 和 Weather Norm EUI就是要考慮的目標，他倆描述的基本是同一個事

plot_data = data[['Weather Normalized Site EUI (kBtu/ft²)', 'Site EUI (kBtu/ft²)']].dropna()

plt.plot(plot_data['Site EUI (kBtu/ft²)'], plot_data['Weather Normalized Site EUI (kBtu/ft²)'], 'bo')
plt.xlabel('Site EUI'); plt.ylabel('Weather Norm EUI')
plt.title('Weather Norm EUI vs Site EUI, R = %.4f' % np.corrcoef(data[['Weather Normalized Site EUI (kBtu/ft²)', 'Site EUI (kBtu/ft²)']].dropna(), rowvar=False)[0][1])

3.3 剔除共線特徵

def remove_collinear_features(x, threshold):
    '''
    Objective:
        Remove collinear features in a dataframe with a correlation coefficient
        greater than the threshold. Removing collinear features can help a model
        to generalize and improves the interpretability of the model.
        
    Inputs: 
        threshold: any features with correlations greater than this value are removed
    
    Output: 
        dataframe that contains only the non-highly-collinear features
    '''
    
    y = x['score']
    x = x.drop(columns = ['score'])
    
    corr_matrix = x.corr()
    iters = range(len(corr_matrix.columns) - 1)
    drop_cols = []
    
    for i in iters:
        for j in range(i):
            item = corr_matrix.iloc[j: (j+1), (i+1): (i+2)]            
            col = item.columns
            row = item.index
            val = abs(item.values)           
            
            if val >= threshold:
                # print(col.values[0], "|", row.values[0], "|", round(val[0][0], 2))
                drop_cols.append(col.values[0])
        
    drops = set(drop_cols)
    # print(drops)
    x = x.drop(columns = drops)
    x = x.drop(columns = ['Weather Normalized Site EUI (kBtu/ft²)', 
                      'Water Use (All Water Sources) (kgal)',
                      'log_Water Use (All Water Sources) (kgal)',
                      'Largest Property Use Type - Gross Floor Area (ft²)'])
    x['score'] = y
    return x

features = remove_collinear_features(features, 0.6)
features = features.dropna(axis = 1, how = 'all')
print(features.shape)
features.head()

(11319, 65)

3.4 數據集劃分

no_score = features[features['score'].isna()]
score = features[features['score'].notnull()]
print('no_score.shape: ', no_score.shape)
print('score.shape', score.shape)

from sklearn.model_selection import train_test_split
features = score.drop(columns = 'score')
labels = pd.DataFrame(score['score'])
features = features.replace({np.inf: np.nan, -np.inf: np.nan})
X, X_test, y, y_test = train_test_split(features, labels, test_size = 0.3, random_state = 42)
print(X.shape)
print(X_test.shape)
print(y.shape)
print(y_test.shape)

no_score.shape: (1858, 65)
score.shape: (9461, 65)
(6622, 64)
(2839, 64)
(6622, 1)
(2839, 1)

3.5 建立一個Baseline

在建模之前，我們得有一個最壞的打算，就是模型起碼得有點作用才行。

# 衡量標準: Mean Absolute Error
def mae(y_true, y_pred):
    return np.mean(abs(y_true - y_pred))

baseline_guess = np.median(y)

print('The baseline guess is a score of %.2f' % baseline_guess)
print('Baseline Performance on the test set: MAE = %.4f' % mae(y_test, baseline_guess))

The baseline guess is a score of 66.00
Baseline Performance on the test set: MAE = 24.5164

* 保存結果

no_score.to_csv('data/no_score.csv', index = False)
X.to_csv('data/training_features.csv', index = False)
X_test.to_csv('data/testing_features.csv', index = False)
y.to_csv('data/training_labels.csv', index = False)
y_test.to_csv('data/testing_labels.csv', index = False)

未完待續: 建模 與分析

機器學習項目實戰-能源利用率1-數據預處理

目錄:

* 項目工作流程

一. 數據清洗與格式轉換

1.1 數據類型與缺失值

1.2 缺失值處理

1.2.1 缺失值比例函數:

1.2.2 獲取缺失值比例 > 50% 的列

1.2.3 刪除缺失值比例高於50%的列

二. 探索性數據分析

2.1 單變量繪圖

2.2 剔除離羣點

2.3 觀察哪些變量會對結果產生影響

2.4 特徵和標籤之間的相關性

2.4.* 特徵變換與 one-hot encode

2.5 雙變量繪圖

2.6 Pairs Plot

三. 特徵工程與特徵篩選

3.1 特徵變換與 One-hot encode

3.2 共線特徵

3.3 剔除共線特徵

3.4 數據集劃分

3.5 建立一個Baseline

* 保存結果

vue綁定對象，綁定的值不改變的問題

詐騙（殺豬盤）網站進行滲透測試

Spring Cloud 部署時如何使用 Kubernetes 作爲註冊中心和配置中心

KubeKey 部署 K8s v1.28.8 實戰

記一些CISP-PTE題目解析

數據挖掘之房價預測任務

協同過濾與隱語義模型推薦系統實例2: 基於相似度的推薦

ARIMA 時間序列2: 評估和參數選擇

時間處理date_range,truncate,Timestamp,Period,Timedelta,resample,rolling

HMM隱馬爾科夫模型與實例2: 預測股票走勢

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結

機器學習項目實戰-能源利用率1-數據預處理

目錄:

* 項目工作流程

一. 數據清洗與格式轉換

1.1 數據類型與缺失值

1.2 缺失值處理

1.2.1 缺失值比例函數:

1.2.2 獲取缺失值比例 > 50% 的列

1.2.3 刪除缺失值比例高於50%的列

二. 探索性數據分析

2.1 單變量繪圖

2.2 剔除離羣點

2.3 觀察哪些變量會對結果產生影響

2.4 特徵和標籤之間的相關性

2.4.* 特徵變換與 one-hot encode

2.5 雙變量繪圖

2.6 Pairs Plot

三. 特徵工程與特徵篩選

3.1 特徵變換 與 One-hot encode

3.2 共線特徵

3.3 剔除共線特徵

3.4 數據集劃分

3.5 建立一個Baseline

* 保存結果

3.1 特徵變換與 One-hot encode