通過分析房屋價格理解機器學習流程 --Adam Studio

Machine Learning Workflow for House Prices

在這裏插入圖片描述

1- Introduction

This is a A Comprehensive ML Workflow for House Prices data set, it is clear that everyone in this community is familiar with house prices dataset but if you need to review your information about the dataset please visit this link.

I have tried to help Fans of Machine Learning in Kaggle how to face machine learning problems. and I think it is a great opportunity for who want to learn machine learning workflow with python completely.

I want to covere most of the methods that are implemented for house prices until 2018, you can start to learn and review your knowledge about ML with a simple dataset and try to learn and memorize the workflow for your journey in Data science world.

Before we get into the notebook, let me introduce some helpful resources.

這是房屋價格數據集的綜合ML工作流程,很明顯,該社區的每個人都熟悉房價數據集,但如果您需要查看有關數據集的信息,請訪問此鏈接。

我試圖幫助Kaggle的機器學習愛好者如何面對機器學習問題。 我認爲這對於想要完全使用python學習機器學習工作流程的人來說是一個很好的機會。

我想強調大部分直到2018年實施房價的方法,你可以用一個簡單的數據集開始學習和回顧你對ML的瞭解,並嘗試學習和記住你在數據科學世界中旅程的工作流程。

在我們進入筆記本之前,讓我介紹一些有用的資源。

2- Machine Learning Workflow

If you have already read some machine learning books. You have noticed that there are different ways to stream data into machine learning.

Most of these books share the following steps:

  • Define Problem
  • Specify Inputs & Outputs
  • Exploratory data analysis
  • Data Collection
  • Data Preprocessing
  • Data Cleaning
  • Visualization
  • Model Design, Training, and Offline Evaluation
  • Model Deployment, Online Evaluation, and Monitoring
  • Model Maintenance, Diagnosis, and Retraining

Of course, the same solution can not be provided for all problems, so the best way is to create a general framework and adapt it to new problem.

如果您已經閱讀過一些機器學習書籍。 您已經注意到有不同的方法將數據流式傳輸到機器學習中。

這些書中的大多數共享以下步驟:

  • 定義問題
  • 指定輸入和輸出
  • 探索性數據分析
  • 數據採集
  • 數據預處理
  • 數據清理
  • 可視化
  • 模型設計,培訓和離線評估
  • 模型部署,在線評估和監控
  • 模型維護,診斷和再培訓

當然,不能爲所有問題提供相同的解決方案,因此最好的方法是創建一個通用框架並使其適應新問題。

You can see my workflow in the below image :

在這裏插入圖片描述

Data Science has so many techniques and procedures that can confuse anyone.

數據科學有許多技術和程序可以讓任何人感到困惑。

2-2 Real world Application Vs Competitions

We all know that there are differences between real world problem and competition problem. The following figure that is taken from one of the courses in coursera, has partly made this comparison
我們都知道現實世界問題和競爭問題之間存在差異。 下圖取自課程中的一個課程,部分進行了這種比較

在這裏插入圖片描述

As you can see, there are a lot more steps to solve in real problems.

3- Problem Definition

I think one of the important things when you start a new machine learning project is defining your problem.that means you should understand business problem.( Problem Formalization).

Problem definition has four steps that have illustrated in the picture below:

我認爲,當你開始一個新的機器學習項目時,重要的事情之一是定義你的問題。這意味着你應該理解業務問題。(問題形式化)。

問題定義有四個步驟,如下圖所示:

在這裏插入圖片描述

3-1 Problem Feature

We will use the house prices data set. This dataset contains information about house prices and the target value is:

  • SalePrice
    Why am I using House price dataset:

  • This is a good project because it is so well understood.

  • Attributes are numeric and categurical so you have to figure out how to load and handle data.

  • It is a Regression problem, allowing you to practice with perhaps an easier type of supervised learning algorithm.

  • This is a perfect competition for data science students who have completed an online course in machine learning and are looking to expand their skill set before trying a featured competition.

  • Creative feature engineering .

我們將使用房價數據集。 此數據集包含有關房價的信息,目標值爲:

  • 銷售價格
    爲什麼我使用房價數據集:

  • 這是一個很好的項目,因爲它很好理解。

  • 屬性是數字和分類,因此您必須弄清楚如何加載和處理數據。

  • 這是一個迴歸問題,允許您練習一種更簡單的監督學習算法。

  • 對於已完成機器學習在線課程並希望在參加特色競賽之前擴展其技能的數據科學學生來說,這是一場完美的比賽。

  • 創意特色工程。

3-1-1 Metric

Submissions are evaluated on Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

在預測值的對數與觀察到的銷售價格的對數之間的均方根誤差(RMSE)上評估提交。 (記錄日誌意味着預測昂貴房屋和廉價房屋的錯誤將同樣影響結果。)
在這裏插入圖片描述

3-2 Aim

It is our job to predict the sales price for each house. for each Id in the test set, you must predict the value of the SalePrice variable.

我們的工作是預測每棟房屋的銷售價格。 對於測試集中的每個Id,您必須預測SalePrice變量的值。

3-3 Variables

The variables are :

  • SalePrice - the property’s sale price in dollars. This is the target variable that you’re trying to predict.
  • MSSubClass: The building class
  • MSZoning: The general zoning classification
  • LotFrontage: Linear feet of street connected to property
  • LotArea: Lot size in square feet
  • Street: Type of road access
  • Alley: Type of alley access
  • LotShape: General shape of property
  • LandContour: Flatness of the property
  • Utilities: Type of utilities available
  • LotConfig: Lot configuration
  • LandSlope: Slope of property
  • Neighborhood: Physical locations within Ames city limits
  • Condition1: Proximity to main road or railroad
  • Condition2: Proximity to main road or railroad (if a second is present)
  • BldgType: Type of dwelling
  • HouseStyle: Style of dwelling
  • OverallQual: Overall material and finish quality
  • OverallCond: Overall condition rating
  • YearBuilt: Original construction date
  • YearRemodAdd: Remodel date
  • RoofStyle: Type of roof
  • RoofMatl: Roof material
  • Exterior1st: Exterior covering on house
  • Exterior2nd: Exterior covering on house (if more than one material)
  • MasVnrType: Masonry veneer type
  • MasVnrArea: Masonry veneer area in square feet
  • ExterQual: Exterior material quality
  • ExterCond: Present condition of the material on the exterior
  • Foundation: Type of foundation
  • BsmtQual: Height of the basement
  • BsmtCond: General condition of the basement
  • BsmtExposure: Walkout or garden level basement walls
  • BsmtFinType1: Quality of basement finished area
  • BsmtFinSF1: Type 1 finished square feet
  • BsmtFinType2: Quality of second finished area (if present)
  • BsmtFinSF2: Type 2 finished square feet
  • BsmtUnfSF: Unfinished square feet of basement area
  • TotalBsmtSF: Total square feet of basement area
  • Heating: Type of heating
  • HeatingQC: Heating quality and condition
  • CentralAir: Central air conditioning
  • Electrical: Electrical system
  • 1stFlrSF: First Floor square feet
  • 2ndFlrSF: Second floor square feet
  • LowQualFinSF: Low quality finished square feet (all floors)
  • GrLivArea: Above grade (ground) living area square feet
  • BsmtFullBath: Basement full bathrooms
  • BsmtHalfBath: Basement half bathrooms
  • FullBath: Full bathrooms above grade
  • HalfBath: Half baths above grade
  • Bedroom: Number of bedrooms above basement level
  • Kitchen: Number of kitchens
  • KitchenQual: Kitchen quality
  • TotRmsAbvGrd: Total rooms above grade (does not include bathrooms)
  • Functional: Home functionality rating
  • Fireplaces: Number of fireplaces
  • FireplaceQu: Fireplace quality
  • GarageType: Garage location
  • GarageYrBlt: Year garage was built
  • GarageFinish: Interior finish of the garage
  • GarageCars: Size of garage in car capacity
  • GarageArea: Size of garage in square feet
  • GarageQual: Garage quality
  • GarageCond: Garage condition
  • PavedDrive: Paved driveway
  • WoodDeckSF: Wood deck area in square feet
  • OpenPorchSF: Open porch area in square feet
  • EnclosedPorch: Enclosed porch area in square feet
  • 3SsnPorch: Three season porch area in square feet
  • ScreenPorch: Screen porch area in square feet
  • PoolArea: Pool area in square feet
  • PoolQC: Pool quality
  • Fence: Fence quality
  • MiscFeature: Miscellaneous feature not covered in other categories
  • MiscVal: Value of miscellaneous feature
  • MoSold: Month Sold
  • YrSold: Year Sold
  • SaleType: Type of sale
  • SaleCondition: Condition of sale

變量是:

  • SalePrice - 該物業的銷售價格以美元計算。這是您嘗試預測的目標變量。
  • MSSubClass:建築類
  • MSZoning:一般分區分類
  • LotFrontage:與物業相連的街道的線性腳
  • LotArea:地塊尺寸,平方英尺
  • 街道:道路通行類型
  • 衚衕:衚衕通道的類型
  • LotShape:一般的財產形狀
  • LandContour:酒店的平整度
  • 實用程序:可用的實用程序類型
  • LotConfig:批量配置
  • LandSlope:物業坡度
  • 鄰里:Ames市區內的物理位置
  • 條件1:靠近主要道路或鐵路
  • 條件2:靠近主要道路或鐵路(如果存在第二個)
  • BldgType:住宅類型
  • HouseStyle:住宅風格
  • OverallQual:整體材料和成品質量
  • OverallCond:總體狀況評級
  • YearBuilt:原始施工日期
  • YearRemodAdd:改造日期
  • RoofStyle:屋頂類型
  • RoofMatl:屋頂材料
  • Exterior1st:房屋外牆
  • Exterior2nd:房屋外牆(如果有多種材料)
  • MasVnrType:砌體貼面類型
  • MasVnrArea:平方英尺的砌體飾面區域
  • ExterQual:外部材料質量
  • ExterCond:外部材料的現狀
  • 基礎:基礎類型
  • BsmtQual:地下室的高度
  • BsmtCond:地下室的一般狀況
  • BsmtExposure:罷工或花園層地下室牆壁
  • BsmtFinType1:地下室成品區的質量
  • BsmtFinSF1:類型1完成平方英尺
  • BsmtFinType2:第二個完成區域的質量(如果存在)
  • BsmtFinSF2:2型成品平方英尺
  • BsmtUnfSF:未完工的地下室平方英尺
  • TotalBsmtSF:地下室總面積平方英尺
  • 加熱:加熱類型
  • HeatingQC:加熱質量和條件
  • CentralAir:中央空調
  • 電氣:電氣系統
  • 1stFlrSF:一樓平方英尺
  • 2ndFlrSF:二樓平方英尺
  • LowQualFinSF:低質量的平方英尺(所有樓層)
  • GrLivArea:以上(地面)生活區平方英尺
  • BsmtFullBath:地下室齊全的浴室
  • BsmtHalfBath:地下室半浴室
  • FullBath:高檔以上的完整浴室
  • HalfBath:高於等級的半浴
  • 臥室:地下室以上的臥室數量
  • 廚房:廚房數量
  • KitchenQual:廚房質量
  • TotRmsAbvGrd:以上客房總數(不包括浴室)
  • 功能:家庭功能評級
  • 壁爐:壁爐數量
  • FireplaceQu:壁爐質量
  • 車庫類型:車庫位置
  • GarageYrBlt:建造了年車庫
  • GarageFinish:車庫的內部裝飾
  • GarageCars:車庫容量的車庫大小
  • GarageArea:車庫的面積,平方英尺
  • GarageQual:車庫質量
  • GarageCond:車庫狀況
  • PavedDrive:鋪好的車道
  • WoodDeckSF:平方英尺的木甲板面積
  • OpenPorchSF:平方英尺的開放式門廊區域
  • EnclosedPorch:封閉的門廊面積,平方英尺
  • 3SsnPorch:三個季節的門廊面積,平方英尺
  • ScreenPorch:屏幕門廊面積,平方英尺
  • PoolArea:泳池面積,平方英尺
  • PoolQC:游泳池質量
  • 圍欄:圍欄質量
  • MiscFeature:其他類別未涵蓋的其他功能
  • MiscVal:雜項功能的價值
  • MoSold:已售出月份
  • YrSold:已售出年份
  • SaleType:銷售類型
  • SaleCondition:銷售條件

4- Inputs & Outputs

For every machine learning problem, you should ask yourself, what are inputs and outputs for the model?

對於每個機器學習問題,您應該問自己,模型的輸入和輸出是什麼?
在這裏插入圖片描述

4-1 Inputs

  • train.csv - the training set
  • test.csv - the test set

4-2 Outputs

  • sale prices for every record in test.csv

5 Loading Packages

In this kernel we are using the following packages:
在這裏插入圖片描述

5-1 Import

from sklearn.linear_model import Ridge, RidgeCV, ElasticNet, LassoCV, LassoLarsCV
from sklearn.base import BaseEstimator, TransformerMixin, RegressorMixin, clone
from sklearn.linear_model import ElasticNet, Lasso,  BayesianRidge, LassoLarsIC
from sklearn.ensemble import RandomForestRegressor,  GradientBoostingRegressor
from sklearn.model_selection import KFold, cross_val_score, train_test_split
from sklearn.metrics import make_scorer, accuracy_score
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import cross_val_score
from sklearn.metrics import classification_report
from sklearn.model_selection import GridSearchCV
from sklearn.preprocessing import StandardScaler
from sklearn.metrics import mean_squared_error
from sklearn.preprocessing import RobustScaler
from sklearn.metrics import confusion_matrix
from sklearn.kernel_ridge import KernelRidge
from sklearn.pipeline import make_pipeline
from sklearn.metrics import accuracy_score
from sklearn.pipeline import Pipeline
import matplotlib.pyplot as plt
from scipy.stats import skew
import scipy.stats as stats
import lightgbm as lgb
import seaborn as sns
import xgboost as xgb
import pandas as pd
import numpy as np
import matplotlib
import warnings
import sklearn
import scipy
import json
import sys
import csv
import os

5-2 Version

print('matplotlib: {}'.format(matplotlib.__version__))
print('sklearn: {}'.format(sklearn.__version__))
print('scipy: {}'.format(scipy.__version__))
print('seaborn: {}'.format(sns.__version__))
print('pandas: {}'.format(pd.__version__))
print('numpy: {}'.format(np.__version__))
print('Python: {}'.format(sys.version))

在這裏插入圖片描述

5-5-3 Setup

A few tiny adjustments for better code readability

pd.set_option('display.float_format', lambda x: '%.3f' % x)
sns.set(style='white', context='notebook', palette='deep')
warnings.filterwarnings('ignore')
sns.set_style('white')
%matplotlib inline

6- Exploratory Data Analysis(EDA)

In this section, you’ll learn how to use graphical and numerical techniques to begin uncovering the structure of your data.

  • Which variables suggest interesting relationships?
  • Which observations are unusual?

By the end of the section, you’ll be able to answer these questions and more, while generating graphics that are both insightful and beautiful. then We will review analytical and statistical operations:

  • Data Collection

  • Visualization

  • Data Cleaning

  • Data Preprocessing
    在本節中,您將學習如何使用圖形和數字技術來開始發現數據結構。

  • 哪些變量表明有趣的關係?

  • 哪些觀察結果不尋常?

在本節結束時,您將能夠回答這些問題以及更多問題,同時生成既富有洞察力又美觀的圖形。 然後我們將審查分析和統計操作:

  • 數據採集
  • 可視化
  • 數據清理
  • 數據預處理
    在這裏插入圖片描述

6-1 Data Collection

Data collection is the process of gathering and measuring data, information or any variables of interest in a standardized and established manner that enables the collector to answer or test hypothesis and evaluate outcomes of the particular collection.
數據收集是以標準化和既定方式收集和測量數據,信息或任何感興趣變量的過程,使收集者能夠回答或檢驗假設並評估特定收集的結果。[techopedia]

# import Dataset to play with it
train = pd.read_csv('../input/train.csv')
test= pd.read_csv('../input/test.csv')

The concat function does all of the heavy lifting of performing concatenation operations along an axis. Let us create all_data.

concat函數完成了沿軸執行連接操作的所有繁重工作。 讓我們創建all_data。

all_data = pd.concat((train.loc[:,'MSSubClass':'SaleCondition'],
                      test.loc[:,'MSSubClass':'SaleCondition']))
  • Each row is an observation (also known as : sample, example, instance, record)
  • Each column is a feature (also known as: Predictor, attribute, Independent Variable, input, regressor, Covariate)

After loading the data via pandas, we should checkout what the content is, description and via the following:

  • 每行都是觀察(也稱爲:樣本,示例,實例,記錄)
  • 每列都是一個特徵(也稱爲:預測變量,屬性,獨立變量,輸入,迴歸量,協變量)

通過pandas加載數據後,我們應該檢查內容是什麼,描述以及通過以下內容:

type(train),type(test)

在這裏插入圖片描述

6-1-1 Statistical Summary

  • 1- Dimensions of the dataset.

  • 2- Peek at the data itself.

  • 3- Statistical summary of all attributes.

  • 4- Breakdown of the data by the class variable.[7]

Don’t worry, each look at the data is one command. These are useful commands that you can use again and again on future projects.

  • 1-數據集的維度。

  • 2-查看數據本身。

  • 3-所有屬性的統計摘要。

  • 4類變量對數據的細分。[7]

別擔心,每次查看數據都是一個命令。 這些是有用的命令,您可以在將來的項目中反覆使用這些命令。

# shape
print(train.shape)

在這裏插入圖片描述
Train has one column more than test why? (yes ==>> target value)

# shape
print(test.shape)

(1459, 80)

We can get a quick idea of how many instances (rows) and how many attributes (columns) the data contains with the shape property.

You should see 1460 instances and 81 attributes for train and 1459 instances and 80 attributes for test

For getting some information about the dataset you can use info() command

我們可以快速瞭解數據包含多少個實例(行)和多少屬性(列)以及shape屬性。

您應該看到列車的1460個實例和81個屬性以及1459個實例和80個測試屬性

要獲取有關數據集的一些信息,可以使用info()命令

print(train.info())

<class ‘pandas.core.frame.DataFrame’>
RangeIndex: 1460 entries, 0 to 1459
Data columns (total 81 columns):
Id 1460 non-null int64
MSSubClass 1460 non-null int64
MSZoning 1460 non-null object
LotFrontage 1201 non-null float64
LotArea 1460 non-null int64
Street 1460 non-null object
Alley 91 non-null object
LotShape 1460 non-null object
LandContour 1460 non-null object
Utilities 1460 non-null object
LotConfig 1460 non-null object
LandSlope 1460 non-null object
Neighborhood 1460 non-null object
Condition1 1460 non-null object
Condition2 1460 non-null object
BldgType 1460 non-null object
HouseStyle 1460 non-null object
OverallQual 1460 non-null int64
OverallCond 1460 non-null int64
YearBuilt 1460 non-null int64
YearRemodAdd 1460 non-null int64
RoofStyle 1460 non-null object
RoofMatl 1460 non-null object
Exterior1st 1460 non-null object
Exterior2nd 1460 non-null object
MasVnrType 1452 non-null object
MasVnrArea 1452 non-null float64
ExterQual 1460 non-null object
ExterCond 1460 non-null object
Foundation 1460 non-null object
BsmtQual 1423 non-null object
BsmtCond 1423 non-null object
BsmtExposure 1422 non-null object
BsmtFinType1 1423 non-null object
BsmtFinSF1 1460 non-null int64
BsmtFinType2 1422 non-null object
BsmtFinSF2 1460 non-null int64
BsmtUnfSF 1460 non-null int64
TotalBsmtSF 1460 non-null int64
Heating 1460 non-null object
HeatingQC 1460 non-null object
CentralAir 1460 non-null object
Electrical 1459 non-null object
1stFlrSF 1460 non-null int64
2ndFlrSF 1460 non-null int64
LowQualFinSF 1460 non-null int64
GrLivArea 1460 non-null int64
BsmtFullBath 1460 non-null int64
BsmtHalfBath 1460 non-null int64
FullBath 1460 non-null int64
HalfBath 1460 non-null int64
BedroomAbvGr 1460 non-null int64
KitchenAbvGr 1460 non-null int64
KitchenQual 1460 non-null object
TotRmsAbvGrd 1460 non-null int64
Functional 1460 non-null object
Fireplaces 1460 non-null int64
FireplaceQu 770 non-null object
GarageType 1379 non-null object
GarageYrBlt 1379 non-null float64
GarageFinish 1379 non-null object
GarageCars 1460 non-null int64
GarageArea 1460 non-null int64
GarageQual 1379 non-null object
GarageCond 1379 non-null object
PavedDrive 1460 non-null object
WoodDeckSF 1460 non-null int64
OpenPorchSF 1460 non-null int64
EnclosedPorch 1460 non-null int64
3SsnPorch 1460 non-null int64
ScreenPorch 1460 non-null int64
PoolArea 1460 non-null int64
PoolQC 7 non-null object
Fence 281 non-null object
MiscFeature 54 non-null object
MiscVal 1460 non-null int64
MoSold 1460 non-null int64
YrSold 1460 non-null int64
SaleType 1460 non-null object
SaleCondition 1460 non-null object
SalePrice 1460 non-null int64
dtypes: float64(3), int64(35), object(43)
memory usage: 924.0+ KB
None

if you want see the type of data and unique value of it you use following script

train['Fence'].unique()

array([nan, ‘MnPrv’, ‘GdWo’, ‘GdPrv’, ‘MnWw’], dtype=object)

train["Fence"].value_counts()

MnPrv 157
GdPrv 59
GdWo 54
MnWw 11
Name: Fence, dtype: int64

Copy Id for test and train data set

train_id=train['Id'].copy()
test_id=test['Id'].copy()

to check the first 5 rows of the data set, we can use head(5).

train.head(5)

在這裏插入圖片描述

1to check out last 5 row of the data set, we use tail() function

train.tail() 

在這裏插入圖片描述
to pop up 5 random rows from the data set, we can use sample(5) function

train.sample(5) 

在這裏插入圖片描述
To give a statistical summary about the dataset, we can use **describe()

train.describe() 

在這裏插入圖片描述
To check out how many null info are on the dataset, we can use **isnull().sum()

train.isnull().sum().head(2)

Id 0
MSSubClass 0
dtype: int64

train.groupby('SaleType').count()

在這裏插入圖片描述
to print dataset columns, we can use columns atribute

train.columns

在這裏插入圖片描述

type((train.columns))

pandas.core.indexes.base.Index

<< Note 2 >> in pandas’s data frame you can perform some query such as "where"

train[train['SalePrice']>700000]

在這裏插入圖片描述

6-1-2 Target Value Analysis

As you know SalePrice is our target value that we should predict it then now we take a look a

train['SalePrice'].describe()

在這裏插入圖片描述
Flexibly plot a univariate distribution of observations.

sns.set(rc={'figure.figsize':(9,7)})
sns.distplot(train['SalePrice']);

在這裏插入圖片描述

6-1-3 Skewness vs Kurtosis

Skewness

It is the degree of distortion from the symmetrical bell curve or the normal distribution. It measures the lack of symmetry in data distribution. It differentiates extreme values in one versus the other tail. A symmetrical distribution will have a skewness of 0.

Kurtosis

Kurtosis is all about the tails of the distribution — not the peakedness or flatness. It is used to describe the extreme values in one versus the other tail. It is actually the measure of outliers present in the distribution.

#skewness and kurtosis
print("Skewness: %f" % train['SalePrice'].skew())
print("Kurtosis: %f" % train['SalePrice'].kurt())

在這裏插入圖片描述

6-2 Visualization

Data visualization is the presentation of data in a pictorial or graphical format. It enables decision makers to see analytics presented visually, so they can grasp difficult concepts or identify new patterns.

With interactive visualization, you can take the concept a step further by using technology to drill down into charts and graphs for more detail, interactively changing what data you see and how it’s processed.[SAS]

In this section I show you 11 plots with matplotlib and seaborn that is listed in the blew picture:

在這裏插入圖片描述

6-2-1 Scatter plot

Scatter plot Purpose To identify the type of relationship (if any) between two quantitative variables

# Modify the graph above by assigning each species an individual color.
columns = ['SalePrice','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','FullBath','YearBuilt','YearRemodAdd']
g=sns.FacetGrid(train[columns], hue="OverallQual", size=5) \
   .map(plt.scatter, "OverallQual", "SalePrice") \
   .add_legend()
g=g.map(plt.scatter, "OverallQual", "SalePrice",edgecolor="w").add_legend();
plt.show()

在這裏插入圖片描述

6-2-2 Box

In descriptive statistics, a box plot or boxplot is a method for graphically depicting groups of numerical data through their quartiles. Box plots may also have lines extending vertically from the boxes (whiskers) indicating variability outside the upper and lower quartiles, hence the terms box-and-whisker plot and box-and-whisker diagram.[wikipedia]

data = pd.concat([train['SalePrice'], train['OverallQual']], axis=1)
f, ax = plt.subplots(figsize=(12, 8))
fig = sns.boxplot(x='OverallQual', y="SalePrice", data=data)

在這裏插入圖片描述

ax= sns.boxplot(x="OverallQual", y="SalePrice", data=train[columns])
ax= sns.stripplot(x="OverallQual", y="SalePrice", data=train[columns], jitter=True, edgecolor="gray")
plt.show()

在這裏插入圖片描述

6-2-3 Histogram

We can also create a histogram of each input variable to get an idea of the distribution.

# histograms
train.hist(figsize=(15,20))
plt.figure()

在這裏插入圖片描述

mini_train=train[columns]
f,ax=plt.subplots(1,2,figsize=(20,10))
mini_train[mini_train['SalePrice']>100000].GarageArea.plot.hist(ax=ax[0],bins=20,edgecolor='black',color='red')
ax[0].set_title('SalePrice>100000')
x1=list(range(0,85,5))
ax[0].set_xticks(x1)
mini_train[mini_train['SalePrice']<100000].GarageArea.plot.hist(ax=ax[1],color='green',bins=20,edgecolor='black')
ax[1].set_title('SalePrice<100000')
x2=list(range(0,85,5))
ax[1].set_xticks(x2)
plt.show()

在這裏插入圖片描述

mini_train[['SalePrice','OverallQual']].groupby(['OverallQual']).mean().plot.bar()

在這裏插入圖片描述

train['OverallQual'].value_counts().plot(kind="bar");

在這裏插入圖片描述

It looks like perhaps two of the input variables have a Gaussian distribution. This is useful to note as we can use algorithms that can exploit this assumption.

6-2-4 Multivariate Plots

Now we can look at the interactions between the variables.

First, let’s look at scatterplots of all pairs of attributes. This can be helpful to spot structured relationships between input variables.

# scatter plot matrix
pd.plotting.scatter_matrix(train[columns],figsize=(10,10))
plt.figure()

在這裏插入圖片描述

Note the diagonal grouping of some pairs of attributes. This suggests a high correlation and a predictable relationship.

6-2-5 violinplots

# violinplots on petal-length for each species
sns.violinplot(data=train,x="Functional", y="SalePrice")

在這裏插入圖片描述

6-2-6 pairplot

# Using seaborn pairplot to see the bivariate relation between each pair of features
sns.set()
columns = ['SalePrice','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','FullBath','YearBuilt','YearRemodAdd']
sns.pairplot(train[columns],size = 2 ,kind ='scatter')
plt.show()

在這裏插入圖片描述

6-2-7 kdeplot

# seaborn's kdeplot, plots univariate or bivariate density estimates.
#Size can be changed by tweeking the value used
columns = ['SalePrice','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','FullBath','YearBuilt','YearRemodAdd']
sns.FacetGrid(train[columns], hue="OverallQual", size=5).map(sns.kdeplot, "YearBuilt").add_legend()
plt.show()

在這裏插入圖片描述

6-2-8 jointplot

# Use seaborn's jointplot to make a hexagonal bin plot
#Set desired size and ratio and choose a color.
columns = ['SalePrice','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','FullBath','YearBuilt','YearRemodAdd']
sns.jointplot(x="OverallQual", y="SalePrice", data=train[columns], size=10,ratio=10, kind='hex',color='green')
plt.show()

在這裏插入圖片描述

# we will use seaborn jointplot shows bivariate scatterplots and univariate histograms with Kernel density 
# estimation in the same figure
columns = ['SalePrice','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','FullBath','YearBuilt','YearRemodAdd']
sns.jointplot(x="SalePrice", y="YearBuilt", data=train[columns], size=6, kind='kde', color='#800000', space=0)

在這裏插入圖片描述

6-2-9 Heatmap

plt.figure(figsize=(7,4)) 
columns = ['SalePrice','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','FullBath','YearBuilt','YearRemodAdd']
sns.heatmap(train[columns].corr(),annot=True,cmap='cubehelix_r') #draws  heatmap with input as the correlation matrix calculted by(iris.corr())
plt.show()

在這裏插入圖片描述

6-2-10 radviz

from pandas.tools.plotting import radviz
columns = ['SalePrice','OverallQual','TotalBsmtSF','GrLivArea','GarageArea','FullBath','YearBuilt','YearRemodAdd']
radviz(train[columns], "OverallQual")

在這裏插入圖片描述

6-2-12 Factorplot

sns.factorplot('OverallQual','SalePrice',hue='Functional',data=train)
plt.show()

在這裏插入圖片描述

6-3 Data Preprocessing

Data preprocessing refers to the transformations applied to our data before feeding it to the algorithm.

Data Preprocessing is a technique that is used to convert the raw data into a clean data set. In other words, whenever the data is gathered from different sources it is collected in raw format which is not feasible for the analysis. there are plenty of steps for data preprocessing and we just listed some of them :

  • removing Target column (id)
  • Sampling (without replacement)
  • Making part of iris unbalanced and balancing (with undersampling and SMOTE)
  • Introducing missing values and treating them (replacing by average values)
  • Noise filtering
  • Data discretization
  • Normalization and standardization
  • PCA analysis
  • Feature selection (filter, embedded, wrapper)
    數據預處理是指在將數據提供給算法之前應用於我們數據的轉換。

數據預處理是一種用於將原始數據轉換爲乾淨數據集的技術。 換句話說,每當從不同來源收集數據時,就以原始格式收集數據,這對於分析是不可行的。 有很多步驟可以進行數據預處理,我們只列出了一些步驟:

  • 刪除目標列(id)
  • 取樣(無需更換)
  • 使虹膜的一部分不平衡和平衡(使用欠採樣和SMOTE)
  • 引入缺失值並對其進行處理(替換爲平均值)
  • 噪音過濾
  • 數據離散化
  • 規範化和標準化
  • PCA分析
  • 功能選擇(過濾器,嵌入式,包裝器)

6-3-1 Noise filtering (Outliers)

An outlier is a data point that is distant from other similar points. Further simplifying an outlier is an observation that lies on abnormal observation amongst the normal observations in a sample set of population.

異常值是遠離其他類似點的數據點。 進一步簡化異常值是一種觀察,其在於樣本集合中的正常觀察中的異常觀察。

在這裏插入圖片描述
In statistics, an outlier is an observation point that is distant from other observations.

# Looking for outliers, as indicated in https://ww2.amstat.org/publications/jse/v19n3/decock.pdf
plt.scatter(train.GrLivArea, train.SalePrice, c = "blue", marker = "s")
plt.title("Looking for outliers")
plt.xlabel("GrLivArea")
plt.ylabel("SalePrice")
plt.show()

train = train[train.GrLivArea < 4000]

在這裏插入圖片描述

2 extreme outliers on the bottom right

#deleting points
train.sort_values(by = 'GrLivArea', ascending = False)[:2]
train = train.drop(train[train['Id'] == 1299].index)
train = train.drop(train[train['Id'] == 524].index)
#log transform skewed numeric features:
numeric_feats = all_data.dtypes[all_data.dtypes != "object"].index
skewed_feats = train[numeric_feats].apply(lambda x: skew(x.dropna())) #compute skewness
skewed_feats = skewed_feats[skewed_feats > 0.75]
skewed_feats = skewed_feats.index

all_data[skewed_feats] = np.log1p(all_data[skewed_feats])
# Log transform the target for official scoring
#The key point is to to log_transform the numeric variables since most of them are skewed.
train.SalePrice = np.log1p(train.SalePrice)
y = train.SalePrice

Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.

plt.scatter(train.GrLivArea, train.SalePrice, c = "blue", marker = "s")
plt.title("Looking for outliers")
plt.xlabel("GrLivArea")
plt.ylabel("SalePrice")
plt.show()

在這裏插入圖片描述

6-4 Data Cleaning

When dealing with real-world data, dirty data is the norm rather than the exception. We continuously need to predict correct values, impute missing ones, and find links between various data artefacts such as schemas and records. We need to stop treating data cleaning as a piecemeal exercise (resolving different types of errors in isolation), and instead leverage all signals and resources (such as constraints, available statistics, and dictionaries) to accurately predict corrective actions.

在處理實際數據時,髒數據是常態而不是異常。 我們不斷需要預測正確的值,估算缺失的值,並找到各種數據僞像(如模式和記錄)之間的鏈接。 我們需要停止將數據清理視爲零碎的練習(孤立地解決不同類型的錯誤),而是利用所有信號和資源(例如約束,可用統計和詞典)來準確預測糾正措施。

6-4-1 Handle missing values

Firstly, understand that there is NO good way to deal with missing data

首先,要了解沒有好的方法來處理缺失的數據

在這裏插入圖片描述

#filling NA's with the mean of the column:
all_data = all_data.fillna(all_data.mean())

7- Model Deployment

In this section have been applied plenty of learning algorithms that play an important rule in your experiences and improve your knowledge in case of ML technique.

<< Note 3 >> : The results shown here may be slightly different for your analysis because, for example, the neural network algorithms use random number generators for fixing the initial value of the weights (starting points) of the neural networks, which often result in obtaining slightly different (local minima) solutions each time you run the analysis. Also note that changing the seed for the random number generator used to create the train, test, and validation samples can change your results.

在本節中已經應用了大量的學習算法,這些算法在您的經歷中起着重要作用,並在ML技術的情況下提高您的知識。

<<注3 >>:此處顯示的結果可能與您的分析略有不同,因爲,例如,神經網絡算法使用隨機數生成器來固定神經網絡的權重(起始點)的初始值,這通常是 每次運行分析時,都會獲得稍微不同的(局部最小值)解決方案。 另請注意,更改用於創建訓練,測試和驗證樣本的隨機數生成器的種子可能會更改結果。

7-1 Families of ML algorithms

There are several categories for machine learning algorithms, below are some of these categories:

  • Linear
    • Linear Regression
    • Logistic Regression
    • Support Vector Machines
  • Tree-Based
    • Decision Tree
    • Random Forest
    • GBDT
  • KNN
  • Neural Networks

And if we want to categorize ML algorithms with the type of learning, there are below type:

  • Classification

    • k-Nearest Neighbors
    • LinearRegression
    • SVM
    • DT
    • NN
  • clustering

    • K-means
    • HCA
    • Expectation Maximization
  • Visualization and dimensionality reduction:

    • Principal Component Analysis(PCA)
    • Kernel PCA
    • Locally -Linear Embedding (LLE)
    • t-distributed Stochastic Neighbor Embedding (t-SNE)
  • Association rule learning

    • Apriori
    • Eclat
  • Semisupervised learning

  • Reinforcement Learning

    • Q-learning
  • Batch learning & Online learning

  • Ensemble Learning
    << Note >>

Here is no method which outperforms all others for all tasks

7-2 Accuracy and precision

One of the most important questions to ask as a machine learning engineer when evaluating our model is how to judge our own model? each machine learning model is trying to solve a problem with a different objective using a different dataset and hence, it is important to understand the context before choosing a metric.

在評估我們的模型時,作爲機器學習工程師要求的最重要問題之一是如何判斷我們自己的模型? 每個機器學習模型都試圖使用不同的數據集來解決具有不同目標的問題,因此,在選擇度量之前理解上下文非常重要。

在這裏插入圖片描述

7-2-1 RMSE

Root-Mean-Squared-Error (RMSE) between the logarithm of the predicted value and the logarithm of the observed sales price. (Taking logs means that errors in predicting expensive houses and cheap houses will affect the result equally.)

預測值的對數與觀察到的銷售價格的對數之間的均方根誤差(RMSE)。 (記錄日誌意味着預測昂貴房屋和廉價房屋的錯誤將同樣影響結果。)
在這裏插入圖片描述

#creating matrices for sklearn:
X_train = all_data[:train.shape[0]]
X_test = all_data[train.shape[0]:]
y = train.SalePrice
X_train.info()

在這裏插入圖片描述

7-3 Ridge

def rmse_cv(model):
    rmse= np.sqrt(-cross_val_score(model, X_train, y, scoring="neg_mean_squared_error", cv = 5))
    return(rmse)
model_ridge = Ridge()
alphas = [0.05, 0.1, 0.3, 1, 3, 5, 10, 15, 30, 50, 75]
cv_ridge = [rmse_cv(Ridge(alpha = alpha)).mean() for alpha in alphas]

7-3-1 Root Mean Squared Error

cv_ridge = pd.Series(cv_ridge, index = alphas)
cv_ridge.plot(title = "Validation")
plt.xlabel("alpha")
plt.ylabel("rmse")

在這裏插入圖片描述

在這裏插入圖片描述

# steps
steps = [('scaler', StandardScaler()),
         ('ridge', Ridge())]

# Create the pipeline: pipeline
pipeline = Pipeline(steps)

# Specify the hyperparameter space
parameters = {'ridge__alpha':np.logspace(-4, 0, 50)}

# Create the GridSearchCV object: cv
cv = GridSearchCV(pipeline, parameters, cv=3)

# Fit to the training set
cv.fit(X_train, y)

#predict on train set
y_pred_train=cv.predict(X_train)

# Predict test set
y_pred_test=cv.predict(X_test)

# rmse on train set
rmse = np.sqrt(mean_squared_error(y, y_pred_train))
print("Root Mean Squared Error: {}".format(rmse))

在這裏插入圖片描述

7-4 RandomForestClassifier

A random forest is a meta estimator that fits a number of decision tree classifiers on various sub-samples of the dataset and uses averaging to improve the predictive accuracy and control over-fitting. The sub-sample size is always the same as the original input sample size but the samples are drawn with replacement if bootstrap=True (default).

隨機森林是一種元估計器,它適用於數據集的各個子樣本上的多個決策樹分類器,並使用平均來提高預測精度和控制過擬合。 子樣本大小始終與原始輸入樣本大小相同,但如果bootstrap = True(默認),則使用替換繪製樣本。

num_test = 0.3
X_train, X_test, y_train, y_test = train_test_split(X_train, y, test_size=num_test, random_state=100)
# Fit Random Forest on Training Set
from sklearn.ensemble import RandomForestRegressor
regressor = RandomForestRegressor(n_estimators=300, random_state=0)
regressor.fit(X_train, y_train)

# Score model
regressor.score(X_train, y_train)

在這裏插入圖片描述

7-5 XGBoost

XGBoost is one of the most popular machine learning algorithm these days. Regardless of the type of prediction task at hand; regression or classification.

XGBoost是目前最流行的機器學習算法之一。 無論手頭的預測任務類型如何; 迴歸或分類。

7-5-1 But what makes XGBoost so popular?

  • Speed and performance : Originally written in C++, it is comparatively faster than other ensemble classifiers.

  • Core algorithm is parallelizable : Because the core XGBoost algorithm is parallelizable it can harness the power of multi-core computers. It is also parallelizable onto GPU’s and across networks of computers making it feasible to train on very large datasets as well.

  • Consistently outperforms other algorithm methods : It has shown better performance on a variety of machine learning benchmark datasets.

  • Wide variety of tuning parameters : XGBoost internally has parameters for cross-validation, regularization, user-defined objective functions, missing values, tree parameters, scikit-learn compatible API etc.[10]

XGBoost (Extreme Gradient Boosting) belongs to a family of boosting algorithms and uses the gradient boosting (GBM) framework at its core. It is an optimized distributed gradient boosting library. But wait, what is boosting? Well, keep on reading.

  • 速度和性能:最初用C ++編寫,比其他整體分類器要快。

  • 核心算法是可並行化的:因爲核心XGBoost算法是可並行化的,所以它可以利用多核計算機的強大功能。 它還可以並行化到GPU和計算機網絡上,因此也可以在非常大的數據集上進行訓練。

  • 始終如一地優於其他算法方法:它在各種機器學習基準數據集上表現出更好的性能。

  • 各種調整參數:XGBoost內部具有交叉驗證,正則化,用戶定義的目標函數,缺失值,樹參數,scikit-learn兼容API等參數。[10]

XGBoost(Extreme Gradient Boosting)屬於一系列增強算法,並在其核心使用梯度增強(GBM)框架。 它是一個優化的分佈式梯度增強庫。 但等等,是什麼促進? 好吧,繼續閱讀。

# Initialize model
from xgboost.sklearn import XGBRegressor
XGB_Regressor = XGBRegressor()                  

# Fit the model on our data
XGB_Regressor.fit(X_train, y_train)

在這裏插入圖片描述

# Score model
XGB_Regressor.score(X_train, y_train)

在這裏插入圖片描述

7-6 LassoCV

Lasso linear model with iterative fitting along a regularization path. The best model is selected by cross-validation.

lasso=LassoCV()
# Fit the model on our data
lasso.fit(X_train, y_train)

在這裏插入圖片描述

# Score model
lasso.score(X_train, y_train)

在這裏插入圖片描述

7-7 GradientBoostingRegressor

GB builds an additive model in a forward stage-wise fashion; it allows for the optimization of arbitrary differentiable loss functions. In each stage a regression tree is fit on the negative gradient of the given loss function.

boostingregressor=GradientBoostingRegressor()
# Fit the model on our data
boostingregressor.fit(X_train, y_train)

在這裏插入圖片描述

# Score model
boostingregressor.score(X_train, y_train)

在這裏插入圖片描述

7-8 DecisionTree

from sklearn.tree import DecisionTreeRegressor

# Define model. Specify a number for random_state to ensure same results each run
dt = DecisionTreeRegressor(random_state=1)
# Fit model
dt.fit(X_train, y_train)

在這裏插入圖片描述

dt.score(X_train, y_train)

在這裏插入圖片描述

7-9 ExtraTreeRegressor

from sklearn.tree import ExtraTreeRegressor

dtr = ExtraTreeRegressor()
# Fit model
dtr.fit(X_train, y_train)

在這裏插入圖片描述

# Fit model
dtr.score(X_train, y_train)

在這裏插入圖片描述

8- Conclusion

This kernel is not completed yet, I will try to cover all the parts related to the process of ML with a variety of Python packages and I know that there are still some problems then I hope to get your feedback to improve it.

9- References

Https://skymind.ai/wiki/machine-learning-workflow
Problem-define
Sklearn
Machine-learning-in-python-step-by-step
Data Cleaning
Kaggle kernel
Choosing-the-right-metric-for-machine-learning-models-part
Unboxing outliers in machine learning
How to handle missing data
Datacamp

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章