Hands-On Machine Learning with Scikit-Learn, Keras, and TensorFlow (2)

Pipline

A sequence of data processing components is called a data pipeline.

Root Mean Square Error (RMSE)

RMSE(X,h)=1mi=1m(h(x(i))y(i))2 RMSE(X,h) = \sqrt{\frac{1}{m}\sum_{i=1}^m(h(x^{(i)})-y^{(i)})^2}

  • x(i)x^{(i)} is a vector of all the feature values, y(i)y^{(i)} is its label.
  • X is a matrix containing all the feature values, and the ithi^{th} row is equal to the transpose of x(i)x^{(i)}, noted (x(i))T(x^{(i)} )^T
  • h is prediction function, also called a hypothesis.
  • RMSE(X,h) is true cost function.

Mean Absolute Error (MAE)

MAE(X,h)=1mi=1mh(x(i))y(i) MAE(X,h) = \frac{1}{m}\sum_{i=1}^m|h(x^{(i)})-y^{(i)}|

Both the RMSE and the MAE are ways to measure the distance between two vectors.

  • RMSE corresponds to the l2l_2 norm, also called the Euclidean norm.
  • MAE corresponds to the l1l_1 norm, also called the Manhattan norm.
  • The higher the norm index, the more it focuses on large values and neglects small ones. This is why the RMSE is more sensitive to outliers than the MAE.

Creating an isolated environment

# install virtualenv
$ python3 -m pip install --user -U virtualenv
# create an isolated environment
$ python3 -m virtualenv my_env
# activate this environment
$ source my_env/bin/activate # on Linux or macOS
$ .\my_env\Scripts\activate  # on Windows

# register virtualenv to Jupyter and give it a name
$ python3 -m ipykernel install --user --name=python3

Download the Data

# fetch the data
import os
import tarfile
import urllib

DOWNLOAD_ROOT = "https://raw.githubusercontent.com/ageron/handson-ml2/master/"
HOUSING_PATH = os.path.join("datasets", "housing")
HOUSING_URL = DOWNLOAD_ROOT + "datasets/housing/housing.tgz"

def fetch_housing_data(housing_url=HOUSING_URL, housing_path=HOUSING_PATH):
    os.makedirs(housing_path, exist_ok=True)
    tgz_path = os.path.join(housing_path, "housing.tgz")
    urllib.request.urlretrieve(housing_url, tgz_path)
    housing_tgz = tarfile.open(tgz_path)
    housing_tgz.extractall(path=housing_path)
    housing_tgz.close()

# load the data using pandas
import pandas as pd

def load_housing_data(housing_path=HOUSING_PATH):
    csv_path = os.path.join(housing_path, "housing.csv")
    return pd.read_csv(csv_path)

Take a Quick Look at the Data Structure

housing =load_housing_data()
# show the total number of rows, each attribute’s type, and the number of nonnull values
housing.info()
# how many districts belong to each category 
housing["ocean_proximity"].value_counts()
# show a summary of the numerical attributes
housing.describe()

# plot a histogram
%matplotlib inline   # only in a Jupyter notebook
import matplotlib.pyplot as plt
housing.hist(bins=50, figsize=(20,15))
plt.show()
  • The 25%, 50%, and 75% rows show the corresponding percentiles: a percentile indicates the value below which a given percentage of observations in a group of observations fall.
  • Tail-heavy: they extend much farther to the right of the median than to the left.

Create a Test Set

import numpy as np

def split_train_test(data, test_ratio):
    shuffled_indices = np.random.permutation(len(data))
    test_set_size = int(len(data) * test_ratio)
    test_indices = shuffled_indices[:test_set_size]
    train_indices = shuffled_indices[test_set_size:]
    return data.iloc[train_indices], data.iloc[test_indices]

# set the random number generator’s seed so that generate the same shuffled indices
np.random.seed(42)

# To have a stable train/test split after updating the dataset, 
# a solution is to use each instance’s identifier.
from zlib import crc32

def test_set_check(identifier, test_ratio):
    return crc32(np.int64(identifier)) & 0xffffffff < test_ratio * 2**32

def split_train_test_by_id(data, test_ratio, id_column):
    ids = data[id_column]
    in_test_set = ids.apply(lambda id_: test_set_check(id_, test_ratio))
    return data.loc[~in_test_set], data.loc[in_test_set]

# Use the row index as the ID as identifier column
housing_with_id = housing.reset_index()   # adds an `index` column
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "index")

# Combine a district’s latitude and longitude into an ID
housing_with_id["id"] = housing["longitude"] * 1000 + housing["latitude"]
train_set, test_set = split_train_test_by_id(housing_with_id, 0.2, "id")

# Scikit-Learn's functions
from sklearn.model_selection import train_test_split
# random_state parameter allows to set the random generator seed
train_set, test_set = train_test_split(housing, test_size=0.2, random_state=42)

# create an income category attribute with five categories (labeled from 1 to 5)
housing["income_cat"] = pd.cut(housing["median_income"],
                               bins=[0., 1.5, 3.0, 4.5, 6., np.inf],
                               labels=[1, 2, 3, 4, 5])
housing["income_cat"].hist()

# use Scikit-Learn’s StratifiedShuffleSplit class to do stratified sampling
from sklearn.model_selection import StratifiedShuffleSplit

split = StratifiedShuffleSplit(n_splits=1, test_size=0.2, random_state=42)
for train_index, test_index in split.split(housing, housing["income_cat"]):
    strat_train_set = housing.loc[train_index]
    strat_test_set = housing.loc[test_index]

# looking at the income category proportions
strat_test_set["income_cat"].value_counts() / len(strat_test_set)

# remove the income_cat attribute so the data is back to its original state
for set_ in (strat_train_set, strat_test_set):
    set_.drop("income_cat", axis=1, inplace=True)
  • stratified sampling: the population is divided into homogeneous subgroups called strata, and the right number of instances are sampled from each stratum to guarantee that the test set is representative of the overall population.
  • the test set generated using stratified sampling has income category proportions almost identical to those in the full dataset, whereas the test set generated using random sampling is skewed.

Discover and Visualize the Data to Gain Insights

# create a copy 
housing = strat_train_set.copy()

# create a scatterplot of all districts
# Setting the alpha option to 0.1 makes it easier to visualize the places where there is a high density
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.1)

# The radius of each circle represents the district’s population (option s)
#  the color represents the price (option c). 
# use a predefined color map (option cmap) called jet, which ranges from blue (low values) to red (high prices)
housing.plot(kind="scatter", x="longitude", y="latitude", alpha=0.4,
    s=housing["population"]/100, label="population", figsize=(10,7),
    c="median_house_value", cmap=plt.get_cmap("jet"), colorbar=True,
)
plt.legend()

# compute the standard correlation coefficient (also called Pearson’s r) 
corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)

# plot every numerical attribute against every other numerical attribute
from pandas.plotting import scatter_matrix

attributes = ["median_house_value", "median_income", "total_rooms",
              "housing_median_age"]
scatter_matrix(housing[attributes], figsize=(12, 8))

housing["rooms_per_household"] = housing["total_rooms"]/housing["households"]
housing["bedrooms_per_room"] = housing["total_bedrooms"]/housing["total_rooms"]
housing["population_per_household"]=housing["population"]/housing["households"]
corr_matrix = housing.corr()
corr_matrix["median_house_value"].sort_values(ascending=False)
  • The correlation coefficient ranges from –1 to 1. When it is close to 1, it means that there is a strong positive correlation; When the coefficient is close to –1, it means that there is a strong negative correlation; coefficients close to 0 mean that there is no linear correlation.
  • The correlation coefficient only measures linear correlations.

Prepare the Data for Machine Learning Algorithms

# drop() creates a copy of the data and does not affect strat_train_set
housing = strat_train_set.drop("median_house_value", axis=1)
housing_labels = strat_train_set["median_house_value"].copy()

# 1.Get rid of the corresponding districts.
housing.dropna(subset=["total_bedrooms"]) 
# 2. Get rid of the whole attribute.
housing.drop("total_bedrooms", axis=1) 
# 3. Set the values to some value (zero, the mean, the median)
median = housing["total_bedrooms"].median() 
housing["total_bedrooms"].fillna(median, inplace=True)

from sklearn.impute import SimpleImputer
# create a SimpleImputer instance, specifying that you want to replace each attribute’s missing values with the median of that attribute
imputer = SimpleImputer(strategy="median")
# create a copy of the data without the text attribute ocean_proximity
housing_num = housing.drop("ocean_proximity", axis=1)
# fit the imputer instance to the training data
imputer.fit(housing_num)
# The imputer has computed the median of each attribute and stored the result in its statistics_ instance variable. 
imputer.statistics_
housing_num.median().values
# use this “trained” imputer to transform the training set by replacing missing values with the learned medians
X = imputer.transform(housing_num)
# put the result back into a pandas DataFrame
housing_tr = pd.DataFrame(X, columns=housing_num.columns, index=housing_num.index)
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章