【深度學習 kears+tensorflow】預測房價:迴歸問題

import keras
keras.__version__
Using TensorFlow backend.

'2.3.1'

Predicting house prices: a regression example

This notebook contains the code samples found in Chapter 3, Section 6 of Deep Learning with Python. Note that the original text features far more content, in particular further explanations and figures: in this notebook, you will only find source code and related comments.


In our two previous examples, we were considering classification problems, where the goal was to predict a single discrete label of an
input data point. Another common type of machine learning problem is “regression”, which consists of predicting a continuous value instead
of a discrete label. For instance, predicting the temperature tomorrow, given meteorological data, or predicting the time that a
software project will take to complete, given its specifications.

Do not mix up “regression” with the algorithm “logistic regression”: confusingly, “logistic regression” is not a regression algorithm,
it is a classification algorithm.

預測房價:迴歸問題

前面兩個例子都是分類問題,其目標是預測輸入數據點所對應的單一離散的標籤。另一種常見的機器學習問題是迴歸問題,它預測一個連續值而不是離散的標籤,例如,根據氣象數據預測明天的氣溫,或者根據軟件說明書預測完成軟件項目所需要的時間。

The Boston Housing Price dataset

We will be attempting to predict the median price of homes in a given Boston suburb in the mid-1970s, given a few data points about the
suburb at the time, such as the crime rate, the local property tax rate, etc.

The dataset we will be using has another interesting difference from our two previous examples: it has very few data points, only 506 in
total, split between 404 training samples and 102 test samples, and each “feature” in the input data (e.g. the crime rate is a feature) has
a different scale. For instance some values are proportions, which take a values between 0 and 1, others take values between 1 and 12,
others between 0 and 100…

Let’s take a look at the data:

波士頓房價數據集

本節將要預測 20 世紀 70 年代中期波士頓郊區房屋價格的中位數,已知當時郊區的一些數據點,比如犯罪率、當地房產稅率等。

本節用到的數據集與前面兩個例子有一個有趣的區別。它包含的數據點相對較少,只有 506 個,分爲 404 個訓練樣本和 102 個測試樣本。輸入數據的每個特徵(比如犯罪率)都有不同的取值範圍。例如,有些特性是比例,取值範圍爲 0~1;有的取值範圍爲 1~12;還有的取值範圍爲 0~100,等等。

from keras.datasets import boston_housing

(train_data, train_targets), (test_data, test_targets) =  boston_housing.load_data()
train_data.shape
(404, 13)
test_data.shape
(102, 13)

As you can see, we have 404 training samples and 102 test samples. The data comprises 13 features. The 13 features in the input data are as
follow:

  1. Per capita crime rate.
  2. Proportion of residential land zoned for lots over 25,000 square feet.
  3. Proportion of non-retail business acres per town.
  4. Charles River dummy variable (= 1 if tract bounds river; 0 otherwise).
  5. Nitric oxides concentration (parts per 10 million).
  6. Average number of rooms per dwelling.
  7. Proportion of owner-occupied units built prior to 1940.
  8. Weighted distances to five Boston employment centres.
  9. Index of accessibility to radial highways.
  10. Full-value property-tax rate per $10,000.
  11. Pupil-teacher ratio by town.
  12. 1000 * (Bk - 0.63) ** 2 where Bk is the proportion of Black people by town.
  13. % lower status of the population.

The targets are the median values of owner-occupied homes, in thousands of dollars:

如您所見,我們有404個訓練樣本和102個測試樣本。 數據包含13個特徵。 輸入數據中的13個特徵如下:

  1. 人均犯罪率。
  2. 佔地超過25,000平方英尺的住宅用地比例。
  3. 每個鎮非零售業務用地的比例。
  4. 查爾斯河虛擬變量(如果束縛河流,則爲1;否則爲0)。
  5. 一氧化氮濃度(百萬分之幾)。
  6. 每個住宅的平均房間數。
  7. 1940年以前建造的自有住房的比例。
  8. 到五個波士頓就業中心的加權距離。
  9. 徑向公路的可達性指數。
  10. 每$ 10,000美元的全值財產稅率。
  11. 各鎮的師生比例。
  12. 1000 (Bk-0.63)* 2其中,Bk是按城鎮劃分的黑人比例。
  13. 人口狀況下降13.%。

目標是自住房價格的中位數,以千美元爲單位:

train_targets
array([15.2, 42.3, 50. , 21.1, 17.7, 18.5, 11.3, 15.6, 15.6, 14.4, 12.1,
       17.9, 23.1, 19.9, 15.7,  8.8, 50. , 22.5, 24.1, 27.5, 10.9, 30.8,
       32.9, 24. , 18.5, 13.3, 22.9, 34.7, 16.6, 17.5, 22.3, 16.1, 14.9,
       23.1, 34.9, 25. , 13.9, 13.1, 20.4, 20. , 15.2, 24.7, 22.2, 16.7,
       12.7, 15.6, 18.4, 21. , 30.1, 15.1, 18.7,  9.6, 31.5, 24.8, 19.1,
       22. , 14.5, 11. , 32. , 29.4, 20.3, 24.4, 14.6, 19.5, 14.1, 14.3,
       15.6, 10.5,  6.3, 19.3, 19.3, 13.4, 36.4, 17.8, 13.5, 16.5,  8.3,
       14.3, 16. , 13.4, 28.6, 43.5, 20.2, 22. , 23. , 20.7, 12.5, 48.5,
       14.6, 13.4, 23.7, 50. , 21.7, 39.8, 38.7, 22.2, 34.9, 22.5, 31.1,
       28.7, 46. , 41.7, 21. , 26.6, 15. , 24.4, 13.3, 21.2, 11.7, 21.7,
       19.4, 50. , 22.8, 19.7, 24.7, 36.2, 14.2, 18.9, 18.3, 20.6, 24.6,
       18.2,  8.7, 44. , 10.4, 13.2, 21.2, 37. , 30.7, 22.9, 20. , 19.3,
       31.7, 32. , 23.1, 18.8, 10.9, 50. , 19.6,  5. , 14.4, 19.8, 13.8,
       19.6, 23.9, 24.5, 25. , 19.9, 17.2, 24.6, 13.5, 26.6, 21.4, 11.9,
       22.6, 19.6,  8.5, 23.7, 23.1, 22.4, 20.5, 23.6, 18.4, 35.2, 23.1,
       27.9, 20.6, 23.7, 28. , 13.6, 27.1, 23.6, 20.6, 18.2, 21.7, 17.1,
        8.4, 25.3, 13.8, 22.2, 18.4, 20.7, 31.6, 30.5, 20.3,  8.8, 19.2,
       19.4, 23.1, 23. , 14.8, 48.8, 22.6, 33.4, 21.1, 13.6, 32.2, 13.1,
       23.4, 18.9, 23.9, 11.8, 23.3, 22.8, 19.6, 16.7, 13.4, 22.2, 20.4,
       21.8, 26.4, 14.9, 24.1, 23.8, 12.3, 29.1, 21. , 19.5, 23.3, 23.8,
       17.8, 11.5, 21.7, 19.9, 25. , 33.4, 28.5, 21.4, 24.3, 27.5, 33.1,
       16.2, 23.3, 48.3, 22.9, 22.8, 13.1, 12.7, 22.6, 15. , 15.3, 10.5,
       24. , 18.5, 21.7, 19.5, 33.2, 23.2,  5. , 19.1, 12.7, 22.3, 10.2,
       13.9, 16.3, 17. , 20.1, 29.9, 17.2, 37.3, 45.4, 17.8, 23.2, 29. ,
       22. , 18. , 17.4, 34.6, 20.1, 25. , 15.6, 24.8, 28.2, 21.2, 21.4,
       23.8, 31. , 26.2, 17.4, 37.9, 17.5, 20. ,  8.3, 23.9,  8.4, 13.8,
        7.2, 11.7, 17.1, 21.6, 50. , 16.1, 20.4, 20.6, 21.4, 20.6, 36.5,
        8.5, 24.8, 10.8, 21.9, 17.3, 18.9, 36.2, 14.9, 18.2, 33.3, 21.8,
       19.7, 31.6, 24.8, 19.4, 22.8,  7.5, 44.8, 16.8, 18.7, 50. , 50. ,
       19.5, 20.1, 50. , 17.2, 20.8, 19.3, 41.3, 20.4, 20.5, 13.8, 16.5,
       23.9, 20.6, 31.5, 23.3, 16.8, 14. , 33.8, 36.1, 12.8, 18.3, 18.7,
       19.1, 29. , 30.1, 50. , 50. , 22. , 11.9, 37.6, 50. , 22.7, 20.8,
       23.5, 27.9, 50. , 19.3, 23.9, 22.6, 15.2, 21.7, 19.2, 43.8, 20.3,
       33.2, 19.9, 22.5, 32.7, 22. , 17.1, 19. , 15. , 16.1, 25.1, 23.7,
       28.7, 37.2, 22.6, 16.4, 25. , 29.8, 22.1, 17.4, 18.1, 30.3, 17.5,
       24.7, 12.6, 26.5, 28.7, 13.3, 10.4, 24.4, 23. , 20. , 17.8,  7. ,
       11.8, 24.4, 13.8, 19.4, 25.2, 19.4, 19.4, 29.1])

The prices are typically between $10,000 and $50,000. If that sounds cheap, remember this was the mid-1970s, and these prices are not
inflation-adjusted.

房價大都在 10 000~50 000 美元。如果你覺得這很便宜,不要忘記當時是 20 世紀 70 年代中期,而且這些價格沒有根據通貨膨脹進行調整。

Preparing the data

It would be problematic to feed into a neural network values that all take wildly different ranges. The network might be able to
automatically adapt to such heterogeneous data, but it would definitely make learning more difficult. A widespread best practice to deal
with such data is to do feature-wise normalization: for each feature in the input data (a column in the input data matrix), we
will subtract the mean of the feature and divide by the standard deviation, so that the feature will be centered around 0 and will have a
unit standard deviation. This is easily done in Numpy:

準備數據

將取值範圍差異很大的數據輸入到神經網絡中,這是有問題的。網絡可能會自動適應這種取值範圍不同的數據,但學習肯定變得更加困難。對於這種數據,普遍採用的最佳實踐是對每個特徵做標準化,即對於輸入數據的每個特徵(輸入數據矩陣中的列),減去特徵平均值,再除以標準差,這樣得到的特徵平均值爲 0,標準差爲 1。用 Numpy 可以很容易實現標準化。

mean = train_data.mean(axis=0)
train_data -= mean
std = train_data.std(axis=0)
train_data /= std

test_data -= mean
test_data /= std

Note that the quantities that we use for normalizing the test data have been computed using the training data. We should never use in our
workflow any quantity computed on the test data, even for something as simple as data normalization.

注意,用於測試數據標準化的均值和標準差都是在訓練數據上計算得到的。在工作流程中,你不能使用在測試數據上計算得到的任何結果,即使是像數據標準化這麼簡單的事情也不行。

Building our network

Because so few samples are available, we will be using a very small network with two
hidden layers, each with 64 units. In general, the less training data you have, the worse overfitting will be, and using
a small network is one way to mitigate overfitting.

構建網絡

由於樣本數量很少,我們將使用一個非常小的網絡,其中包含兩個隱藏層,每層有 64 個單元。一般來說,訓練數據越少,過擬合會越嚴重,而較小的網絡可以降低過擬合。

from keras import models
from keras import layers

def build_model():
    # Because we will need to instantiate
    # the same model multiple times,
    # we use a function to construct it.
    model = models.Sequential()
    model.add(layers.Dense(64, activation='relu',
                           input_shape=(train_data.shape[1],)))
    model.add(layers.Dense(64, activation='relu'))
    model.add(layers.Dense(1))
    model.compile(optimizer='rmsprop', loss='mse', metrics=['mae'])
    return model

Our network ends with a single unit, and no activation (i.e. it will be linear layer).
This is a typical setup for scalar regression (i.e. regression where we are trying to predict a single continuous value).
Applying an activation function would constrain the range that the output can take; for instance if
we applied a sigmoid activation function to our last layer, the network could only learn to predict values between 0 and 1. Here, because
the last layer is purely linear, the network is free to learn to predict values in any range.

Note that we are compiling the network with the mse loss function – Mean Squared Error, the square of the difference between the
predictions and the targets, a widely used loss function for regression problems.

We are also monitoring a new metric during training: mae. This stands for Mean Absolute Error. It is simply the absolute value of the
difference between the predictions and the targets. For instance, a MAE of 0.5 on this problem would mean that our predictions are off by
$500 on average.

網絡的最後一層只有一個單元,沒有激活,是一個線性層。這是標量回歸(標量回歸是預測單一連續值的迴歸)的典型設置。添加激活函數將會限制輸出範圍。例如,如果向最後一層添加 sigmoid 激活函數,網絡只能學會預測 0~1 範圍內的值。這裏最後一層是純線性的,所以
網絡可以學會預測任意範圍內的值。

注意,編譯網絡用的是 mse 損失函數,即均方誤差(MSE,mean squared error),預測值與目標值之差的平方。這是迴歸問題常用的損失函數。

在訓練過程中還監控一個新指標:平均絕對誤差(MAE,mean absolute error)。它是預測值與目標值之差的絕對值。比如,如果這個問題的 MAE 等於 0.5,就表示你預測的房價與實際價格平均相差 500 美元。

Validating our approach using K-fold validation

To evaluate our network while we keep adjusting its parameters (such as the number of epochs used for training), we could simply split the
data into a training set and a validation set, as we were doing in our previous examples. However, because we have so few data points, the
validation set would end up being very small (e.g. about 100 examples). A consequence is that our validation scores may change a lot
depending on which data points we choose to use for validation and which we choose for training, i.e. the validation scores may have a
high variance with regard to the validation split. This would prevent us from reliably evaluating our model.

The best practice in such situations is to use K-fold cross-validation. It consists of splitting the available data into K partitions
(typically K=4 or 5), then instantiating K identical models, and training each one on K-1 partitions while evaluating on the remaining
partition. The validation score for the model used would then be the average of the K validation scores obtained.

利用 K 折驗證來驗證你的方法

爲了在調節網絡參數(比如訓練的輪數)的同時對網絡進行評估,你可以將數據劃分爲訓練集和驗證集,正如前面例子中所做的那樣。但由於數據點很少,驗證集會非常小(比如大約100 個樣本)。因此,驗證分數可能會有很大波動,這取決於你所選擇的驗證集和訓練集。也就是說,驗證集的劃分方式可能會造成驗證分數上有很大的方差,這樣就無法對模型進行可靠的評估。

在這種情況下,最佳做法是使用 K 折交叉驗證(見圖 3-11)。這種方法將可用數據劃分爲 K個分區(K 通常取 4 或 5),實例化 K 個相同的模型,將每個模型在 K-1 個分區上訓練,並在剩下的一個分區上進行評估。模型的驗證分數等於 K 個驗證分數的平均值。。

In terms of code, this is straightforward:

這種方法的代碼實現很簡單

import numpy as np

k = 4
num_val_samples = len(train_data) // k
num_epochs = 100
all_scores = []
for i in range(k):
    print('processing fold #', i)
    # Prepare the validation data: data from partition # k
    val_data = train_data[i * num_val_samples: (i + 1) * num_val_samples]
    val_targets = train_targets[i * num_val_samples: (i + 1) * num_val_samples]

    # Prepare the training data: data from all other partitions
    partial_train_data = np.concatenate(
        [train_data[:i * num_val_samples],
         train_data[(i + 1) * num_val_samples:]],
        axis=0)
    partial_train_targets = np.concatenate(
        [train_targets[:i * num_val_samples],
         train_targets[(i + 1) * num_val_samples:]],
        axis=0)

    # Build the Keras model (already compiled)
    model = build_model()
    # Train the model (in silent mode, verbose=0)
    history=model.fit(partial_train_data, partial_train_targets,
              epochs=num_epochs, batch_size=1, verbose=0)
    # Evaluate the model on the validation data
    val_mse, val_mae = model.evaluate(val_data, val_targets, verbose=0)
    all_scores.append(val_mae)
processing fold # 0
processing fold # 1
processing fold # 2
processing fold # 3
all_scores
[2.153496742248535, 2.462418794631958, 3.175769567489624, 2.324655294418335]
np.mean(all_scores)
2.529085099697113

As you can notice, the different runs do indeed show rather different validation scores, from 2.1 to 2.9. Their average (2.4) is a much more
reliable metric than any single of these scores – that’s the entire point of K-fold cross-validation. In this case, we are off by $2,400 on
average, which is still significant considering that the prices range from $10,000 to $50,000.

Let’s try training the network for a bit longer: 500 epochs. To keep a record of how well the model did at each epoch, we will modify our training loop to save the per-epoch validation score log:

每次運行模型得到的驗證分數有很大差異,從 2.6 到 3.2 不等。平均分數(3.0)是比單一分數更可靠的指標——這就是 K 折交叉驗證的關鍵。在這個例子中,預測的房價與實際價格平均相差 3000 美元,考慮到實際價格範圍在 10 000~50 000 美元,這一差別還是很大的。

我們讓訓練時間更長一點,達到 500 個輪次。爲了記錄模型在每輪的表現,我們需要修改訓練循環,以保存每輪的驗證分數記錄。

from keras import backend as K

# Some memory clean-up
K.clear_session()
num_epochs = 500
all_mae_histories = []
for i in range(k):
    print('processing fold #', i)
    # Prepare the validation data: data from partition # k
    val_data = train_data[i * num_val_samples: (i + 1) * num_val_samples]
    val_targets = train_targets[i * num_val_samples: (i + 1) * num_val_samples]

    # Prepare the training data: data from all other partitions
    partial_train_data = np.concatenate(
        [train_data[:i * num_val_samples],
         train_data[(i + 1) * num_val_samples:]],
        axis=0)
    partial_train_targets = np.concatenate(
        [train_targets[:i * num_val_samples],
         train_targets[(i + 1) * num_val_samples:]],
        axis=0)

    # Build the Keras model (already compiled)
    model = build_model()
    # Train the model (in silent mode, verbose=0)
    history = model.fit(partial_train_data, partial_train_targets,
                        validation_data=(val_data, val_targets),
                        epochs=num_epochs, batch_size=1, verbose=0)
#     mae_history = history.history;    
#     print(mae_history.keys());                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                    
    mae_history = history.history['val_mae']
    all_mae_histories.append(mae_history)
processing fold # 0
processing fold # 1
processing fold # 2
processing fold # 3

We can then compute the average of the per-epoch MAE scores for all folds:

然後你可以計算每個輪次中所有折 MAE 的平均值。

average_mae_history = [
    np.mean([x[i] for x in all_mae_histories]) for i in range(num_epochs)]

Let’s plot this:

我們畫圖來看一下,

import matplotlib.pyplot as plt

plt.plot(range(1, len(average_mae_history) + 1), average_mae_history)
plt.xlabel('Epochs')
plt.ylabel('Validation MAE')
plt.show()

在這裏插入圖片描述

It may be a bit hard to see the plot due to scaling issues and relatively high variance. Let’s:

  • Omit the first 10 data points, which are on a different scale from the rest of the curve.
  • Replace each point with an exponential moving average of the previous points, to obtain a smooth curve.

因爲縱軸的範圍較大,且數據方差相對較大,所以難以看清這張圖的規律。我們來重新繪製一張圖。

‰ 刪除前 10 個數據點,因爲它們的取值範圍與曲線上的其他點不同。

‰ 將每個數據點替換爲前面數據點的指數移動平均值,以得到光滑的曲線。

def smooth_curve(points, factor=0.9):
  smoothed_points = []
  for point in points:
    if smoothed_points:
      previous = smoothed_points[-1]
      smoothed_points.append(previous * factor + point * (1 - factor))
    else:
      smoothed_points.append(point)
  return smoothed_points

smooth_mae_history = smooth_curve(average_mae_history[10:])

plt.plot(range(1, len(smooth_mae_history) + 1), smooth_mae_history)
plt.xlabel('Epochs')
plt.ylabel('Validation MAE')
plt.show()

在這裏插入圖片描述

According to this plot, it seems that validation MAE stops improving significantly after 80 epochs. Past that point, we start overfitting.

Once we are done tuning other parameters of our model (besides the number of epochs, we could also adjust the size of the hidden layers), we
can train a final “production” model on all of the training data, with the best parameters, then look at its performance on the test data:

從圖 3-13 可以看出,驗證 MAE 在 80 輪後不再顯著降低,之後就開始過擬合。

完成模型調參之後(除了輪數,還可以調節隱藏層大小),你可以使用最佳參數在所有訓練數據上訓練最終的生產模型,然後觀察模型在測試集上的性能。

# Get a fresh, compiled model.
model = build_model()
# Train it on the entirety of the data.
model.fit(train_data, train_targets,
          epochs=80, batch_size=16, verbose=0)
test_mse_score, test_mae_score = model.evaluate(test_data, test_targets)
102/102 [==============================] - 0s 244us/step
test_mae_score
2.76008677482605

We are still off by about $2,550.

Wrapping up

Here’s what you should take away from this example:

  • Regression is done using different loss functions from classification; Mean Squared Error (MSE) is a commonly used loss function for
    regression.
  • Similarly, evaluation metrics to be used for regression differ from those used for classification; naturally the concept of “accuracy”
    does not apply for regression. A common regression metric is Mean Absolute Error (MAE).
  • When features in the input data have values in different ranges, each feature should be scaled independently as a preprocessing step.
  • When there is little data available, using K-Fold validation is a great way to reliably evaluate a model.
  • When little training data is available, it is preferable to use a small network with very few hidden layers (typically only one or two),
    in order to avoid severe overfitting.

This example concludes our series of three introductory practical examples. You are now able to handle common types of problems with vector data input:

  • Binary (2-class) classification.
  • Multi-class, single-label classification.
  • Scalar regression.

In the next chapter, you will acquire a more formal understanding of some of the concepts you have encountered in these first examples,
such as data preprocessing, model evaluation, and overfitting.

總結

下面是你應該從這個例子中學到的要點。

‰ 迴歸問題使用的損失函數與分類問題不同。迴歸常用的損失函數是均方誤差(MSE)。

‰ 同樣,迴歸問題使用的評估指標也與分類問題不同。顯而易見,精度的概念不適用於迴歸問題。常見的迴歸指標是平均絕對誤差(MAE)。

‰ 如果輸入數據的特徵具有不同的取值範圍,應該先進行預處理,對每個特徵單獨進行縮放。

‰ 如果可用的數據很少,使用 K 折驗證可以可靠地評估模型。

‰ 如果可用的訓練數據很少,最好使用隱藏層較少(通常只有一到兩個)的小型網絡,以避免嚴重的過擬合。

現在你可以處理關於向量數據最常見的機器學習任務了:

‰ 二分類問題、

‰ 多分類問題和標

‰ 標量量回歸問題。


發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章