prophet 安裝
prophet 是facebook 開源的一款時間序列預測工具包,直接用 conda 安裝 fbprophet 即可
prophet 的官網:https://facebook.github.io/prophet/
prophet 中文意思是“先知”
prophet 的輸入一般具有兩列:ds
和y
ds
(datestamp) 列應爲 Pandas 可以識別的日期格式,日期應爲YYYY-MM-DD,時間戳則應爲YYYY-MM-DD HH:MM:SS
y
列必須是數值
數據集下載
Metro Interstate Traffic Volume Data Set
prophet 實戰
導入包
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.metrics import mean_squared_error, mean_absolute_error
%matplotlib inline
plt.rcParams['font.sans-serif'] = 'SimHei' #顯示中文
plt.rcParams['axes.unicode_minus'] = False #顯示負號
plt.rcParams['figure.dpi'] = 200
plt.rcParams['text.color'] = 'black'
plt.rcParams['font.size'] = 20
plt.style.use('ggplot')
print(plt.style.available)
# ['bmh', 'classic', 'dark_background', 'fast', 'fivethirtyeight', 'ggplot', 'grayscale', 'seaborn-bright', 'seaborn-colorblind', 'seaborn-dark-palette', 'seaborn-dark', 'seaborn-darkgrid', 'seaborn-deep', 'seaborn-muted', 'seaborn-notebook', 'seaborn-paper', 'seaborn-pastel', 'seaborn-poster', 'seaborn-talk', 'seaborn-ticks', 'seaborn-white', 'seaborn-whitegrid', 'seaborn', 'Solarize_Light2', 'tableau-colorblind10', '_classic_test']
pandas 讀取 csv 數據
csv_files = 'Metro_Interstate_Traffic_Volume.csv'
df = pd.read_csv(csv_files)
df.set_index('date_time',inplace=True)
df.index = pd.to_datetime(df.index)
df.head()
略掃一眼表格內容,主要有假期、氣溫、降雨、降雪、天氣類型等因素,因變量是交通流量traffic_volume
df.info()
'''
<class 'pandas.core.frame.DataFrame'>
DatetimeIndex: 48204 entries, 2012-10-02 09:00:00 to 2018-09-30 23:00:00
Data columns (total 8 columns):
holiday 48204 non-null object
temp 48204 non-null float64
rain_1h 48204 non-null float64
snow_1h 48204 non-null float64
clouds_all 48204 non-null int64
weather_main 48204 non-null object
weather_description 48204 non-null object
traffic_volume 48204 non-null int64
dtypes: float64(3), int64(2), object(3)
memory usage: 3.3+ MB
'''
df.describe()
畫個圖
原來少了一點數據,不過影響不大
traffic = df[['traffic_volume']]
traffic[:].plot(style='--', figsize=(15,5), title='traffic_volume')
plt.show()
拆分數據集
知識點:pandas 中篩選日期
traffic_train = traffic.loc[(traffic.index >='2017-01') & (traffic.index <= '2018-03')].copy()
traffic_test = traffic.loc[traffic.index > '2018-03'].copy()
_ = traffic_test.rename(columns={'traffic_volume': 'TEST SET'})\
.join(traffic_train.rename(columns={'traffic_volume': 'TRAINING SET'}),how='outer') \
.plot(figsize=(20,5), title='traffic_volume', style='.')
因爲是逐小時統計的數據,只選兩年的量就已經夠多了
從日期中拆分特徵
雖然 prophet 不需要我們手工提取特徵,但我們還是可以自己試試
def create_features(df, label=None):
"""
Creates time series features from datetime index.
"""
df = df.copy()
df['date'] = df.index
df['hour'] = df['date'].dt.hour
df['dayofweek'] = df['date'].dt.dayofweek
df['quarter'] = df['date'].dt.quarter
df['month'] = df['date'].dt.month
df['year'] = df['date'].dt.year
df['dayofyear'] = df['date'].dt.dayofyear
df['dayofmonth'] = df['date'].dt.day
df['weekofyear'] = df['date'].dt.weekofyear
X = df[['hour','dayofweek','quarter','month','year',
'dayofyear','dayofmonth','weekofyear']]
if label:
y = df[label]
return X, y
return X
X, y = create_features(traffic, label='traffic_volume')
features_and_target = pd.concat([X, y], axis=1)
features_and_target.head()
自己體會一下不同特徵對預測變量的影響
sns.pairplot(features_and_target.dropna(),
hue='hour',
x_vars=['hour','dayofweek',
'year','weekofyear'],
y_vars='traffic_volume',
height=5,
plot_kws={'alpha':0.15, 'linewidth':0}
)
plt.suptitle('Traffic Volume by Hour, Day of Week, Year and Week of Year')
plt.show()
使用 prophet 訓練和預測
from fbprophet import Prophet
# Setup and train model and fit
model = Prophet()
model.fit(traffic_train.reset_index().rename(columns={'date_time':'ds','traffic_volume':'y'}))
traffic_test_pred = model.predict(df=traffic_test.reset_index() \
.rename(columns={'date_time':'ds'}))
畫出預測結果
f, ax = plt.subplots(1)
f.set_figheight(5)
f.set_figwidth(15)
ax.scatter(traffic_test.index, traffic_test['traffic_volume'], color='r')
fig = model.plot(traffic_test_pred, ax=ax)
造成這種現象是因爲:
- 訓練數據太多,是模型沒有把握最近趨勢
- 預測範圍太大,誤差隨時間放大
感興趣的朋友可以自己玩玩
prophet 學到了什麼
從下圖可以看出:
- 總體趨勢:下行
- 每週趨勢:工作日流量大、週末流量低
- 每日趨勢:早晚上下班高峯,所以每天流量基本呈現 M 型曲線
fig = model.plot_components(traffic_test_pred)
放大圖
看看模型對測試集中第一個月的預測情況:
# Plot the forecast with the actuals
f, ax = plt.subplots(1)
f.set_figheight(5)
f.set_figwidth(15)
plt.plot(traffic_test.index, traffic_test['traffic_volume'], color='r')
fig = model.plot(traffic_test_pred, ax=ax)
ax.set_xbound(lower='03-01-2018',
upper='04-01-2018')
ax.set_ylim(-1000, 8000)
plot = plt.suptitle('Forecast vs Actuals')
是不是有模有樣的 😉