-
ARIMA -> SARIMA -> SARIMAX:
S是Seasonal,就是季節性、週期性的意思
X是eXogenous,外部信息的意思 -
季節性參數:
P:季節性自迴歸階數。
D:季節性差分階數。
Q:季節性移動平均階數。
m:單個季節期間的時間步數。
import numpy as np
import pandas as pd
import matplotlib.pylab as pl
import seaborn as sns
%matplotlib inline
sns.set(style = 'ticks', context = 'poster')
pd.set_option('display.float_format', lambda x: '%.5f' % x)
np.set_printoptions(precision = 5, suppress = True)
filename_ts = 'data/series1.csv'
ts_df = pd.read_csv(filename_ts, index_col = 0, parse_dates = [0])
n_sample = ts_df.shape[0]
print(ts_df.shape)
ts_df.head(6)
# 訓練集和測試集
n_train = int(0.95 * n_sample) + 1
n_test = n_sample - n_train
ts_train = ts_df.iloc[:n_train]['value']
ts_test = ts_df.iloc[n_train:]['value']
print(ts_train.shape)
print(ts_test.shape)
print('Training Series:', '\n', ts_train.tail(), '\n')
print('Testing Series:', '\n', ts_test.head())
import statsmodels.tsa.api as smt
import statsmodels.tsa.api as smt
def tsplot(y, lags=None, title='', figsize=(20, 12)):
fig = pl.figure(figsize = figsize)
layout = (2, 2)
ts_ax = pl.subplot2grid(layout, (0, 0))
hist_ax = pl.subplot2grid(layout, (0, 1))
acf_ax = pl.subplot2grid(layout, (1, 0))
pacf_ax = pl.subplot2grid(layout, (1, 1))
y.plot(ax = ts_ax)
ts_ax.set_title(title)
y.plot(ax = hist_ax, kind = 'hist', bins = 25)
hist_ax.set_title('Histogram')
smt.graphics.plot_acf(y, lags = lags, ax = acf_ax)
smt.graphics.plot_pacf(y, lags = lags, ax = pacf_ax)
[ax.set_xlim(0) for ax in [acf_ax, pacf_ax]]
sns.despine() #去掉上方和右方的線
fig.tight_layout()
return ts_ax, acf_ax, pacf_ax
tsplot(ts_train, title='A Given Training Series', lags=20)
# 模型評估
import statsmodels.api as sm
arima200 = sm.tsa.SARIMAX(ts_train, order = (2, 0, 0)
model_results = arima200.fit()
模型選擇AIC與BIC: 選擇更簡單的模型
- AIC:赤池信息準則(Akaike Information Criterion,AIC)
𝐴𝐼𝐶 = 2𝑘 − 2ln(𝐿) - BIC:貝葉斯信息準則(Bayesian Information Criterion,BIC)
𝐵𝐼𝐶 = 𝑘𝑙𝑛 𝑛 − 2ln(𝐿) - k爲模型參數個數,n爲樣本數量,L爲似然函數
import itertools #迭代器模塊
p_min = 0
d_min = 0
q_min = 0
p_max = 4
d_max = 0
q_max = 4
results_bic = pd.DataFrame(index=['AR{}'.format(i) for i in range(p_min, p_max+1)],
columns=['MA{}'.format(i) for i in range(q_min,q_max+1)])
for p,d,q in itertools.product(range(p_min,p_max+1),range(d_min,d_max+1),range(q_min,q_max+1)):
if p==0 and d==0 and q==0:
results_bic.loc['AR{}'.format(p), 'MA{}'.format(q)] = np.nan
continue
try:
model = sm.tsa.SARIMAX(ts_train, order=(p,d,q)
results = model.fit()
results_bic.loc['AR{}'.format(p), 'MA{}'.format(q)] = results.bic
except:
continue
results_bic = results_bic[results_bic.columns].astype(float)
results_bic
fig, ax = pl.subplots(figsize = (10, 8))
ax = sns.heatmap(results_bic, mask=results_bic.isnull(), ax=ax, annot=True, fmt='.2f')
ax.set_title('BIC')
train_results = sm.tsa.arma_order_select_ic(ts_train, ic=['aic', 'bic'], trend='nc', max_ar=4, max_ma=4)
print('AIC', train_results.aic_min_order)
print('BIC', train_results.bic_min_order)
AIC (4, 2)
BIC (1, 1)
train_results
模型殘差檢驗:
- ARIMA模型的殘差是否是平均值爲0且方差爲常數的正態分佈
- QQ圖:線性即正態分佈
import statsmodels.api as sm
arima111 = sm.tsa.SARIMAX(ts_train, order=(1,1,1))
model_results = arima111.fit()
model_results.plot_diagnostics(figsize = (16, 12));