1 引言
在本文章中,我們將提供可靠的時間序列預測。我們將首先介紹和討論自相關,平穩性和季節性的概念,並繼續應用最常用的時間序列預測方法之一,稱爲ARIMA。
2 簡介
時間序列提供了預測未來價值的機會。 基於以前的價值觀,可以使用時間序列來預測經濟,天氣和能力規劃的趨勢,其中僅舉幾例。 時間序列數據的具體屬性意味着通常需要專門的統計方法。
在時間序列中,ARIMA模型是在ARMA模型的基礎上多了差分的操作。
3 python代碼實現
(1)判斷時間序列是否是平穩白噪聲序列,若不是進行平穩化
(2)本實例數據帶有週期性,因此先進行一階差分,再進行144步差分
(3)看差分序列的自相關圖和偏自相關圖,差分後的而序列爲平穩序列
(4)模型定階,根據aic,bic,hqic
(5)預測,確定模型後預測
(5)還原,由於預測時用的差分序列,得到的預測值爲差分序列的預測值,需要將其還原
#-*- coding: utf-8 -*-
from __future__ import print_function
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm
#讀取Excel數據
discfile = 'data_test.xls'
data = pd.read_excel(discfile,index_col=0)
data=data['number']
data.head()
data.plot(figsize=(12,8))
print(data)
#使用一階差分,12步差分處理時間序列
diff_1 = data.diff(1)
diff1 = diff_1.dropna()
diff1_144_1 = diff_1-diff_1.shift(144)
diff1_144 = diff1_144_1.dropna()
#print(diff1_144_1)
#判斷序列是否平穩,計算ACF,PACF
fig1 = plt.figure(figsize=(12,8))
ax1=fig1.add_subplot(111)
sm.graphics.tsa.plot_acf(diff1_144,lags=40,ax=ax1)
fig2 = plt.figure(figsize=(12,8))
ax2=fig2.add_subplot(111)
sm.graphics.tsa.plot_pacf(diff1_144,lags=40, ax=ax2)
#模型定階,根據aic,bic,hqic,三者都是越小越好
# arma_mod01 = sm.tsa.ARMA(diff1_144,(0,1)).fit()
# print(arma_mod01.aic,arma_mod01.bic,arma_mod01.hqic)
# arma_mod10 = sm.tsa.ARMA(diff1_144,(1,0)).fit()
# print(arma_mod10.aic,arma_mod10.bic,arma_mod10.hqic)
# arma_mod60 = sm.tsa.ARMA(diff1_144,(6,0)).fit()
# print(arma_mod60.aic,arma_mod60.bic,arma_mod60.hqic)
arma_mod61 = sm.tsa.ARMA(diff1_144,(6,1)).fit()
print(arma_mod61.aic,arma_mod61.bic,arma_mod61.hqic)
#計算殘差
resid = arma_mod61.resid
#看殘差的acf和pacf,殘差自相關圖斷尾,所以殘差序列爲白噪聲
fig = plt.figure(figsize=(12,8))
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(resid.values.squeeze(), lags=40, ax=ax1)
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(resid, lags=40, ax=ax2)
print(sm.stats.durbin_watson(arma_mod61.resid.values))
# 殘差DW檢驗,DW的值越接近2,表示越不相關
r,q,p = sm.tsa.acf(resid.values.squeeze(), qstat=True)
d = np.c_[range(1,41), r[1:], q, p]
table = pd.DataFrame(d, columns=['lag', "AC", "Q", "Prob(>Q)"])
print(table.set_index('lag'))
# 用模型預測
predict_data=arma_mod61.predict('2017/4/4 23:50','2017/4/6 00:00',dynamic=False)
# print(predict_data)
# print(diff_1)
# 由於是用差分後的值做的預測,因此需要把結果還原
# 144步差分還原
diff1_144_shift=diff_1.shift(144)
# print('print diff1_144_shift')
print(diff1_144_shift)
diff_recover_144=predict_data.add(diff1_144_shift)
# 一階差分還原
diff1_shift=data.shift(1)
diff_recover_1=diff_recover_144.add(diff1_shift)
diff_recover_1=diff_recover_1.dropna() # 最終還原的預測值
print('預測值')
print(diff_recover_1)
# 實際值、預測值、差分預測值作圖
fig, ax = plt.subplots(figsize=(12, 8))
ax = data.loc['2017-04-01':].plot(ax=ax)
ax = diff_recover_1.plot(ax=ax)
fig = arma_mod61.plot_predict('2017/4/2 23:50', '2017/4/6 00:00', dynamic=False, ax=ax, plot_insample=False)
plt.show()
4 代碼解析
數據是這樣的。
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm
#讀取數據
data=pd.read_excel("data_test.xls",index_col=0)
data=data['number']
data.plot(figsize=(12,8)) #原圖
差分,一般在大數據裏用在以時間爲統計維度的分析中,其實就是下一個數值 ,減去上一個數值 。
當間距相等時,用下一個數值,減去上一個數值 ,就叫“一階差分”,做兩次相同的動作,即再在一階差分的基礎上用後一個數值再減上一個數值一次,就叫“二階差分"。
間距相等定義:即下圖中要麼1,2,3,4,5,6,7,8,9,10行後一個向前一個相減;要麼2,4,6,8,10或1,3,5,7,9行後一個向前一個相減;但不要1,5,6,10這樣不規律跳動着減,就是間距相等的定義(當然還有其它組合,如3,6,9,但無論怎樣,之間間距必須相等)。
差分形像點理解,可以看下圖
因此,差分的作用是減輕數據之間的不規律波動,使其波動曲線更平穩。
#數據帶有週期性,先一階差分,再144步差分
diff_1=data.diff(1)
diff1=diff_1.dropna()
diff1_144_1=diff_1-diff_1.shift(144)
diff1_144=diff1_144_1.dropna()
#畫圖判斷是否平穩
fig=plt.figure(figsize=(12,8))
ax=fig.add_subplot(111)
diff1_144.plot(ax=ax)
#求差分序列的自相關圖ACF和偏自相關圖PACF
fig=plt.figure(figsize=(12,8))
ax1=fig.add_subplot(211)
fig=sm.graphics.tsa.plot_acf(diff1_144,lags=40,ax=ax1)
ax2=fig.add_subplot(212)
fig=sm.graphics.tsa.plot_pacf(diff1_144,lags=40,ax=ax2)
plt.show()
#模型定階,根據aic、bic、hqic,三者都是越小越好
arma_mod01=sm.tsa.ARMA(diff1_144,(0,1)).fit()
print(arma_mod01.aic,arma_mod01.bic,arma_mod01.hqic)
arma_mod10=sm.tsa.ARMA(diff1_144,(1,0)).fit()
print(arma_mod10.aic,arma_mod10.bic,arma_mod10.hqic)
arma_mod60=sm.tsa.ARMA(diff1_144,(6,0)).fit()
print(arma_mod60.aic,arma_mod60.bic,arma_mod60.hqic)
arma_mod61=sm.tsa.ARMA(diff1_144,(6,1)).fit()
print(arma_mod61.aic,arma_mod61.bic,arma_mod61.hqic)
8782.801951424293 8795.000275694605 8787.618254792987
8781.294547949288 8793.4928722196 8786.110851317982
8761.522020813209 8794.05088553404 8774.365496463062
8758.668160226449 8795.263133037382 8773.117070332533
#模型定爲ARMA(6,1)
#計算殘差
resid=arma_mod61.resid
#模型檢驗
#殘差的acf和pacf
fig=plt.figure(figsize=(12,8))
ax1=fig.add_subplot(211)
fig=sm.graphics.tsa.plot_acf(resid.values.squeeze(),lags=40,ax=ax1) #squeeze()數組變爲1維
ax2=fig.add_subplot(212)
fig=sm.graphics.tsa.plot_pacf(resid,lags=40,ax=ax2)
plt.show()
#殘差自相關圖斷尾,所以殘差序列爲白噪聲
#DW檢驗
print(sm.stats.durbin_watson(resid.values))
#DW值越接近2,越不相關
2.0010218978025396
#LB檢驗
r,q,p=sm.tsa.acf(resid.values.squeeze(),qstat=True)
d=np.c_[range(1,41),r[1:],q,p]
table=pd.DataFrame(d,columns=['lag','AC','Q','Prob(>Q)'])
print(table.set_index('lag'))
#最後一列,前12行>0.05,是白噪聲序列
AC Q Prob(>Q)
結果:
"C:\Program Files\Python36\pythonw.exe" C:/Users/88304/Desktop/arima/ts_3.py
time
2017-04-01 00:00:00.000 597816.0
2017-04-01 00:10:00.000 583104.0
2017-04-01 00:20:00.000 572465.0
2017-04-01 00:30:00.000 561279.0
2017-04-01 00:40:00.000 551589.0
...
2017-04-05 23:20:00.700 NaN
2017-04-05 23:30:00.705 NaN
2017-04-05 23:40:00.710 NaN
2017-04-05 23:50:00.715 NaN
2017-04-06 00:00:00.720 NaN
Name: number, Length: 721, dtype: float64
C:\Program Files\Python36\lib\site-packages\statsmodels\tsa\base\tsa_model.py:162: ValueWarning: No frequency information was provided, so inferred frequency 10T will be used.
% freq, ValueWarning)
This problem is unconstrained.
RUNNING THE L-BFGS-B CODE
* * *
Machine precision = 2.220D-16
N = 8 M = 12
At X0 0 variables are exactly at the bounds
At iterate 0 f= 1.01400D+01 |proj g|= 1.14344D-03
At iterate 5 f= 1.01400D+01 |proj g|= 2.06057D-05
At iterate 10 f= 1.01400D+01 |proj g|= 1.06581D-06
At iterate 15 f= 1.01400D+01 |proj g|= 1.42109D-06
At iterate 20 f= 1.01400D+01 |proj g|= 1.77636D-06
At iterate 25 f= 1.01400D+01 |proj g|= 5.50671D-06
At iterate 30 f= 1.01400D+01 |proj g|= 3.19744D-06
At iterate 35 f= 1.01400D+01 |proj g|= 2.27907D-04
At iterate 40 f= 1.01400D+01 |proj g|= 2.45848D-04
At iterate 45 f= 1.01400D+01 |proj g|= 6.91003D-05
At iterate 50 f= 1.01400D+01 |proj g|= 7.44294D-05
At iterate 55 f= 1.01400D+01 |proj g|= 5.32907D-06
At iterate 60 f= 1.01400D+01 |proj g|= 3.55271D-07
* * *
Tit = total number of iterations
Tnf = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip = number of BFGS updates skipped
Nact = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F = final function value
* * *
N Tit Tnf Tnint Skip Nact Projg F
8 61 89 1 0 0 3.553D-07 1.014D+01
F = 10.139982273393125
CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH
Cauchy time 0.000E+00 seconds.
Subspace minimization time 0.000E+00 seconds.
Line search time 0.000E+00 seconds.
Total User time 0.000E+00 seconds.
8758.664719664874 8795.259692475807 8773.113629770958
C:\Program Files\Python36\lib\site-packages\statsmodels\tsa\stattools.py:572: FutureWarning: fft=True will become the default in a future version of statsmodels. To suppress this warning, explicitly set fft=False.
2.000999340483735
FutureWarning
AC Q Prob(>Q)
lag
1.0 -0.002054 0.001831 0.965873
2.0 -0.002387 0.004308 0.997848
3.0 0.002796 0.007718 0.999820
4.0 0.009734 0.049130 0.999703
5.0 0.027399 0.378006 0.995914
6.0 0.029302 0.755043 0.993225
7.0 0.029967 1.150302 0.992027
8.0 0.009435 1.189574 0.996744
9.0 0.019522 1.358117 0.998069
10.0 -0.044781 2.247036 0.994072
11.0 -0.040868 2.989184 0.990868
12.0 0.011338 3.046438 0.995207
13.0 0.056177 4.455400 0.985313
14.0 -0.101291 9.047087 0.828022
15.0 0.009775 9.089956 0.872767
16.0 -0.136811 17.506946 0.353548
17.0 -0.040941 18.262543 0.372462
18.0 0.076638 20.916581 0.283643
19.0 -0.032963 21.408772 0.314660
20.0 -0.036033 21.998325 0.340602
21.0 -0.054510 23.350834 0.325562
22.0 -0.073115 25.790054 0.260805
23.0 -0.095540 29.965209 0.150403
24.0 -0.000846 29.965537 0.185897
25.0 -0.017794 30.111071 0.220157
26.0 -0.036573 30.727426 0.238577
27.0 -0.020433 30.920279 0.274425
28.0 -0.015074 31.025498 0.315949
29.0 -0.007243 31.049854 0.363078
30.0 0.028095 31.417193 0.395111
31.0 0.014408 31.514052 0.440538
32.0 0.034205 32.061290 0.463707
33.0 0.065449 34.069847 0.415955
34.0 0.002278 34.072286 0.464262
35.0 0.038981 34.788397 0.478274
36.0 0.011688 34.852937 0.523033
37.0 0.023137 35.106490 0.558062
38.0 0.009311 35.147656 0.602066
39.0 0.015067 35.255730 0.641377
40.0 -0.001959 35.257562 0.683451
time
2017-04-01 00:00:00.000 NaN
2017-04-01 00:10:00.000 NaN
2017-04-01 00:20:00.000 NaN
2017-04-01 00:30:00.000 NaN
2017-04-01 00:40:00.000 NaN
...
2017-04-05 23:20:00.700 -15984.0
2017-04-05 23:30:00.705 -17059.0
2017-04-05 23:40:00.710 -25804.0
2017-04-05 23:50:00.715 -4121.0
2017-04-06 00:00:00.720 NaN
Name: number, Length: 721, dtype: float64
預測值
2017-04-04 23:50:00 663671.544399
2017-04-05 00:00:00 645407.867736
dtype: float64