機器學習(MACHINE LEARNING)使用ARIMA進行時間序列預測

1 引言

在本文章中,我們將提供可靠的時間序列預測。我們將首先介紹和討論自相關,平穩性和季節性的概念,並繼續應用最常用的時間序列預測方法之一,稱爲ARIMA。

2 簡介

時間序列提供了預測未來價值的機會。 基於以前的價值觀,可以使用時間序列來預測經濟,天氣和能力規劃的趨勢,其中僅舉幾例。 時間序列數據的具體屬性意味着通常需要專門的統計方法。
在時間序列中,ARIMA模型是在ARMA模型的基礎上多了差分的操作。

3 python代碼實現

(1)判斷時間序列是否是平穩白噪聲序列,若不是進行平穩化
(2)本實例數據帶有週期性,因此先進行一階差分,再進行144步差分
(3)看差分序列的自相關圖和偏自相關圖,差分後的而序列爲平穩序列
(4)模型定階,根據aic,bic,hqic
(5)預測,確定模型後預測
(5)還原,由於預測時用的差分序列,得到的預測值爲差分序列的預測值,需要將其還原

#-*- coding: utf-8 -*-

from __future__ import print_function
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm

#讀取Excel數據
discfile = 'data_test.xls'
data = pd.read_excel(discfile,index_col=0)
data=data['number']
data.head()

data.plot(figsize=(12,8))
print(data)

#使用一階差分,12步差分處理時間序列
diff_1 = data.diff(1)
diff1 = diff_1.dropna()
diff1_144_1 = diff_1-diff_1.shift(144)
diff1_144 = diff1_144_1.dropna()
#print(diff1_144_1)
#判斷序列是否平穩,計算ACF,PACF
fig1 = plt.figure(figsize=(12,8))
ax1=fig1.add_subplot(111)
sm.graphics.tsa.plot_acf(diff1_144,lags=40,ax=ax1)
fig2 = plt.figure(figsize=(12,8))
ax2=fig2.add_subplot(111)
sm.graphics.tsa.plot_pacf(diff1_144,lags=40, ax=ax2)

#模型定階,根據aic,bic,hqic,三者都是越小越好
# arma_mod01 = sm.tsa.ARMA(diff1_144,(0,1)).fit()
# print(arma_mod01.aic,arma_mod01.bic,arma_mod01.hqic)
# arma_mod10 = sm.tsa.ARMA(diff1_144,(1,0)).fit()
# print(arma_mod10.aic,arma_mod10.bic,arma_mod10.hqic)
# arma_mod60 = sm.tsa.ARMA(diff1_144,(6,0)).fit()
# print(arma_mod60.aic,arma_mod60.bic,arma_mod60.hqic)
arma_mod61 = sm.tsa.ARMA(diff1_144,(6,1)).fit()
print(arma_mod61.aic,arma_mod61.bic,arma_mod61.hqic)
#計算殘差
resid = arma_mod61.resid
#看殘差的acf和pacf,殘差自相關圖斷尾,所以殘差序列爲白噪聲
fig = plt.figure(figsize=(12,8))
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(resid.values.squeeze(), lags=40, ax=ax1)
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(resid, lags=40, ax=ax2)

print(sm.stats.durbin_watson(arma_mod61.resid.values))
# 殘差DW檢驗,DW的值越接近2,表示越不相關
r,q,p = sm.tsa.acf(resid.values.squeeze(), qstat=True)
d = np.c_[range(1,41), r[1:], q, p]
table = pd.DataFrame(d, columns=['lag', "AC", "Q", "Prob(>Q)"])
print(table.set_index('lag'))

# 用模型預測
predict_data=arma_mod61.predict('2017/4/4 23:50','2017/4/6 00:00',dynamic=False)
# print(predict_data)
# print(diff_1)
# 由於是用差分後的值做的預測,因此需要把結果還原
# 144步差分還原
diff1_144_shift=diff_1.shift(144)
# print('print diff1_144_shift')
print(diff1_144_shift)
diff_recover_144=predict_data.add(diff1_144_shift)
# 一階差分還原
diff1_shift=data.shift(1)
diff_recover_1=diff_recover_144.add(diff1_shift)
diff_recover_1=diff_recover_1.dropna() # 最終還原的預測值
print('預測值')
print(diff_recover_1)

# 實際值、預測值、差分預測值作圖
fig, ax = plt.subplots(figsize=(12, 8))
ax = data.loc['2017-04-01':].plot(ax=ax)
ax = diff_recover_1.plot(ax=ax)
fig = arma_mod61.plot_predict('2017/4/2 23:50', '2017/4/6 00:00', dynamic=False, ax=ax, plot_insample=False)
plt.show()


4 代碼解析

數據是這樣的。
在這裏插入圖片描述

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm

#讀取數據
data=pd.read_excel("data_test.xls",index_col=0)
data=data['number']
data.plot(figsize=(12,8)) #原圖

在這裏插入圖片描述
差分,一般在大數據裏用在以時間爲統計維度的分析中,其實就是下一個數值 ,減去上一個數值 。

當間距相等時,用下一個數值,減去上一個數值 ,就叫“一階差分”,做兩次相同的動作,即再在一階差分的基礎上用後一個數值再減上一個數值一次,就叫“二階差分"。

間距相等定義:即下圖中要麼1,2,3,4,5,6,7,8,9,10行後一個向前一個相減;要麼2,4,6,8,10或1,3,5,7,9行後一個向前一個相減;但不要1,5,6,10這樣不規律跳動着減,就是間距相等的定義(當然還有其它組合,如3,6,9,但無論怎樣,之間間距必須相等)。

差分形像點理解,可以看下圖
因此,差分的作用是減輕數據之間的不規律波動,使其波動曲線更平穩。

#數據帶有週期性,先一階差分,再144步差分
diff_1=data.diff(1)
diff1=diff_1.dropna()
diff1_144_1=diff_1-diff_1.shift(144)
diff1_144=diff1_144_1.dropna()
#畫圖判斷是否平穩
fig=plt.figure(figsize=(12,8))
ax=fig.add_subplot(111)
diff1_144.plot(ax=ax)

在這裏插入圖片描述

#求差分序列的自相關圖ACF和偏自相關圖PACF
fig=plt.figure(figsize=(12,8))
ax1=fig.add_subplot(211)
fig=sm.graphics.tsa.plot_acf(diff1_144,lags=40,ax=ax1)
ax2=fig.add_subplot(212)
fig=sm.graphics.tsa.plot_pacf(diff1_144,lags=40,ax=ax2)
plt.show()

在這裏插入圖片描述
在這裏插入圖片描述

#模型定階,根據aic、bic、hqic,三者都是越小越好
arma_mod01=sm.tsa.ARMA(diff1_144,(0,1)).fit()
print(arma_mod01.aic,arma_mod01.bic,arma_mod01.hqic)
arma_mod10=sm.tsa.ARMA(diff1_144,(1,0)).fit()
print(arma_mod10.aic,arma_mod10.bic,arma_mod10.hqic)
arma_mod60=sm.tsa.ARMA(diff1_144,(6,0)).fit()
print(arma_mod60.aic,arma_mod60.bic,arma_mod60.hqic)
arma_mod61=sm.tsa.ARMA(diff1_144,(6,1)).fit()
print(arma_mod61.aic,arma_mod61.bic,arma_mod61.hqic)

8782.801951424293 8795.000275694605 8787.618254792987
8781.294547949288 8793.4928722196 8786.110851317982
8761.522020813209 8794.05088553404 8774.365496463062
8758.668160226449 8795.263133037382 8773.117070332533

#模型定爲ARMA(6,1)
#計算殘差
resid=arma_mod61.resid

#模型檢驗
#殘差的acf和pacf
fig=plt.figure(figsize=(12,8))
ax1=fig.add_subplot(211)
fig=sm.graphics.tsa.plot_acf(resid.values.squeeze(),lags=40,ax=ax1) #squeeze()數組變爲1維
ax2=fig.add_subplot(212)
fig=sm.graphics.tsa.plot_pacf(resid,lags=40,ax=ax2)
plt.show()
#殘差自相關圖斷尾,所以殘差序列爲白噪聲

在這裏插入圖片描述

#DW檢驗
print(sm.stats.durbin_watson(resid.values))
#DW值越接近2,越不相關

2.0010218978025396

#LB檢驗
r,q,p=sm.tsa.acf(resid.values.squeeze(),qstat=True)
d=np.c_[range(1,41),r[1:],q,p]
table=pd.DataFrame(d,columns=['lag','AC','Q','Prob(>Q)'])
print(table.set_index('lag'))
#最後一列,前12行>0.05,是白噪聲序列

在這裏插入圖片描述

        AC          Q  Prob(>Q)

結果:

"C:\Program Files\Python36\pythonw.exe" C:/Users/88304/Desktop/arima/ts_3.py
time
2017-04-01 00:00:00.000    597816.0
2017-04-01 00:10:00.000    583104.0
2017-04-01 00:20:00.000    572465.0
2017-04-01 00:30:00.000    561279.0
2017-04-01 00:40:00.000    551589.0
                             ...   
2017-04-05 23:20:00.700         NaN
2017-04-05 23:30:00.705         NaN
2017-04-05 23:40:00.710         NaN
2017-04-05 23:50:00.715         NaN
2017-04-06 00:00:00.720         NaN
Name: number, Length: 721, dtype: float64
C:\Program Files\Python36\lib\site-packages\statsmodels\tsa\base\tsa_model.py:162: ValueWarning: No frequency information was provided, so inferred frequency 10T will be used.
  % freq, ValueWarning)
 This problem is unconstrained.
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            8     M =           12

At X0         0 variables are exactly at the bounds

At iterate    0    f=  1.01400D+01    |proj g|=  1.14344D-03

At iterate    5    f=  1.01400D+01    |proj g|=  2.06057D-05

At iterate   10    f=  1.01400D+01    |proj g|=  1.06581D-06

At iterate   15    f=  1.01400D+01    |proj g|=  1.42109D-06

At iterate   20    f=  1.01400D+01    |proj g|=  1.77636D-06

At iterate   25    f=  1.01400D+01    |proj g|=  5.50671D-06

At iterate   30    f=  1.01400D+01    |proj g|=  3.19744D-06

At iterate   35    f=  1.01400D+01    |proj g|=  2.27907D-04

At iterate   40    f=  1.01400D+01    |proj g|=  2.45848D-04

At iterate   45    f=  1.01400D+01    |proj g|=  6.91003D-05

At iterate   50    f=  1.01400D+01    |proj g|=  7.44294D-05

At iterate   55    f=  1.01400D+01    |proj g|=  5.32907D-06

At iterate   60    f=  1.01400D+01    |proj g|=  3.55271D-07

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    8     61     89      1     0     0   3.553D-07   1.014D+01
  F =   10.139982273393125     

CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH             

 Cauchy                time 0.000E+00 seconds.
 Subspace minimization time 0.000E+00 seconds.
 Line search           time 0.000E+00 seconds.

 Total User time 0.000E+00 seconds.

8758.664719664874 8795.259692475807 8773.113629770958
C:\Program Files\Python36\lib\site-packages\statsmodels\tsa\stattools.py:572: FutureWarning: fft=True will become the default in a future version of statsmodels. To suppress this warning, explicitly set fft=False.
2.000999340483735
  FutureWarning
            AC          Q  Prob(>Q)
lag                                
1.0  -0.002054   0.001831  0.965873
2.0  -0.002387   0.004308  0.997848
3.0   0.002796   0.007718  0.999820
4.0   0.009734   0.049130  0.999703
5.0   0.027399   0.378006  0.995914
6.0   0.029302   0.755043  0.993225
7.0   0.029967   1.150302  0.992027
8.0   0.009435   1.189574  0.996744
9.0   0.019522   1.358117  0.998069
10.0 -0.044781   2.247036  0.994072
11.0 -0.040868   2.989184  0.990868
12.0  0.011338   3.046438  0.995207
13.0  0.056177   4.455400  0.985313
14.0 -0.101291   9.047087  0.828022
15.0  0.009775   9.089956  0.872767
16.0 -0.136811  17.506946  0.353548
17.0 -0.040941  18.262543  0.372462
18.0  0.076638  20.916581  0.283643
19.0 -0.032963  21.408772  0.314660
20.0 -0.036033  21.998325  0.340602
21.0 -0.054510  23.350834  0.325562
22.0 -0.073115  25.790054  0.260805
23.0 -0.095540  29.965209  0.150403
24.0 -0.000846  29.965537  0.185897
25.0 -0.017794  30.111071  0.220157
26.0 -0.036573  30.727426  0.238577
27.0 -0.020433  30.920279  0.274425
28.0 -0.015074  31.025498  0.315949
29.0 -0.007243  31.049854  0.363078
30.0  0.028095  31.417193  0.395111
31.0  0.014408  31.514052  0.440538
32.0  0.034205  32.061290  0.463707
33.0  0.065449  34.069847  0.415955
34.0  0.002278  34.072286  0.464262
35.0  0.038981  34.788397  0.478274
36.0  0.011688  34.852937  0.523033
37.0  0.023137  35.106490  0.558062
38.0  0.009311  35.147656  0.602066
39.0  0.015067  35.255730  0.641377
40.0 -0.001959  35.257562  0.683451
time
2017-04-01 00:00:00.000        NaN
2017-04-01 00:10:00.000        NaN
2017-04-01 00:20:00.000        NaN
2017-04-01 00:30:00.000        NaN
2017-04-01 00:40:00.000        NaN
                            ...   
2017-04-05 23:20:00.700   -15984.0
2017-04-05 23:30:00.705   -17059.0
2017-04-05 23:40:00.710   -25804.0
2017-04-05 23:50:00.715    -4121.0
2017-04-06 00:00:00.720        NaN
Name: number, Length: 721, dtype: float64
預測值
2017-04-04 23:50:00    663671.544399
2017-04-05 00:00:00    645407.867736
dtype: float64

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章