机器学习(MACHINE LEARNING)使用ARIMA进行时间序列预测

1 引言

在本文章中,我们将提供可靠的时间序列预测。我们将首先介绍和讨论自相关,平稳性和季节性的概念,并继续应用最常用的时间序列预测方法之一,称为ARIMA。

2 简介

时间序列提供了预测未来价值的机会。 基于以前的价值观,可以使用时间序列来预测经济,天气和能力规划的趋势,其中仅举几例。 时间序列数据的具体属性意味着通常需要专门的统计方法。
在时间序列中,ARIMA模型是在ARMA模型的基础上多了差分的操作。

3 python代码实现

(1)判断时间序列是否是平稳白噪声序列,若不是进行平稳化
(2)本实例数据带有周期性,因此先进行一阶差分,再进行144步差分
(3)看差分序列的自相关图和偏自相关图,差分后的而序列为平稳序列
(4)模型定阶,根据aic,bic,hqic
(5)预测,确定模型后预测
(5)还原,由于预测时用的差分序列,得到的预测值为差分序列的预测值,需要将其还原

#-*- coding: utf-8 -*-

from __future__ import print_function
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import statsmodels.api as sm

#读取Excel数据
discfile = 'data_test.xls'
data = pd.read_excel(discfile,index_col=0)
data=data['number']
data.head()

data.plot(figsize=(12,8))
print(data)

#使用一阶差分,12步差分处理时间序列
diff_1 = data.diff(1)
diff1 = diff_1.dropna()
diff1_144_1 = diff_1-diff_1.shift(144)
diff1_144 = diff1_144_1.dropna()
#print(diff1_144_1)
#判断序列是否平稳,计算ACF,PACF
fig1 = plt.figure(figsize=(12,8))
ax1=fig1.add_subplot(111)
sm.graphics.tsa.plot_acf(diff1_144,lags=40,ax=ax1)
fig2 = plt.figure(figsize=(12,8))
ax2=fig2.add_subplot(111)
sm.graphics.tsa.plot_pacf(diff1_144,lags=40, ax=ax2)

#模型定阶,根据aic,bic,hqic,三者都是越小越好
# arma_mod01 = sm.tsa.ARMA(diff1_144,(0,1)).fit()
# print(arma_mod01.aic,arma_mod01.bic,arma_mod01.hqic)
# arma_mod10 = sm.tsa.ARMA(diff1_144,(1,0)).fit()
# print(arma_mod10.aic,arma_mod10.bic,arma_mod10.hqic)
# arma_mod60 = sm.tsa.ARMA(diff1_144,(6,0)).fit()
# print(arma_mod60.aic,arma_mod60.bic,arma_mod60.hqic)
arma_mod61 = sm.tsa.ARMA(diff1_144,(6,1)).fit()
print(arma_mod61.aic,arma_mod61.bic,arma_mod61.hqic)
#计算残差
resid = arma_mod61.resid
#看残差的acf和pacf,残差自相关图断尾,所以残差序列为白噪声
fig = plt.figure(figsize=(12,8))
ax1 = fig.add_subplot(211)
fig = sm.graphics.tsa.plot_acf(resid.values.squeeze(), lags=40, ax=ax1)
ax2 = fig.add_subplot(212)
fig = sm.graphics.tsa.plot_pacf(resid, lags=40, ax=ax2)

print(sm.stats.durbin_watson(arma_mod61.resid.values))
# 残差DW检验,DW的值越接近2,表示越不相关
r,q,p = sm.tsa.acf(resid.values.squeeze(), qstat=True)
d = np.c_[range(1,41), r[1:], q, p]
table = pd.DataFrame(d, columns=['lag', "AC", "Q", "Prob(>Q)"])
print(table.set_index('lag'))

# 用模型预测
predict_data=arma_mod61.predict('2017/4/4 23:50','2017/4/6 00:00',dynamic=False)
# print(predict_data)
# print(diff_1)
# 由于是用差分后的值做的预测,因此需要把结果还原
# 144步差分还原
diff1_144_shift=diff_1.shift(144)
# print('print diff1_144_shift')
print(diff1_144_shift)
diff_recover_144=predict_data.add(diff1_144_shift)
# 一阶差分还原
diff1_shift=data.shift(1)
diff_recover_1=diff_recover_144.add(diff1_shift)
diff_recover_1=diff_recover_1.dropna() # 最终还原的预测值
print('预测值')
print(diff_recover_1)

# 实际值、预测值、差分预测值作图
fig, ax = plt.subplots(figsize=(12, 8))
ax = data.loc['2017-04-01':].plot(ax=ax)
ax = diff_recover_1.plot(ax=ax)
fig = arma_mod61.plot_predict('2017/4/2 23:50', '2017/4/6 00:00', dynamic=False, ax=ax, plot_insample=False)
plt.show()


4 代码解析

数据是这样的。
在这里插入图片描述

import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import statsmodels.api as sm

#读取数据
data=pd.read_excel("data_test.xls",index_col=0)
data=data['number']
data.plot(figsize=(12,8)) #原图

在这里插入图片描述
差分,一般在大数据里用在以时间为统计维度的分析中,其实就是下一个数值 ,减去上一个数值 。

当间距相等时,用下一个数值,减去上一个数值 ,就叫“一阶差分”,做两次相同的动作,即再在一阶差分的基础上用后一个数值再减上一个数值一次,就叫“二阶差分"。

间距相等定义:即下图中要么1,2,3,4,5,6,7,8,9,10行后一个向前一个相减;要么2,4,6,8,10或1,3,5,7,9行后一个向前一个相减;但不要1,5,6,10这样不规律跳动着减,就是间距相等的定义(当然还有其它组合,如3,6,9,但无论怎样,之间间距必须相等)。

差分形像点理解,可以看下图
因此,差分的作用是减轻数据之间的不规律波动,使其波动曲线更平稳。

#数据带有周期性,先一阶差分,再144步差分
diff_1=data.diff(1)
diff1=diff_1.dropna()
diff1_144_1=diff_1-diff_1.shift(144)
diff1_144=diff1_144_1.dropna()
#画图判断是否平稳
fig=plt.figure(figsize=(12,8))
ax=fig.add_subplot(111)
diff1_144.plot(ax=ax)

在这里插入图片描述

#求差分序列的自相关图ACF和偏自相关图PACF
fig=plt.figure(figsize=(12,8))
ax1=fig.add_subplot(211)
fig=sm.graphics.tsa.plot_acf(diff1_144,lags=40,ax=ax1)
ax2=fig.add_subplot(212)
fig=sm.graphics.tsa.plot_pacf(diff1_144,lags=40,ax=ax2)
plt.show()

在这里插入图片描述
在这里插入图片描述

#模型定阶,根据aic、bic、hqic,三者都是越小越好
arma_mod01=sm.tsa.ARMA(diff1_144,(0,1)).fit()
print(arma_mod01.aic,arma_mod01.bic,arma_mod01.hqic)
arma_mod10=sm.tsa.ARMA(diff1_144,(1,0)).fit()
print(arma_mod10.aic,arma_mod10.bic,arma_mod10.hqic)
arma_mod60=sm.tsa.ARMA(diff1_144,(6,0)).fit()
print(arma_mod60.aic,arma_mod60.bic,arma_mod60.hqic)
arma_mod61=sm.tsa.ARMA(diff1_144,(6,1)).fit()
print(arma_mod61.aic,arma_mod61.bic,arma_mod61.hqic)

8782.801951424293 8795.000275694605 8787.618254792987
8781.294547949288 8793.4928722196 8786.110851317982
8761.522020813209 8794.05088553404 8774.365496463062
8758.668160226449 8795.263133037382 8773.117070332533

#模型定为ARMA(6,1)
#计算残差
resid=arma_mod61.resid

#模型检验
#残差的acf和pacf
fig=plt.figure(figsize=(12,8))
ax1=fig.add_subplot(211)
fig=sm.graphics.tsa.plot_acf(resid.values.squeeze(),lags=40,ax=ax1) #squeeze()数组变为1维
ax2=fig.add_subplot(212)
fig=sm.graphics.tsa.plot_pacf(resid,lags=40,ax=ax2)
plt.show()
#残差自相关图断尾,所以残差序列为白噪声

在这里插入图片描述

#DW检验
print(sm.stats.durbin_watson(resid.values))
#DW值越接近2,越不相关

2.0010218978025396

#LB检验
r,q,p=sm.tsa.acf(resid.values.squeeze(),qstat=True)
d=np.c_[range(1,41),r[1:],q,p]
table=pd.DataFrame(d,columns=['lag','AC','Q','Prob(>Q)'])
print(table.set_index('lag'))
#最后一列,前12行>0.05,是白噪声序列

在这里插入图片描述

        AC          Q  Prob(>Q)

结果:

"C:\Program Files\Python36\pythonw.exe" C:/Users/88304/Desktop/arima/ts_3.py
time
2017-04-01 00:00:00.000    597816.0
2017-04-01 00:10:00.000    583104.0
2017-04-01 00:20:00.000    572465.0
2017-04-01 00:30:00.000    561279.0
2017-04-01 00:40:00.000    551589.0
                             ...   
2017-04-05 23:20:00.700         NaN
2017-04-05 23:30:00.705         NaN
2017-04-05 23:40:00.710         NaN
2017-04-05 23:50:00.715         NaN
2017-04-06 00:00:00.720         NaN
Name: number, Length: 721, dtype: float64
C:\Program Files\Python36\lib\site-packages\statsmodels\tsa\base\tsa_model.py:162: ValueWarning: No frequency information was provided, so inferred frequency 10T will be used.
  % freq, ValueWarning)
 This problem is unconstrained.
RUNNING THE L-BFGS-B CODE

           * * *

Machine precision = 2.220D-16
 N =            8     M =           12

At X0         0 variables are exactly at the bounds

At iterate    0    f=  1.01400D+01    |proj g|=  1.14344D-03

At iterate    5    f=  1.01400D+01    |proj g|=  2.06057D-05

At iterate   10    f=  1.01400D+01    |proj g|=  1.06581D-06

At iterate   15    f=  1.01400D+01    |proj g|=  1.42109D-06

At iterate   20    f=  1.01400D+01    |proj g|=  1.77636D-06

At iterate   25    f=  1.01400D+01    |proj g|=  5.50671D-06

At iterate   30    f=  1.01400D+01    |proj g|=  3.19744D-06

At iterate   35    f=  1.01400D+01    |proj g|=  2.27907D-04

At iterate   40    f=  1.01400D+01    |proj g|=  2.45848D-04

At iterate   45    f=  1.01400D+01    |proj g|=  6.91003D-05

At iterate   50    f=  1.01400D+01    |proj g|=  7.44294D-05

At iterate   55    f=  1.01400D+01    |proj g|=  5.32907D-06

At iterate   60    f=  1.01400D+01    |proj g|=  3.55271D-07

           * * *

Tit   = total number of iterations
Tnf   = total number of function evaluations
Tnint = total number of segments explored during Cauchy searches
Skip  = number of BFGS updates skipped
Nact  = number of active bounds at final generalized Cauchy point
Projg = norm of the final projected gradient
F     = final function value

           * * *

   N    Tit     Tnf  Tnint  Skip  Nact     Projg        F
    8     61     89      1     0     0   3.553D-07   1.014D+01
  F =   10.139982273393125     

CONVERGENCE: REL_REDUCTION_OF_F_<=_FACTR*EPSMCH             

 Cauchy                time 0.000E+00 seconds.
 Subspace minimization time 0.000E+00 seconds.
 Line search           time 0.000E+00 seconds.

 Total User time 0.000E+00 seconds.

8758.664719664874 8795.259692475807 8773.113629770958
C:\Program Files\Python36\lib\site-packages\statsmodels\tsa\stattools.py:572: FutureWarning: fft=True will become the default in a future version of statsmodels. To suppress this warning, explicitly set fft=False.
2.000999340483735
  FutureWarning
            AC          Q  Prob(>Q)
lag                                
1.0  -0.002054   0.001831  0.965873
2.0  -0.002387   0.004308  0.997848
3.0   0.002796   0.007718  0.999820
4.0   0.009734   0.049130  0.999703
5.0   0.027399   0.378006  0.995914
6.0   0.029302   0.755043  0.993225
7.0   0.029967   1.150302  0.992027
8.0   0.009435   1.189574  0.996744
9.0   0.019522   1.358117  0.998069
10.0 -0.044781   2.247036  0.994072
11.0 -0.040868   2.989184  0.990868
12.0  0.011338   3.046438  0.995207
13.0  0.056177   4.455400  0.985313
14.0 -0.101291   9.047087  0.828022
15.0  0.009775   9.089956  0.872767
16.0 -0.136811  17.506946  0.353548
17.0 -0.040941  18.262543  0.372462
18.0  0.076638  20.916581  0.283643
19.0 -0.032963  21.408772  0.314660
20.0 -0.036033  21.998325  0.340602
21.0 -0.054510  23.350834  0.325562
22.0 -0.073115  25.790054  0.260805
23.0 -0.095540  29.965209  0.150403
24.0 -0.000846  29.965537  0.185897
25.0 -0.017794  30.111071  0.220157
26.0 -0.036573  30.727426  0.238577
27.0 -0.020433  30.920279  0.274425
28.0 -0.015074  31.025498  0.315949
29.0 -0.007243  31.049854  0.363078
30.0  0.028095  31.417193  0.395111
31.0  0.014408  31.514052  0.440538
32.0  0.034205  32.061290  0.463707
33.0  0.065449  34.069847  0.415955
34.0  0.002278  34.072286  0.464262
35.0  0.038981  34.788397  0.478274
36.0  0.011688  34.852937  0.523033
37.0  0.023137  35.106490  0.558062
38.0  0.009311  35.147656  0.602066
39.0  0.015067  35.255730  0.641377
40.0 -0.001959  35.257562  0.683451
time
2017-04-01 00:00:00.000        NaN
2017-04-01 00:10:00.000        NaN
2017-04-01 00:20:00.000        NaN
2017-04-01 00:30:00.000        NaN
2017-04-01 00:40:00.000        NaN
                            ...   
2017-04-05 23:20:00.700   -15984.0
2017-04-05 23:30:00.705   -17059.0
2017-04-05 23:40:00.710   -25804.0
2017-04-05 23:50:00.715    -4121.0
2017-04-06 00:00:00.720        NaN
Name: number, Length: 721, dtype: float64
预测值
2017-04-04 23:50:00    663671.544399
2017-04-05 00:00:00    645407.867736
dtype: float64

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章