時間序列分析的計量經濟學方法 - Python中的序列性ARIMA

自相關,時間序列分解,數據轉換,Sarimax模型,性能指標,分析框架

在這篇文章中,我們將討論使用趨勢和季節性組件分析時間序列數據。 將遵循計量經濟學方法來模擬數據的統計特性。 這裏的業務目標是預測。 我們試圖解釋時間序列建模中涉及的各種概念,例如時間序列組件,序列相關,模型擬合,度量等。我們將使用statsmodels庫提供的SARIMAX模型來模擬數據中的季節性和趨勢。 SARIMA(季節性ARIMA)能夠將季節性和趨勢建模在一起,不像ARIMA只能模擬趨勢。

Contents:

Definition of time series data
Introduction to the project and data
Seasonal decomposition and Time series components: Trend, Seasonality, Cycles, Residuals
Stationarity in time series data and why it is important
Autocorrelation and partial autocorrelation
Data transformation: Log transformation and differencing
Model Selection and Fitting
Conclusion
Access full Python code from the GitHub repository: https://github.com/jahangirmammadov/sarima/blob/master/Seasonal%20Time%20Series%20Analysis.ipynb

1. Definition of time series data?

Time series data is a sequence of data points measured over time intervals. In other words, data is a function of time f(t) = y.
Data points can be measured hourly, daily, weekly, monthly, quarterly, yearly and also with smaller or larger time scales such as seconds or decades.
時間序列數據是在時間間隔上測量的一系列數據點。 換句話說,數據是時間f(t)= y的函數。
數據點可以按小時,每日,每週,每月,每季度,每年以及更小或更大的時間尺度(例如秒或數十年)進行測量。

2. Introduction to the project and data

The data we are using in this article is a monthly home sales index for 20 major US cities between the years 2000 and 2019. (https://fred.stlouisfed.org/series/SPCS20RPSNSA). You can freely download many different economic time series data representing the US economy from this source. You may see 2 different versions of the same data, seasonally-adjusted and non-seasonally-adjusted. The version used at this post is not seasonally adjusted as we want to model the seasonality as well as trend. You may ask why people want to use seasonally adjusted data in the industry. Well, sometimes businesses may want to know the true effect of economic events on a particular data, which may overlap with a season. In that case, seasonality may hide or underestimate/overestimate the effect of an economic event. For instance: Heating oil producers may want to study the impact of the declining petrol prices on heating oil prices. However, heating oil prices are increasing in winter, despite the fact that heating oil is a petrol substance. The decrease in petrol prices should be reflected in the decrease in heating oil prices. However, in winter there is a big demand for heating, which causes a slight increase in prices. By removing the seasonal effect from the time series data, you may see that heating oil price actually follows a decreasing trend. The slight increase in the price was the seasonal effect. In Section 3, we will talk about seasonal decomposition in more detail.

If we look at the time series plot of the data we can observe an increasing trend in 2000–2006, a decreasing trend in home sales starting from 2007 till 2012 due to big financial crisis and increasing trend again till 2018. We can also observe seasonality in the data as usually the housing market is not active at the beginning of a year and sales usually go high in mid-year and again sales getting lower by the end of the year. Seems like warmer seasons, especially summer is the good season for the American housing market.
我們在本文中使用的數據是2000年至2019年間美國20個主要城市的月度房屋銷售指數。(https://fred.stlouisfed.org/series/SPCS20RPSNSA)。您可以從此來源免費下載代表美國經濟的許多不同的經濟時間序列數據。您可能會看到相同數據的2個不同版本,經過季節性調整和非季節性調整。這篇文章中使用的版本沒有經過季節性調整,因爲我們想要模擬季節性和趨勢。您可能會問爲什麼人們想要使用業內經季節性調整的數據。好吧,有時企業可能想知道經濟事件對特定數據的真實影響,這可能與季節重疊。在這種情況下,季節性可能隱藏或低估/高估經濟事件的影響。例如:取暖油生產商可能想研究汽油價格下跌對取暖油價格的影響。然而,儘管取暖油是一種汽油物質,但冬季取暖油價格仍在上漲。汽油價格的下降應反映在取暖油價格的下降中。然而,在冬季,對供暖的需求很大,導致價格略有上漲。通過消除時間序列數據中的季節性影響,您可能會看到取暖油價格實際上遵循下降趨勢。價格的小幅上漲是季節性影響。在第3節中,我們將更詳細地討論季節性分解。

如果我們看一下數據的時間序列圖,我們可以觀察到2000 - 2006年的增長趨勢,從2007年到2012年,由於金融危機和2018年再次增加趨勢,房屋銷售呈下降趨勢。我們還可以觀察季節性在數據中,通常房屋市場在年初並不活躍,銷售額通常在年中高位,而且銷售額在年底前再次下降。看起來像溫暖的季節,尤其是夏季是美國住房市場的好季節。

在這裏插入圖片描述

3. Seasonal decomposition and Time series components: Trend, Seasonality, Cycles, Residuals

Time series data Y is composed of a combination of Trend, Cycles, Seasonality and Residuals. Obviously, you may come across with time series where it doesn’t have a Trend, Cycles or Seasonality. So, it is your task to identify the components of time series data. Definition of the terms are given below:
Trend — long-term upward or downward movement.
Cycle — periodic variation due to economic movements. It is different from seasonal variation. The cycle is the variation of autoregressive component of time series data. Cycles occur within longer time intervals such as every 6–10 years, whereas seasonal variation occurs in shorter time intervals.
Seasonality — variation in data caused by seasonal effects. Ice cream sales are high in summer, heating oil sales are high in winter but low in summer.
Residuals — a component that is left after other components have been calculated and removed from time series data. It is randomly, identiclly and independently distributed (i.i.d). Residuals, R~ N(0,1).
Statsmodels library has a function called seasonal_decompose, which decomposes time series Y into Trend, Seasonality and Residuals. Although, it is a naive decomposition algorithm, in practice it is very intuitive and works well for time series data where T, S and R are obvious. Before explaining the below graphs I would like to talk about the interaction among these components.
Time series data Y can take either an additive or a multiplicative form. In additive form, time series Y is formed by the sum of time series components, namely, T, S, C, R:
Y = T + C + S + R
In multiplicative form time series Y is formed by the product of time series components:
Y = T * C * S * R
So, is home sales index is multiplicative or additive? If you carefully look at the time series plot of Housing index you may notice that the seasonal variation (seasonality) gets smaller when the trend decreases (years 2008–2013) and gets bigger when the trend increases (years 2014–2019). This happens in multiplicative time series, where small value for trend T results in small S because we multiply S by T. You don’t experience this phenomenon in additive time series.
We decompose time series data into Trend, Seasonal component and Residuals using seasonal_decompose(original_data, ‘multiplicative’) function from statsmodels. Don’t be surprised if the function returns all 3 components even though you assume that they do not exist for a particular time series data. In reality, these components are generated by a simple algorithm, that’s why the decomposition function cannot say a component doesn’t exist, despite the calculated value is not significant. So, you will see these three components for any time series data. You have to know how to read the results and decide which model (ARIMA or SARIMA) to fit the data.
時間序列數據Y由趨勢,週期,季節性和殘差組合而成。顯然,你可能會遇到沒有趨勢,週期或季節性的時間序列。因此,您需要確定時間序列數據的組成部分。術語的定義如下:
趨勢 - 長期向上或向下移動。
週期 - 由經濟運動引起的週期性變化。它與季節變化不同。該週期是時間序列數據的自迴歸分量的變化。週期發生在較長的時間間隔內,例如每6到10年,而季節性變化發生在較短的時間間隔內。
季節性 - 季節性影響導致的數據變化。夏季冰淇淋銷量較高,冬季取暖油銷量較高,夏季較低。
殘差 - 計算其他組件並從時間序列數據中刪除後剩餘的組件。它是隨機的,同心的和獨立分佈的(i.i.d)。殘差,R~N(0,1)。
Statsmodels庫有一個名爲seasonal_decompose的函數,它將時間序列Y分解爲Trend,Seasonality和Residuals。雖然它是一種天真的分解算法,但在實踐中它非常直觀,適用於時間序列數據,其中T,S和R是顯而易見的。在解釋下面的圖表之前,我想談談這些組件之間的相互作用。
時間序列數據Y可以採用加法或乘法形式。在加法形式中,時間序列Y由時間序列分量之和形成,即T,S,C,R:
Y = T + C + S + R.
在乘法形式中,時間序列Y由時間序列分量的乘積形成:
Y = T * C * S * R.
那麼,房屋銷售指數是乘法還是加法?如果仔細查看房屋指數的時間序列圖,您可能會注意到趨勢減少時(2008-2013年)季節性變化(季節性)變小,而趨勢增加時變得更大(2014 - 2019年)。這在乘法時間序列中發生,其中趨勢T的小值導致小S,因爲我們將S乘以T.在加法時間序列中不會遇到這種現象。
我們使用來自statsmodels的seasonal_decompose(original_data,‘multiplicative’)函數將時間序列數據分解爲Trend,Seasonal組件和Residuals。如果函數返回所有3個組件,即使您認爲特定時間序列數據不存在,也不要感到驚訝。實際上,這些組件是通過簡單的算法生成的,這就是爲什麼分解函數不能說組件不存在的原因,儘管計算的值並不重要。因此,您將看到任何時間序列數據的這三個組件。您必須知道如何閱讀結果並確定適合數據的模型(ARIMA或SARIMA)。
在這裏插入圖片描述
Plot of seasonal decomposition from statsmodels library

We will try to decompose the data into components ourselves to better understand their derivation and usage. The trend can be calculated taking moving average with window size= 12. The below plot is very similar to the Trend generated by stats model library.

我們將嘗試將數據分解爲組件,以便更好地理解它們的派生和使用。 可以使用窗口大小= 12的移動平均值計算趨勢。下圖與stats模型庫生成的趨勢非常相似。

#Trend is moving avg
ma_12 = training_data.rolling(window=12).mean()
plt.figure(figsize=(12,4));
plt.plot(ma_12);
plt.title('MA_12');

Moving average with window size=12
在這裏插入圖片描述
Plot of Moving average window size=12

Remember the multiplicative model, Y = T C S* R ? Dividing Y by T may give you Y/T=CSR and we assume that Residuals are too small for this data as the time series plot looks smooth. We have a very small data, thus we cannot detect an economic cycle. We get a seasonal component when we divide time series by trend, S=Y/T.

seasonal_component = training_data /decomposition.trend
plt.figure(figsize=(12,4))
plt.plot(seasonal_component);
plt.scatter(x='2002-07-01', y=seasonal_component.loc['2002-07-01'], marker='D', s=50, color='r');
plt.scatter(x='2003-02-01', y=seasonal_component.loc['2003-02-01'], marker='D', s=50, color='g');
plt.title('S=Y/T');

Code snippet for seasonality decomposition

Graph of seasonality is a little bit harder to understand, however, the explanation given here is sufficient. y=1.2(red marker) means there were 20% more sales in July-2000. In other words, June has a seasonal effect of +20% or 1.2 *T. On the other hand, y=0.8(green marker) in Feb-2003 shows a 20% decrease in sales. So, for some time series data not at this particular case, you may see the seasonal effect is very small i.e y=0.0001. It shows a very small seasonal effect, which shouldn’t even be considered significant.

在這裏插入圖片描述
Seasonal Component

Residuals can be computed as R = Y/(S*T)

#R = Y/(S*T)
residual_component= training_data/(decomposition.trend*decomposition.seasonal)
plt.figure(figsize=(12,4))
plt.plot(residual_component);
plt.title('R = Y / (S*T)')

Code snippet for residuals decomposition

在這裏插入圖片描述
Resiudals component

There are some other ways to detect seasonality. In the below graph, monthly home sales for each year is plotted and as you can see every year follows pretty much the same pattern with a slight difference. House sales are high in summer, lower in winter months.

plt.figure(figsize=(16,8))
plt.grid(which='both')
years = int(np.round(len(training_data)/12))
for i in range(years):
    index = training_data.index[i*12:(i+1)*12]
    plt.plot(training_data.index[:12].month_name(),training_data.loc[index].values);
    plt.text(y=training_data.loc[index].values[11], x=11, s=training_data.index.year.unique()[i]);
plt.legend(training_data.index.year.unique(), loc=0);
plt.title('Monthly Home Sales per Year');

Code snippet for the monthly home sales graph

在這裏插入圖片描述
Monthly sales per year

4. Stationarity in time series data and why it is important

When we have trend and/or seasonality in a time series data we call it non-staionary.
Stationarity means the statistical properties of data, such as mean, variance and standard deviation remain constant over time. Stationary data should be i.i.d . In simpler language, every data point should be independent from the previous data point.
Why do we want the statistical properties to remain the same over time? Well, because we make statistical assumptions (a good example could be OLS assumptions) about the sample data in due course of model building and the model will only be capable of performing under those assumptions. When the statistical properties of the data changes, the model is no longer capable of representing the true nature of the data.
That’s why our forecasting/prediction results will no longer be valid. Changing mean/variance will require us to fit another model and this model may be valid for a short period of time and we have to abandon it again and fit a new model. See, how inefficient and unreliable this process looks like. We have to make time series data stationary before fitting a model. We can make time series stationary by transforming the data. Usually, differencing is used to make the data stationary. We will talk about it in Section 6, below.
So, how can we test whether a time series data is stationary or not? The first is just eyeballing the time series plot and identify trend or seasonality. If at least one of them exists, then the time series data is not stationary. Secondly, you may divide the data into 3 different sets and calculate mean and variance for each set and confirm whether mean and variance for each set is substantially different or not. The third option is to use one of the statistical tests provided in statsmodels library.
Augmented Dickey-Fuller test is the most popular amongst others, where the null hypothesis, H_0 = data is not stationary. ADF test result provides test statistic and P value. P value >= 0.05 means the data is not stationary, otherwise, we reject the null hypothesis and say data is stationary.
We assume you know what hypothesis testing is and what P value means. If you are not quite familiar with these terms, then look at the p-value and if it is smaller than 0.05 (p-value < 0.05) then data is stationary if p-value >= 0.05 data is not stationary. ADF test confirms that the original time series data is not stationary with p-value ~ 0.08
當我們在時間序列數據中有趨勢和/或季節性時,我們將其稱爲非穩定性。
平穩性意味着數據的統計特性,例如均值,方差和標準偏差隨時間保持不變。固定數據應該是i.i.d.在更簡單的語言中,每個數據點應獨立於先前的數據點。
爲什麼我們希望統計屬性隨時間保持不變?那麼,因爲我們在模型構建的適當過程中對樣本數據做出統計假設(一個很好的例子可能是OLS假設),並且該模型只能在這些假設下執行。當數據的統計屬性發生變化時,模型不再能夠表示數據的真實性質。
這就是爲什麼我們的預測/預測結果將不再有效。改變均值/方差將需要我們適應另一個模型,並且該模型可能在短時間內有效,我們不得不再次放棄它並適合新模型。看,這個過程看起來多麼低效和不可靠。在擬合模型之前,我們必須使時間序列數據靜止。我們可以通過轉換數據使時間序列固定。通常,差分用於使數據靜止。我們將在下面的第6節中討論它。
那麼,我們如何測試時間序列數據是否靜止?第一個是觀察時間序列圖並確定趨勢或季節性。如果它們中至少存在一個,則時間序列數據不是靜止的。其次,您可以將數據分成3組,並計算每組的均值和方差,並確認每組的均值和方差是否大不相同。第三種選擇是使用statsmodels庫中提供的統計測試之一。
增強Dickey-Fuller測試是最受歡迎的測試,其中零假設H_0 =數據不是靜止的。 ADF測試結果提供測試統計和P值。 P值> = 0.05表示數據不是靜止的,否則,我們拒絕原假設並說數據是靜止的。
我們假設您知道假設檢驗是什麼以及P值意味着什麼。如果您對這些術語不是很熟悉,那麼請查看p值,如果它小於0.05(p值<0.05),那麼如果p值> = 0.05,數據不是靜止的,則數據是靜止的。 ADF測試證實原始時間序列數據不是靜止的,p值約爲0.08

from statsmodels.tsa.stattools import adfuller
def test_stationarity(data):
    p_val=adfuller(data['SPCS20RPSNSA'])[1]
    if p_val >= 0.05:
        print("Time series data is not stationary. Adfuller test pvalue={}".format(p_val))
    else:
        print("Time series data is stationary. Adfuller test pvalue={}".format(p_val))
test_stationarity(original_data) 

Code snippet for ADF test

Time series data is not stationary. Adfuller test pvalue=0.0803366374517756

5. Autocorrelation and partial autocorrelation

We have to take a look at ACF and PACF plots, before model building as we will use these plots a lot from now on.
Autocorrelation plot shows the correlation of time series data with its own lagged values. For example, autocorrelation at lag=1 shows the correlation between y_t and y_t-1. At lag=2, corr(y_t, y_t-2). At lag=12 corr(y_t, y_t-12). Every data point at time t having a high correlation with a data point at time t-12, t-24, etc denotes seasonality at this particular example.
在這裏插入圖片描述
Autocorrelation plot of original home sales index data

The below code snippet and scatter plots may help you to better understand the correlation between lagged values, namely, autocorrelation.

fig, axes = plt.subplots(1,2, squeeze=False);
fig.set_size_inches(14,4);
axes[0,0].scatter(x=original_data[1:], y=original_data.shift(1)[1:]);
axes[0,1].scatter(x=original_data[12:], y=original_data.shift(12)[12:]);
axes[0,0].set_title('Correlation of y_t and y_t-1');
axes[0,1].set_title('Correlation of y_t and y_t-12');

Code snippet for correlation of lags

在這裏插入圖片描述
Correlation between lag values — highly correlated

Back to ACF plot, blue shaded area at the autocorrelation plot shows significance level. So, correlation coefficients within the shaded area show weak correlation at those lags and we don’t consider them significant in the analysis.
The partial autocorrelation function (PACF) gives the partial correlation of a stationary time series with its own lagged values.
回到ACF圖,自相關圖中的藍色陰影區域顯示顯着性水平。 因此,陰影區域內的相關係數在這些滯後時顯示弱相關性,我們認爲它們在分析中不顯着。
部分自相關函數(PACF)給出了靜止時間序列與其自身滯後值的部分相關性。

在這裏插入圖片描述
Partial autocorrelation plot of original home sales index data

PACF removes the correlation contribution of other lags and gives the pure correlation between two lags without the effect of others.

We use ACF and PACF to choose a correct order for AR§ and MA(q) components/features of an ARIMA model. For AR order p, look at PACF plot and choose a lag value which has a significant correlation factor before correlations get insignificant. For MA order q look at ACF plot and do the same. Don’t forget you should only get these values from the ACF and PACF plots of stationary time series, not the above plots. The ACF and PACF plot given above are the plots of original data, which is non-stationary.
PACF消除了其他滯後的相關性貢獻,並給出了兩個滯後之間的純相關性而沒有其他滯後效應。

我們使用ACF和PACF爲ARIMA模型的AR(p)和MA(q)組件/特徵選擇正確的順序。 對於AR階數p,請查看PACF圖並選擇滯後值,該滯後值在相關性變得無關緊要之前具有顯着的相關因子。 對於MA訂單q,請查看ACF圖並執行相同操作。 不要忘記你應該只從固定時間序列的ACF和PACF圖中獲得這些值,而不是上面的圖。 上面給出的ACF和PACF圖是原始數據的圖,它是非平穩的。

6. Data transformation: Log transformation and differencing

So, let’s transform the data to make it stationary, so we can start the model building phase. We split the original data into training and test data. Training data will contain US home sales data from 2000 to 2018 and test data will contain data from 2018 to 2019. Don’t forget, you cannot do random sampling like you are doing for cross-sectional data. We have to keep the temporal behaviour (dependence on time) of time series data.
Home sales index data can be formulated as a multiplicative model where Y= TSR. I am ignoring Cycles, as it is not actually present in this data. (S)ARIMA models are linear models, like Linear Regression. We can not fit a linear model SARIMA to data generated by a process Y = TSR. We have to make Y linear before fitting a linear model. As you are aware of from math Log(a*b) = log(a) + log(b). We have to log-transform the data to make it linear. log(Y) = log(T) + log(S) + log®. Log transformation makes data linear and smoother.
因此,讓我們轉換數據使其靜止,這樣我們就可以啓動模型構建階段。 我們將原始數據分成訓練和測試數據。 培訓數據將包含2000年至2018年的美國房屋銷售數據,測試數據將包含2018年至2019年的數據。不要忘記,您不能像對待橫截面數據那樣進行隨機抽樣。 我們必須保持時間序列數據的時間行爲(依賴於時間)。
房屋銷售指數數據可以表示爲乘法模型,其中Y = T * S * R. 我忽略了Cycles,因爲它實際上不存在於此數據中。 (S)ARIMA模型是線性模型,如線性迴歸。 我們不能將線性模型SARIMA擬合到由過程Y = T * S * R生成的數據。 在擬合線性模型之前,我們必須使Y線性化。 如你所知,從數學Log(a * b)= log(a)+ log(b)。 我們必須對數據進行對數轉換以使其成爲線性的。 log(Y)= log(T)+ log(S)+ log(R)。 日誌轉換使數據線性和平滑。

log_transformed_data = np.log(training_data)
plot_data_properties(log_transformed_data, ‘Log tranformed training data’)

在這裏插入圖片描述
Properties of log transformed data

Sometimes, log transformation on itself can make data stationary, but it is not the case here.

test_stationarity(log_transformed_data)
Time series data is not stationary. Adfuller test pvalue=0.22522944188413385

Differencing is a basic operation or data transformation. It is the difference between y at time=t and y at time=t-x. diff_1 = y_t — y_t-1
Differencing makes the data stationary as it removes time series components from the data and you are left with changes between time periods. Notice, first order differencing took away only Trend, not Seasonality. Data is still not stationary as it contains seasonal effects.

logged_diffed_data = log_transformed_data.diff()[1:]
plot_data_properties(logged_diffed_data, 'Log transformed and differenced data')

在這裏插入圖片描述
Log-transformed and 1st order integrated: non-stationary

test_stationarity(logged_diffed_data)
Time series data is not stationary. Adfuller test pvalue=0.20261733702504936

We have to take 12th order difference to remove seasonality. You may ask how we decided to take 12th order difference not 6th or 8th or other order. Usually, monthly data has seasonality at lag=12, weekly data has at lag=4 and daily has at lag=30. Or you can derive it from the ACF plot. The12th, 24th, 36th lags are highly correlated for this particular data.
Data is stationary now. If you look at the histogram at the below graph, it looks like Normal bell curve. Stationary data is randomly i.i.d distributed and the plot looks like white noise. White noise is just an example of stationary time series data.
我們必須採取12階差異來消除季節性。 您可能會問我們如何決定採用12階差異而非6或8或其他順序。 通常,月度數據具有滯後= 12的季節性,每週數據具有滯後= 4並且每日數據具有滯後= 30。 或者你可以從ACF圖中得出它。 對於這些特定數據,第12,24,36個滯後高度相關。
數據現在是固定的。 如果您查看下圖中的直方圖,它看起來像普通鐘形曲線。 固定數據是隨機分佈的,並且該圖看起來像白噪聲。 白噪聲只是靜止時間序列數據的一個例子。

seasonally_diffed_data = logged_diffed_data.diff(12)[12:]
plot_data_properties(seasonally_diffed_data, 'Log transofrmed, diff=1 and seasonally differenced data')

在這裏插入圖片描述
Stationary data

test_stationarity(seasonally_diffed_data)
Time series data is stationary. Adfuller test pvalue=0.0006264163287311492

7. Model Selection and Fitting

As transformed data is stationary now we can proceed to model fitting phase. We had a brief chat about SARIMA before. I want to elaborate on this particular model. SARIMA, Seasonal ARIMA is a special member of ARIMA family which can model seasonal component of time series data. Just to recap what ARIMA means:
AR — Auto-Regressive model means time series data is regressed on its lagged values. Lagged values become independent variables, whereas time series itself becomes the dependent variable.
y = a_0 + a_1y_t-1 + a_2y_t-2, ……, a_ky_t-k.
The main task here is to choose how many time steps to be used as independent variables. Do not let the word time series or lagged values to confuse you, they are just independent variables. In linear Regression, you could look at the correlation between independent and dependent variables and choose highly correlated variables as your features. Here you should do the same. But, you don’t have to calculate the correlation between lagged values and target variable, because you can use PACF to determine how many lags to use. PACF of stationary data has significant autocorrelation at lag=1 and the next autocorrelation at Lag=2 becomes insignificant. Ideally, AR order p should be 1. Since AR§ and MA(q) terms interact, the initial p and q values observed from autocorrelation plots are no longer reliable and should be used as a starting point. We have to do parameter search on p to find the optimal value. An initial guess will help to define which values to use for a grid search. In this case, p = [0–2] would be sufficient.
I — order of integration: Basically, how many times you have differenced the data. We had it once d=1. Do not forget to fit the model to not differenced data when you set parameter d=1, as the algorithm will differentiate it. If you fit model to stationary data, thenyou don’t need differencing anymore. You can leave d=0. We need differencing just to make the data stationary.
MA — Moving Average model: Time series y is regressed on residuals w.
y = a_0 + a_1
w1 + a_2w2 + …. + a_kwk
Look at ACF plot to determine MA order (q) of the ARIMA model. ACF suggests order q=1 for MA part of the ARIMA model. However, we should do a grid search to find an optimal model. I suggest looking at values q=[0–2]
Seasonal model — Seasonal features have to be added to the model together with AR and MA and it has 4 parameters (P, D, Q, s).
Think of P, D and Q parameters being similar to AR, I and MA parameters, but only for a seasonal component of the series.
Choose P by looking at PACF and Q by looking at ACF. The number of seasonal differences has been taken is D. Frequency of seasonal effect is defined by s.
P = 1 — because we have significant correlation at lag=12, however, they are not strong enough and we may not need to have an AR variable in the model. That’s why we should grid search on P = [0–2]
D=1 — we differenced for seasonality once
Q=1 — as we have strong correlation at lag=12 according to ACF plot. Let’s perform grid search on parameter Q=[0–2], too.
s=12 — seasonality frequency, every 12 months
best_sarima_model function below performs grid search on (p,d,q) and (P,D,Q,s) parameters and finds the best model taking statistical metrics AIC, BIC, HQIC as evaluation criteria. Lower AIC, BIC, HQIC means better model. These metrics reward goodness-of-fit (log-likelihood) and penalises overfitting. In our case having many lagged features leads to overfitting. AIC, BIC and HQIC balance the tradeoff between likelihood and degrees of freedom. You can see this property in their formula. I will not get into the details of other metrics, but will give an example below for supporting my point of using AIC:
-k is number of estimated parameters in the model, in other words, number of features (lag terms).
-L is the maximum of the likelihood function.
AIC = 2k — 2ln(L)
I have seen many examples in the industry, using only one of those metrics as a model selection criteria, but you may come across the cases where AIC of a model can be lower than another, while BIC is higher. That’s why try to choose a model over the other if 2 of the 3 metrics are lower.
由於轉換數據現在是固定的,我們可以進入模型擬合階段。我們之前有過關於SARIMA的簡短聊天。我想詳細說明這個特定的模型。 SARIMA,季節性ARIMA是ARIMA家族的特殊成員,可以對時間序列數據的季節性成分進行建模。只是回顧一下ARIMA的含義:
AR - 自迴歸模型意味着時間序列數據在其滯後值上回歸。滯後值成爲獨立變量,而時間序列本身成爲因變量。
y = a_0 + a_1 * y_t-1 + a_2 * y_t-2,…,a_k * y_t-k。
這裏的主要任務是選擇要用作獨立變量的時間步數。不要讓時間序列或滯後值這個詞讓你迷惑,它們只是自變量。在線性迴歸中,您可以查看獨立變量和因變量之間的相關性,並選擇高度相關的變量作爲您的要素。在這裏你應該這樣做。但是,您不必計算滯後值和目標變量之間的相關性,因爲您可以使用PACF來確定要使用的滯後數。固定數據的PACF在滯後= 1時具有顯着的自相關性,並且滯後= 2的下一自相關變得無關緊要。理想情況下,AR階數p應爲1.由於AR(p)和MA(q)項相互作用,從自相關圖中觀察到的初始p和q值不再可靠,應作爲起點。我們必須在p上進行參數搜索才能找到最佳值。初始猜測將有助於定義用於網格搜索的值。在這種情況下,p = [0-2]就足夠了。
I - 集成順序:基本上,您有多少次數據差異。我們曾經有過d = 1。設置參數d = 1時,不要忘記將模型擬合到差異數據,因爲算法會區分它。如果您將模型擬合到靜止數據,那麼您不再需要差分。你可以留d = 0。我們需要差分才能使數據保持不變。
MA - 移動平均模型:時間序列y在殘差w上回歸。
y = a_0 + a_1 * w1 + a_2 * w2 + … + a_k * wk
查看ACF圖以確定ARIMA模型的MA順序(q)。 ACF建議對於ARIMA模型的MA部分,q = 1。但是,我們應該進行網格搜索以找到最佳模型。我建議看看值q = [0-2]
季節性模型 - 季節性特徵必須與AR和MA一起添加到模型中,並且它具有4個參數(P,D,Q,s)。
可以認爲P,D和Q參數與AR,I和MA參數類似,但僅適用於該系列的季節性組件。
通過查看ACF來查看PACF和Q來選擇P.季節性差異的數量是D.季節性影響的頻率由s定義。
P = 1 - 因爲我們在滯後= 12時具有顯着的相關性,但是它們不夠強大,我們可能不需要在模型中具有AR變量。這就是爲什麼我們應該在P = [0-2]上進行網格搜索
D = 1 - 我們的季節性差異一次
Q = 1 - 因爲根據ACF圖,我們在滯後= 12時具有強相關性。讓我們對參數Q = [0-2]進行網格搜索。
s = 12 - 季節性頻率,每12個月一次
下面的best_sarima_model函數對(p,d,q)和(P,D,Q,s)參數執行網格搜索,並找到以統計度量AIC,BIC,HQIC作爲評估標準的最佳模型。較低的AIC,BIC,HQIC意味着更好的模型。這些指標獎勵擬合優度(對數似然)並懲罰過度擬合。在我們的情況下,具有許多滯後特徵導致過度擬合。 AIC,BIC和HQIC平衡了可能性和自由度之間的權衡。你可以在他們的公式中看到這個屬性。我不會深入瞭解其他指標的細節,但下面將舉例說明如何使用AIC:
-k是模型中估計參數的數量,換言之,特徵數量(滯後項)。
-L是似然函數的最大值。
AIC = 2k - 2ln(L)
我在行業中看到了很多例子,只使用其中一個指標作爲模型選擇標準,但是你可能會遇到模型的AIC低於另一個,而BIC更高的情況。這就是爲什麼如果3個指標中的2個較低,則嘗試選擇模型而不是另一個。

def best_sarima_model(train_data,p,q,P,Q,d=1,D=1,s=12):
    best_model_aic = np.Inf 
    best_model_bic = np.Inf 
    best_model_hqic = np.Inf
    best_model_order = (0,0,0)
    models = []
    for p_ in p:
        for q_ in q:
            for P_ in P:
                for Q_ in Q:
                    try:
                        no_of_lower_metrics = 0
                        model = SARIMAX(endog=train_data,order=(p_,d,q_), seasonal_order=(P_,D,Q_,s),
                                        enforce_invertibility=False).fit()
                        models.append(model)
                        if model.aic <= best_model_aic: no_of_lower_metrics+=1
                        if model.bic <= best_model_bic: no_of_lower_metrics+=1
                        if model.hqic <= best_model_hqic:no_of_lower_metrics+=1
                        if no_of_lower_metrics >= 2:
                            best_model_aic = np.round(model.aic,0)
                            best_model_bic = np.round(model.bic,0)
                            best_model_hqic = np.round(model.hqic,0)
                            best_model_order = (p_,d,q_,P_,D,Q_,s)
                            current_best_model = model
                            models.append(model)
                            print("Best model so far: SARIMA" +  str(best_model_order) + 
                                  " AIC:{} BIC:{} HQIC:{}".format(best_model_aic,best_model_bic,best_model_hqic)+
                                  " resid:{}".format(np.round(np.exp(current_best_model.resid).mean(),3)))

                    except:
                        pass

    print('\n')
    print(current_best_model.summary())                
    return current_best_model, models 

Code snippet for model selection

Note, we are fitting a model to log-transformed data and because of that. we have set d=1 and D=1 parameters so that the model to do differencing itself. If you are fitting a model to stationary data instead, you have to set order of integration (d, D) to 0. We evaluated SARIMA models with the parameters we have identified above. The summary below shows the best model or in other words lowest AIC, BIC, HQIC. The best model suggests that we don’t need to have AR features, but only MA and seasonal MA features.
注意,我們正在擬合一個模型來記錄轉換後的數據,因此。 我們設置了d = 1和D = 1的參數,以便模型自己做差異化。 如果您將模型擬合到固定數據,則必須將積分順序(d,D)設置爲0.我們使用上面確定的參數評估SARIMA模型。 下面的摘要顯示了最佳模型或換句話說最低AIC,BIC,HQIC。 最好的模型表明我們不需要AR功能,但只需要MA和季節性MA功能。

best_model, models = best_sarima_model(train_data=log_transformed_data,p=range(3),q=range(3),P=range(3),Q=range(3))

在這裏插入圖片描述
Model Summary of the best Sarima model

ARIMA or SARIMA models are estimated by using MLE and OLS assumptions are applicable to this family of models. I don’t want to elaborate on these assumptions here. It is a topic of another article. However, we have to confirm that our model aligns with those assumptions. P values of coefficients are <= 0.05. Residuals are stationary and homoscedastic. There is no serial correlation among residuals.
通過使用MLE和OLS假設估計ARIMA或SARIMA模型適用於該系列模型。 我不想在這裏詳細說明這些假設。 這是另一篇文章的主題。 但是,我們必須確認我們的模型與這些假設一致。 係數的P值<= 0.05。 殘留物是靜止的和同質的。 殘差之間沒有序列相關性。
在這裏插入圖片描述
Endog vs Residuals Scatter plot

在這裏插入圖片描述
Residuals Autocorrelation plot: No correlation

We will predict home sales from 2018–01–01 to 2019–01–01. I will use MAPE — mean absolute percentage error to evaluate the model performance. Best model we have got is SARIMA(order=(0,1,2),seasonal_order=(0,1,1,12). I prefer MAPE error metric in time series analysis as it is more intuitive. Sklearn doesn’t provide MAPE metric, that’s why we have to code it ourselves. Formula:
我們將預測2018-01-01至2019-01-01的房屋銷售情況。 我將使用MAPE - 平均絕對百分比誤差來評估模型性能。 我們得到的最佳模型是SARIMA(order =(0,1,2),seasonal_order =(0,1,1,12)。我更喜歡時間序列分析中的MAPE誤差度量,因爲它更直觀.Sklearn不提供 MAPE指標,這就是爲什麼我們必須自己編碼。公式:
在這裏插入圖片描述

def mean_abs_pct_error(actual_values, forecast_values):
    err=0
    for i in range(len(forecast_values)):
        err += np.abs(actual_values.values[i] - forecast_values.values[i])/actual_values.values[i]
    return err[0] * 100/len(forecast_values) 

Code snippet for MAPE

When you use predict function, there are some nuances to be careful about its parameters:
type = ‘levels’ means predicted values will be at the same level with endog/training values, in our case they were log transformed and not diffed at all. Then, if you notice we take np.exp() to scale the predicted values to original data. Remember, np.exp(np.log(a)) = a. So, np.exp(np.log(original data)) = original data
dynamics = True, then use the predicted value for time = t as a predictor for time = t+1.
使用預測函數時,要注意其參數有一些細微差別:
type ='levels’表示預測值與endog / training值處於同一水平,在我們的例子中,它們是對數變換而根本沒有差異。 然後,如果您注意到我們採用np.exp()將預測值縮放爲原始數據。 請記住,np.exp(np.log(a))= a。 所以,np.exp(np.log(原始數據))=原始數據
dynamics = True,然後使用time = t的預測值作爲time = t + 1的預測值。

preds_best=np.exp(best_model.predict(start=test_start_date,end='2019-01-01', dynamic=True, typ='levels'))
print("MAPE{}%".format(np.round(mean_abs_pct_error(test_data,preds_best),2)))
MAPE:6.05%

We make approx. 6% error in our prediction. It doesn’t mean the model will underperform at 6% of the time. Rather it translates as the predicted value will be offset from the real value 6% on average.
Plot the predicted values with original data and see the results. What can we infer from the below plot? Well, a lot! The model can successfully capture the seasonal effect, however, cannot do the same with the trend. Home sales go downward trend, however, the model cannot capture it well. It knows that sales go down but due to seasonal effect, however, there is a downward trend after 2018 which it struggles to predict. This is due to small training data we have got.
我們約 我們的預測誤差爲6%。 這並不意味着該模型在6%的時間內表現不佳。 相反,它會轉換爲預測值將平均偏離實際值6%。
用原始數據繪製預測值並查看結果。 我們可以從下面的情節推斷出什麼? 好吧,很多! 該模型可以成功捕獲季節性效果,但是,不能對趨勢做同樣的事情。 房屋銷售呈下降趨勢,但模型無法很好地捕捉到它。 它知道銷售下降但是由於季節性影響,但是,2018年之後有一個下降的趨勢,它很難預測。 這是由於我們獲得的培訓數據很少。
在這裏插入圖片描述
Forecasted data

If we had a larger data set, we could identify an economic cycle, and model it. Possibly, every 6–7 years of housing sales follow a reduction. Or if this downward trend continues in 2019 our 2020 prediction would definitely capture the trend.
Another option to capture trend quicker is to add an AR term to the model. If we add 1 or 2 AR terms to the model it could react to the trend quicker and have less MAPE. The below plot displays MAPE for each model. Models performing better than the best model in terms of test MAPE are in green.
如果我們有更大的數據集,我們可以確定一個經濟週期,並對其進行建模。 可能每隔6 - 7年的房屋銷售量就會減少。 或者,如果這種下降趨勢在2019年繼續下去,那麼2020年的預測肯定會捕捉這一趨勢
另一種快速捕獲趨勢的選擇是爲模型添加AR項。 如果我們在模型中添加1或2個AR項,它可以更快地對趨勢做出反應並且MAPE更少。 下圖顯示每種型號的MAPE。 在測試MAPE方面表現優於最佳模型的模型爲綠色。
在這裏插入圖片描述
We have added AR terms to the model and we have got improvement in test metrics.

agile_model = SARIMAX(endog=log_transformed_data,order=(1,1,2), seasonal_order=(1,1,2,12),enforce_invertibility=False).fit()
agile_model.summary()

在這裏插入圖片描述
Model summary

Test MAPE now is 5.67%, improved from 6.05%, which is the test MAPE of the optimal model.

agile_model_pred = np.exp(agile_model.predict(start=test_start_date,end=20190101', dynamic=True, typ=’levels’))
print(“MAPE{}%.format(np.round(mean_abs_pct_error(test_data,agile_model_pred),2)))
MAPE:5.67%

However, if you look at AIC, BIC and HQIC we get higher values, which means we traded off the model generality. We know that we have few data points roughly 300 and having 6 features in a linear model may lead to overfitting. If you take a look at the model summary above, P values of feature coefficients ar.L1, ma.L2, ar.S.L12, ma.S.L12 and ma.S.L24 are higher than 0.05%.

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章