Spatiotemporal forecasting has various applications in neuroscience, climate and transportation domain. Traffic forecasting is one canonical example of such learning task. The task is challenging due to (1) complex spatial dependency on road networks, (2) non-linear temporal dynamics with changing road conditions and (3) inherent difficulty of long-term forecasting. To address these challenges, we propose to model the traffic flow as a diffusion process on a directed graph and introduce Diffusion Convolutional Recurrent Neural Network (DCRNN), a deep learning framework for traffic forecasting that incorporates both spatial and temporal dependency in the traffic flow. Specifically, DCRNN captures the spatial dependency using bidirectional random walks on the graph, and the temporal dependency using the encoder-decoder architecture with scheduled sampling. We evaluate the framework on two real-world large scale road network traffic datasets and observe consistent improvement of 12% - 15% over state-of-the-art baselines.
Spatiotemporal forecasting is a crucial task for a learning system that operates in a dynamic environment. It has a wide range of applications from autonomous vehicles operations, to energy and smart grid optimization, to logistics and supply chain management. In this paper, we study one important task: traffic forecasting on road networks, the core component of the intelligent transportation systems. The goal of traffic forecasting is to predict the future traffic speeds of a sensor network given historic traffic speeds and the underlying road networks.
This task is challenging mainly due to the complex spatiotemporal dependencies and inherent difficulty in the long term forecasting. On the one hand, traffic time series demonstrate strong temporal dynamics. Recurring incidents such as rush hours or accidents can cause nonstationarity, making it difficult to forecast longterm. On the other hand, sensors on the road network contain complex yet unique spatial correlations. Figure 1 illustrates an example. Road 1 and road 2 are correlated, while road 1 and road 3 are not. Although road 1 and road 3 are close in the Euclidean space, they demonstrate very different behaviors. Moreover, the future traffic speed is influenced more by the downstream traffic than the upstream one. This means that the spatial structure in traffic is nonEuclidean and directional.
Traffic forecasting has been studied for decades, falling into two main categories: knowledgedriven approach and data-driven approach. In transportation and operational research, knowledgedriven methods usually apply queuing theory and simulate user behaviors in traffic (Cascetta, 2013). In time series community, data-driven methods such as Auto-Regressive Integrated Moving Average (ARIMA) model and Kalman filtering remain popular (Liu et al., 2011; Lippi et al., 2013). However, simple time series models usually rely on the stationarity assumption, which is often violated by the traffic data. Most recently, deep learning models for traffic forecasting have been developed in Lv et al. (2015); Y u et al. (2017b), but without considering the spatial structure. Wu & Tan (2016) and Ma et al. (2017) model the spatial correlation with Convolutional Neural Networks (CNN), but the spatial structure is in the Euclidean space (e.g., 2D images). Bruna et al. (2014), Defferrard et al. (2016) studied graph convolution, but only for undirected graphs.
In this work, we represent the pair-wise spatial correlations between traffic sensors using a directed graph whose nodes are sensors and edge weights denote proximity between the sensor pairs measured by the road network distance. We model the dynamics of the traffic flow as a diffusion process and propose the diffusion convolution operation to capture the spatial dependency. We further propose Diffusion Convolutional Recurrent Neural Network (DCRNN) that integrates diffusion convolution, the sequence to sequence architecture and the scheduled sampling technique. When evaluated on realworld traffic datasets, DCRNN consistently outperforms state-of-the-art traffic forecasting baselines by a large margin. In summary:
• We study the traffic forecasting problem and model the spatial dependency of traffic as a diffusion process on a directed graph. We propose diffusion convolution, which has an intuitive interpretation and can be computed efficiently. • We propose Diffusion Convolutional Recurrent Neural Network (DCRNN), a holistic approach that captures both spatial and temporal dependencies among time series using diffusion convolution and the sequence to sequence learning framework together with scheduled sampling. DCRNN is not limited to transportation and is readily applicable to other spatiotemporal forecasting tasks. • We conducted extensive experiments on two large-scale real-world datasets, and the proposed approach obtains significant improvement over state-of-the-art baseline methods.
Figure 1: Spatial correlation is dominated by road network structure. (1) Traffic speed in road 1 are similar to road 2 as they locate in the same highway. (2) Road 1 and road 3 locate in the opposite directions of the highway. Though close to each other in the Euclidean space, their road network distance is large, and their traffic speeds differ significantly.
We formalize the learning problem of spatiotemporal traffic forecasting and describe how to model the dependency structures using diffusion convolutional recurrent neural network.
我們將時空交通預測的學習問題形式化,並描述如何使用擴散卷積遞歸神經網絡對依賴結構進行建模。
2.1 TRAFFIC FORECASTING PROBLEM
The goal of traffic forecasting is to predict the future traffic speed given previously observed traffic flow from N correlated sensors on the road network.We can represent the sensor network as a weighted directed graph G=(ν;ε;W), where ν is a set of nodes ∣ν∣=N, ε is a set of edges and W∈RN×N is a weighted adjacency matrix representing the nodes proximity (e.g., a function of their road network distance).Denote the traffic flow observed on G as a graph signal X∈RN×P, where P is the number of features of each node (e.g., velocity, volume). Let X(t) represent the graph signal observed at time t, the traffic forecasting problem aims to learn a function h(⋅) that maps T′ historical graph signals to future T graph signals, given a graph G:
交通預測的目標是根據道路網絡上 N 個相關傳感器的先前觀測到的交通流量來預測未來的交通速度。我們可以將傳感器網絡表示爲加權有向圖 G=(ν;ε;W) ,其中 ν 是一組節點 ∣ν∣=N,ε 是一組邊,W∈RN×N 是表示節點接近度(例如,其路網距離的函數)的加權鄰接矩陣 )。將在 G上觀察到的流量表示爲圖形信號 X∈RN×P,其中 P 是每個節點的特徵數(例如,速度, 體積)。 令 X(t) 代表在時間 t 觀察到的圖形信號,流量預測問題旨在學習一個函數 h(⋅),該函數將 T′ 歷史圖形信號映射到未來的 T 給定圖形 G 的圖形信號:
2.2 SPATIAL DEPENDENCY MODELING
We model the spatial dependency by relating traffic flow to a diffusion process, which explicitly captures the stochastic nature of traffic dynamics.This diffusion process is characterized by a random walk on G with restart probability α∈[0,1], and a state transition matrix DO−1W.Here DO=diag(W1) is the out-degree diagonal matrix, and 1∈RN denotes the all one vector.After many time steps, such Markov process converges to a stationary distribution P∈RN×N whose ith row Pi,:∈RN represents the likelihood of diffusion from node vi∈V, hence the proximity w.r.t. the node vi.The following Lemma provides a closed form solution for the stationary distribution.
Lemma 2.1. (Teng et al., 2016) The stationary distribution of the diffusion process can be represented as a weighted combination of infinite random walks on the graph, and be calculated in closed form:
引理2.1(Teng et al,2016)擴散過程的平穩分佈可以表示爲圖上無限隨機遊動的加權組合,並以封閉形式計算:
where k is the diffusion step. In practice, we use a finite K-step truncation of the diffusion process and assign a trainable weight to each step. We also include the reversed direction diffusion process, such that the bidirectional diffusion offers the model more flexibility to capture the influence from both the upstream and the downstream traffic.
其中 k 是擴散步驟。在實踐中,我們使用擴散過程的有限K步截斷併爲每個步驟分配可訓練的權重。我們還包括反向擴散過程,以便雙向擴散爲模型提供了更大的靈活性,以捕獲上游和下游流量的影響。
Diffusion Convolution
The resulted diffusion convolution operation over a graph signal X∈RN×P and a filter fθ is defined as:
在圖形信號 X∈RN×P 和濾波器 fθ 上進行的擴散卷積運算定義爲:
where θ∈RK×2 are the parameters for the filter and DO−1W,DI−1WT represent the transition matrices of the diffusion process and the reverse one, respectively. In general, computing the convolution can be expensive. However, if G is sparse, Equation 2 can be calculated efficiently using O(K) recursive sparse-dense matrix multiplication with total time complexity O(K∣ε∣)≪O(N2). See Appendix B for more detail.
With the convolution operation defined in Equation 2, we can build a diffusion convolutional layer that maps P-dimensional features to Q-dimensional outputs.Denote the parameter tensor as Θ∈RQ×P×K×2=[θ]q,p, where Θq,p,:,:∈RK×2 parameterizes the convolutional filter for the pth input and the qth output. The diffusion convolutional layer is thus:
where X∈RN×P is the input, H∈RN×Q is the output, {fΘq,p,,:} are the filters and a is the activation function (e.g., ReLU, Sigmoid). Diffusion convolutional layer learns the representations for graph structured data and we can train it using stochastic gradient based method.
Diffusion convolution is defined on both directed and undirected graphs. When applied to undirected graphs, we show that many existing graph structured convolutional operations including the popular spectral graph convolution, i.e., ChebNet (Defferrard et al., 2016), can be considered as a special case of diffusion convolution (up to a similarity transformation). Let D denote the degree matrix, and L=D−21(D−W)D−21 be the normalized graph Laplacian, the following Proposition demonstrates the connection.
在有向圖和無向圖上都定義了擴散卷積。當將其應用於無向圖時,我們發現許多現有的圖結構化卷積運算,包括流行的頻譜圖卷積,即ChebNet(Defferrard et al,2016),可以看作是擴散卷積的一種特殊情況(直至相似變換))。令 D 表示度矩陣,L=D−21(D−W)D−21 爲歸一化圖拉普拉斯算子,以下命題證明了這種聯繫。
Proposition 2.2. The spectral graph convolution defined as
with eigenvalue decomposition L=ΦΛΦT and F(θ)=∑0K−1θkΛk, is equivalent to graph diffusion convolution up to a similarity transformation, when the graph G is undirected.
特徵值分解爲 L=ΦΛΦT 和 F(θ)=∑0K−1θkΛk 的情況下,當圖 G 無向時,等效於圖擴散卷積直至相似變換。
2.3 TEMPORAL DYNAMICS MODELING🎨
We leverage the recurrent neural networks (RNNs) to model the temporal dependency. In particular, we use Gated Recurrent Units (GRU) (Chung et al., 2014), which is a simple yet powerful variant of RNNs. We replace the matrix multiplications in GRU with the diffusion convolution, which leads to our proposed Diffusion Convolutional Gated Recurrent Unit (DCGRU).
where X(t),H(t) denote the input and output of at time t, r(t),u(t) are reset gate and update gate at time t, respectively. ⋆G denotes the diffusion convolution defined in Equation 2 and Θr,Θu,ΘC are parameters for the corresponding filters. Similar to GRU, DCGRU can be used to build recurrent neural network layers and be trained using backpropagation through time.
其中X(t),H(t)表示在時間 t 的輸入和輸出,r(t),u(t) 分別是在時間 t 的復位門和更新門。 ⋆G 表示在等式 2 中定義的擴散卷積,並且 Θr,Θu,ΘC 是對應濾波器的參數。與GRU相似,DCGRU可用於構建遞歸神經網絡層,並使用反向傳播進行訓練。
In multiple step ahead forecasting, we employ the Sequence to Sequence architecture (Sutskever et al., 2014). Both the encoder and the decoder are recurrent neural networks with DCGRU. During training, we feed the historical time series into the encoder and use its final states to initialize the decoder. The decoder generates predictions given previous ground truth observations. At testing time, ground truth observations are replaced by predictions generated by the model itself. The discrepancy between the input distributions of training and testing can cause degraded performance. To mitigate this issue, we integrate scheduled sampling (Bengio et al., 2015) into the model, where we feed the model with either the ground truth observation with probability ϵi or the prediction by the model with probability 1−ϵi at the ith iteration. During the training process, ϵi gradually decreases to 0 to allow the model to learn the testing distribution.
在多步預測中,我們採用了序列到序列的架構(Sutskever等,2014)。編碼器和解碼器都是具有DCGRU的遞歸神經網絡。在訓練過程中,我們將歷史時間序列輸入編碼器,並使用其最終狀態初始化解碼器。解碼器根據先前的地面實況觀測值生成預測。在測試時,地面真相觀測將由模型本身生成的預測代替。培訓和測試的輸入分佈之間的差異會導致性能下降。爲了緩解這個問題,我們將計劃抽樣(Bengio等人,2015)集成到模型中,在模型中,我們將模型的概率 ϵi 爲地面實況觀測值,或者將模型的預測概率爲 1−ϵi ,並通過第 i 次迭代。在訓練過程中, ϵi 逐漸減少到0,以允許模型學習測試分佈。
With both spatial and temporal modeling, we build a Diffusion Convolutional Recurrent Neural Network (DCRNN). The model architecture of DCRNN is shown in Figure 2. The entire network is trained by maximizing the likelihood of generating the target future time series using backpropagation through time. DCRNN is able to capture spatiotemporal dependencies among time series and can be applied to various spatiotemporal forecasting problems.
Figure 2: System architecture for the Diffusion Convolutional Recurrent Neural Network designed for spatiotemporal traffic forecasting. The historical time series are fed into an encoder whose final states are used to initialize the decoder. The decoder makes predictions based on either previous ground truth or the model output.
Traffic forecasting is a classic problem in transportation and operational research which are primarily based on queuing theory and simulations (Drew, 1968). Data-driven approaches for traffic forecasting have received considerable attention, and more details can be found in a recent survey paper (Vlahogianni et al., 2014) and the references therein. However, existing machine learning models either impose strong stationary assumptions on the data (e.g., auto-regressive model) or fail to account for highly non-linear temporal dependency (e.g., latent space model Y u et al. (2016); Deng et al. (2016)). Deep learning models deliver new promise for time series forecasting problem. For example, in Y u et al. (2017b); Laptev et al. (2017), the authors study time series forecasting using deep Recurrent Neural Networks (RNN). Convolutional Neural Networks (CNN) have also been applied to traffic forecasting. Zhang et al. (2016; 2017) convert the road network to a regular 2-D grid and apply traditional CNN to predict crowd flow. Cheng et al. (2017) propose DeepTransport which models the spatial dependency by explicitly collecting upstream and downstream neighborhood roads for each individual road and then conduct convolution on these neighborhoods respectively.
Recently, CNN has been generalized to arbitrary graphs based on the spectral graph theory. Graph convolutional neural networks (GCN) are first introduced in Bruna et al. (2014), which bridges the spectral graph theory and deep neural networks. Defferrard et al. (2016) propose ChebNet which improves GCN with fast localized convolutions filters. Kipf & Welling (2017) simplify ChebNet and achieve state-of-the-art performance in semi-supervised classification tasks. Seo et al. (2016) combine ChebNet with Recurrent Neural Networks (RNN) for structured sequence modeling. Yu et al. (2017a) model the sensor network as a undirected graph and applied ChebNet and convolutional sequence model (Gehring et al., 2017) to do forecasting. One limitation of the mentioned spectral based convolutions is that they generally require the graph to be undirected to calculate meaningful spectral decomposition. Going from spectral domain to vertex domain, Atwood & Towsley (2016) propose diffusion-convolutional neural network (DCNN) which defines convolution as a diffusion process across each node in a graph-structured input. Hechtlinger et al. (2017) propose GraphCNN to generalize convolution to graph by convolving every node with its p nearest neighbors. However, both these methods do not consider the temporal dynamics and mainly deal with static graph settings.
Our approach is different from all those methods due to both the problem settings and the formulation of the convolution on the graph. We model the sensor network as a weighted directed graph which is more realistic than grid or undirected graph. Besides, the proposed convolution is defined using bidirectional graph random walk and is further integrated with the sequence to sequence learning framework as well as the scheduled sampling to model the long-term temporal dependency.
Table 1: Performance comparison of different approaches for traffic speed forecasting. DCRNN achieves the best performance with all three metrics for all forecasting horizons, and the advantage becomes more evident with the increase of the forecasting horizon.
We conduct experiments on two real-world large-scale datasets: (1) METR-LA This traffic dataset contains traffic information collected from loop detectors in the highway of Los Angeles County (Jagadish et al., 2014). We select 207 sensors and collect 4 months of data ranging from Mar 1st 2012 to Jun 30th 2012 for the experiment. (2) PEMS-BA Y This traffic dataset is collected by California Transportation Agencies (CalTrans) Performance Measurement System (PeMS). We select 325 sensors in the Bay Area and collect 6 months of data ranging from Jan 1st 2017 to May 31th 2017 for the experiment. The sensor distributions of both datasets are visualized in Figure 8 in the Appendix.
In both of those datasets, we aggregate traffic speed readings into 5 minutes windows, and apply Z-Score normalization. 70% of data is used for training, 20% are used for testing while the remaining 10% for validation. To construct the sensor graph, we compute the pairwise road network distances between sensors and build the adjacency matrix using thresholded Gaussian kernel (Shuman et al., 2013).
Wij=exp(−σ2dist(vi,vj)2) if dist(vi,vj)≤κ, otherwise 0, where Wij represents the edge weight between sensor vi and sensor vj, dist(vi,vj) denotes the road network distance from sensor vi to sensor vj. σ is the standard deviation of distances and κ is the threshold.
Figure 3: Learning curve for DCRNN and DCRNN without diffusion convolution. Removing diffusion convolution results in much higher validation error. Moreover, DCRNN with bidirectional random walk achieves the lowest validation error.
Figure 4: Effects of K and the number of units in each layer of DCRNN. K corresponds to the reception field width of the filter, and the number of units corresponds to the number of filters.
Baselines We compare DCRNN1with widely used time series regression models, including (1) HA: Historical Average, which models the traffic flow as a seasonal process, and uses weighted average of previous seasons as the prediction; (2) ARIMAkal: Auto-Regressive Integrated Moving Average model with Kalman filter which is widely used in time series prediction; (3) VAR: Vector Auto-Regression (Hamilton, 1994). (4) SVR: Support V ector Regression which uses linear support vector machine for the regression task; The following deep neural network based approaches are also included: (5) Feed forward Neural network (FNN): Feed forward neural network with two hidden layers and L2 regularization. (6) Recurrent Neural Network with fully connected LSTM hidden units (FC-LSTM) (Sutskever et al., 2014).
基線我們將DCRNN1與廣泛使用的時間序列迴歸模型進行比較,其中包括:(1)HA:歷史平均值,該模型將交通流量建模爲一個季節性過程,並使用先前季節的加權平均值作爲預測; (2)ARIMAkal:帶有卡爾曼濾波器的自迴歸綜合移動平均模型,廣泛用於時間序列預測; (3)VAR:向量自迴歸(Hamilton,1994)。 (4)SVR:支持向量迴歸,使用線性支持向量機進行迴歸任務;還包括以下基於深度神經網絡的方法:(5)前饋神經網絡(FNN):具有兩個隱藏層和L2正則化的前饋神經網絡。 (6)具有完全連接的LSTM隱藏單元(FC-LSTM)的遞歸神經網絡(Sutskever et al。,2014)。
All neural network based approaches are implemented using Tensorflow (Abadi et al., 2016), and trained using the Adam optimizer with learning rate annealing. The best hyperparameters are chosen using the Tree-structured Parzen Estimator (TPE) (Bergstra et al., 2011) on the validation dataset. Detailed parameter settings for DCRNN as well as baselines are available in Appendix E.
所有基於神經網絡的方法均使用Tensorflow(Abadi等人,2016)實施,並使用具有學習速率退火功能的Adam優化器進行訓練。在驗證數據集上使用樹結構的Parzen估計器(TPE)(Bergstra et al,2011)選擇最佳超參數。附錄E中提供了DCRNN的詳細參數設置以及基線。
4.2 TRAFFIC FORECASTING PERFORMANCE COMPARISON
Table 1 shows the comparison of different approaches for 15 minutes, 30 minutes and 1 hour ahead forecasting on both datasets. These methods are evaluated based on three commonly used metrics in traffic forecasting, including (1) Mean Absolute Error (MAE), (2) Mean Absolute Percentage Error (MAPE), and (3) Root Mean Squared Error (RMSE). Missing values are excluded in calculating these metrics. Detailed formulations of these metrics are provided in Appendix E.2. We observe the following phenomenon in both of these datasets. (1) RNN-based methods, including FC-LSTM and DCRNN, generally outperform other baselines which emphasizes the importance of modeling the temporal dependency. (2) DCRNN achieves the best performance regarding all the metrics for all forecasting horizons, which suggests the effectiveness of spatiotemporal dependency modeling. (3) Deep neural network based methods including FNN, FC-LSTM and DCRNN, tend to have better performance than linear baselines for long-term forecasting, e.g., 1 hour ahead. This is because the temporal dependency becomes increasingly non-linear with the growth of the horizon. Besides, as the historical average method does not depend on short-term data, its performance is invariant to the small increases in the forecasting horizon.
Note that, traffic forecasting on the METR-LA (Los Angeles, which is known for its complicated traffic conditions) dataset is more challenging than that in the PEMS-BAY (Bay Area) dataset. Thus we use METR-LA as the default dataset for following experiments.
To further investigate the effect of spatial dependency modeling, we compare DCRNN with the following variants: (1) DCRNN-NoConv, which ignores spatial dependency by replacing the transition matrices in the diffusion convolution (Equation 2) with identity matrices. This essentially means the forecasting of a sensor can be only be inferred from its own historical readings; (2) DCRNN-UniConv,which only uses the forward random walk transition matrix for diffusion convolution; Figure 3 shows the learning curves of these three models with roughly the same number of parameters. Without diffusion convolution, DCRNN-NoConv has much higher validation error. Moreover, DCRNN achieves the lowest validation error which shows the effectiveness of using bidirectional random walk. The intuition is that the bidirectional random walk gives the model the ability and flexibility to capture the influence from both the upstream and the downstream traffic.
which only uses the forward random walk transition matrix for diffusion convolution; Figure 3 shows the learning curves of these three models with roughly the same number of parameters. Without diffusion convolution, DCRNN-NoConv has much higher validation error. Moreover, DCRNN achieves the lowest validation error which shows the effectiveness of using bidirectional random walk. The intuition is that the bidirectional random walk gives the model the ability and flexibility to capture the influence from both the upstream and the downstream traffic.
To investigate the effect of graph construction, we construct a undirected graph by settingc Wij^=max(Wij,Wji), where W^ is the new symmetric weight matrix. Then we develop a variant of DCRNN denotes GCRNN, which uses the sequence to sequence learning with ChebNet graph convolution (Equation 5) with roughly the same amount of parameters. Table 2 shows the comparison between DCRNN and GCRNN in the METR-LA dataset. DCRNN consistently outperforms GCRNN. The intuition is that directed graph better captures the asymmetric correlation between traffic sensors. Figure 4 shows the effects of different parameters. K roughly corresponds to the size of filters’ reception fields while the number of units corresponds to the number of filters. Larger K enables the model to capture broader spatial dependency at the cost of increasing learning complexity. We observe that with the increase of K, the error on the validation dataset first quickly decrease, and then slightly increase. Similar behavior is observed for varying the number of units.
爲了研究圖構造的效果,我們通過設置 Wij^=max(Wij,Wji), where W^ 來構造無向圖,其中 W^ 是新的對稱權重矩陣。然後,我們開發出一個表示GCRNN的DCRNN變體,該變體使用該序列對帶有大致相同數量參數的ChebNet圖卷積(等式5)進行序列學習。表2顯示了METR-LA數據集中DCRNN和GCRNN之間的比較。 DCRNN始終優於GCRNN。直覺是有向圖可以更好地捕獲交通傳感器之間的不對稱相關性。圖4顯示了不同參數的影響。 K 大致對應於過濾器接收字段的大小,而單位數則對應於過濾器的數量。較大的 K 使模型能夠以增加學習複雜性爲代價捕獲更廣泛的空間依賴性。我們觀察到,隨着 K 的增加,驗證數據集上的誤差首先迅速減小,然後略有增加。對於改變單元數量,觀察到類似的行爲。
To evaluate the effect of temporal modeling including the sequence to sequence framework as well as the scheduled sampling mechanism, we further design three variants of DCRNN: (1) DCNN: in which we concatenate the historical observations as a fixed length vector and feed it into stacked diffusion convolutional layers to predict the future time series. We train a single model for one step ahead prediction, and feed the previous prediction into the model as input to perform multiple steps ahead prediction. (2) DCRNN-SEQ: which uses the encoder-decoder sequence to sequence learning framework to perform multiple steps ahead forecasting. (3) DCRNN: similar to DCRNN-SEQ except for adding scheduled sampling.
Figure 5 shows the comparison of those four methods with regards to MAE for different forecasting horizons. We observe that: (1) DCRNN-SEQ outperforms DCNN by a large margin which conforms the importance of modeling temporal dependency. (2) DCRNN achieves the best result, and its superiority becomes more evident with the increase of the forecasting horizon. This is mainly because the model is trained to deal with its mistakes during multiple steps ahead prediction and thus suffers less from the problem of error propagation. We also train a model that always been fed its output as input for multiple steps ahead prediction. However, its performance is much worse than all the three variants which emphasizes the importance of scheduled sampling.
To better understand the model, we visualize forecasting results as well as learned filters. Figure 6 shows the visualization of 1 hour ahead forecasting. We have the following observations: (1) DCRNN generates smooth prediction of the mean when small oscillation exists in the traffic speeds (Figure 6(a)). This reflects the robustness of the model. (2) DCRNN is more likely to accurately predict abrupt changes in the traffic speed than baseline methods (e.g., FC-LSTM). As shown in Figure 6(b), DCRNN predicts the start and the end of the peak hours. This is because DCRNN captures the spatial dependency, and is able to utilize the speed changes in neighborhood sensors for more accurate forecasting. Figure 7 visualizes examples of learned filters centered at different nodes. The star denotes the center, and colors denote the weights. We can observe that (1) weights are well localized around the center, and (2) the weights diffuse based on road network distance. More visualizations are provided in Appendix F.
Figure 7: Visualization of learned localized filters centered at different nodes with K = 3 on the METR-LA dataset. The star denotes the center, and the colors represent the weights. We observe that weights are localized around the center, and diffuse alongside the road network.
In this paper, we formulated the traffic prediction on road network as a spatiotemporal forecasting problem, and proposed the diffusion convolutional recurrent neural network that captures the spatiotemporal dependencies. Specifically, we use bidirectional graph random walk to model spatial dependency and recurrent neural network to capture the temporal dynamics. We further integrated the encoder-decoder architecture and the scheduled sampling technique to improve the performance for long-term forecasting. When evaluated on two large-scale real-world traffic datasets, our approach obtained significantly better prediction than baselines. For future work, we will investigate the following two aspects (1) applying the proposed model to other spatial-temporal forecasting tasks; (2) modeling the spatiotemporal dependency when the underlying graph structure is evolving, e.g., the K nearest neighbor graph for moving objects.