Spatiotemporal forecasting has various applications in neuroscience, climate and transportation domain. Traffic forecasting is one canonical example of such learning task. The task is challenging due to (1) complex spatial dependency on road networks, (2) non-linear temporal dynamics with changing road conditions and (3) inherent difficulty of long-term forecasting. To address these challenges, we propose to model the traffic flow as a diffusion process on a directed graph and introduce Diffusion Convolutional Recurrent Neural Network (DCRNN), a deep learning framework for traffic forecasting that incorporates both spatial and temporal dependency in the traffic flow. Specifically, DCRNN captures the spatial dependency using bidirectional random walks on the graph, and the temporal dependency using the encoder-decoder architecture with scheduled sampling. We evaluate the framework on two real-world large scale road network traffic datasets and observe consistent improvement of 12% - 15% over state-of-the-art baselines.
Spatiotemporal forecasting is a crucial task for a learning system that operates in a dynamic environment. It has a wide range of applications from autonomous vehicles operations, to energy and smart grid optimization, to logistics and supply chain management. In this paper, we study one important task: traffic forecasting on road networks, the core component of the intelligent transportation systems. The goal of traffic forecasting is to predict the future traffic speeds of a sensor network given historic traffic speeds and the underlying road networks.
This task is challenging mainly due to the complex spatiotemporal dependencies and inherent difficulty in the long term forecasting. On the one hand, traffic time series demonstrate strong temporal dynamics. Recurring incidents such as rush hours or accidents can cause nonstationarity, making it difficult to forecast longterm. On the other hand, sensors on the road network contain complex yet unique spatial correlations. Figure 1 illustrates an example. Road 1 and road 2 are correlated, while road 1 and road 3 are not. Although road 1 and road 3 are close in the Euclidean space, they demonstrate very different behaviors. Moreover, the future traffic speed is influenced more by the downstream traffic than the upstream one. This means that the spatial structure in traffic is nonEuclidean and directional.
Traffic forecasting has been studied for decades, falling into two main categories: knowledgedriven approach and data-driven approach. In transportation and operational research, knowledgedriven methods usually apply queuing theory and simulate user behaviors in traffic (Cascetta, 2013). In time series community, data-driven methods such as Auto-Regressive Integrated Moving Average (ARIMA) model and Kalman filtering remain popular (Liu et al., 2011; Lippi et al., 2013). However, simple time series models usually rely on the stationarity assumption, which is often violated by the traffic data. Most recently, deep learning models for traffic forecasting have been developed in Lv et al. (2015); Y u et al. (2017b), but without considering the spatial structure. Wu & Tan (2016) and Ma et al. (2017) model the spatial correlation with Convolutional Neural Networks (CNN), but the spatial structure is in the Euclidean space (e.g., 2D images). Bruna et al. (2014), Defferrard et al. (2016) studied graph convolution, but only for undirected graphs.
In this work, we represent the pair-wise spatial correlations between traffic sensors using a directed graph whose nodes are sensors and edge weights denote proximity between the sensor pairs measured by the road network distance. We model the dynamics of the traffic flow as a diffusion process and propose the diffusion convolution operation to capture the spatial dependency. We further propose Diffusion Convolutional Recurrent Neural Network (DCRNN) that integrates diffusion convolution, the sequence to sequence architecture and the scheduled sampling technique. When evaluated on realworld traffic datasets, DCRNN consistently outperforms state-of-the-art traffic forecasting baselines by a large margin. In summary:
• We study the traffic forecasting problem and model the spatial dependency of traffic as a diffusion process on a directed graph. We propose diffusion convolution, which has an intuitive interpretation and can be computed efficiently. • We propose Diffusion Convolutional Recurrent Neural Network (DCRNN), a holistic approach that captures both spatial and temporal dependencies among time series using diffusion convolution and the sequence to sequence learning framework together with scheduled sampling. DCRNN is not limited to transportation and is readily applicable to other spatiotemporal forecasting tasks. • We conducted extensive experiments on two large-scale real-world datasets, and the proposed approach obtains significant improvement over state-of-the-art baseline methods.
Figure 1: Spatial correlation is dominated by road network structure. (1) Traffic speed in road 1 are similar to road 2 as they locate in the same highway. (2) Road 1 and road 3 locate in the opposite directions of the highway. Though close to each other in the Euclidean space, their road network distance is large, and their traffic speeds differ significantly.
We formalize the learning problem of spatiotemporal traffic forecasting and describe how to model the dependency structures using diffusion convolutional recurrent neural network.
我们将时空交通预测的学习问题形式化,并描述如何使用扩散卷积递归神经网络对依赖结构进行建模。
2.1 TRAFFIC FORECASTING PROBLEM
The goal of traffic forecasting is to predict the future traffic speed given previously observed traffic flow from N correlated sensors on the road network.We can represent the sensor network as a weighted directed graph G=(ν;ε;W), where ν is a set of nodes ∣ν∣=N, ε is a set of edges and W∈RN×N is a weighted adjacency matrix representing the nodes proximity (e.g., a function of their road network distance).Denote the traffic flow observed on G as a graph signal X∈RN×P, where P is the number of features of each node (e.g., velocity, volume). Let X(t) represent the graph signal observed at time t, the traffic forecasting problem aims to learn a function h(⋅) that maps T′ historical graph signals to future T graph signals, given a graph G:
交通预测的目标是根据道路网络上 N 个相关传感器的先前观测到的交通流量来预测未来的交通速度。我们可以将传感器网络表示为加权有向图 G=(ν;ε;W) ,其中 ν 是一组节点 ∣ν∣=N,ε 是一组边,W∈RN×N 是表示节点接近度(例如,其路网距离的函数)的加权邻接矩阵 )。将在 G上观察到的流量表示为图形信号 X∈RN×P,其中 P 是每个节点的特征数(例如,速度, 体积)。 令 X(t) 代表在时间 t 观察到的图形信号,流量预测问题旨在学习一个函数 h(⋅),该函数将 T′ 历史图形信号映射到未来的 T 给定图形 G 的图形信号:
2.2 SPATIAL DEPENDENCY MODELING
We model the spatial dependency by relating traffic flow to a diffusion process, which explicitly captures the stochastic nature of traffic dynamics.This diffusion process is characterized by a random walk on G with restart probability α∈[0,1], and a state transition matrix DO−1W.Here DO=diag(W1) is the out-degree diagonal matrix, and 1∈RN denotes the all one vector.After many time steps, such Markov process converges to a stationary distribution P∈RN×N whose ith row Pi,:∈RN represents the likelihood of diffusion from node vi∈V, hence the proximity w.r.t. the node vi.The following Lemma provides a closed form solution for the stationary distribution.
Lemma 2.1. (Teng et al., 2016) The stationary distribution of the diffusion process can be represented as a weighted combination of infinite random walks on the graph, and be calculated in closed form:
引理2.1(Teng et al,2016)扩散过程的平稳分布可以表示为图上无限随机游动的加权组合,并以封闭形式计算:
where k is the diffusion step. In practice, we use a finite K-step truncation of the diffusion process and assign a trainable weight to each step. We also include the reversed direction diffusion process, such that the bidirectional diffusion offers the model more flexibility to capture the influence from both the upstream and the downstream traffic.
其中 k 是扩散步骤。在实践中,我们使用扩散过程的有限K步截断并为每个步骤分配可训练的权重。我们还包括反向扩散过程,以便双向扩散为模型提供了更大的灵活性,以捕获上游和下游流量的影响。
Diffusion Convolution
The resulted diffusion convolution operation over a graph signal X∈RN×P and a filter fθ is defined as:
在图形信号 X∈RN×P 和滤波器 fθ 上进行的扩散卷积运算定义为:
where θ∈RK×2 are the parameters for the filter and DO−1W,DI−1WT represent the transition matrices of the diffusion process and the reverse one, respectively. In general, computing the convolution can be expensive. However, if G is sparse, Equation 2 can be calculated efficiently using O(K) recursive sparse-dense matrix multiplication with total time complexity O(K∣ε∣)≪O(N2). See Appendix B for more detail.
With the convolution operation defined in Equation 2, we can build a diffusion convolutional layer that maps P-dimensional features to Q-dimensional outputs.Denote the parameter tensor as Θ∈RQ×P×K×2=[θ]q,p, where Θq,p,:,:∈RK×2 parameterizes the convolutional filter for the pth input and the qth output. The diffusion convolutional layer is thus:
where X∈RN×P is the input, H∈RN×Q is the output, {fΘq,p,,:} are the filters and a is the activation function (e.g., ReLU, Sigmoid). Diffusion convolutional layer learns the representations for graph structured data and we can train it using stochastic gradient based method.
Diffusion convolution is defined on both directed and undirected graphs. When applied to undirected graphs, we show that many existing graph structured convolutional operations including the popular spectral graph convolution, i.e., ChebNet (Defferrard et al., 2016), can be considered as a special case of diffusion convolution (up to a similarity transformation). Let D denote the degree matrix, and L=D−21(D−W)D−21 be the normalized graph Laplacian, the following Proposition demonstrates the connection.
在有向图和无向图上都定义了扩散卷积。当将其应用于无向图时,我们发现许多现有的图结构化卷积运算,包括流行的频谱图卷积,即ChebNet(Defferrard et al,2016),可以看作是扩散卷积的一种特殊情况(直至相似变换))。令 D 表示度矩阵,L=D−21(D−W)D−21 为归一化图拉普拉斯算子,以下命题证明了这种联系。
Proposition 2.2. The spectral graph convolution defined as
with eigenvalue decomposition L=ΦΛΦT and F(θ)=∑0K−1θkΛk, is equivalent to graph diffusion convolution up to a similarity transformation, when the graph G is undirected.
特征值分解为 L=ΦΛΦT 和 F(θ)=∑0K−1θkΛk 的情况下,当图 G 无向时,等效于图扩散卷积直至相似变换。
2.3 TEMPORAL DYNAMICS MODELING🎨
We leverage the recurrent neural networks (RNNs) to model the temporal dependency. In particular, we use Gated Recurrent Units (GRU) (Chung et al., 2014), which is a simple yet powerful variant of RNNs. We replace the matrix multiplications in GRU with the diffusion convolution, which leads to our proposed Diffusion Convolutional Gated Recurrent Unit (DCGRU).
where X(t),H(t) denote the input and output of at time t, r(t),u(t) are reset gate and update gate at time t, respectively. ⋆G denotes the diffusion convolution defined in Equation 2 and Θr,Θu,ΘC are parameters for the corresponding filters. Similar to GRU, DCGRU can be used to build recurrent neural network layers and be trained using backpropagation through time.
其中X(t),H(t)表示在时间 t 的输入和输出,r(t),u(t) 分别是在时间 t 的复位门和更新门。 ⋆G 表示在等式 2 中定义的扩散卷积,并且 Θr,Θu,ΘC 是对应滤波器的参数。与GRU相似,DCGRU可用于构建递归神经网络层,并使用反向传播进行训练。
In multiple step ahead forecasting, we employ the Sequence to Sequence architecture (Sutskever et al., 2014). Both the encoder and the decoder are recurrent neural networks with DCGRU. During training, we feed the historical time series into the encoder and use its final states to initialize the decoder. The decoder generates predictions given previous ground truth observations. At testing time, ground truth observations are replaced by predictions generated by the model itself. The discrepancy between the input distributions of training and testing can cause degraded performance. To mitigate this issue, we integrate scheduled sampling (Bengio et al., 2015) into the model, where we feed the model with either the ground truth observation with probability ϵi or the prediction by the model with probability 1−ϵi at the ith iteration. During the training process, ϵi gradually decreases to 0 to allow the model to learn the testing distribution.
在多步预测中,我们采用了序列到序列的架构(Sutskever等,2014)。编码器和解码器都是具有DCGRU的递归神经网络。在训练过程中,我们将历史时间序列输入编码器,并使用其最终状态初始化解码器。解码器根据先前的地面实况观测值生成预测。在测试时,地面真相观测将由模型本身生成的预测代替。培训和测试的输入分布之间的差异会导致性能下降。为了缓解这个问题,我们将计划抽样(Bengio等人,2015)集成到模型中,在模型中,我们将模型的概率 ϵi 为地面实况观测值,或者将模型的预测概率为 1−ϵi ,并通过第 i 次迭代。在训练过程中, ϵi 逐渐减少到0,以允许模型学习测试分布。
With both spatial and temporal modeling, we build a Diffusion Convolutional Recurrent Neural Network (DCRNN). The model architecture of DCRNN is shown in Figure 2. The entire network is trained by maximizing the likelihood of generating the target future time series using backpropagation through time. DCRNN is able to capture spatiotemporal dependencies among time series and can be applied to various spatiotemporal forecasting problems.
Figure 2: System architecture for the Diffusion Convolutional Recurrent Neural Network designed for spatiotemporal traffic forecasting. The historical time series are fed into an encoder whose final states are used to initialize the decoder. The decoder makes predictions based on either previous ground truth or the model output.
Traffic forecasting is a classic problem in transportation and operational research which are primarily based on queuing theory and simulations (Drew, 1968). Data-driven approaches for traffic forecasting have received considerable attention, and more details can be found in a recent survey paper (Vlahogianni et al., 2014) and the references therein. However, existing machine learning models either impose strong stationary assumptions on the data (e.g., auto-regressive model) or fail to account for highly non-linear temporal dependency (e.g., latent space model Y u et al. (2016); Deng et al. (2016)). Deep learning models deliver new promise for time series forecasting problem. For example, in Y u et al. (2017b); Laptev et al. (2017), the authors study time series forecasting using deep Recurrent Neural Networks (RNN). Convolutional Neural Networks (CNN) have also been applied to traffic forecasting. Zhang et al. (2016; 2017) convert the road network to a regular 2-D grid and apply traditional CNN to predict crowd flow. Cheng et al. (2017) propose DeepTransport which models the spatial dependency by explicitly collecting upstream and downstream neighborhood roads for each individual road and then conduct convolution on these neighborhoods respectively.
Recently, CNN has been generalized to arbitrary graphs based on the spectral graph theory. Graph convolutional neural networks (GCN) are first introduced in Bruna et al. (2014), which bridges the spectral graph theory and deep neural networks. Defferrard et al. (2016) propose ChebNet which improves GCN with fast localized convolutions filters. Kipf & Welling (2017) simplify ChebNet and achieve state-of-the-art performance in semi-supervised classification tasks. Seo et al. (2016) combine ChebNet with Recurrent Neural Networks (RNN) for structured sequence modeling. Yu et al. (2017a) model the sensor network as a undirected graph and applied ChebNet and convolutional sequence model (Gehring et al., 2017) to do forecasting. One limitation of the mentioned spectral based convolutions is that they generally require the graph to be undirected to calculate meaningful spectral decomposition. Going from spectral domain to vertex domain, Atwood & Towsley (2016) propose diffusion-convolutional neural network (DCNN) which defines convolution as a diffusion process across each node in a graph-structured input. Hechtlinger et al. (2017) propose GraphCNN to generalize convolution to graph by convolving every node with its p nearest neighbors. However, both these methods do not consider the temporal dynamics and mainly deal with static graph settings.
Our approach is different from all those methods due to both the problem settings and the formulation of the convolution on the graph. We model the sensor network as a weighted directed graph which is more realistic than grid or undirected graph. Besides, the proposed convolution is defined using bidirectional graph random walk and is further integrated with the sequence to sequence learning framework as well as the scheduled sampling to model the long-term temporal dependency.
Table 1: Performance comparison of different approaches for traffic speed forecasting. DCRNN achieves the best performance with all three metrics for all forecasting horizons, and the advantage becomes more evident with the increase of the forecasting horizon.
We conduct experiments on two real-world large-scale datasets: (1) METR-LA This traffic dataset contains traffic information collected from loop detectors in the highway of Los Angeles County (Jagadish et al., 2014). We select 207 sensors and collect 4 months of data ranging from Mar 1st 2012 to Jun 30th 2012 for the experiment. (2) PEMS-BA Y This traffic dataset is collected by California Transportation Agencies (CalTrans) Performance Measurement System (PeMS). We select 325 sensors in the Bay Area and collect 6 months of data ranging from Jan 1st 2017 to May 31th 2017 for the experiment. The sensor distributions of both datasets are visualized in Figure 8 in the Appendix.
In both of those datasets, we aggregate traffic speed readings into 5 minutes windows, and apply Z-Score normalization. 70% of data is used for training, 20% are used for testing while the remaining 10% for validation. To construct the sensor graph, we compute the pairwise road network distances between sensors and build the adjacency matrix using thresholded Gaussian kernel (Shuman et al., 2013).
Wij=exp(−σ2dist(vi,vj)2) if dist(vi,vj)≤κ, otherwise 0, where Wij represents the edge weight between sensor vi and sensor vj, dist(vi,vj) denotes the road network distance from sensor vi to sensor vj. σ is the standard deviation of distances and κ is the threshold.
Figure 3: Learning curve for DCRNN and DCRNN without diffusion convolution. Removing diffusion convolution results in much higher validation error. Moreover, DCRNN with bidirectional random walk achieves the lowest validation error.
Figure 4: Effects of K and the number of units in each layer of DCRNN. K corresponds to the reception field width of the filter, and the number of units corresponds to the number of filters.
Baselines We compare DCRNN1with widely used time series regression models, including (1) HA: Historical Average, which models the traffic flow as a seasonal process, and uses weighted average of previous seasons as the prediction; (2) ARIMAkal: Auto-Regressive Integrated Moving Average model with Kalman filter which is widely used in time series prediction; (3) VAR: Vector Auto-Regression (Hamilton, 1994). (4) SVR: Support V ector Regression which uses linear support vector machine for the regression task; The following deep neural network based approaches are also included: (5) Feed forward Neural network (FNN): Feed forward neural network with two hidden layers and L2 regularization. (6) Recurrent Neural Network with fully connected LSTM hidden units (FC-LSTM) (Sutskever et al., 2014).
基线我们将DCRNN1与广泛使用的时间序列回归模型进行比较,其中包括:(1)HA:历史平均值,该模型将交通流量建模为一个季节性过程,并使用先前季节的加权平均值作为预测; (2)ARIMAkal:带有卡尔曼滤波器的自回归综合移动平均模型,广泛用于时间序列预测; (3)VAR:向量自回归(Hamilton,1994)。 (4)SVR:支持向量回归,使用线性支持向量机进行回归任务;还包括以下基于深度神经网络的方法:(5)前馈神经网络(FNN):具有两个隐藏层和L2正则化的前馈神经网络。 (6)具有完全连接的LSTM隐藏单元(FC-LSTM)的递归神经网络(Sutskever et al。,2014)。
All neural network based approaches are implemented using Tensorflow (Abadi et al., 2016), and trained using the Adam optimizer with learning rate annealing. The best hyperparameters are chosen using the Tree-structured Parzen Estimator (TPE) (Bergstra et al., 2011) on the validation dataset. Detailed parameter settings for DCRNN as well as baselines are available in Appendix E.
所有基于神经网络的方法均使用Tensorflow(Abadi等人,2016)实施,并使用具有学习速率退火功能的Adam优化器进行训练。在验证数据集上使用树结构的Parzen估计器(TPE)(Bergstra et al,2011)选择最佳超参数。附录E中提供了DCRNN的详细参数设置以及基线。
4.2 TRAFFIC FORECASTING PERFORMANCE COMPARISON
Table 1 shows the comparison of different approaches for 15 minutes, 30 minutes and 1 hour ahead forecasting on both datasets. These methods are evaluated based on three commonly used metrics in traffic forecasting, including (1) Mean Absolute Error (MAE), (2) Mean Absolute Percentage Error (MAPE), and (3) Root Mean Squared Error (RMSE). Missing values are excluded in calculating these metrics. Detailed formulations of these metrics are provided in Appendix E.2. We observe the following phenomenon in both of these datasets. (1) RNN-based methods, including FC-LSTM and DCRNN, generally outperform other baselines which emphasizes the importance of modeling the temporal dependency. (2) DCRNN achieves the best performance regarding all the metrics for all forecasting horizons, which suggests the effectiveness of spatiotemporal dependency modeling. (3) Deep neural network based methods including FNN, FC-LSTM and DCRNN, tend to have better performance than linear baselines for long-term forecasting, e.g., 1 hour ahead. This is because the temporal dependency becomes increasingly non-linear with the growth of the horizon. Besides, as the historical average method does not depend on short-term data, its performance is invariant to the small increases in the forecasting horizon.
Note that, traffic forecasting on the METR-LA (Los Angeles, which is known for its complicated traffic conditions) dataset is more challenging than that in the PEMS-BAY (Bay Area) dataset. Thus we use METR-LA as the default dataset for following experiments.
To further investigate the effect of spatial dependency modeling, we compare DCRNN with the following variants: (1) DCRNN-NoConv, which ignores spatial dependency by replacing the transition matrices in the diffusion convolution (Equation 2) with identity matrices. This essentially means the forecasting of a sensor can be only be inferred from its own historical readings; (2) DCRNN-UniConv,which only uses the forward random walk transition matrix for diffusion convolution; Figure 3 shows the learning curves of these three models with roughly the same number of parameters. Without diffusion convolution, DCRNN-NoConv has much higher validation error. Moreover, DCRNN achieves the lowest validation error which shows the effectiveness of using bidirectional random walk. The intuition is that the bidirectional random walk gives the model the ability and flexibility to capture the influence from both the upstream and the downstream traffic.
which only uses the forward random walk transition matrix for diffusion convolution; Figure 3 shows the learning curves of these three models with roughly the same number of parameters. Without diffusion convolution, DCRNN-NoConv has much higher validation error. Moreover, DCRNN achieves the lowest validation error which shows the effectiveness of using bidirectional random walk. The intuition is that the bidirectional random walk gives the model the ability and flexibility to capture the influence from both the upstream and the downstream traffic.
To investigate the effect of graph construction, we construct a undirected graph by settingc Wij^=max(Wij,Wji), where W^ is the new symmetric weight matrix. Then we develop a variant of DCRNN denotes GCRNN, which uses the sequence to sequence learning with ChebNet graph convolution (Equation 5) with roughly the same amount of parameters. Table 2 shows the comparison between DCRNN and GCRNN in the METR-LA dataset. DCRNN consistently outperforms GCRNN. The intuition is that directed graph better captures the asymmetric correlation between traffic sensors. Figure 4 shows the effects of different parameters. K roughly corresponds to the size of filters’ reception fields while the number of units corresponds to the number of filters. Larger K enables the model to capture broader spatial dependency at the cost of increasing learning complexity. We observe that with the increase of K, the error on the validation dataset first quickly decrease, and then slightly increase. Similar behavior is observed for varying the number of units.
为了研究图构造的效果,我们通过设置 Wij^=max(Wij,Wji), where W^ 来构造无向图,其中 W^ 是新的对称权重矩阵。然后,我们开发出一个表示GCRNN的DCRNN变体,该变体使用该序列对带有大致相同数量参数的ChebNet图卷积(等式5)进行序列学习。表2显示了METR-LA数据集中DCRNN和GCRNN之间的比较。 DCRNN始终优于GCRNN。直觉是有向图可以更好地捕获交通传感器之间的不对称相关性。图4显示了不同参数的影响。 K 大致对应于过滤器接收字段的大小,而单位数则对应于过滤器的数量。较大的 K 使模型能够以增加学习复杂性为代价捕获更广泛的空间依赖性。我们观察到,随着 K 的增加,验证数据集上的误差首先迅速减小,然后略有增加。对于改变单元数量,观察到类似的行为。
To evaluate the effect of temporal modeling including the sequence to sequence framework as well as the scheduled sampling mechanism, we further design three variants of DCRNN: (1) DCNN: in which we concatenate the historical observations as a fixed length vector and feed it into stacked diffusion convolutional layers to predict the future time series. We train a single model for one step ahead prediction, and feed the previous prediction into the model as input to perform multiple steps ahead prediction. (2) DCRNN-SEQ: which uses the encoder-decoder sequence to sequence learning framework to perform multiple steps ahead forecasting. (3) DCRNN: similar to DCRNN-SEQ except for adding scheduled sampling.
Figure 5 shows the comparison of those four methods with regards to MAE for different forecasting horizons. We observe that: (1) DCRNN-SEQ outperforms DCNN by a large margin which conforms the importance of modeling temporal dependency. (2) DCRNN achieves the best result, and its superiority becomes more evident with the increase of the forecasting horizon. This is mainly because the model is trained to deal with its mistakes during multiple steps ahead prediction and thus suffers less from the problem of error propagation. We also train a model that always been fed its output as input for multiple steps ahead prediction. However, its performance is much worse than all the three variants which emphasizes the importance of scheduled sampling.
To better understand the model, we visualize forecasting results as well as learned filters. Figure 6 shows the visualization of 1 hour ahead forecasting. We have the following observations: (1) DCRNN generates smooth prediction of the mean when small oscillation exists in the traffic speeds (Figure 6(a)). This reflects the robustness of the model. (2) DCRNN is more likely to accurately predict abrupt changes in the traffic speed than baseline methods (e.g., FC-LSTM). As shown in Figure 6(b), DCRNN predicts the start and the end of the peak hours. This is because DCRNN captures the spatial dependency, and is able to utilize the speed changes in neighborhood sensors for more accurate forecasting. Figure 7 visualizes examples of learned filters centered at different nodes. The star denotes the center, and colors denote the weights. We can observe that (1) weights are well localized around the center, and (2) the weights diffuse based on road network distance. More visualizations are provided in Appendix F.
Figure 7: Visualization of learned localized filters centered at different nodes with K = 3 on the METR-LA dataset. The star denotes the center, and the colors represent the weights. We observe that weights are localized around the center, and diffuse alongside the road network.
In this paper, we formulated the traffic prediction on road network as a spatiotemporal forecasting problem, and proposed the diffusion convolutional recurrent neural network that captures the spatiotemporal dependencies. Specifically, we use bidirectional graph random walk to model spatial dependency and recurrent neural network to capture the temporal dynamics. We further integrated the encoder-decoder architecture and the scheduled sampling technique to improve the performance for long-term forecasting. When evaluated on two large-scale real-world traffic datasets, our approach obtained significantly better prediction than baselines. For future work, we will investigate the following two aspects (1) applying the proposed model to other spatial-temporal forecasting tasks; (2) modeling the spatiotemporal dependency when the underlying graph structure is evolving, e.g., the K nearest neighbor graph for moving objects.