Reinforcement Learning for Solving the Vehicle Routing Problem 筆記

增強學習 —— 車輛路徑問題（VRP）

摘要

We present an end-to-end framework for solving the Vehicle Routing Problem (VRP) using reinforcement learning. In this approach, we train a single model that finds near-optimal solutions for problem instances sampled from a given distribution, only by observing the reward signals and following feasibility rules. Our model represents a parameterized stochastic policy, and by applying a policy gradient algorithm to optimize its parameters, the trained model produces the solution as a sequence of consecutive actions in real time, without the need to re-train for every new problem instance. On capacitated VRP, our approach outperforms classical heuristics and Google’s OR-Tools on medium-sized instances in solution quality with comparable computation time (after training). We demonstrate how our approach can handle problems with split delivery and explore the effect of such deliveries on the solution quality. Our proposed framework can be applied to other variants of the VRP such as the stochastic VRP, and has the potential to be applied more generally to combinatorial optimization problems.

我們提出了一個使用強化學習來解決車輛路徑問題（VRP）的端到端框架。在這種方法中，我們訓練單個模型，該模型僅通過觀察獎勵信號和遵循可行性規則，爲從給定分佈採樣的問題實例找到近似最優解。我們的模型表示參數化隨機策略，並且通過應用策略梯度算法來優化其參數，訓練模型實時地生成解決方案作爲連續動作的序列，而不需要針對每個新問題實例重新訓練。在具有能力的車輛路徑問題（VRP）上，我們的方法在解決方案質量方面優於經典啓發式算法和谷歌的OR-Tools，具有可比較的計算時間（訓練後）。我們演示了我們的方法如何處理拆分交付（分批交貨、需求拆分）的問題，並探討此類交付對解決方案質量的影響。我們提出的框架可以應用於車輛路徑問題（VRP）的其他變體，例如隨機車輛路徑問題（VRP），並且有可能更普遍地應用於組合優化問題。

1、簡介

The Vehicle Routing Problem (VRP) is a combinatorial optimization problem that has been studied in applied mathematics and computer science for decades. VRP is known to be a computationally difficult problem for which many exact and heuristic algorithms have been proposed, but providing fast and reliable solutions is still a challenging task. In the simplest form of the VRP, a single capacitated vehicle is responsible for delivering items to multiple customer nodes; the vehicle must return to the depot to pick up additional items when it runs out. The objective is to optimize a set of routes, all beginning and ending at a given node, called the depot, in order to attain the maximum possible reward, which is often the negative of the total vehicle distance or average service time. This problem is computationally difficult to solve to optimality, even with only a few hundred customer nodes [12]. For an overview of the VRP, see, for example, [15, 22, 23, 31].

車輛路徑問題（VRP）是一種組合優化問題，已在應用數學和計算機科學中研究了數十年。已知VRP是計算上難以解決的問題(NP-hard)，已經提出了許多精確和啓發式算法，但是提供快速且可靠的解決方案仍然是一項具有挑戰性的任務。在最簡單的VRP形式中，單個容量限制的的車輛負責將物品運送到多個客戶節點; 當車輛用完時，車輛必須返回車廠以取出其他物品。目標是優化一組路線，所有路線的起點和終點都是一個給定的節點，稱爲倉庫，爲了獲得最大可能的獎勵，這通常是總車輛距離或平均服務時間的負值（即車輛走的距離或者時間越短，路線越好）。即使只有幾百個客戶節點，這個問題在計算上難以解決最優問題[12]。有關VRP的概述，請參閱[15,22,23,31]。

車輛路徑問題（VRP）常見問題有以下幾類：
（1）旅行商問題（TSP）
（2）帶容量約束的車輛路線問題
（3）帶時間窗的車輛路線問題
（4）收集和分發問題
（5）多車型車輛路線問題
（6）優先約束車輛路線問題
（7）相容性約束車輛路線問題
（8）隨機需求車輛路線問題

https://en.wikipedia.org/wiki/Vehicle_routing_problem

根據全局的行駛距離以及與使用的車和駕駛員相關的固定成本，最大限度地降低全局運輸成本

最大限度地減少爲所有客戶提供服務所需的車輛數量

行程時間和車輛載荷的最小變化

儘量減少對低質量服務的處罰

The prospect of new algorithm discovery, with out any hand-engineered reasoning, makes neural networks and reinforcement learning a compelling choice that has the potential to be an important milestone on the path toward approaching these problems. In this work, we develop a framework with the capability of solving a wide variety of combinatorial optimization problems using Reinforcement Learning (RL) and show how it can be applied to solve the VRP. For this purpose, we consider the Markov Decision Process (MDP) formulation of the problem, in which the optimal solution can be viewed as a sequence of decisions. This allows us to use RL to produce near-optimal solutions by increasing the probability of decoding “desirable” sequences. A naive approach is to find a problem-specific solution by considering every instance separately. Obviously, this approach is not practical in terms of either solution quality or runtime since there should be many trajectories sampled from one MDP to be able to produce a near-optimal solution. Moreover, the learned policy does not apply to instances other than the one that was used in the training; with a small perturbation of the problem setting, we need to rebuild the policy from scratch.

新算法發現的前景，沒有任何手工設計推理，使神經網絡和強化學習成爲一個引人注目的選擇，有可能成爲解決這些問題的重要里程碑。在這項工作中，我們開發了一個框架，能夠使用強化學習解決各種組合優化問題並展示如何應用它來解決VRP。爲此，我們考慮問題的馬爾可夫決策過程（MDP），其中最優解可被視爲一系列決策。這允許我們通過增加解碼“期望”序列的概率來使用RL（強化學習）來產生近似最優解。通常的方法是通過分別考慮每個實例來找到特定於問題的解決方案。顯然，這種方法在解決方案質量或運行時方面都不實用，因爲應該從一個MDP採樣許多軌跡以便能夠產生接近最優的解決方案。此外，學習政策不適用於培訓中使用的實例以外的實例;在對問題設置進行小擾動時，我們需要從頭開始重建策略。

Therefore, rather than focusing on training a separate model for every problem instance, we propose a structure that performs well on any problem sampled from a given distribution. This means that if we generate a new VRP instance with the same number of nodes and vehicle capacity, and the same location and demand distributions as the ones that we used during training, then the trained policy will work well, and we can solve the problem right away, without retraining for every new instance. As long as we approximate the generating distribution of the problem, the framework can be applied. One can view the trained model as a black-box heuristic (or a meta-algorithm) which generates a high-quality solution in a reasonable amount of time.

因此，我們不是專注於爲每個問題實例訓練單獨的模型，而是提出一種結構，該結構在從給定分佈採樣的任何問題上表現良好。這意味着如果我們生成一個具有相同數量的節點和車輛容量的新VRP實例，以及與我們在培訓期間使用的位置和需求分佈相同的位置和需求分佈，那麼受過訓練的策略將運行良好，我們可以解決問題馬上，沒有重新訓練每個新的實例。只要我們估計問題的生成分佈，就可以應用框架。可以將訓練的模型視爲黑盒啓發式（或元算法），其在合理的時間量內生成高質量的解決方案。

This study is motivated by the recent work by Bello et al. [4]. We have generalized their framework to include a wider range of combinatorial optimization problems such as the VRP. Bello et al. [4] propose the use of a Pointer Network [32] to decode the solution. One major issue that prohibits the direct use of their approach for the VRP is that it assumes the system is static over time. In contrast, in the VRP, the demands change over time in the sense that once a node has been visited its demand becomes, effectively, zero. To overcome this, we propose an alternate approach—which is actually simpler than the Pointer Network approach—that can efficiently handle both the static and dynamic elements of the system. Our model consists of a recurrent neural network (RNN) decoder coupled with an attention mechanism. At each time step, the embeddings of the static elements are the input to the RNN decoder, and the output of the RNN and the dynamic element embeddings are fed into an attention mechanism, which forms a distribution over the feasible destinations that can be chosen at the next decision point.

這項研究收到了Bello等人最近的工作的啓發。 [4]。我們將其框架概括爲包括更廣泛的組合優化問題，例如VRP.Bello等。 [4]建議使用指針網絡[32]來解碼解決方案。禁止將其方法直接用於VRP的一個主要問題是它假設系統隨着時間的推移是靜態的。相反，在VRP中，需求隨着時間的推移而變化，即一旦訪問了節點，其需求實際上變爲零。爲了克服這個問題，我們提出了一種替代方法 - 它實際上比指針網絡方法更簡單 - 可以有效地處理系統的靜態和動態元素。我們的模型由遞歸神經網絡（RNN）解碼器和注意機制組成。在每個時間步驟，靜態元素的嵌入是RNN解碼器的輸入，並且RNN的輸出和動態元素嵌入被饋送到關注機制，該關注機制在可選擇的可行目的地上形成分佈。下一個決定點。

The proposed framework is appealing since we utilize a self-driven learning procedure that only requires the reward calculation based on the generated outputs; as long as we can observe the reward and verify the feasibility of a generated sequence, we can learn the desired meta-algorithm. For instance, if one does not know how to solve the VRP but can compute the cost of a given solution, then one can provide the signal required for solving the problem using our method. Unlike most classical heuristic methods, it is robust to problem changes, meaning that when the inputs change in any way, it can automatically adapt the solution. Using classical heuristics for VRP, the entire distance matrix must be recalculated and the system must be re-optimized from scratch, which is often impractical, especially if the problem size is large. In contrast, our proposed framework does not require an explicit distance matrix, and only one feed-forward pass of the network will update the routes based on the new data.

擬議的框架很有吸引力，因爲我們利用自我驅動的學習程序，只需要根據產生的產出進行獎勵計算;只要我們能夠觀察到獎勵並驗證生成序列的可行性，我們就可以學習所需的元算法。例如，如果一個人不知道如何解決VRP但可以計算給定解決方案的成本，那麼可以使用我們的方法提供解決問題所需的信號。與大多數經典啓發式方法不同，它對問題變化具有魯棒性，這意味着當輸入以任何方式變化時，它可以自動調整解決方案。使用VRP的經典啓發式算法，必須重新計算整個距離矩陣，並且必須從頭開始重新優化系統，這通常是不切實際的，特別是在問題規模很大的情況下。相反，我們提出的框架不需要顯式距離矩陣，並且只有一個網絡的前饋傳遞將基於新數據更新路由。

Our numerical experiments indicate that our framework performs significantly better than well-known classical heuristics designed for the VRP, and that it is robust in the sense that its worst results are still relatively close to optimal. Comparing our method with the OR-Tools VRP engine [16], which is ones of the best open-source VRP solvers, we observe a noticeable improvement; in VRP instances with 50 and 100 customers, our method provides shorter tours in roughly 61% of the instances. Another interesting observation that we make in this study is that by allowing multiple vehicles to supply the demand of a single node, our RL-based framework finds policies that outperform the solutions that require single deliveries. We obtain this appealing property, known as the split delivery, without any hand engineering and no extra cost.

我們的數值實驗表明，我們的框架比爲VRP設計的衆所周知的經典啓發式算法表現得更好，並且在其最差結果仍然相對接近最優的意義上它是穩健的。將我們的方法與OR-Tools VRP引擎[16]進行比較，後者是最好的開源VRP求解器之一，我們觀察到了明顯的改進; 在擁有50和100個客戶的VRP實例中，我們的方法在大約61％的實例中提供了較短的旅行。我們在這項研究中做出的另一個有趣的觀察是，通過允許多個車輛滿足單個節點的需求，我們基於RL的框架發現的策略優於需要單次交付的解決方案。我們沒有任何手工工作，也沒有額外開銷，得到了這種吸引人的屬性——拆分交付。

2 Background
Before presenting the model, we briefly review some background that is closely related to our work.

Sequence-to-Sequence Models Sequence-to-Sequence models [30, 32, 24] are useful in tasks for which a mapping from one sequence to another is required. They have been extensively studied in the field of neural machine translation over the past several years, and there are numerous variants of these models. The general architecture, which is almost the same among different versions, consists of two RNN networks, called the encoder and decoder. An encoder network reads through the input sequence and stores the knowledge in a fixed-size vector representation (or a sequence of vectors); then, a decoder converts the encoded information back to an output sequence.

序列到序列模型[30,32,24]在需要從一個序列到另一個序列的映射的任務中是有用的。在過去的幾年中，它們已經在神經機器翻譯領域進行了廣泛的研究，並且這些模型有許多變體。一般架構在不同版本中幾乎相同，由兩個RNN網絡組成，稱爲編碼器和解碼器。編碼器網絡讀取輸入序列並將知識存儲在固定大小的矢量表示（或矢量序列）中; 然後，解碼器將編碼信息轉換回輸出序列。

In the vanilla Sequence-to-Sequence architecture [30], the source sequence appears only once in the encoder and the entire output sequence is generated based on one vector (i.e., the last hidden state of the encoder RNN). Other extensions, for example Bahdanau et al. [3], illustrate that the source information can be used more wisely to increase the amount of information during the decoding steps. In addition to the encoder and decoder networks, they employ another neural network, namely an attention mechanism that attends to the entire encoder RNN states. This mechanism allows the decoder to focus on the important locations of the source sequence and use the relevant information during decoding steps for producing “better” output sequences. Recently, the concept of attention has been a popular research idea due to its capability to align different objects, e.g., in computer vision [6, 37, 38, 18] and neural machine translation [3, 19, 24]. In this study, we also employ a special attention structure for policy representation. See Section 3.3 for a detailed discussion of the attention mechanism.

Neural Combinatorial Optimization Over the last several years, multiple methods have been developed to tackle ombinatorial optimization problems by using recent advances in artificial intelligence. The first attempt was proposed by Vinyals et al. [32], who introduce the concept of a Pointer Network, a model originally inspired by sequence-to-sequence models. Because it is invariant to the length of the encoder sequence, the Pointer Network enables the model to apply to combinatorial optimization problems, where the output sequence length is determined by the source sequence. They use the Pointer Network architecture in a supervised fashion to find near-optimal Traveling Salesman Problem (TSP) tours from ground truth optimal (or heuristic) solutions. This dependence on supervision prohibits the Pointer Network from finding better solutions than the ones provided during the training.

神經組合優化在過去幾年中，已經開發了多種方法來利用人工智能的最新進展來解決組合優化問題。第一次嘗試是由Vinyals等人提出的。 [32]，他介紹了指針網絡的概念，這是一種最初受序列到序列模型啓發的模型。因爲它對編碼器序列的長度不變，所以指針網絡使模型能夠應用於組合優化問題，其中輸出序列長度由源序列確定。他們以受監督的方式使用指針網絡架構，從地面實況最優（或啓發式）解決方案中找到近乎最優的旅行商問題（TSP）旅行。這種對監督的依賴使得指針網絡無法找到比培訓期間提供的解決方案更好的解決方案。

Closest to our approach, Bello et al. [4] address this issue by developing a neural combinatorial optimization framework that uses RL to optimize a policy modeled by a Pointer Network. Using several classical combinatorial optimization problems such as TSP and the knapsack problem, they show the effectiveness and generality of their architecture.

On a related topic, Dai et al. [11] solve optimization problems over graphs using a graph embedding structure [10] and a deep Q-learning (DQN) algorithm [25]. Even though VRP can be represented by a graph with weighted nodes and edges, their proposed model does not directly apply since in VRP, a particular node (e.g. the depot) might be visited multiple times. Next, we introduce our model, which is a simplified version of the Pointer Network.

3 The Model

In this section, we formally define the problem and our proposed framework for a generic combinatorial optimization problem with a given set of inputs X. We allow some of the elements of each input to change between the decoding steps, which is, in fact, the case in many problems such as the VRP. The dynamic elements might be an artifact of the decoding procedure itself, or they can be imposed by the environment. For example, in the VRP, the remaining customer demands change over time as the vehicle visits the customer nodes; or we might consider a variant in which new customers arrive or adjust their demand values over time, independent of the vehicle decisions. Formally, we represent each input xi by a sequence of tuples{x}, where s and d are the static and dynamic elements of the input, respectively, and can themselves be tuples. One can view x as a vector of features that describes the state of input i at time t. For
instance, in the VRP, x gives a snapshot of the customer i, where s corresponds to the 2-dimensional coordinate of customer i’s location and d is its demand at time t. We will denote the set of all input states at a fixed time t with X.

本節中，我們正式定義問題和我們提出的具有一組給定輸入X的通用組合優化問題的框架。我們允許每個輸入的一些元素在解碼步驟之間改變，實際上，在諸如VRP之類的許多問題中。動態元素可能是解碼過程本身的僞像，或者它們可以由環境強加。例如，在VRP中，當車輛訪問客戶節點時，剩餘的客戶需求隨時間而變化;或者我們可能會考慮新客戶隨時間推移或調整其需求值的變體，與車輛決策無關。形式上，我們用元組序列{x}表示每個輸入xi，其中s和d分別是輸入的靜態和動態元素，並且它們本身可以是元組。可以將x視爲描述在時間t的輸入i的狀態的特徵向量。例如，在VRP中，x給出客戶i的快照，其中s對應於客戶i的位置的2維座標，d是其在時間t的需求。我們將用X表示在固定時間t的所有輸入狀態的集合。

We start from an arbitrary input in X0, where we use the pointer y0 to refer to that input. At every decoding time t , yt+1 points to one of the available inputs Xt, which determines the input of the next decoder step; this process continues until a termination condition is satisfied.The termination condition is problem-specific, showing that the generated sequence satisfies the feasibility constraints. For instance, in the VRP that we consider in this work, the terminating condition is that there is no more demand to satisfy. This process will generate a sequence of length T, Y , possibly with a different sequence length compared to the input length M. This is due to the fact that, for example, the vehicle may have to go back to the depot several times to refill. We also use the notation Yt to denote the decoded sequence up to time t, i.e., Y. We are interested in finding a stochastic policy Pi which generates the sequence Y in a way that minimizes a loss objective while satisfying the problem constraints. The optimal policy Pi* will generate the optimal solution with probability 1. Our goal is to make Pi as close to Pi* as possible. Similar to Sutskever et al. [30], we use the probability chain rule to decompose the probability of generating sequence Y , i.e., P(Y|X0), as follows.

我們從X0中的任意輸入開始，我們使用指針y0來引用該輸入。在每個解碼時間t，yt + 1指向可用輸入Xt之一，其確定下一個解碼器步驟的輸入;該過程繼續，直到滿足終止條件。終止條件是特定於問題的，表明生成的序列滿足可行性約束。例如，在我們在這項工作中考慮的VRP中，終止條件是不再需要滿足。該過程將生成長度爲T，Y的序列，可能具有與輸入長度M相比不同的序列長度。這是因爲，例如，車輛可能必須多次返回倉庫以重新填充。我們還使用符號Yt來表示直到時間t的解碼序列，即Y.我們感興趣的是找到隨機策略Pi，其以在滿足問題約束的同時最小化損失目標的方式生成序列Y.最優策略Pi *將以概率1生成最優解。我們的目標是使Pi儘可能接近Pi *。與Sutskever等人類似。 [30]，我們使用概率鏈規則來分解生成序列Y的概率，即P（Y | X0），如下所述。

Remark 1: This model can handle combinatorial optimization problems in both a more classical static setting as well as in dynamically changing ones. In static combinatorial optimization, X0 fully defines the problem that we are trying to solve. For example, in the VRP, X0 includes all customer locations as well as their demands, and the depot location; then, the remaining demands are updated with respect to the vehicle destination and its load. With this consideration, often there exists a well-defined Markovian transition function f, as defined in (2), which is sufficient to update the dynamics between decision points. However, our model can also be applied to problems in which the state transition function is unknown and/or is subject to external noise, since the training does not explicitly make use of the transition function. However, knowing this transition function helps in simulating the environment that the training algorithm interacts with. See Appendix C.6 for an example of how to apply the model to a stochastic version of the VRP in which random customers with random demands appear over time.

備註1：該模型可以處理更經典的靜態設置以及動態變化的靜態設置中的組合優化問題。在靜態組合優化中，X0完全定義了我們試圖解決的問題。例如，在VRP中，X0包括所有客戶位置以及他們的需求和倉庫位置;然後，關於車輛目的地及其負載更新剩餘需求。考慮到這一點，通常存在明確定義的馬爾可夫過渡函數f，如（2）中所定義的，其足以更新決策點之間的動態。然而，我們的模型也可以應用於狀態轉換函數未知和/或受外部噪聲影響的問題，因爲訓練沒有明確地利用轉換函數。但是，瞭解此轉換函數有助於模擬訓練算法與之交互的環境。有關如何將模型應用於VRP的隨機版本的示例，請參閱附錄C.6，其中隨機需求的隨機客戶隨着時間的推移而出現。

3.1 Limitations of Pointer Networks
Although the framework proposed by Bello et al. [4] works well on problems such as the knapsack problem and TSP, it is not applicable to more complicated combinatorial optimization problems in which the system representation varies over time, such as VRP. Bello et al. [4] feed a random sequence of inputs to the RNN encoder. Figure 1 illustrates with an example why using the RNN in the encoder is restrictive. Suppose that at the first decision step, the policy sends the vehicle to customer 1, and as a result, its demand is satisfied, i.e., d0!= d1. Then in the second decision step, we need to re-calculate the whole network with the new d1 information in order to choose the next customer. The dynamic elements complicate the forward pass of the network since there should be encoder/decoder updates when an input changes. The situation is even worse during back-propagation to accumulate the gradients since we need to remember when the dynamic elements changed. In order to resolve this complication, we require the model to be invariant to the input sequence so that changing the order of any two inputs does not affect the network. In Section 3.2, we present a simple network that satisfies this property.

儘管Bello等人提出了框架。 [4]適用於揹包問題和TSP等問題，它不適用於系統表示隨時間變化的更復雜的組合優化問題，如VRP。 Bello等。 [4]向RNN編碼器提供隨機輸入序列。圖1以示例說明了爲什麼在編碼器中使用RNN是限制性的。假設在第一個決策步驟，策略將車輛發送給客戶1，結果滿足其需求，即d0！= d1。然後在第二個決策步驟中，我們需要使用新的d1信息重新計算整個網絡，以便選擇下一個客戶。動態元素使網絡的正向傳遞複雜化，因爲當輸入改變時應該有編碼器/解碼器更新。在反向傳播過程中情況甚至更糟，因爲我們需要記住動態元素何時發生變化。爲了解決這種複雜問題，我們要求模型對輸入序列不變，以便更改任何兩個輸入的順序不會影響網絡。在3.2節中，我們提出了一個滿足此屬性的簡單網絡。

3.2 The Proposed Neural Network Model
We argue that the RNN encoder adds extra complication to the encoder but is actually not necessary, and the approach can be made much more general by omitting it. RNNs are necessary only when the inputs convey sequential information; e.g., in text translation the combination of words and their relative position must be captured in order for the translation to be accurate. But the question here is, why do we need to have them in the encoder for combinatorial optimization problems when there is no meaningful order in the input set? As an example, in the VRP, the inputs are the set of unordered customer locations with their respective demands, and their order is not meaningful; any random permutation contains the same information as the original inputs. Therefore, in our model, we simply leave out the encoder RNN and directly use the embedded inputs instead of the RNN hidden states. By this modification, many of the computational complications disappear, without decreasing the model’s efficiency. In Appendix A, we provide an experiment to verify this claim.

我們認爲RNN編碼器給編碼器增加了額外的複雜性，但實際上並不是必需的，並且通過省略它可以使方法更加通用。只有當輸入傳遞順序信息時，RNN纔是必要的;例如，在文本翻譯中，必須捕獲單詞的組合及其相對位置，以使翻譯準確。但問題是，當輸入集中沒有有意義的順序時，爲什麼我們需要在編碼器中將它們用於組合優化問題？例如，在VRP中，輸入是一組無序的客戶位置及其各自的需求，它們的順序沒有意義;任何隨機排列都包含與原始輸入相同的信息。因此，在我們的模型中，我們簡單地省略了編碼器RNN並直接使用嵌入式輸入而不是RNN隱藏狀態。通過這種修改，許多計算複雜性消失了，而不降低模型的效率。在附錄A中，我們提供了一個實驗來驗證此聲明。

As illustrated in Figure 2, our model is composed of two main components. The first is a set of embeddings that maps the inputs into a D-dimensional vector space. We might have multiple embeddings corresponding to different elements of the input, but they are shared among the inputs. The second component of our model is a decoder that points to an input at every decoding step. As is common in the literature [3, 30, 7], we use RNN to model the decoder network. Notice that we feed the static elements as the inputs to the decoder network. The dynamic element can also be an input to the decoder, but our experiments on the VRP do not suggest any improvement by doing so, so dynamic elements are used only in the attention layer, described next.

如圖2所示，我們的模型由兩個主要組件組成。第一個是一組嵌入，將輸入映射到D維向量空間。我們可能有多個嵌入對應於輸入的不同元素，但它們在輸入之間共享。我們模型的第二個組件是解碼器，它指向每個解碼步驟的輸入。正如文獻[3,30,7]中常見的那樣，我們使用RNN來模擬解碼器網絡。請注意，我們將靜態元素作爲輸入提供給解碼器網絡。動態元素也可以是解碼器的輸入，但是我們在VRP上的實驗沒有提出任何改進，因此動態元素僅在注意層中使用，如下所述。

3.3 Attention Mechanism

3.4 Training Method
To train the network, we use well-known policy gradient approaches. To use these methods, we parameterize the stochastic policy Pi with parameters Theta . Policy gradient methods use an estimate of the gradient of the expected return with respect to the policy parameters to iteratively improve the policy. In principle, the policy gradient algorithm contains two networks: (i) an actor network that predicts a probability distribution over the next action at any given decision step, and (ii) a critic network that estimates the reward for any problem instance from a given state. Our training methods are quite standard, and due to space limitation we leave the details to the Appendix.

爲了訓練網絡，我們使用衆所周知的策略梯度方法。要使用這些方法，我們使用參數Theta來參數化隨機策略Pi。策略梯度方法使用相對於策略參數的預期收益的梯度的估計來迭代地改進策略。原則上，策略梯度算法包含兩個網絡：（i）在任何給定決策步驟預測下一個動作的概率分佈的動作網絡，以及（ii）估計給定任何問題實例的獎勵的評價網絡。我們的訓練方法非常標準，由於篇幅限制，我們將細節留在附錄中。

5 Discussion and Conclusion
We expect that the proposed architecture has significant potential to be used in real-world problems with further improvements. Noting that the proposed algorithm is not limited to VRP, it will be an important topic of future research to apply it to other combinatorial optimization problems such as bin-packing, job-shop, and flow-shop.

我們期望所提出的架構具有顯着的潛力，可用於實際問題並進一步改進。注意到所提出的算法不僅限於VRP，將其應用於其他組合優化問題（如裝箱，作業車間和流水車間）將是未來研究的重要課題。

This method is quite appealing since the only requirement is a verifier to find feasible solutions and also a reward signal to demonstrate how well the policy is working. Once the trained model is available, it can be used many times, without needing to re-train for the new problems as long as they are generated from the training distribution. Unlike many classical heuristics, our proposed method scales well with increasing problem size, and has a superior performance with competitive solution-time. It doesn’t require a distance matrix calculation which might be computationally cumbersome, especially in dynamically changing VRPs. We also illustrate the performance of the algorithm on a much more complicated stochastic version of the VRP.

這種方法非常有吸引力，因爲唯一的要求是驗證者找到可行的解決方案，還有一個獎勵信號來證明政策的運作情況。一旦訓練好的模型可用，它就可以多次使用，而不需要爲新問題重新訓練，只要它們是從訓練分佈中生成的。與許多經典啓發式方法不同，我們提出的方法可以隨着問題規模的增加而擴展，並且具有優越的性能和競爭性的解決方案時間。它不需要距離矩陣計算，這可能在計算上很麻煩，尤其是在動態變化的VRP中。我們還在更復雜的VRP隨機版本上說明了算法的性能。

Reinforcement Learning for Solving the Vehicle Routing Problem 筆記

PDManer [元數建模]-v4.9.0 發佈：一款簡單好用的數據庫建模平臺

使用neovim打造go ide(支持代碼跳轉, 代碼補全, 實時語法檢查)

cs01 CSS Syntax

挑戰程序設計競賽 2.3章習題 poj 3046 Ant Counting

[MASM拾遺]Offset僞指令

h30 HTML Layout Elements

瞭解顯卡

一款基於C#開發的通訊調試工具（支持Modbus RTU、MQTT調試）

Linux/Golang/glibC系統調用

cs04 CSS Measurement Units

Learning Combinatorial Optimization Algorithms over Graphs 翻譯

Reinforcement Learning for Solving the Vehicle Routing Problem 筆記

REVISED NOTE ON LEARNING QUADRATIC ASSIGNMENT WITH GRAPH NEURAL NETWORKS 翻譯

NEURAL COMBINATORIAL OPTIMIZATION WITH REINFORCEMENT LEARNING 翻譯

linux 增加虛擬內存

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結