Reinforcement Learning for Solving the Vehicle Routing Problem 笔记

增强学习 —— 车辆路径问题（VRP）

摘要

We present an end-to-end framework for solving the Vehicle Routing Problem (VRP) using reinforcement learning. In this approach, we train a single model that finds near-optimal solutions for problem instances sampled from a given distribution, only by observing the reward signals and following feasibility rules. Our model represents a parameterized stochastic policy, and by applying a policy gradient algorithm to optimize its parameters, the trained model produces the solution as a sequence of consecutive actions in real time, without the need to re-train for every new problem instance. On capacitated VRP, our approach outperforms classical heuristics and Google’s OR-Tools on medium-sized instances in solution quality with comparable computation time (after training). We demonstrate how our approach can handle problems with split delivery and explore the effect of such deliveries on the solution quality. Our proposed framework can be applied to other variants of the VRP such as the stochastic VRP, and has the potential to be applied more generally to combinatorial optimization problems.

我们提出了一个使用强化学习来解决车辆路径问题（VRP）的端到端框架。在这种方法中，我们训练单个模型，该模型仅通过观察奖励信号和遵循可行性规则，为从给定分布采样的问题实例找到近似最优解。我们的模型表示参数化随机策略，并且通过应用策略梯度算法来优化其参数，训练模型实时地生成解决方案作为连续动作的序列，而不需要针对每个新问题实例重新训练。在具有能力的车辆路径问题（VRP）上，我们的方法在解决方案质量方面优于经典启发式算法和谷歌的OR-Tools，具有可比较的计算时间（训练后）。我们演示了我们的方法如何处理拆分交付（分批交货、需求拆分）的问题，并探讨此类交付对解决方案质量的影响。我们提出的框架可以应用于车辆路径问题（VRP）的其他变体，例如随机车辆路径问题（VRP），并且有可能更普遍地应用于组合优化问题。

1、简介

The Vehicle Routing Problem (VRP) is a combinatorial optimization problem that has been studied in applied mathematics and computer science for decades. VRP is known to be a computationally difficult problem for which many exact and heuristic algorithms have been proposed, but providing fast and reliable solutions is still a challenging task. In the simplest form of the VRP, a single capacitated vehicle is responsible for delivering items to multiple customer nodes; the vehicle must return to the depot to pick up additional items when it runs out. The objective is to optimize a set of routes, all beginning and ending at a given node, called the depot, in order to attain the maximum possible reward, which is often the negative of the total vehicle distance or average service time. This problem is computationally difficult to solve to optimality, even with only a few hundred customer nodes [12]. For an overview of the VRP, see, for example, [15, 22, 23, 31].

车辆路径问题（VRP）是一种组合优化问题，已在应用数学和计算机科学中研究了数十年。已知VRP是计算上难以解决的问题(NP-hard)，已经提出了许多精确和启发式算法，但是提供快速且可靠的解决方案仍然是一项具有挑战性的任务。在最简单的VRP形式中，单个容量限制的的车辆负责将物品运送到多个客户节点; 当车辆用完时，车辆必须返回车厂以取出其他物品。目标是优化一组路线，所有路线的起点和终点都是一个给定的节点，称为仓库，为了获得最大可能的奖励，这通常是总车辆距离或平均服务时间的负值（即车辆走的距离或者时间越短，路线越好）。即使只有几百个客户节点，这个问题在计算上难以解决最优问题[12]。有关VRP的概述，请参阅[15,22,23,31]。

车辆路径问题（VRP）常见问题有以下几类：
（1）旅行商问题（TSP）
（2）带容量约束的车辆路线问题
（3）带时间窗的车辆路线问题
（4）收集和分发问题
（5）多车型车辆路线问题
（6）优先约束车辆路线问题
（7）相容性约束车辆路线问题
（8）随机需求车辆路线问题

https://en.wikipedia.org/wiki/Vehicle_routing_problem

根据全局的行驶距离以及与使用的车和驾驶员相关的固定成本，最大限度地降低全局运输成本

最大限度地减少为所有客户提供服务所需的车辆数量

行程时间和车辆载荷的最小变化

尽量减少对低质量服务的处罚

The prospect of new algorithm discovery, with out any hand-engineered reasoning, makes neural networks and reinforcement learning a compelling choice that has the potential to be an important milestone on the path toward approaching these problems. In this work, we develop a framework with the capability of solving a wide variety of combinatorial optimization problems using Reinforcement Learning (RL) and show how it can be applied to solve the VRP. For this purpose, we consider the Markov Decision Process (MDP) formulation of the problem, in which the optimal solution can be viewed as a sequence of decisions. This allows us to use RL to produce near-optimal solutions by increasing the probability of decoding “desirable” sequences. A naive approach is to find a problem-specific solution by considering every instance separately. Obviously, this approach is not practical in terms of either solution quality or runtime since there should be many trajectories sampled from one MDP to be able to produce a near-optimal solution. Moreover, the learned policy does not apply to instances other than the one that was used in the training; with a small perturbation of the problem setting, we need to rebuild the policy from scratch.

新算法发现的前景，没有任何手工设计推理，使神经网络和强化学习成为一个引人注目的选择，有可能成为解决这些问题的重要里程碑。在这项工作中，我们开发了一个框架，能够使用强化学习解决各种组合优化问题并展示如何应用它来解决VRP。为此，我们考虑问题的马尔可夫决策过程（MDP），其中最优解可被视为一系列决策。这允许我们通过增加解码“期望”序列的概率来使用RL（强化学习）来产生近似最优解。通常的方法是通过分别考虑每个实例来找到特定于问题的解决方案。显然，这种方法在解决方案质量或运行时方面都不实用，因为应该从一个MDP采样许多轨迹以便能够产生接近最优的解决方案。此外，学习政策不适用于培训中使用的实例以外的实例;在对问题设置进行小扰动时，我们需要从头开始重建策略。

Therefore, rather than focusing on training a separate model for every problem instance, we propose a structure that performs well on any problem sampled from a given distribution. This means that if we generate a new VRP instance with the same number of nodes and vehicle capacity, and the same location and demand distributions as the ones that we used during training, then the trained policy will work well, and we can solve the problem right away, without retraining for every new instance. As long as we approximate the generating distribution of the problem, the framework can be applied. One can view the trained model as a black-box heuristic (or a meta-algorithm) which generates a high-quality solution in a reasonable amount of time.

因此，我们不是专注于为每个问题实例训练单独的模型，而是提出一种结构，该结构在从给定分布采样的任何问题上表现良好。这意味着如果我们生成一个具有相同数量的节点和车辆容量的新VRP实例，以及与我们在培训期间使用的位置和需求分布相同的位置和需求分布，那么受过训练的策略将运行良好，我们可以解决问题马上，没有重新训练每个新的实例。只要我们估计问题的生成分布，就可以应用框架。可以将训练的模型视为黑盒启发式（或元算法），其在合理的时间量内生成高质量的解决方案。

This study is motivated by the recent work by Bello et al. [4]. We have generalized their framework to include a wider range of combinatorial optimization problems such as the VRP. Bello et al. [4] propose the use of a Pointer Network [32] to decode the solution. One major issue that prohibits the direct use of their approach for the VRP is that it assumes the system is static over time. In contrast, in the VRP, the demands change over time in the sense that once a node has been visited its demand becomes, effectively, zero. To overcome this, we propose an alternate approach—which is actually simpler than the Pointer Network approach—that can efficiently handle both the static and dynamic elements of the system. Our model consists of a recurrent neural network (RNN) decoder coupled with an attention mechanism. At each time step, the embeddings of the static elements are the input to the RNN decoder, and the output of the RNN and the dynamic element embeddings are fed into an attention mechanism, which forms a distribution over the feasible destinations that can be chosen at the next decision point.

这项研究收到了Bello等人最近的工作的启发。 [4]。我们将其框架概括为包括更广泛的组合优化问题，例如VRP.Bello等。 [4]建议使用指针网络[32]来解码解决方案。禁止将其方法直接用于VRP的一个主要问题是它假设系统随着时间的推移是静态的。相反，在VRP中，需求随着时间的推移而变化，即一旦访问了节点，其需求实际上变为零。为了克服这个问题，我们提出了一种替代方法 - 它实际上比指针网络方法更简单 - 可以有效地处理系统的静态和动态元素。我们的模型由递归神经网络（RNN）解码器和注意机制组成。在每个时间步骤，静态元素的嵌入是RNN解码器的输入，并且RNN的输出和动态元素嵌入被馈送到关注机制，该关注机制在可选择的可行目的地上形成分布。下一个决定点。

The proposed framework is appealing since we utilize a self-driven learning procedure that only requires the reward calculation based on the generated outputs; as long as we can observe the reward and verify the feasibility of a generated sequence, we can learn the desired meta-algorithm. For instance, if one does not know how to solve the VRP but can compute the cost of a given solution, then one can provide the signal required for solving the problem using our method. Unlike most classical heuristic methods, it is robust to problem changes, meaning that when the inputs change in any way, it can automatically adapt the solution. Using classical heuristics for VRP, the entire distance matrix must be recalculated and the system must be re-optimized from scratch, which is often impractical, especially if the problem size is large. In contrast, our proposed framework does not require an explicit distance matrix, and only one feed-forward pass of the network will update the routes based on the new data.

拟议的框架很有吸引力，因为我们利用自我驱动的学习程序，只需要根据产生的产出进行奖励计算;只要我们能够观察到奖励并验证生成序列的可行性，我们就可以学习所需的元算法。例如，如果一个人不知道如何解决VRP但可以计算给定解决方案的成本，那么可以使用我们的方法提供解决问题所需的信号。与大多数经典启发式方法不同，它对问题变化具有鲁棒性，这意味着当输入以任何方式变化时，它可以自动调整解决方案。使用VRP的经典启发式算法，必须重新计算整个距离矩阵，并且必须从头开始重新优化系统，这通常是不切实际的，特别是在问题规模很大的情况下。相反，我们提出的框架不需要显式距离矩阵，并且只有一个网络的前馈传递将基于新数据更新路由。

Our numerical experiments indicate that our framework performs significantly better than well-known classical heuristics designed for the VRP, and that it is robust in the sense that its worst results are still relatively close to optimal. Comparing our method with the OR-Tools VRP engine [16], which is ones of the best open-source VRP solvers, we observe a noticeable improvement; in VRP instances with 50 and 100 customers, our method provides shorter tours in roughly 61% of the instances. Another interesting observation that we make in this study is that by allowing multiple vehicles to supply the demand of a single node, our RL-based framework finds policies that outperform the solutions that require single deliveries. We obtain this appealing property, known as the split delivery, without any hand engineering and no extra cost.

我们的数值实验表明，我们的框架比为VRP设计的众所周知的经典启发式算法表现得更好，并且在其最差结果仍然相对接近最优的意义上它是稳健的。将我们的方法与OR-Tools VRP引擎[16]进行比较，后者是最好的开源VRP求解器之一，我们观察到了明显的改进; 在拥有50和100个客户的VRP实例中，我们的方法在大约61％的实例中提供了较短的旅行。我们在这项研究中做出的另一个有趣的观察是，通过允许多个车辆满足单个节点的需求，我们基于RL的框架发现的策略优于需要单次交付的解决方案。我们没有任何手工工作，也没有额外开销，得到了这种吸引人的属性——拆分交付。

2 Background
Before presenting the model, we briefly review some background that is closely related to our work.

Sequence-to-Sequence Models Sequence-to-Sequence models [30, 32, 24] are useful in tasks for which a mapping from one sequence to another is required. They have been extensively studied in the field of neural machine translation over the past several years, and there are numerous variants of these models. The general architecture, which is almost the same among different versions, consists of two RNN networks, called the encoder and decoder. An encoder network reads through the input sequence and stores the knowledge in a fixed-size vector representation (or a sequence of vectors); then, a decoder converts the encoded information back to an output sequence.

序列到序列模型[30,32,24]在需要从一个序列到另一个序列的映射的任务中是有用的。在过去的几年中，它们已经在神经机器翻译领域进行了广泛的研究，并且这些模型有许多变体。一般架构在不同版本中几乎相同，由两个RNN网络组成，称为编码器和解码器。编码器网络读取输入序列并将知识存储在固定大小的矢量表示（或矢量序列）中; 然后，解码器将编码信息转换回输出序列。

In the vanilla Sequence-to-Sequence architecture [30], the source sequence appears only once in the encoder and the entire output sequence is generated based on one vector (i.e., the last hidden state of the encoder RNN). Other extensions, for example Bahdanau et al. [3], illustrate that the source information can be used more wisely to increase the amount of information during the decoding steps. In addition to the encoder and decoder networks, they employ another neural network, namely an attention mechanism that attends to the entire encoder RNN states. This mechanism allows the decoder to focus on the important locations of the source sequence and use the relevant information during decoding steps for producing “better” output sequences. Recently, the concept of attention has been a popular research idea due to its capability to align different objects, e.g., in computer vision [6, 37, 38, 18] and neural machine translation [3, 19, 24]. In this study, we also employ a special attention structure for policy representation. See Section 3.3 for a detailed discussion of the attention mechanism.

Neural Combinatorial Optimization Over the last several years, multiple methods have been developed to tackle ombinatorial optimization problems by using recent advances in artificial intelligence. The first attempt was proposed by Vinyals et al. [32], who introduce the concept of a Pointer Network, a model originally inspired by sequence-to-sequence models. Because it is invariant to the length of the encoder sequence, the Pointer Network enables the model to apply to combinatorial optimization problems, where the output sequence length is determined by the source sequence. They use the Pointer Network architecture in a supervised fashion to find near-optimal Traveling Salesman Problem (TSP) tours from ground truth optimal (or heuristic) solutions. This dependence on supervision prohibits the Pointer Network from finding better solutions than the ones provided during the training.

神经组合优化在过去几年中，已经开发了多种方法来利用人工智能的最新进展来解决组合优化问题。第一次尝试是由Vinyals等人提出的。 [32]，他介绍了指针网络的概念，这是一种最初受序列到序列模型启发的模型。因为它对编码器序列的长度不变，所以指针网络使模型能够应用于组合优化问题，其中输出序列长度由源序列确定。他们以受监督的方式使用指针网络架构，从地面实况最优（或启发式）解决方案中找到近乎最优的旅行商问题（TSP）旅行。这种对监督的依赖使得指针网络无法找到比培训期间提供的解决方案更好的解决方案。

Closest to our approach, Bello et al. [4] address this issue by developing a neural combinatorial optimization framework that uses RL to optimize a policy modeled by a Pointer Network. Using several classical combinatorial optimization problems such as TSP and the knapsack problem, they show the effectiveness and generality of their architecture.

On a related topic, Dai et al. [11] solve optimization problems over graphs using a graph embedding structure [10] and a deep Q-learning (DQN) algorithm [25]. Even though VRP can be represented by a graph with weighted nodes and edges, their proposed model does not directly apply since in VRP, a particular node (e.g. the depot) might be visited multiple times. Next, we introduce our model, which is a simplified version of the Pointer Network.

3 The Model

In this section, we formally define the problem and our proposed framework for a generic combinatorial optimization problem with a given set of inputs X. We allow some of the elements of each input to change between the decoding steps, which is, in fact, the case in many problems such as the VRP. The dynamic elements might be an artifact of the decoding procedure itself, or they can be imposed by the environment. For example, in the VRP, the remaining customer demands change over time as the vehicle visits the customer nodes; or we might consider a variant in which new customers arrive or adjust their demand values over time, independent of the vehicle decisions. Formally, we represent each input xi by a sequence of tuples{x}, where s and d are the static and dynamic elements of the input, respectively, and can themselves be tuples. One can view x as a vector of features that describes the state of input i at time t. For
instance, in the VRP, x gives a snapshot of the customer i, where s corresponds to the 2-dimensional coordinate of customer i’s location and d is its demand at time t. We will denote the set of all input states at a fixed time t with X.

本节中，我们正式定义问题和我们提出的具有一组给定输入X的通用组合优化问题的框架。我们允许每个输入的一些元素在解码步骤之间改变，实际上，在诸如VRP之类的许多问题中。动态元素可能是解码过程本身的伪像，或者它们可以由环境强加。例如，在VRP中，当车辆访问客户节点时，剩余的客户需求随时间而变化;或者我们可能会考虑新客户随时间推移或调整其需求值的变体，与车辆决策无关。形式上，我们用元组序列{x}表示每个输入xi，其中s和d分别是输入的静态和动态元素，并且它们本身可以是元组。可以将x视为描述在时间t的输入i的状态的特征向量。例如，在VRP中，x给出客户i的快照，其中s对应于客户i的位置的2维座标，d是其在时间t的需求。我们将用X表示在固定时间t的所有输入状态的集合。

We start from an arbitrary input in X0, where we use the pointer y0 to refer to that input. At every decoding time t , yt+1 points to one of the available inputs Xt, which determines the input of the next decoder step; this process continues until a termination condition is satisfied.The termination condition is problem-specific, showing that the generated sequence satisfies the feasibility constraints. For instance, in the VRP that we consider in this work, the terminating condition is that there is no more demand to satisfy. This process will generate a sequence of length T, Y , possibly with a different sequence length compared to the input length M. This is due to the fact that, for example, the vehicle may have to go back to the depot several times to refill. We also use the notation Yt to denote the decoded sequence up to time t, i.e., Y. We are interested in finding a stochastic policy Pi which generates the sequence Y in a way that minimizes a loss objective while satisfying the problem constraints. The optimal policy Pi* will generate the optimal solution with probability 1. Our goal is to make Pi as close to Pi* as possible. Similar to Sutskever et al. [30], we use the probability chain rule to decompose the probability of generating sequence Y , i.e., P(Y|X0), as follows.

我们从X0中的任意输入开始，我们使用指针y0来引用该输入。在每个解码时间t，yt + 1指向可用输入Xt之一，其确定下一个解码器步骤的输入;该过程继续，直到满足终止条件。终止条件是特定于问题的，表明生成的序列满足可行性约束。例如，在我们在这项工作中考虑的VRP中，终止条件是不再需要满足。该过程将生成长度为T，Y的序列，可能具有与输入长度M相比不同的序列长度。这是因为，例如，车辆可能必须多次返回仓库以重新填充。我们还使用符号Yt来表示直到时间t的解码序列，即Y.我们感兴趣的是找到随机策略Pi，其以在满足问题约束的同时最小化损失目标的方式生成序列Y.最优策略Pi *将以概率1生成最优解。我们的目标是使Pi尽可能接近Pi *。与Sutskever等人类似。 [30]，我们使用概率链规则来分解生成序列Y的概率，即P（Y | X0），如下所述。

Remark 1: This model can handle combinatorial optimization problems in both a more classical static setting as well as in dynamically changing ones. In static combinatorial optimization, X0 fully defines the problem that we are trying to solve. For example, in the VRP, X0 includes all customer locations as well as their demands, and the depot location; then, the remaining demands are updated with respect to the vehicle destination and its load. With this consideration, often there exists a well-defined Markovian transition function f, as defined in (2), which is sufficient to update the dynamics between decision points. However, our model can also be applied to problems in which the state transition function is unknown and/or is subject to external noise, since the training does not explicitly make use of the transition function. However, knowing this transition function helps in simulating the environment that the training algorithm interacts with. See Appendix C.6 for an example of how to apply the model to a stochastic version of the VRP in which random customers with random demands appear over time.

备注1：该模型可以处理更经典的静态设置以及动态变化的静态设置中的组合优化问题。在静态组合优化中，X0完全定义了我们试图解决的问题。例如，在VRP中，X0包括所有客户位置以及他们的需求和仓库位置;然后，关于车辆目的地及其负载更新剩余需求。考虑到这一点，通常存在明确定义的马尔可夫过渡函数f，如（2）中所定义的，其足以更新决策点之间的动态。然而，我们的模型也可以应用于状态转换函数未知和/或受外部噪声影响的问题，因为训练没有明确地利用转换函数。但是，了解此转换函数有助于模拟训练算法与之交互的环境。有关如何将模型应用于VRP的随机版本的示例，请参阅附录C.6，其中随机需求的随机客户随着时间的推移而出现。

3.1 Limitations of Pointer Networks
Although the framework proposed by Bello et al. [4] works well on problems such as the knapsack problem and TSP, it is not applicable to more complicated combinatorial optimization problems in which the system representation varies over time, such as VRP. Bello et al. [4] feed a random sequence of inputs to the RNN encoder. Figure 1 illustrates with an example why using the RNN in the encoder is restrictive. Suppose that at the first decision step, the policy sends the vehicle to customer 1, and as a result, its demand is satisfied, i.e., d0!= d1. Then in the second decision step, we need to re-calculate the whole network with the new d1 information in order to choose the next customer. The dynamic elements complicate the forward pass of the network since there should be encoder/decoder updates when an input changes. The situation is even worse during back-propagation to accumulate the gradients since we need to remember when the dynamic elements changed. In order to resolve this complication, we require the model to be invariant to the input sequence so that changing the order of any two inputs does not affect the network. In Section 3.2, we present a simple network that satisfies this property.

尽管Bello等人提出了框架。 [4]适用于揹包问题和TSP等问题，它不适用于系统表示随时间变化的更复杂的组合优化问题，如VRP。 Bello等。 [4]向RNN编码器提供随机输入序列。图1以示例说明了为什么在编码器中使用RNN是限制性的。假设在第一个决策步骤，策略将车辆发送给客户1，结果满足其需求，即d0！= d1。然后在第二个决策步骤中，我们需要使用新的d1信息重新计算整个网络，以便选择下一个客户。动态元素使网络的正向传递复杂化，因为当输入改变时应该有编码器/解码器更新。在反向传播过程中情况甚至更糟，因为我们需要记住动态元素何时发生变化。为了解决这种复杂问题，我们要求模型对输入序列不变，以便更改任何两个输入的顺序不会影响网络。在3.2节中，我们提出了一个满足此属性的简单网络。

3.2 The Proposed Neural Network Model
We argue that the RNN encoder adds extra complication to the encoder but is actually not necessary, and the approach can be made much more general by omitting it. RNNs are necessary only when the inputs convey sequential information; e.g., in text translation the combination of words and their relative position must be captured in order for the translation to be accurate. But the question here is, why do we need to have them in the encoder for combinatorial optimization problems when there is no meaningful order in the input set? As an example, in the VRP, the inputs are the set of unordered customer locations with their respective demands, and their order is not meaningful; any random permutation contains the same information as the original inputs. Therefore, in our model, we simply leave out the encoder RNN and directly use the embedded inputs instead of the RNN hidden states. By this modification, many of the computational complications disappear, without decreasing the model’s efficiency. In Appendix A, we provide an experiment to verify this claim.

我们认为RNN编码器给编码器增加了额外的复杂性，但实际上并不是必需的，并且通过省略它可以使方法更加通用。只有当输入传递顺序信息时，RNN才是必要的;例如，在文本翻译中，必须捕获单词的组合及其相对位置，以使翻译准确。但问题是，当输入集中没有有意义的顺序时，为什么我们需要在编码器中将它们用于组合优化问题？例如，在VRP中，输入是一组无序的客户位置及其各自的需求，它们的顺序没有意义;任何随机排列都包含与原始输入相同的信息。因此，在我们的模型中，我们简单地省略了编码器RNN并直接使用嵌入式输入而不是RNN隐藏状态。通过这种修改，许多计算复杂性消失了，而不降低模型的效率。在附录A中，我们提供了一个实验来验证此声明。

As illustrated in Figure 2, our model is composed of two main components. The first is a set of embeddings that maps the inputs into a D-dimensional vector space. We might have multiple embeddings corresponding to different elements of the input, but they are shared among the inputs. The second component of our model is a decoder that points to an input at every decoding step. As is common in the literature [3, 30, 7], we use RNN to model the decoder network. Notice that we feed the static elements as the inputs to the decoder network. The dynamic element can also be an input to the decoder, but our experiments on the VRP do not suggest any improvement by doing so, so dynamic elements are used only in the attention layer, described next.

如图2所示，我们的模型由两个主要组件组成。第一个是一组嵌入，将输入映射到D维向量空间。我们可能有多个嵌入对应于输入的不同元素，但它们在输入之间共享。我们模型的第二个组件是解码器，它指向每个解码步骤的输入。正如文献[3,30,7]中常见的那样，我们使用RNN来模拟解码器网络。请注意，我们将静态元素作为输入提供给解码器网络。动态元素也可以是解码器的输入，但是我们在VRP上的实验没有提出任何改进，因此动态元素仅在注意层中使用，如下所述。

3.3 Attention Mechanism

3.4 Training Method
To train the network, we use well-known policy gradient approaches. To use these methods, we parameterize the stochastic policy Pi with parameters Theta . Policy gradient methods use an estimate of the gradient of the expected return with respect to the policy parameters to iteratively improve the policy. In principle, the policy gradient algorithm contains two networks: (i) an actor network that predicts a probability distribution over the next action at any given decision step, and (ii) a critic network that estimates the reward for any problem instance from a given state. Our training methods are quite standard, and due to space limitation we leave the details to the Appendix.

为了训练网络，我们使用众所周知的策略梯度方法。要使用这些方法，我们使用参数Theta来参数化随机策略Pi。策略梯度方法使用相对于策略参数的预期收益的梯度的估计来迭代地改进策略。原则上，策略梯度算法包含两个网络：（i）在任何给定决策步骤预测下一个动作的概率分布的动作网络，以及（ii）估计给定任何问题实例的奖励的评价网络。我们的训练方法非常标准，由于篇幅限制，我们将细节留在附录中。

5 Discussion and Conclusion
We expect that the proposed architecture has significant potential to be used in real-world problems with further improvements. Noting that the proposed algorithm is not limited to VRP, it will be an important topic of future research to apply it to other combinatorial optimization problems such as bin-packing, job-shop, and flow-shop.

我们期望所提出的架构具有显着的潜力，可用于实际问题并进一步改进。注意到所提出的算法不仅限于VRP，将其应用于其他组合优化问题（如装箱，作业车间和流水车间）将是未来研究的重要课题。

This method is quite appealing since the only requirement is a verifier to find feasible solutions and also a reward signal to demonstrate how well the policy is working. Once the trained model is available, it can be used many times, without needing to re-train for the new problems as long as they are generated from the training distribution. Unlike many classical heuristics, our proposed method scales well with increasing problem size, and has a superior performance with competitive solution-time. It doesn’t require a distance matrix calculation which might be computationally cumbersome, especially in dynamically changing VRPs. We also illustrate the performance of the algorithm on a much more complicated stochastic version of the VRP.

这种方法非常有吸引力，因为唯一的要求是验证者找到可行的解决方案，还有一个奖励信号来证明政策的运作情况。一旦训练好的模型可用，它就可以多次使用，而不需要为新问题重新训练，只要它们是从训练分布中生成的。与许多经典启发式方法不同，我们提出的方法可以随着问题规模的增加而扩展，并且具有优越的性能和竞争性的解决方案时间。它不需要距离矩阵计算，这可能在计算上很麻烦，尤其是在动态变化的VRP中。我们还在更复杂的VRP随机版本上说明了算法的性能。

Reinforcement Learning for Solving the Vehicle Routing Problem 笔记

Learning Combinatorial Optimization Algorithms over Graphs 翻譯

Reinforcement Learning for Solving the Vehicle Routing Problem 筆記

REVISED NOTE ON LEARNING QUADRATIC ASSIGNMENT WITH GRAPH NEURAL NETWORKS 翻譯

NEURAL COMBINATORIAL OPTIMIZATION WITH REINFORCEMENT LEARNING 翻譯

linux 增加虛擬內存

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結