NEURAL COMBINATORIAL OPTIMIZATION WITH REINFORCEMENT LEARNING 翻譯

Irwan Bello∗, Hieu Pham∗, Quoc V. Le, Mohammad Norouzi, Samy Bengio
Google Brain
{ibello,hyhieu,qvl,mnorouzi,bengio}@google.com

使用強化學習進行神經組合優化

摘要

This paper presents a framework to tackle combinatorial optimization problems using neural networks and reinforcement learning. We focus on the traveling salesman problem (TSP) and train a recurrent neural network that, given a set of city coordinates, predicts a distribution over different city permutations. Using negative tour length as the reward signal, we optimize the parameters of the recurrent neural network using a policy gradient method. We compare learning the network parameters on a set of training graphs against learning them on individual test graphs. Despite the computational expense, without much engineering and heuristic designing, Neural Combinatorial Optimization achieves close to optimal results on 2D Euclidean graphs with up to 100 nodes. Applied to the KnapSack,another NP-hard problem, the same method obtains optimal solutions for instances with up to 200 items.

本文提出了一個使用神經網絡和強化學習來解決組合優化問題的框架。 我們專注於旅行商問題(TSP)並訓練一個遞歸神經網絡——給定一組城市座標,預測不同城市排列的分佈。 使用負遊覽長度作爲獎勵信號,我們使用策略梯度方法優化遞歸神經網絡的參數。 我們比較了在一組訓練圖上學習網絡參數與在單獨的測試圖上學習它們。 儘管存在計算開銷,但沒有多種工程和啓發式設計,神經組合優化在具有多達100個節點的2D歐幾里得圖上實現了接近最優的結果。 應用於另一個難以解決的NP問題KnapSack,同樣的方法可以獲得最多200個項目的最佳解決方案。

 

簡介

Combinatorial optimization is a fundamental problem in computer science. A canonical example is the traveling salesman problem (TSP), where given a graph, one needs to search the space of permutations to find an optimal sequence of nodes with minimal total edge weights (tour length).The TSP and its variants have myriad applications in planning, manufacturing, genetics, etc. (see(Applegate et al., 2011) for an overview). 組合優化是計算機科學中的基本問題。 一個典型的例子是旅行商問題(TSP),在給定圖形的情況下,需要搜索排列的空間以找到具有最小總邊緣權重(旅行長度)的最佳節點序列.TSP及其變體具有無數的應用 在規劃,製造,遺傳等方面(見(Applegate等,2011)的概述)。
Finding the optimal TSP solution is NP-hard, even in the two-dimensional Euclidean case (Papadimitriou, 1977), where the nodes are 2D points and edge weights are Euclidean distances between pairs of points. In practice, TSP solvers rely on handcrafted heuristics that guide their search procedures to find ompetitive tours efficiently. Even though these heuristics work well on TSP, once the problem statement changes slightly, they need to be revised. In contrast, machine learning methods have the potential to be applicable across many optimization tasks by automatically discovering their own heuristics based on the training data, thus requiring less handengineering than solvers that are optimized for one task only. 即使在二維歐幾里德案例(Papadimitriou,1977)(其中節點是2D點,邊緣權重是點對之間的歐幾里德距離)中,找到最優TSP解也是NP難的。 在實踐中,TSP求解器依靠手工啓發式方法來指導他們的搜索程序,以便有效地找到有效的路線。 儘管這些啓發式方法在TSP上運行良好,但一旦問題陳述略有變化,它們就需要進行修改。 相比之下,機器學習方法有可能通過基於訓練數據自動發現他們自己的啓發式來適用於許多優化任務,因此比僅針對一個任務優化的求解器需要更少的手工工程。
While most successful machine learning techniques fall into the family of supervised learning, where a mapping from training inputs to outputs is learned, supervised learning is not applicable to most combinatorial optimization problems because one does not have access to optimal labels. However,one can compare the quality of a set of solutions using a verifier, and provide some reward feedbacks to a learning algorithm. Hence, we follow the reinforcement learning (RL) paradigm to tackle combinatorial optimization. We empirically demonstrate that, even when using optimal solutions as labeled data to optimize a supervised mapping, the generalization is rather poor compared to an RL agent that explores different tours and observes their corresponding rewards. 雖然大多數成功的機器學習技術都屬於監督學習系列(學習了從訓練輸入到輸出的映射),但是監督學習不適用於大多數組合優化問題,因爲人們無法訪問最佳標籤。 但是,可以使用驗證器比較一組解決方案的質量,併爲學習算法提供一些獎勵和反饋。 因此,我們遵循強化學習(RL)範式來解決組合優化問題。 我們憑經驗證明,即使使用最優解作爲標記數據來優化監督映射,與探索不同遊覽並觀察其相應獎勵的RL代理相比,泛化相當差。
We propose Neural Combinatorial Optimization, a framework to tackle combinatorial optimization problems using reinforcement learning and neural networks. We consider two approaches based on policy gradients (Williams, 1992). The first approach, called RL pretraining, uses a training set to optimize a recurrent neural network (RNN) that parameterizes a stochastic policy over solutions, using the expected reward as objective. At test time, the policy is fixed, and one performs inference by greedy decoding or sampling. The second approach, called active search, involves no pretraining. 我們提出神經組合優化,這是一個使用強化學習和神經網絡解決組合優化問題的框架。 我們考慮兩種基於政策梯度的方法(Williams,1992)。 第一種方法稱爲RL預訓練,使用訓練集來優化遞歸神經網絡(RNN),該神經網絡使用預期的獎勵作爲目標,對解決方案的隨機策略進行參數化。 在測試時,策略是固定的,並且通過貪婪解碼或採樣來執行推斷。 第二種方法稱爲主動搜索,不涉及預訓練。
It starts froma random policy and iteratively optimizes the RNN parameters on a single test instance,again using the expected reward objective, while keeping track of the best solution sampled during the search. We find that combining RL pretraining and active search works best in practice. 它從隨機策略開始,並在單個測試實例上迭代優化RNN參數,再次使用預期的獎勵目標,同時跟蹤搜索期間採樣的最佳解決方案。 我們發現結合RL預訓練和主動搜索在實踐中效果最好。
On 2D Euclidean graphs with up to 100 nodes, Neural Combinatorial Optimization significantly outperforms the supervised learning approach to the TSP (Vinyals et al., 2015b) and obtains close to optimal results when allowed more computation time. We illustrate its flexibility by testing the same method on the KnapSack problem, for which we get optimal results for instances with up to 200 items. These results give insights into how neural networks can be used as a general tool for tackling combinatorial optimization problems, especially those that are difficult to design heuristics for. 在具有多達100個節點的2D歐幾里得圖上,神經組合優化明顯優於TSP的監督學習方法(Vinyals等,2015b),並且當允許更多計算時間時獲得接近最優的結果。 我們通過在KnapSack問題上測試相同的方法來說明其靈活性,爲此我們可以獲得最多200個項目的實例的最佳結果。 這些結果可以深入瞭解神經網絡如何用作解決組合優化問題的通用工具,特別是那些難以設計啓發式算法的工具。

 

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章