Deep Reinforcement Learning: Pong from Pixels翻譯和簡單理解

原文鏈接：

http://karpathy.github.io/2016/05/31/rl/

文章目錄

前言

還是簡單講講吧，感覺不說，後面沒法看的。
PG算法，大概的意思是，輸入狀態(state)，輸出動作(actions)，環境會不定時的給你點反饋(rewards)，類似於人一樣。
然後網絡更新和有監督學習(supervised-learning)有點區別，有監督會直接告訴你哪個動作是正確的，但是 PG沒有這個標籤(label)，那怎麼辦？
只好利用簡陋的環境給的反饋信息，假設我這次的行動取得了好的回報，那麼我將大大的提高這種行動的概率。
如果被懲罰了，那麼我就不會抑制這種行動的概率。
網絡的更新梯度值，大概是　這次行動的梯度∇θlogp(x;θ)　×　得分值f(x)
用這個僞標籤，來更新網絡。
讓決策更明智。

Deep Reinforcement Learning: Pong from Pixels

May 31, 2016

This is a long overdue blog post on Reinforcement Learning (RL). RL is hot! You may have noticed that computers can now automatically learn to play ATARI games (from raw game pixels!), they are beating world champions at Go, simulated quadrupeds are learning to run and leap, and robots are learning how to perform complex manipulation tasks that defy explicit programming. It turns out that all of these advances fall under the umbrella of RL research. I also became interested in RL myself over the last ~year: I worked through Richard Sutton’s book, read through David Silver’s course, watched John Schulmann’s lectures, wrote an RL library in Javascript, over the summer interned at DeepMind working in the DeepRL group, and most recently pitched in a little with the design/development of OpenAI Gym, a new RL benchmarking toolkit. So I’ve certainly been on this funwagon for at least a year but until now I haven’t gotten around to writing up a short post on why RL is a big deal, what it’s about, how it all developed and where it might be going.
強化學習很火，作者很強，這段沒啥翻譯的，自己看就好了。

Examples of RL in the wild. From left to right: Deep Q Learning network playing ATARI, AlphaGo, Berkeley robot stacking Legos, physically-simulated quadruped leaping over terrain.

It’s interesting to reflect on the nature of recent progress in RL. I broadly like to think about four separate factors that hold back AI:

Compute (the obvious one: Moore’s Law, GPUs, ASICs),
Data (in a nice form, not just out there somewhere on the internet - e.g. ImageNet),
Algorithms (research and ideas, e.g. backprop, CNN, LSTM), and
Infrastructure (software under you - Linux, TCP/IP, Git, ROS, PR2, AWS, AMT, TensorFlow, etc.).

Similar to what happened in Computer Vision, the progress in RL is not driven as much as you might reasonably assume by new amazing ideas. In Computer Vision, the 2012 AlexNet was mostly a scaled up (deeper and wider) version of 1990’s ConvNets. Similarly, the ATARI Deep Q Learning paper from 2013 is an implementation of a standard algorithm (Q Learning with function approximation, which you can find in the standard RL book of Sutton 1998), where the function approximator happened to be a ConvNet. AlphaGo uses policy gradients with Monte Carlo Tree Search (MCTS) - these are also standard components. Of course, it takes a lot of skill and patience to get it to work, and multiple clever tweaks on top of old algorithms have been developed, but to a first-order approximation the main driver of recent progress is not the algorithms but (similar to Computer Vision) compute/data/infrastructure.

Now back to RL. Whenever there is a disconnect between how magical something seems and how simple it is under the hood I get all antsy and really want to write a blog post. In this case I’ve seen many people who can’t believe that we can automatically learn to play most ATARI games at human level, with one algorithm, from pixels, and from scratch - and it is amazing, and I’ve been there myself! But at the core the approach we use is also really quite profoundly dumb (though I understand it’s easy to make such claims in retrospect). Anyway, I’d like to walk you through Policy Gradients (PG), our favorite default choice for attacking RL problems at the moment. If you’re from outside of RL you might be curious why I’m not presenting DQN instead, which is an alternative and better-known RL algorithm, widely popularized by the ATARI game playing paper. It turns out that Q-Learning is not a great algorithm (you could say that DQN is so 2013 (okay I’m 50% joking)). In fact most people prefer to use Policy Gradients, including the authors of the original DQN paper who have shown Policy Gradients to work better than Q Learning when tuned well. PG is preferred because it is end-to-end: there’s an explicit policy and a principled approach that directly optimizes the expected reward. Anyway, as a running example we’ll learn to play an ATARI game (Pong!) with PG, from scratch, from pixels, with a deep neural network, and the whole thing is 130 lines of Python only using numpy as a dependency (Gist link). Lets get to it.

從這段開始翻譯

回到強化學習，每當有東西看起來非常神奇，並且和其中的簡單原理存在脫節時，我都會感到非常的煩惱，想寫一篇博客闡述一下這些內容(我也是？)，這時候我發現和多人不相信我們能用一種算法，從像素圖片中從頭開始，讓機器自動學習打atari遊戲，直到達到人類的水平，這是非常令人驚訝的，我一直都是這樣認爲的！但是我們用的核心方法仍然是不夠靈活的。無論怎麼說，我想引導你學習一下PG算法，這是目前我們衝擊強化學習領域的最常見的算法。如果你是RL的門外漢，還想問爲啥我們不用提供DQN，因爲這也是一個可供選擇、廣爲人知的RL算法，廣泛出現在ATARI遊戲論文中。事實證明，Q-learning 不是一種好的算法(你也可是說DQN是如此的2013？這是啥意思？過時的意思？)大多數人更傾向PG算法，包括原始ＤＱＮ算法的作者也表示ＰＧ算法在調參的時候更好。PG是首選的:端到端，有明確的策略和原則方法，直接優化預期回報(說的不是DDPG麼).作爲一個例子，我們用乒乓球這個實驗仿真環境，開始我們的遊戲。用ＰＧ算法，從頭開始(應該是參數隨機初始化，沒有先驗知識的那種？)，利用圖像的像素值，一個比較深的網絡，整個例程有130行代碼，只用了numpy，沒有用tensorflow或者pytorch，有點意思。讓我們開始吧！
翻譯好慢啊，我感覺我不該翻譯這些沒有營養的話了，畢竟我的大頭應該放在DDPG的學習上，而我還沒有看完，就剩1天了，有點慌…

Pong from pixels

Left: The game of Pong. Right: Pong is a special case of a Markov Decision Process (MDP): A graph where each node is a particular game state and each edge is a possible (in general probabilistic) transition. Each edge also gives a reward, and the goal is to compute the optimal way of acting in any state to maximize rewards.

正式翻譯內容：

The game of Pong is an excellent example of a simple RL task. In the ATARI 2600 version we’ll use you play as one of the paddles (the other is controlled by a decent AI) and you have to bounce the ball past the other player (I don’t really have to explain Pong, right?). On the low level the game works as follows: we receive an image frame (a 210x160x3 byte array (integers from 0 to 255 giving pixel values)) and we get to decide if we want to move the paddle UP or DOWN (i.e. a binary choice). After every single choice the game simulator executes the action and gives us a reward: Either a +1 reward if the ball went past the opponent, a -1 reward if we missed the ball, or 0 otherwise. And of course, our goal is to move the paddle so that we get lots of reward.
大概意思是：
輸入圖像是210*160×3，歸一化到0-255之間，輸出動作是上下移動，在每一個選擇之後，遊戲模擬器執行動作並給予我們獎勵：如果球越過對手則爲+1獎勵，如果我們錯過了球則爲-1獎勵，否則爲0。當然，我們的目標是移動球拍，以便獲得大量獎勵。

As we go through the solution keep in mind that we’ll try to make very few assumptions about Pong because we secretly don’t really care about Pong; We care about complex, high-dimensional problems like robot manipulation, assembly and navigation. Pong is just a fun toy test case, something we play with while we figure out how to write very general AI systems that can one day do arbitrary useful tasks.
當我們通過解決方案時請記住，我們會嘗試對Pong做出很少的假設，因爲我們暗中並不真正關心Pong; 我們關心機器人操縱，裝配和導航等複雜的高維問題。Pong只是一個有趣的玩具測試案例，我們在玩弄時會弄清楚如何編寫非常通用的AI系統，這些系統有朝一日可以執行任意有用的任務。

Policy network. First, we’re going to define a policy network that implements our player (or “agent”). This network will take the state of the game and decide what we should do (move UP or DOWN). As our favorite simple block of compute we’ll use a 2-layer neural network that takes the raw image pixels (100,800 numbers total (2101603)), and produces a single number indicating the probability of going UP. Note that it is standard to use a stochastic policy, meaning that we only produce a probability of moving UP. Every iteration we will sample from this distribution (i.e. toss a biased coin) to get the actual move. The reason for this will become more clear once we talk about training.
策略網絡。 首先，我們將定義一個實現功能的玩家（或“代理”）的策略網絡。該網絡將採用遊戲狀態並決定我們應該做什麼（向上或向下移動）。作爲我們最喜歡的簡單計算模塊，我們將使用一個2層全連接神經網絡來獲取原始圖像像素（總共100,800個（210 * 160 * 3）），並生成一個表示向上的概率的數字。請注意，使用隨機策略是標準的，這意味着我們只產生向上移動的概率(這裏有點奇怪？)。每次迭代我們都會從這個分佈中進行採樣（即拋出一個有偏見的硬幣）以獲得實際的移動。當我們講到訓練部分時，其原因將變得更加清晰。

Our policy network is a 2-layer fully-connected net.

and to make things concrete here is how you might implement this policy network in Python/numpy. Suppose we’re given a vector x that holds the (preprocessed) pixel information. We would compute:
具體一點就是，你如何在Python / numpy中實現這個策略網絡。假設我們給出了一個x包含（已經預處理過的）像素信息的向量。我們會計算：

h = np.dot(W1, x) # compute hidden layer neuron activations
h[h<0] = 0 # ReLU nonlinearity: threshold at zero
logp = np.dot(W2, h) # compute log probability of going up
p = 1.0 / (1.0 + np.exp(-logp)) # sigmoid function (gives probability of going up)

where in this snippet W1 and W2 are two matrices that we initialize randomly. We’re not using biases because meh. Notice that we use the sigmoid non-linearity at the end, which squashes the output probability to the range [0,1]. Intuitively, the neurons in the hidden layer (which have their weights arranged along the rows of W1) can detect various game scenarios (e.g. the ball is in the top, and our paddle is in the middle), and the weights in W2 can then decide if in each case we should be going UP or DOWN. Now, the initial random W1 and W2 will of course cause the player to spasm on spot. So the only problem now is to find W1 and W2 that lead to expert play of Pong!
在這個代碼段中，W1和W2是我們隨機初始化兩個矩陣。我們沒有使用偏置是因爲meh。請注意，我們在末尾使用sigmoid非線性，將輸出概率壓縮到範圍[0,1]。直觀地假設網絡具有這樣的結構和功能關係，第一個隱藏層中的神經元（其權重沿着行排列W1）可以檢測各種遊戲場景（例如，球位於頂部，我們的拍子位於中間），然後權重W2可以決定是否在每種情況下，我們都應該向上或向下。現在，最初的隨機W1和W2當然會導致玩家當場絕望。所以現在唯一的問題是找到W1並W2導致Pong的專家級別的遊戲能力！

Fine print: preprocessing. Ideally you’d want to feed at least 2 frames to the policy network so that it can detect motion. To make things a bit simpler (I did these experiments on my Macbook) I’ll do a tiny bit of preprocessing, e.g. we’ll actually feed difference frames to the network (i.e. subtraction of current and last frame).
fine print(不知道怎麼翻譯)：預處理。理想情況下，您應該向策略網絡至少提供2幀圖片，以便它可以檢測到運動。爲了使事情變得更簡單（我在Macbook上進行了這些實驗），我將進行一些預處理，例如，我們實際上將幀差圖提供給網絡（即減去當前幀和最後一幀）。

It sounds kind of impossible. At this point I’d like you to appreciate just how difficult the RL problem is. We get 100,800 numbers (2101603) and forward our policy network (which easily involves on order of a million parameters in W1 and W2). Suppose that we decide to go UP. The game might respond that we get 0 reward this time step and gives us another 100,800 numbers for the next frame. We could repeat this process for hundred timesteps before we get any non-zero reward! E.g. suppose we finally get a +1. That’s great, but how can we tell what made that happen? Was it something we did just now? Or maybe 76 frames ago? Or maybe it had something to do with frame 10 and then frame 90? And how do we figure out which of the million knobs to change and how, in order to do better in the future? We call this the credit assignment problem. In the specific case of Pong we know that we get a +1 if the ball makes it past the opponent. The true cause is that we happened to bounce the ball on a good trajectory, but in fact we did so many frames ago - e.g. maybe about 20 in case of Pong, and every single action we did afterwards had zero effect on whether or not we end up getting the reward. In other words we’re faced with a very difficult problem and things are looking quite bleak.

這聽起來有點不可能。在這一點上，我想讓你瞭解RL問題是多麼困難。我們得到100,800個像素（210 * 160 * 3），並前向傳遞到我們的策略網絡（這很容易涉及到對W1和W2中一百萬個參數的調整）。假設我們決定選擇“向上”。遊戲可能會迴應我們這次獲得0獎勵，併爲下一幀提供另外100,800個數字。在得到任何非零獎勵之前，我們可以重複這個過程一百步！例如，假設我們最終獲得+1。這很好，但是是什麼讓這種情況發生的呢？這是我們剛纔做的事嗎？或者也許是76幀之前？或者它可能與第10幀和第90幀有關？爲了在未來做得更好，我們如何確定要改變的百萬個旋鈕中的哪一個以及如何改變？我們稱之爲信用分配問題(感覺翻譯不合適)。在Pong的特定情況下，我們知道如果球越過對手我們得到+1。真實原因是我們碰巧在一個良好的軌道上回擊球，但實際上我們在很多幀之前就已經做過了 - 例如在Pong的情況下可能大約20幀，我們之後做的每一個動作對我們是否最終都沒有影響獲得獎勵。換句話說，我們面臨着一個非常棘手的問題，事情看起來相當暗淡。

Supervised Learning. Before we dive into the Policy Gradients solution I’d like to remind you briefly about supervised learning because, as we’ll see, RL is very similar. Refer to the diagram below. In ordinary supervised learning we would feed an image to the network and get some probabilities, e.g. for two classes UP and DOWN. I’m showing log probabilities (-1.2, -0.36) for UP and DOWN instead of the raw probabilities (30% and 70% in this case) because we always optimize the log probability of the correct label (this makes math nicer, and is equivalent to optimizing the raw probability because log is monotonic). Now, in supervised learning we would have access to a label. For example, we might be told that the correct thing to do right now is to go UP (label 0). In an implementation we would enter gradient of 1.0 on the log probability of UP and run backprop to compute the gradient vector ∇Wlogp(y=UP∣x). This gradient would tell us how we should change every one of our million parameters to make the network slightly more likely to predict UP. For example, one of the million parameters in the network might have a gradient of -2.1, which means that if we were to increase that parameter by a small positive amount (e.g. 0.001), the log probability of UP would decrease by 2.1 * 0.001 (decrease due to the negative sign). If we then did a parameter update then, yay, our network would now be slightly more likely to predict UP when it sees a very similar image in the future.
我們先來看看有監督學習是怎麼操作的：
監督學習： 在我們深入研究策略梯度解決方案之前，我想簡要回顧一下有關監督學習的問題，因爲正如我們所看到的，這和RL非常相似。請參考下圖。在普通的監督學習中，我們將圖像提供給網絡並獲得一些概率，例如，兩個類UP和DOWN。我顯示UP和DOWN的log(概率)–(-1.2，-0.36)，而不是原始概率（在這種情況下爲30％和70％）因爲我們總是優化正確標籤的對數概率（這在數學上更好，並且相當於優化原始概率，因爲log是單調的）。現在，在監督學習中，我們有標籤。例如，我們可能會被告知現在正確的做法是UP（標籤0）。在實現中，我們將輸入1的漸變。∇w ^logp (y = U.P | x)。這個梯度將告訴我們如何更改百萬個參數中的每一個，以使網絡更有可能預測UP。例如，網絡中的百萬個參數中的一個可能具有-2.1的梯度，這意味着如果我們將該參數增加一個小的正數量（例如0.001,學習率），UP的對數概率將減少2.1 * 0.001（減少由於負號）。如果我們然後進行參數更新，那麼，我們的網絡現在在將來看到和這個非常相似的圖像時會更有可能預測爲UP。

Policy Gradients. Okay, but what do we do if we do not have the correct label in the Reinforcement Learning setting? Here is the Policy Gradients solution (again refer to diagram below). Our policy network calculated probability of going UP as 30% (logprob -1.2) and DOWN as 70% (logprob -0.36). We will now sample an action from this distribution; E.g. suppose we sample DOWN, and we will execute it in the game. At this point notice one interesting fact: We could immediately fill in a gradient of 1.0 for DOWN as we did in supervised learning, and find the gradient vector that would encourage the network to be slightly more likely to do the DOWN action in the future. So we can immediately evaluate this gradient and that’s great, but the problem is that at least for now we do not yet know if going DOWN is good. But the critical point is that that’s okay, because we can simply wait a bit and see! For example in Pong we could wait until the end of the game, then take the reward we get (either +1 if we won or -1 if we lost), and enter that scalar as the gradient for the action we have taken (DOWN in this case). In the example below, going DOWN ended up to us losing the game (-1 reward). So if we fill in -1 for log probability of DOWN and do backprop we will find a gradient that discourages the network to take the DOWN action for that input in the future (and rightly so, since taking that action led to us losing the game).
策略梯度：好的，但如果我們在強化學習設置中沒有正確的標籤，我們該怎麼辦？這是Policy Gradients解決方案（再次參考下圖）。我們的策略網絡計算出UP概率爲30％（logprob = -1.2）和DOWN爲70％（logprob =-0.36）。我們現在將從此分發中採樣一個動作; 例如，假設我們採樣DOWN，我們將在遊戲中執行它。在這一點上注意到一個有趣的事實：我們可以立即填寫DOWN的梯度，就像我們在監督學習中所做的那樣，找到梯度向量，這將促使網絡在未來稍微更有可能做DOWN操作。所以我們可以立即評估這個梯度，這很好，但問題是至少現在我們還不知道DOWN是否好。但關鍵的一點是那也沒關係，因爲我們可以等一下，看看！例如在Pong中，我們可以等到比賽結束，然後獲得我們得到的獎勵（如果我們贏了則爲+1，如果我們輸了則爲-1），並輸入該標量作爲我們採取的動作的梯度（DOWN在這種情況下）。在下面的例子中，DOWN結束了我們輸掉比賽（-1獎勵）。因此，如果我們填寫-1表示DOWN的對數概率並執行backprop，我們將找到一個梯度阻止網絡在將來對該輸入採取DOWN動作（正確地說，因爲採取該行動導致我們輸了遊戲）。

And that’s it: we have a stochastic policy that samples actions and then actions that happen to eventually lead to good outcomes get encouraged in the future, and actions taken that lead to bad outcomes get discouraged. Also, the reward does not even need to be +1 or -1 if we win the game eventually. It can be an arbitrary measure of some kind of eventual quality. For example if things turn out really well it could be 10.0, which we would then enter as the gradient instead of -1 to start off backprop. That’s the beauty of neural nets; Using them can feel like cheating: You’re allowed to have 1 million parameters embedded in 1 teraflop of compute and you can make it do arbitrary things with SGD. It shouldn’t work, but amusingly we live in a universe where it does.
就是這樣：我們有一個隨機策略，可以採取行動，然後在未來鼓勵最終導致良好結果的行動，並且採取導致不良後果的行動會受到打壓。此外，如果我們最終贏得比賽，獎勵甚至不需要爲+1或-1。它可以是某種最終質量的任意度量。例如，如果事情結果非常好，它可能是10.0，然後我們將以梯度而不是-1的形式輸入以啓動backprop。這就是神經網絡的美麗之處; 使用它們可能會感覺像是作弊：你可以在1 teraflop的計算中嵌入100萬個參數，你可以用SGD做任意事情。它應該不起作用，但有趣的是我們生活在一個確實如此的宇宙中。

Training protocol. So here is how the training will work in detail. We will initialize the policy network with some W1, W2 and play 100 games of Pong (we call these policy “rollouts”). Lets assume that each game is made up of 200 frames so in total we’ve made 20,000 decisions for going UP or DOWN and for each one of these we know the parameter gradient, which tells us how we should change the parameters if we wanted to encourage that decision in that state in the future. All that remains now is to label every decision we’ve made as good or bad. For example suppose we won 12 games and lost 88. We’ll take all 20012 = 2400 decisions we made in the winning games and do a positive update (filling in a +1.0 in the gradient for the sampled action, doing backprop, and parameter update encouraging the actions we picked in all those states). And we’ll take the other 20088 = 17600 decisions we made in the losing games and do a negative update (discouraging whatever we did). And… that’s it. The network will now become slightly more likely to repeat actions that worked, and slightly less likely to repeat actions that didn’t work. Now we play another 100 games with our new, slightly improved policy and rinse and repeat.

訓練協議。因此，訓練將如何詳細工作。我們將對策略網絡W1和W2初始化，玩100場乒乓球比賽（我們將這些策略稱爲“輪？”）。讓我們假設每個遊戲都由200幀組成，所以總共我們已經做出了20,000次UP或DOWN的決定，並且對於每一個我們都知道參數梯度，它告訴我們如果我們想要鼓勵將來在該狀態做出決定，如何改變參數。現在剩下的就是將我們做出的每一個決定都標記爲好或壞。例如，假設我們贏了12場比賽並輸掉了88場比賽。我們將在獲勝的比賽中做出所有200 * 12 = 2400的決定並做出積極的更新（在採樣動作的梯度中傳入+1.0，做反向傳播，和參數更新，提高我們在所有這些狀態選擇的行動的概率）。我們將在失敗的遊戲中做出其他200 * 88 = 17600個決定並做出負面更新（不管我們做了什麼）。而且…就是這樣。現在，網絡將更有可能重複有效的操作，並且稍微不太可能重複不起作用的操作。現在我們用我們新的、略微改進的策略再打100場比賽並沖洗和重複。

Policy Gradients: Run a policy for a while. See what actions led to high rewards. Increase their probability.
策略梯度：運行策略一段時間。看看哪些行動帶來了高回報。增加他們的概率。

Cartoon diagram of 4 games. Each black circle is some game state (three example states are visualized on the bottom), and each arrow is a transition, annotated with the action that was sampled. In this case we won 2 games and lost 2 games. With Policy Gradients we would take the two games we won and slightly encourage every single action we made in that episode. Conversely, we would also take the two games we lost and slightly discourage every single action we made in that episode.
4場比賽動畫片圖。每個黑色圓圈都是一些遊戲狀態（三個示例狀態在底部可視化），每個箭頭都是一個transition(轉移狀態)，用採樣的動作註釋。在這種情況下，我們贏了2場比賽，輸了2場比賽。通過政策梯度，我們將參加我們贏得的兩場比賽，並略微提高我們在那一輪中所做的每一個動作概率。相反，我們也會參加我們輸掉的兩場比賽，並略微勸阻我們在那一集中所做的每一個動作。

If you think through this process you’ll start to find a few funny properties. For example what if we made a good action in frame 50 (bouncing the ball back correctly), but then missed the ball in frame 150? If every single action is now labeled as bad (because we lost), wouldn’t that discourage the correct bounce on frame 50? You’re right - it would. However, when you consider the process over thousands/millions of games, then doing the first bounce correctly makes you slightly more likely to win down the road, so on average you’ll see more positive than negative updates for the correct bounce and your policy will end up doing the right thing.
如果你想通了這個過程，你會發現一些有趣的屬性。例如，如果我們在第50幀中做出了很好的動作（正確地將球彈回），但是在第150幀中錯過了球？如果現在每個動作都被標記爲壞（因爲我們輸了），那麼梯度更新時，不會阻止第50幀的正確反彈嗎？你是對的 - 它最終還是會的有抑制作用的。但是，當您考慮這個過程超過數千/百萬的遊戲時，正確地進行第一次反彈會使您更有可能贏得勝利，所以平均而言，您會看到正確反彈的正確更新將會更多，你的策略網絡也會以做正確決策而收斂。

Update: December 9, 2016 - alternative view. In my explanation above I use the terms such as “fill in the gradient and backprop”, which I realize is a special kind of thinking if you’re used to writing your own backprop code, or using Torch where the gradients are explicit and open for tinkering. However, if you’re used to Theano or TensorFlow you might be a little perplexed because the code is oranized around specifying a loss function and the backprop is fully automatic and hard to tinker with. In this case, the following alternative view might be more intuitive. In vanilla supervised learning the objective is to maximize ∑ilogp(yi∣xi) where xi,yi are training examples (such as images and their labels). Policy gradients is exactly the same as supervised learning with two minor differences: 1) We don’t have the correct labels yi so as a “fake label” we substitute the action we happened to sample from the policy when it saw xi, and 2) We modulate the loss for each example multiplicatively based on the eventual outcome, since we want to increase the log probability for actions that worked and decrease it for those that didn’t. So in summary our loss now looks like ∑iAilogp(yi∣xi), where yi is the action we happened to sample and Ai is a number that we call an advantage. In the case of Pong, for example, Ai could be 1.0 if we eventually won in the episode that contained xi and -1.0 if we lost. This will ensure that we maximize the log probability of actions that led to good outcome and minimize the log probability of those that didn’t. So reinforcement learning is exactly like supervised learning, but on a continuously changing dataset (the episodes), scaled by the advantage, and we only want to do one (or very few) updates based on each sampled dataset.
這段的公式有點複雜，我簡單說說吧，因爲這段還涉及到一些理論推導。
更新：2016年12月9日 - 替代視角(這該咋翻譯)？。在我上面的解釋中，我使用了諸如“傳入梯度和反向傳播”之類的術語，如果您習慣於編寫自己的反向代碼，或者使用Torch，其中梯度是明確的，那麼我會認爲這是一種特殊的思維方式——修修補補？。但是，如果你已經習慣了Theano或TensorFlow，你可能會有點困惑，因爲代碼是在指定一個損失函數時被oranized，而backprop是全自動的，很難修改。在這種情況下，以下替代視圖？可能更直觀。在原始的監督學習中，目標是最大化 $\sum_i \log p(y_i \mid x_i)$ ,（哈哈，能用公式編輯了，完美！）其中xi，yi是訓練樣例（例如圖像及其標籤）。策略梯度與監督學習完全相同，只有兩個細微差別：
1）我們沒有正確的標籤，所以作爲一個“僞標籤”，我們用它來代替我們碰巧看到xi時，從策略網絡中採樣的行動；
2）我們基於最終結果乘法地調整每個示例的損失，因爲我們希望增加有效行爲的對數概率，並減少那些無效的行爲。
總而言之，我們的損失現在看起來像 $\sum_i A_i \log p(y_i \mid x_i)$ ，其中yi是我們碰巧採樣的動作，Ai是一個我們稱之爲優勢的數字。例如，在Pong的情況下，如果我們控制xi最終贏了，我們的優勢Ai爲1，如果我們輸了，那麼Ai就是-1.0。這將確保我們最大化導致良好結果的操作的對數概率，並最小化那些無效或者失敗操作的對數概率。因此，強化學習與監督學習完全相同，但是在不斷變化的數據集（輪數）上，按優勢進行縮放，我們只想基於每個採樣數據集進行一次（或極少數）更新。
OK，這樣翻譯我是可以理解的，不知道你們怎麼看～

More general advantage functions. I also promised a bit more discussion of the returns. So far we have judged the goodness of every individual action based on whether or not we win the game. In a more general RL setting we would receive some reward rt at every time step. One common choice is to use a discounted reward, so the “eventual reward” in the diagram above would become Rt=∑∞k=0γkrt+k, where γ is a number between 0 and 1 called a discount factor (e.g. 0.99). The expression states that the strength with which we encourage a sampled action is the weighted sum of all rewards afterwards, but later rewards are exponentially less important. In practice it can can also be important to normalize these. For example, suppose we compute Rt for all of the 20,000 actions in the batch of 100 Pong game rollouts above. One good idea is to “standardize” these returns (e.g. subtract mean, divide by standard deviation) before we plug them into backprop. This way we’re always encouraging and discouraging roughly half of the performed actions. Mathematically you can also interpret these tricks as a way of controlling the variance of the policy gradient estimator. A more in-depth exploration can be found here.
更一般的優勢函數。我還承諾對回報進行更多的討論。到目前爲止，我們已經根據我們是否贏得比賽，來判斷每個人的行動是否良好。在更一般的RL設置中，我們將在每個時間步驟獲得一些獎勵(也就是說獎勵信息不是那麼的稀疏sparse，比如走一步環境就會有一個反饋)。一種常見的選擇是使用折扣獎勵，因此上圖中的“最終獎勵”將變爲 $R_t = \sum_{k=0}^{\infty} \gamma^k r_{t+k}$ ，其中γ是0和1之間的數字，稱爲折扣因子（例如0.99）。該表達式表明，我們鼓勵採樣行動的強度是之後所有獎勵的加權總和，但後來的獎勵在指數上顯得不那麼重要。在實踐中，將這些標準化也很重要。例如，假設我們計算上面一批100 Pong遊戲推出中所有20,000個動作的Rt。一個好主意是在將它們插入backprop之前“標準化”這些返回（例如，減去平均值，除以標準偏差）。通過這種方式，我們總是提高或者抑制(也就是梯度更新)大約一半的已執行行動(這裏的意思大概是，如果不標準化的話，數據差距很大，那麼可能只有少部分的動作會被更新，具體證明看下面)。在數學上，您還可以將這些技巧解釋爲控制策略梯度估計的方差的一種方法。可以在這裏找到更深入的探索。

Deriving Policy Gradients. I’d like to also give a sketch of where Policy Gradients come from mathematically. Policy Gradients are a special case of a more general score function gradient estimator. The general case is that when we have an expression of the form Ex∼p(x∣θ)[f(x)] - i.e. the expectation of some scalar valued score function f(x) under some probability distribution p(x;θ) parameterized by some θ. Hint hint, f(x) will become our reward function (or advantage function more generally) and p(x) will be our policy network, which is really a model for p(a∣I), giving a distribution over actions for any image I. Then we are interested in finding how we should shift the distribution (through its parameters θ) to increase the scores of its samples, as judged by f (i.e. how do we change the network’s parameters so that action samples get higher rewards). We have that:

公式推導

我還是得自己把這個來龍去脈整清楚，加上一些自己在其他書上看到的信息，做一個整合，以及一小問題的註釋
數學準備知識：

期望和概率分佈的關係
這裏的概率分佈函數是f(x),g(x)是值函數。那麼連續的公式如下，這個也就解釋了爲啥下面的第一個公式如何展開的。

推導策略梯度：我還想簡要介紹一下策略梯度來自數學的部分。策略梯度是更一般的分數函數梯度估計器的特例。一般情況是當我們有一個 $E_{x \sim p(x \mid \theta)} [f(x)]$ 形式的表達式 - 即某個概率分佈p(x;θ) 下，某個標量值得分函數f(x)的期望值）由一些θ參數化。注意，f(x)將成爲我們的獎勵函數（或者說更普遍的優勢函數），p(x)將成爲我們的策略網絡，它實際上是 $p(a \mid I)$ 的模型，爲任何輸入圖像I分配行動(輸入信息是圖片，輸出是動作).然後我們將目標放在，如何改變分佈θ（通過其參數θ）以增加其樣本的分數，如f所判斷的（即我們如何改變網絡的參數以使動作樣本獲得更高的獎勵）。我們有：

To put this in English, we have some distribution p(x;θ) (I used shorthand p(x) to reduce clutter) that we can sample from (e.g. this could be a gaussian). For each sample we can also evaluate the score function f which takes the sample and gives us some scalar-valued score. This equation is telling us how we should shift the distribution (through its parameters θ) if we wanted its samples to achieve higher scores, as judged by f. In particular, it says that look: draw some samples x, evaluate their scores f(x), and for each x also evaluate the second term ∇θlogp(x;θ). What is this second term? It’s a vector - the gradient that’s giving us the direction in the parameter space that would lead to increase of the probability assigned to an x. In other words if we were to nudge θ in the direction of ∇θlogp(x;θ) we would see the new probability assigned to some x slightly increase. If you look back at the formula, it’s telling us that we should take this direction and multiply onto it the scalar-valued score f(x). This will make it so that samples that have a higher score will “tug” on the probability density stronger than the samples that have lower score, so if we were to do an update based on several samples from p the probability density would shift around in the direction of higher scores, making highly-scoring samples more likely.

用中文表示就是，我們有一些分佈p（x;θ）（我使用簡寫p(x)來減少干擾）我們可以從中採樣（例如，這可能是高斯）。對於每個樣本，我們還可以評估得到樣本的得分函數f，並給出一些標量值的得分。這個等式告訴我們如果我們希望它的樣本達到更高的分數，我們應該如何改變分佈（通過它的參數θ），如f所判斷的那樣。特別是，它表示看起來：繪製一些樣本x，評估它們的分數f(x)，並且對於每個x也評估第二項∇θlogp(x;θ)。這個第二項是什麼？它是一個向量-梯度，它給出了參數空間中的方向，這將導致分配給x的概率增加。換句話說，如果我們在θlogp(x;θ)的方向上輕推θ，我們會看到分配給某些x的新概率略微增加。如果你回顧一下這個公式，就會告訴我們我們應該採用這個方向並將其乘以標量值得分f(x)。這將使得得分較高的樣本將比具有較低分數的樣本“拉扯”概率密度更強(大概的意思就是得分高的那些樣本，參數更新的就厲害)，因此如果我們基於來自p的幾個樣本進行更新，則概率密度將在更高分的方向，使得得分高的樣本更有可能。

A visualization of the score function gradient estimator. Left: A gaussian distribution and a few samples from it (blue dots). On each blue dot we also plot the gradient of the log probability with respect to the gaussian’s mean parameter. The arrow indicates the direction in which the mean of the distribution should be nudged to increase the probability of that sample. Middle: Overlay of some score function giving -1 everywhere except +1 in some small regions (note this can be an arbitrary and not necessarily differentiable scalar-valued function). The arrows are now color coded because due to the multiplication in the update we are going to average up all the green arrows, and the negative of the red arrows. Right: after parameter update, the green arrows and the reversed red arrows nudge us to left and towards the bottom. Samples from this distribution will now have a higher expected score, as desired.

這張圖莫煩大佬沒有詳細的解釋，我來簡單的翻譯一下，並看看是否有一些自己的理解。
得分函數梯度估計器的可視化：

左：高斯分佈和一些樣本（藍點）。在每個藍點上，我們還繪製了對數概率相對於高斯平均參數的梯度。箭頭表示應該輕推分佈均值的方向，以增加該樣本的概率。
中：在某些小區域中，除了+1之外，某些得分函數的疊加給出-1到處（注意這可以是任意的，不一定是可區分的標量值函數）。箭頭現在是彩色編碼的，因爲由於更新中的乘法，我們將平均所有綠色箭頭爲+1和紅色箭頭爲-1。
右：在參數更新後，綠色箭頭和反向紅色箭頭將我們向左和向下推動。根據需要，來自此分佈的樣本現在將具有更高的預期分數。
我的簡單理解，箭頭是梯度更新的方向，咱們這個不是梯度反向更新，要有一個正反饋，所以這個梯度是正向更新的。
因此如果梯度方向和得分函數的值同號，那麼就正向更新，中間圖的左下角剛好是這樣，那麼那段曲線，將有向做下的趨勢，並且有點往外拉的樣子
右上角基本上得分函數是-1，那麼得反過來，所以最終拉動的效果也是往左下，有點往內縮的樣子，所以整個的圖都往左下偏移了，而且變小了
以上是我的直觀理解，不知道到底對不對，我感覺應該OK…

I hope the connection to RL is clear. Our policy network gives us samples of actions, and some of them work better than others (as judged by the advantage function). This little piece of math is telling us that the way to change the policy’s parameters is to do some rollouts, take the gradient of the sampled actions, multiply it by the score and add everything, which is what we’ve done above. For a more thorough derivation and discussion I recommend John Schulman’s lecture.
我希望有監督學習與RL的聯繫是明確的。我們的策略網絡爲我們提供了行動樣本，其中一些比其他行動更好（由優勢函數判斷）。這個小小的數學告訴我們，改變策略參數的方法是進行一些推移(大概是上面的圖那樣的變化)，採用採樣動作的梯度，將其乘以得分值，並加上所有內容，這就是我們上面所做的。爲了更全面的推導和討論，我推薦John Schulman的講座。

Learning. Alright, we’ve developed the intuition for policy gradients and saw a sketch of their derivation. I implemented the whole approach in a 130-line Python script, which uses OpenAI Gym’s ATARI 2600 Pong. I trained a 2-layer policy network with 200 hidden layer units using RMSProp on batches of 10 episodes (each episode is a few dozen games, because the games go up to score of 21 for either player). I did not tune the hyperparameters too much and ran the experiment on my (slow) Macbook, but after training for 3 nights I ended up with a policy that is slightly better than the AI player. The total number of episodes was approximately 8,000 so the algorithm played roughly 200,000 Pong games (quite a lot isn’t it!) and made a total of ~800 updates. I’m told by friends that if you train on GPU with ConvNets for a few days you can beat the AI player more often, and if you also optimize hyperparameters carefully you can also consistently dominate the AI player (i.e. win every single game). However, I didn’t spend too much time computing or tweaking, so instead we end up with a Pong AI that illustrates the main ideas and works quite well:
學習：好吧，我們已經制定了策略梯度的直覺感受，並看到了它們的推導草圖。我在一個130行的Python腳本中實現了整個方法，該腳本使用OpenAI Gym的ATARI 2600 Pong。我在一個批次10輪中使用RMSProp訓練了一個2層的策略網絡，其中包含200個隱藏層單元（每輪都是幾十次遊戲，因爲這兩個遊戲的遊戲得分爲21？）。我沒有過多調整超參數並在我的（慢速）Macbook上運行實驗，但經過3晚的訓練後，我最終獲得了比AI玩家略好點的策略。輸贏輪數的總數約爲8,000，因此該算法播放了大約200,000個Pong遊戲（相當多了，難道不是？）並且總共進行了大約800次更新。朋友們告訴我，如果你使用ConvNets在GPU上訓練幾天，你可以更頻繁地擊敗AI玩家，如果你也仔細優化超級參數，你也可以一直主宰AI玩家（即贏得每一場比賽）。但是，我沒有花太多時間計算或調整，所以我們最終得到了一個Pong AI，它說明了主要的想法，並且效果很好：

這兒是油管上的一個視頻，我這邊沒法加鏈接，自己去看吧。
The learned agent (in green, right) facing off with the hard-coded AI opponent (left).

Learned weights. We can also take a look at the learned weights. Due to preprocessing every one of our inputs is an 80x80 difference image (current frame minus last frame). We can now take every row of W1, stretch them out to 80x80 and visualize. Below is a collection of 40 (out of 200) neurons in a grid. White pixels are positive weights and black pixels are negative weights. Notice that several neurons are tuned to particular traces of bouncing ball, encoded with alternating black and white along the line. The ball can only be at a single spot, so these neurons are multitasking and will “fire” for multiple locations of the ball along that line. The alternating black and white is interesting because as the ball travels along the trace, the neuron’s activity will fluctuate as a sine wave and due to the ReLU it would “fire” at discrete, separated positions along the trace. There’s a bit of noise in the images, which I assume would have been mitigated if I used L2 regularization.
權重的學習。我們還可以看看學到的權重。由於預處理，我們的每個輸入都是80x80差分圖像（當前幀減去最後一幀）。我們現在可以取W1的每一行，將它們拉伸到80x80並進行可視化。下面是網格中40個（200個）神經元的集合。白色像素是正權重，黑色像素是負權重。請注意，幾個神經元被調整到特定的彈跳球軌跡，沿着線路交替的黑色和白色編碼。球只能在一個點上，因此這些神經元是多任務處理的，並且會沿着該線“射擊”球的多個位置。交替的黑色和白色是有趣的，因爲當球沿着跡線行進時，神經元的活動將作爲正弦波波動，並且由於ReLU，它將沿着跡線在離散的分離位置處“發射”。圖像中有一些噪音，如果我使用L2正則化，我認爲可以減輕這種噪音。

“”"
這裏我並不是很理解這種可視化，作者沒有用卷積網絡，而且這種軌跡我也不知道有啥用，就先這樣吧。
“”"

What isn’t happening
So there you have it - we learned to play Pong from from raw pixels with Policy Gradients and it works quite well. The approach is a fancy form of guess-and-check, where the “guess” refers to sampling rollouts from our current policy, and the “check” refers to encouraging actions that lead to good outcomes. Modulo some details, this represents the state of the art in how we currently approach reinforcement learning problems. Its impressive that we can learn these behaviors, but if you understood the algorithm intuitively and you know how it works you should be at least a bit disappointed. In particular, how does it not work?
所以當你有了它 - 我們學會了使用Policy Gradients從原始圖片中玩Pong並且效果很好。這種方法是一種奇特的“猜測-檢查”形式，其中“猜測”指的是我們當前策略的抽樣推廣，而“檢查”指的是鼓勵採取行動以產生良好結果。模數？一些細節，這代表了我們目前如何處理強化學習問題的最新技術。令人印象深刻的是我們可以學習這些行爲，但是如果你直觀地理解算法並且你知道它是如何工作的，那麼你應該至少有點失望。特別是，它怎麼行不通？

上面的算法部分基本翻譯完畢了，下面是一些作者的思考，我直接複製了谷歌翻譯的內容。

Compare that to how a human might learn to play Pong. You show them the game and say something along the lines of “You’re in control of a paddle and you can move it up and down, and your task is to bounce the ball past the other player controlled by AI”, and you’re set and ready to go. Notice some of the differences:
相比之下，人類可能會學習如何玩Pong。你向他們展示遊戲，然後說出一句“你掌控着拍子，你可以上下移動它，你的任務是將球彈回AI控制的另一個玩家”，然後你“ 重新設定並準備好了。請注意一些差異：

In practical settings we usually communicate the task in some manner (e.g. English above), but in a standard RL problem you assume an arbitrary reward function that you have to discover through environment interactions. It can be argued that if a human went into game of Pong but without knowing anything about the reward function (indeed, especially if the reward function was some static but random function), the human would have a lot of difficulty learning what to do but Policy Gradients would be indifferent, and likely work much better. Similarly, if we took the frames and permuted the pixels randomly then humans would likely fail, but our Policy Gradient solution could not even tell the difference (if it’s using a fully connected network as done here).
在實際環境中，我們通常以某種方式（例如上面的英語）傳達任務，但在標準的RL問題中，您假設您必須通過環境交互發現任意獎勵函數。可以說，如果一個人進入Pong遊戲但卻對獎勵函數一無所知（事實上，特別是如果獎勵函數是一些靜態但隨機的函數），那麼人類在學習該做什麼時會有很多困難但是政策梯度將無動於衷，可能會更好。同樣，如果我們採用幀並隨機置換像素，那麼人類可能會失敗，但我們的Policy Gradient解決方案甚至無法區分（如果它使用完全連接的網絡，就像這裏所做的那樣）。
A human brings in a huge amount of prior knowledge, such as intuitive physics (the ball bounces, it’s unlikely to teleport, it’s unlikely to suddenly stop, it maintains a constant velocity, etc.), and intuitive psychology (the AI opponent “wants” to win, is likely following an obvious strategy of moving towards the ball, etc.). You also understand the concept of being “in control” of a paddle, and that it responds to your UP/DOWN key commands. In contrast, our algorithms start from scratch which is simultaneously impressive (because it works) and depressing (because we lack concrete ideas for how not to).
Policy Gradients are a brute force solution, where the correct actions are eventually discovered and internalized into a policy. Humans build a rich, abstract model and plan within it. In Pong, I can reason that the opponent is quite slow so it might be a good strategy to bounce the ball with high vertical velocity, which would cause the opponent to not catch it in time. However, it also feels as though we also eventually “internalize” good solutions into what feels more like a reactive muscle memory policy. For example if you’re learning a new motor task (e.g. driving a car with stick shift?) you often feel yourself thinking a lot in the beginning but eventually the task becomes automatic and mindless.
人類帶來了大量的先驗知識，例如直覺物理（球彈跳，不太可能傳送，不太可能突然停止，它保持恆定速度等），以及直覺心理學（AI對手“想要“贏得比賽，可能會採取明顯的戰術策略，等等。” 您還理解“控制”槳的概念，並且它響應您的UP / DOWN鍵命令。相比之下，我們的算法從頭開始，同時令人印象深刻（因爲它有效）和令人沮喪（因爲我們缺乏具體的想法，如何不）。
Policy Gradients have to actually experience a positive reward, and experience it very often in order to eventually and slowly shift the policy parameters towards repeating moves that give high rewards. With our abstract model, humans can figure out what is likely to give rewards without ever actually experiencing the rewarding or unrewarding transition. I don’t have to actually experience crashing my car into a wall a few hundred times before I slowly start avoiding to do so.
策略梯度是一種強力解決方案，最終發現正確的操作並將其內化到策略中。人類在其中構建豐富，抽象的模型和計劃。在乒乓球中，我可以推斷對手的速度非常慢，因此以高垂直速度反彈球可能是一個很好的策略，這會導致對手無法及時趕上。然而，我們也感覺好像我們最終將“良好的解決方案”“內化”到更像是一種反應性肌肉記憶政策。例如，如果你正在學習一項新的運動任務（例如駕駛一輛帶有換擋桿的汽車？），你通常會覺得自己在開始時會想很多，但最終這項任務變得自動且無意識。

Left: Montezuma’s Revenge: a difficult game for our RL algorithms. The player must jump down, climb up, get the key, and open the door. A human understands that acquiring a key is useful. The computer samples billions of random moves and 99% of the time falls to its death or gets killed by the monster. In other words it’s hard to “stumble into” the rewarding situation. Right: Another difficult game called Frostbite, where a human understands that things move, some things are good to touch, some things are bad to touch, and the goal is to build the igloo brick by brick. A good analysis of this game and a discussion of differences between the human and computer approach can be found in Building Machines That Learn and Think Like People.
I’d like to also emphasize the point that, conversely, there are many games where Policy Gradients would quite easily defeat a human. In particular, anything with frequent reward signals that requires precise play, fast reflexes, and not too much long-term planning would be ideal, as these short-term correlations between rewards and actions can be easily “noticed” by the approach, and the execution meticulously perfected by the policy. You can see hints of this already happening in our Pong agent: it develops a strategy where it waits for the ball and then rapidly dashes to catch it just at the edge, which launches it quickly and with high vertical velocity. The agent scores several points in a row repeating this strategy. There are many ATARI games where Deep Q Learning destroys human baseline performance in this fashion - e.g. Pinball, Breakout, etc.

In conclusion, once you understand the “trick” by which these algorithms work you can reason through their strengths and weaknesses. In particular, we are nowhere near humans in building abstract, rich representations of games that we can plan within and use for rapid learning. One day a computer will look at an array of pixels and notice a key, a door, and think to itself that it is probably a good idea to pick up the key and reach the door. For now there is nothing anywhere close to this, and trying to get there is an active area of research.

Non-differentiable computation in Neural Networks
I’d like to mention one more interesting application of Policy Gradients unrelated to games: It allows us to design and train neural networks with components that perform (or interact with) non-differentiable computation. The idea was first introduced in Williams 1992 and more recently popularized by Recurrent Models of Visual Attention under the name “hard attention”, in the context of a model that processed an image with a sequence of low-resolution foveal glances (inspired by our own human eyes). In particular, at every iteration an RNN would receive a small piece of the image and sample a location to look at next. For example the RNN might look at position (5,30), receive a small piece of the image, then decide to look at (24, 50), etc. The problem with this idea is that there a piece of network that produces a distribution of where to look next and then samples from it. Unfortunately, this operation is non-differentiable because, intuitively, we don’t know what would have happened if we sampled a different location. More generally, consider a neural network from some inputs to outputs:

Notice that most arrows (in blue) are differentiable as normal, but some of the representation transformations could optionally also include a non-differentiable sampling operation (in red). We can backprop through the blue arrows just fine, but the red arrow represents a dependency that we cannot backprop through.

Policy gradients to the rescue! We’ll think about the part of the network that does the sampling as a small stochastic policy embedded in the wider network. Therefore, during training we will produce several samples (indicated by the branches below), and then we’ll encourage samples that eventually led to good outcomes (in this case for example measured by the loss at the end). In other words we will train the parameters involved in the blue arrows with backprop as usual, but the parameters involved with the red arrow will now be updated independently of the backward pass using policy gradients, encouraging samples that led to low loss. This idea was also recently formalized nicely in Gradient Estimation Using Stochastic Computation Graphs.

Trainable Memory I/O. You’ll also find this idea in many other papers. For example, a Neural Turing Machine has a memory tape that they it read and write from. To do a write operation one would like to execute something like m[i] = x, where i and x are predicted by an RNN controller network. However, this operation is non-differentiable because there is no signal telling us what would have happened to the loss if we were to write to a different location j != i. Therefore, the NTM has to do soft read and write operations. It predicts an attention distribution a (with elements between 0 and 1 and summing to 1, and peaky around the index we’d like to write to), and then doing for all i: m[i] = a[i]*x. This is now differentiable, but we have to pay a heavy computational price because we have to touch every single memory cell just to write to one position. Imagine if every assignment in our computers had to touch the entire RAM!

However, we can use policy gradients to circumvent this problem (in theory), as done in RL-NTM. We still predict an attention distribution a, but instead of doing the soft write we sample locations to write to: i = sample(a); m[i] = x. During training we would do this for a small batch of i, and in the end make whatever branch worked best more likely. The large computational advantage is that we now only have to read/write at a single location at test time. However, as pointed out in the paper this strategy is very difficult to get working because one must accidentally stumble by working algorithms through sampling. The current consensus is that PG works well only in settings where there are a few discrete choices so that one is not hopelessly sampling through huge search spaces.

However, with Policy Gradients and in cases where a lot of data/compute is available we can in principle dream big - for instance we can design neural networks that learn to interact with large, non-differentiable modules such as Latex compilers (e.g. if you’d like char-rnn to generate latex that compiles), or a SLAM system, or LQR solvers, or something. Or, for example, a superintelligence might want to learn to interact with the internet over TCP/IP (which is sadly non-differentiable) to access vital information needed to take over the world. That’s a great example.

Conclusions
We saw that Policy Gradients are a powerful, general algorithm and as an example we trained an ATARI Pong agent from raw pixels, from scratch, in 130 lines of Python. More generally the same algorithm can be used to train agents for arbitrary games and one day hopefully on many valuable real-world control problems. I wanted to add a few more notes in closing:

On advancing AI. We saw that the algorithm works through a brute-force search where you jitter around randomly at first and must accidentally stumble into rewarding situations at least once, and ideally often and repeatedly before the policy distribution shifts its parameters to repeat the responsible actions. We also saw that humans approach these problems very differently, in what feels more like rapid abstract model building - something we have barely even scratched the surface of in research (although many people are trying). Since these abstract models are very difficult (if not impossible) to explicitly annotate, this is also why there is so much interest recently in (unsupervised) generative models and program induction.

On use in complex robotics settings. The algorithm does not scale naively to settings where huge amounts of exploration are difficult to obtain. For instance, in robotic settings one might have a single (or few) robots, interacting with the world in real time. This prohibits naive applications of the algorithm as I presented it in this post. One related line of work intended to mitigate this problem is deterministic policy gradients - instead of requiring samples from a stochastic policy and encouraging the ones that get higher scores, the approach uses a deterministic policy and gets the gradient information directly from a second network (called a critic) that models the score function. This approach can in principle be much more efficient in settings with very high-dimensional actions where sampling actions provides poor coverage, but so far seems empirically slightly finicky to get working. Another related approach is to scale up robotics, as we’re starting to see with Google’s robot arm farm, or perhaps even Tesla’s Model S + Autopilot.

There is also a line of work that tries to make the search process less hopeless by adding additional supervision. In many practical cases, for instance, one can obtain expert trajectories from a human. For example AlphaGo first uses supervised learning to predict human moves from expert Go games and the resulting human mimicking policy is later finetuned with policy gradients on the “real” objective of winning the game. In some cases one might have fewer expert trajectories (e.g. from robot teleoperation) and there are techniques for taking advantage of this data under the umbrella of apprenticeship learning. Finally, if no supervised data is provided by humans it can also be in some cases computed with expensive optimization techniques, e.g. by trajectory optimization in a known dynamics model (such as F=ma in a physical simulator), or in cases where one learns an approximate local dynamics model (as seen in very promising framework of Guided Policy Search).

On using PG in practice. As a last note, I’d like to do something I wish I had done in my RNN blog post. I think I may have given the impression that RNNs are magic and automatically do arbitrary sequential problems. The truth is that getting these models to work can be tricky, requires care and expertise, and in many cases could also be an overkill, where simpler methods could get you 90%+ of the way there. The same goes for Policy Gradients. They are not automatic: You need a lot of samples, it trains forever, it is difficult to debug when it doesn’t work. One should always try a BB gun before reaching for the Bazooka. In the case of Reinforcement Learning for example, one strong baseline that should always be tried first is the cross-entropy method (CEM), a simple stochastic hill-climbing “guess and check” approach inspired loosely by evolution. And if you insist on trying out Policy Gradients for your problem make sure you pay close attention to the tricks section in papers, start simple first, and use a variation of PG called TRPO, which almost always works better and more consistently than vanilla PG in practice. The core idea is to avoid parameter updates that change your policy too much, as enforced by a constraint on the KL divergence between the distributions predicted by the old and the new policy on a batch of data (instead of conjugate gradients the simplest instantiation of this idea could be implemented by doing a line search and checking the KL along the way).

And that’s it! I hope I gave you a sense of where we are with Reinforcement Learning, what the challenges are, and if you’re eager to help advance RL I invite you to do so within our OpenAI Gym 😃 Until next time!

翻譯總結：

感覺大佬的思路真的太棒了，將未知的RL，和常見的有監督學習聯繫起來，稍微點出其中的差別，讓新學者降低了很多的認知門檻，感謝～
另外，希望大家最好還是直接看英文原版，不懂的單詞，用劃詞軟件看看，能看懂的，比看我的，比看莫煩大佬的都要好理解一些，雖然看中文的是一個捷徑，但是以後還是得要自己走的呀。
加油，這篇翻譯應該還是有點價值的，個人覺得，看完了這個基本上，PG算法就應該可以理解了。

Deep Reinforcement Learning: Pong from Pixels翻譯和簡單理解

原文鏈接：

文章目錄

前言

Deep Reinforcement Learning: Pong from Pixels

從這段開始翻譯

正式翻譯內容：

公式推導

上面的算法部分基本翻譯完畢了，下面是一些作者的思考，我直接複製了谷歌翻譯的內容。

翻譯總結：

Android啓動過程-萬字長文(Android14)

optional install error: Error: Unsupported URL Type: npm:vue-loader@^16.1.0

這種嵌套字典類型的數據，我想把它讀取到df裏，如何操作？

【SQL進階】CASE語句的使用

微調真的能讓LLM學到新東西嗎:引入新知識可能讓模型產生更多的幻覺

iNeuOS工業互聯網操作系統，增加電力IEC104協議

微服務實踐k8s&dapr開發部署實驗（3）訂閱發佈

chromedriver版本

kbgressdb之數據結構V0.2

Python中的*和**（轉載+合成---一文搞懂Python的*傳參）

Tensorflow保存和重載參數

四元數旋轉的物理意義以及代碼實現-偏應用向

機械臂正運動學-DH參數-Python快速實現

《AVID: Learning Multi-Stage Tasks via Pixel-Level Translation of Human Videos》閱讀筆記

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結