alphago 基礎之policy gradient
policy_gradient,主要包括兩個網絡:
價值網絡和策略網絡:
價值網絡,主要用於評估基於當前狀態下能夠得到的最大reward(或者叫勝率),該最大reward包括該狀態下的reward,以及後面幾步的reward,只是後面幾步的reward的權重係數更小
策略網絡:主要用於評估在當前狀態下采取哪個策略使得agent獲取的reward最大,要利用訓練數據的實際reward和價值網絡產生的reward對當前狀態下采取當前action的梯度進行更新
以cartpole訓練爲例子
cartpole是一個最簡單的例子:
一個車子放在一個無阻力的水平滑竿上,車上有一個豎直的杆子,很顯然,如果車移動,那麼杆子就會倒,那麼你能夠做的是將車左移或者右移,車在原點,車移動到>|2.4|,或者杆子倒下的角度超過12度,就算失敗.
基於價值策略梯度的訓練過程:
1 隨機產生action(測試時是直接根據策略網絡產生當前狀態下應該採取的action),對應的每步的action會產生一個reward,reward值要麼爲1,要麼沒有,沒有就代表結束,這樣就會產生一個三元組的序列:每個元素爲:當前狀態,當前action,當前reward,這個序列的長度就是這個遊戲結束的時刻(或者超過人爲設定的長度就手動終止)
2 產生用於價值網絡訓練的訓練數據:上面的三元組序列A,假設長度爲N,那麼對於其中的一個元素i,求當前狀態下對應的權重reward:
deacy=1.0
for x in (N-i)
future_reward += A[x+i].reward*decay
decay = decay*0.97
而這個狀態與future_reward對應的二元組序列就可以用來對價值網絡進行訓練,對價值網絡進行訓練的目的就是爲了能夠更好的評估當前的狀態下的勝率,或者說最大reward
3 產生用於策略網絡訓練的訓練數據,上面的價值網絡訓練好了以後就可以評估在當前狀態下能夠獲得的最大future_reward,而策略網絡,訓練就是基於該最大future_reward,策略網絡的輸入,就是當前狀態,得到的就是當前狀態下采取當前action的概率,而前面不是說要利用future_reward嗎?怎麼利用?future_reward是根據實際的訓練樣本得到的,而價值網絡的目的就是評估當前狀態下的future_reward_assessment,那麼假設這個價值網絡已經訓練好了,那麼得到的future_reward_assessment就是準確的,那麼用future_reward-future_reward_assessment,就是當前action的好壞程度,如果相減小於0,那麼證明採取該策略不好,因爲經過評估你能夠獲得更多的reward,而實際結果卻更小,因此在梯度更新的時候乘以這個相減的值,那麼梯度更新了以後,下次main對同樣的情況採取這個action的概率就更小;而如果相減大於0,那麼說明,經過評估,你能夠獲得的reward是這麼多:future_reward_assessment,而實際你獲取的reward是這麼多:future_reward > future_reward_assessment,那就說明一點,你採取的這個action很好,獲得了比預期更多的收成,那麼在梯度更新的時候,也乘以這個值,那麼下次在面對同樣的狀態的時候,就會有更大的機率採取這個action.
4 第一步產生數據,第二步對數據進行簡單的處理,用於訓練價值網絡,第三步利用實際的future_reward與價值網絡對狀態的評估得到的future_reward_assessment相減,得到對當前action進行梯度更新的權重.1,2,3步反覆進行,就實現了對價值網絡和策略網絡的訓練
給出簡單的示例代碼:
代碼地址:
https://github.com/kvfrans/openai-cartpole
import tensorflow as tf
import numpy as np
import random
import gym
import math
import matplotlib.pyplot as plt
def softmax(x):
e_x = np.exp(x - np.max(x))
out = e_x / e_x.sum()
return out
#get action based on state
def policy_gradient():
with tf.variable_scope("policy"):
params = tf.get_variable("policy_parameters",[4,2])
state = tf.placeholder("float",[None,4])
actions = tf.placeholder("float",[None,2])
advantages = tf.placeholder("float",[None,1])
linear = tf.matmul(state,params)
probabilities = tf.nn.softmax(linear)
good_probabilities = tf.reduce_sum(tf.multiply(probabilities, actions),reduction_indices=[1])
#策略網絡就是爲了使得在該策略下能夠獲得的reward最大
#advantages是在當前state下獲得的future_reward-value_function估算出來的reward(assess_reward)
#這個差值爲正,代表經過評估你應該獲得assess_reward的reward,而實際上你獲取的reward>assess_reward,
#那麼就說明這個action不錯,下次遇到同樣的state應該有更大的機率採取這個action
#爲負值就說明經過評估,你能夠獲得的assess_reward<實際的reward,那麼說明你的這個action可能不夠好
#那麼下次採取這個action的機率更小
eligibility = tf.log(good_probabilities) * advantages
loss = -tf.reduce_sum(eligibility)
optimizer = tf.train.AdamOptimizer(0.01).minimize(loss)
return probabilities, state, actions, advantages, optimizer
#價值網絡,用於後期估計當前狀態下的reward,包括基於當前狀態下的直到停止的累加reward
def value_gradient():
with tf.variable_scope("value"):
state = tf.placeholder("float",[None,4])
#newvals is future reward?
newvals = tf.placeholder("float",[None,1])
w1 = tf.get_variable("w1",[4,10])
b1 = tf.get_variable("b1",[10])
h1 = tf.nn.relu(tf.matmul(state,w1) + b1)
w2 = tf.get_variable("w2",[10,1])
b2 = tf.get_variable("b2",[1])
calculated = tf.matmul(h1,w2) + b2
#the value_gradient function is to makes this function estimate the future reward
#according to current state,as long as the future reward exist,it is good
#then this function is to estimate the good reward under current state with good action
#which means you will get this reward if you take the good action
#later when testing,you can use the
diffs = calculated - newvals
loss = tf.nn.l2_loss(diffs)
optimizer = tf.train.AdamOptimizer(0.1).minimize(loss)
return calculated, state, newvals, optimizer, loss
def run_episode(env, policy_grad, value_grad, sess,is_train = True):
pl_calculated, pl_state, pl_actions, pl_advantages, pl_optimizer = policy_grad
vl_calculated, vl_state, vl_newvals, vl_optimizer, vl_loss = value_grad
observation = env.reset()
totalreward = 0
states = []
actions = []
advantages = []
transitions = []
update_vals = []
for _ in range(200):
# calculate policy
obs_vector = np.expand_dims(observation, axis=0)
#calculate action according to current state
probs = sess.run(pl_calculated,feed_dict={pl_state: obs_vector})
#take a random action
#print("shape of probs is ",probs.shape)
action = 1 if probs[0][0]<probs[0][1] else 0
if is_train:
action = 0 if random.uniform(0,1) < probs[0][0] else 1
# record the transition
states.append(observation)
actionblank = np.zeros(2)
actionblank[action] = 1
actions.append(actionblank)
# take the action in the environment
old_observation = observation
observation, reward, done, info = env.step(action)
transitions.append((old_observation, action, reward))
totalreward += reward
if done:
break
for index, trans in enumerate(transitions):
obs, action, reward = trans
# calculate discounted monte-carlo return
future_reward = 0
future_transitions = len(transitions) - index
decrease = 1
for index2 in range(future_transitions):
future_reward += transitions[(index2) + index][2] * decrease
decrease = decrease * 0.97
obs_vector = np.expand_dims(obs, axis=0)
#value function calculate reward under current state
#值函數在當前狀態下,能夠得到的最好的reward:currentval
currentval = sess.run(vl_calculated,feed_dict={vl_state: obs_vector})[0][0]
# advantage: how much better was this action than normal
# 根據實際數據得到future_reward比值函數計算出來的reward要好多少
# 訓練到後來,這個currentval:即在當前reward會估計的比較準確,即在當前state下能夠獲得的
# 最大reward或者平均reward,而有了這個估計,用實際的reward減去這個reward,就可以判斷這個
# action的好壞,即這個currentval是訓練時用來評估某個action的好壞,因此這個估值也很重要
# ,用future_reward減去這個最大reward,就得到了這個action
# 對應的label,如果比估計的值更大,那說明要根據該參數進行更新,如果比該值小,那說明
# 達不到平均水平,那麼將將該action對應的梯度進行反向更新(相減爲負值),使得下次碰到這個
# 類似的state的時候,不再採取這個action
advantages.append(future_reward - currentval)
print("future_reward:",future_reward)
print("currentval:",currentval)
# update the value function towards new return
update_vals.append(future_reward)
# update value function
update_vals_vector = np.expand_dims(update_vals, axis=1)
#根據future reward對值函數進行優化,讓值函數能夠在當前state下估計出能夠得到的最好的reward,包括
# 後期的reward累加
sess.run(vl_optimizer, feed_dict={vl_state: states, vl_newvals: update_vals_vector})
# real_vl_loss = sess.run(vl_loss, feed_dict={vl_state: states, vl_newvals: update_vals_vector})
advantages_vector = np.expand_dims(advantages, axis=1)
#對策略函數進行優化,輸入的是實際的action,以及future_reward
# 需要值函數的目的也只是爲了與 future_reward相減
#得到在當前state下,下次是採取更大的概率採取該action,還是採取更小的概率採取該action
#good_probabilities = tf.reduce_sum(tf.multiply(probabilities, actions),reduction_indices=[1])
#eligibility = tf.log(good_probabilities) * advantages
#loss = -tf.reduce_sum(eligibility)
sess.run(pl_optimizer, feed_dict={pl_state: states, pl_advantages: advantages_vector, pl_actions: actions})
return totalreward
env = gym.make('CartPole-v0')
#env.monitor.start('cartpole-hill/', force=True)
policy_grad = policy_gradient()
value_grad = value_gradient()
sess = tf.InteractiveSession()
sess.run(tf.initialize_all_variables())
for i in range(2000):
reward = run_episode(env, policy_grad, value_grad, sess)
if reward == 200:
print("reward 200")
print(i)
break
t = 0
for _ in range(1000):
env.render()
reward = run_episode(env, policy_grad, value_grad, sess,False)
t += reward
print(t / 1000)
gym,tensorflow的安裝:
git安裝
git clone https://github.com/openai/gym
cd gym
pip install -e . # minimal install
or
pip install -e .[all] # full install (this requires cmake and a recent pip version)
pip安裝
pip install gym #minimal install
or
pip install gym[all] #full install, fetch gym as a package
tensorflow安裝:
pip install tensorflow==1.2