最近參加了百度強化學習7日打卡營,從只是聽說過Q-learning、AlphaGo的完全小白,到可以自己實現幾個經典強化學習算法,並且使用百度PARL框架實現強化學習解決Pong遊戲和四旋翼懸停。這次的7日打卡營讓我快速入門強化學習,瞭解到幾個經典強化學習算法並且動手實現,還是頗有收穫的。在此也給百度PARL框架(Github)打個廣告,PARL框架已經實現並封裝好幾個經典強化學習算法,如Q-learning、DQN、Policy Gradient、DDPG、PPO等,讓開發者們可以專注於網絡Model和Agent交互的設計,易用性和複用性非常高。(個人使用感覺使用PARL實現強化學習還是很方便的)此外,PARL還有大規模分佈式能力,非常利於工業應用。(雖然沒有用過這部分功能)
接下來就簡單記錄下這次課程的學習筆記。
Lesson 1
核心思想:智能體agent
在環境environment
中學習,根據環境的狀態state
(或觀測到的observation
),執行動作action
,並根據環境的反饋 reward
(獎勵)來指導更好的動作。
- 經典算法:
Q-learning
、Sarsa
、DQN
、Policy Gradient
、A3C
、DDPG
、PPO
- 環境分類:離散控制場景(輸出動作可數)、連續控制場景(輸出動作值不可數)
- 強化學習經典環境庫
GYM
將環境交互接口規範化爲:重置環境reset()
、交互step()
、渲染render()
- 強化學習框架庫
PARL
將強化學習框架抽象爲Model
、Algorithm
、Agent
三層,使得強化學習算法的實現和調試更方便和靈活。
強化學習的一個關鍵是探索和利用的平衡。強化學習分類:基於價值value-based、基於策略policy-based。
Lesson 2
Sarsa
Sarsa
全稱是state-action-reward-state'-action'
,目的是學習特定的state
下,特定action
的價值Q
,最終建立和優化一個Q
表格,以state
爲行,action
爲列,根據與環境交互得到的reward
來更新Q
表格,更新公式爲:
# agent.py
class SarsaAgent(object):
def __init__(self, obs_n, act_n, learning_rate=0.01, gamma=0.9, e_greed=0.1):
self.act_n = act_n # 動作維度,有幾個動作可選
self.lr = learning_rate # 學習率
self.gamma = gamma # reward的衰減率
self.epsilon = e_greed # 按一定概率隨機選動作
self.Q = np.zeros((obs_n, act_n))
# 根據輸入觀察值,採樣輸出的動作值,帶探索
def sample(self, obs):
if np.random.uniform(0, 1) < (1.0 - self.epsilon): #根據table的Q值選動作
action = self.predict(obs)
else:
action = np.random.choice(self.act_n) #有一定概率隨機探索選取一個動作
return action
# 根據輸入觀察值,預測輸出的動作值
def predict(self, obs):
Q_list = self.Q[obs, :]
maxQ = np.max(Q_list)
action_list = np.where(Q_list == maxQ)[0] # maxQ可能對應多個action
action = np.random.choice(action_list)
return action
# 學習方法,也就是更新Q-table的方法
def learn(self, obs, action, reward, next_obs, next_action, done):
""" on-policy
obs: 交互前的obs, s_t
action: 本次交互選擇的action, a_t
reward: 本次動作獲得的獎勵r
next_obs: 本次交互後的obs, s_t+1
next_action: 根據當前Q表格, 針對next_obs會選擇的動作, a_t+1
done: episode是否結束
"""
predict_Q = self.Q[obs, action]
if done:
target_Q = reward # 沒有下一個狀態了
else:
target_Q = reward + self.gamma * self.Q[next_obs, next_action] # Sarsa
self.Q[obs, action] += self.lr * (target_Q - predict_Q) # 修正q
# 保存Q表格數據到文件
def save(self):
npy_file = './q_table.npy'
np.save(npy_file, self.Q)
print(npy_file + ' saved.')
# 從文件中讀取Q值到Q表格中
def restore(self, npy_file='./q_table.npy'):
self.Q = np.load(npy_file)
print(npy_file + ' loaded.')
def run_episode(env, agent, render=False):
total_steps = 0 # 記錄每個episode走了多少step
total_reward = 0
obs = env.reset() # 重置環境, 重新開一局(即開始新的一個episode)
action = agent.sample(obs) # 根據算法選擇一個動作
while True:
next_obs, reward, done, _ = env.step(action) # 與環境進行一個交互
next_action = agent.sample(next_obs) # 根據算法選擇一個動作
# 訓練 Sarsa 算法
agent.learn(obs, action, reward, next_obs, next_action, done)
action = next_action
obs = next_obs # 存儲上一個觀察值
total_reward += reward
total_steps += 1 # 計算step數
if render:
env.render() #渲染新的一幀圖形
if done:
break
return total_reward, total_steps
def test_episode(env, agent):
total_reward = 0
obs = env.reset()
while True:
action = agent.predict(obs) # greedy
next_obs, reward, done, _ = env.step(action)
total_reward += reward
obs = next_obs
# time.sleep(0.5)
# env.render()
if done:
break
return total_reward
Q-learning
-
Q-learning跟Sarsa不一樣的地方是更新Q表格的方式。
Sarsa
是on-policy
的更新方式,先做出動作再更新。Q-learning
是off-policy
的更新方式,更新learn()
時無需獲取下一步實際做出的動作next_action
,並假設下一步動作是取最大Q
值的動作。
-
Q-learning
的更新公式爲:
class QLearningAgent(object):
def __init__(self, obs_n, act_n, learning_rate=0.01, gamma=0.9, e_greed=0.1):
self.act_n = act_n # 動作維度,有幾個動作可選
self.lr = learning_rate # 學習率
self.gamma = gamma # reward的衰減率
self.epsilon = e_greed # 按一定概率隨機選動作
self.Q = np.zeros((obs_n, act_n))
# 根據輸入觀察值,採樣輸出的動作值,帶探索
def sample(self, obs):
if np.random.uniform(0, 1) < (1.0 - self.epsilon): #根據table的Q值選動作
action = self.predict(obs)
else:
action = np.random.choice(self.act_n) #有一定概率隨機探索選取一個動作
return action
# 根據輸入觀察值,預測輸出的動作值
def predict(self, obs):
Q_list = self.Q[obs, :]
maxQ = np.max(Q_list)
action_list = np.where(Q_list == maxQ)[0] # maxQ可能對應多個action
action = np.random.choice(action_list)
return action
# 學習方法,也就是更新Q-table的方法
def learn(self, obs, action, reward, next_obs, done):
""" off-policy
obs: 交互前的obs, s_t
action: 本次交互選擇的action, a_t
reward: 本次動作獲得的獎勵r
next_obs: 本次交互後的obs, s_t+1
done: episode是否結束
"""
predict_Q = self.Q[obs, action]
if done:
target_Q = reward # 沒有下一個狀態了
else:
target_Q = reward + self.gamma * np.max(self.Q[next_obs, :]) # Q-learning
self.Q[obs, action] += self.lr * (target_Q - predict_Q) # 修正q
# 把 Q表格 的數據保存到文件中
def save(self):
npy_file = './q_table.npy'
np.save(npy_file, self.Q)
print(npy_file + ' saved.')
# 從文件中讀取數據到 Q表格
def restore(self, npy_file='./q_table.npy'):
self.Q = np.load(npy_file)
print(npy_file + ' loaded.')
def run_episode(env, agent, render=False):
total_steps = 0 # 記錄每個episode走了多少step
total_reward = 0
obs = env.reset() # 重置環境, 重新開一局(即開始新的一個episode)
while True:
action = agent.sample(obs) # 根據算法選擇一個動作
next_obs, reward, done, _ = env.step(action) # 與環境進行一個交互
# 訓練 Q-learning算法
agent.learn(obs, action, reward, next_obs, done)
obs = next_obs # 存儲上一個觀察值
total_reward += reward
total_steps += 1 # 計算step數
if render:
env.render() #渲染新的一幀圖形
if done:
break
return total_reward, total_steps
def test_episode(env, agent):
total_reward = 0
obs = env.reset()
while True:
action = agent.predict(obs) # greedy
next_obs, reward, done, _ = env.step(action)
total_reward += reward
obs = next_obs
# time.sleep(0.5)
# env.render()
if done:
break
return total_reward
On-policy策略:使用策略π學習,使用策略π與環境交互產生經驗。由於需要兼顧探索,策略π並不穩定
Off-policy策略:目標策略π;行爲策略μ。目標策略用於學習最優策略,行爲策略更具有探索性,與環境交互產生經驗軌跡
Lesson 3
DQN
- 本質上
DQN
還是一個Q-learning
算法,更新方式一致。爲了更好的探索環境,同樣的也採用ε-greedy
方法訓練。 - 在Q-learning的基礎上,DQN提出了兩個技巧使得Q網絡的更新迭代更穩定。
- 經驗回放
Experience Replay
:主要解決樣本關聯性和利用效率的問題。使用一個經驗池存儲多條經驗s,a,r,s'
,再從中隨機抽取一批數據送去訓練。 - 固定Q目標
Fixed-Q-Target
:主要解決算法訓練不穩定的問題。複製一個和原來Q
網絡結構一樣的Target Q
網絡,用於計算Q
目標值。
- 經驗回放
class Model(parl.Model):
def __init__(self, act_dim):
hid1_size = 128
hid2_size = 128
# 3層全連接網絡
self.fc1 = layers.fc(size=hid1_size, act='relu')
self.fc2 = layers.fc(size=hid2_size, act='relu')
self.fc3 = layers.fc(size=act_dim, act=None)
def value(self, obs):
# 定義網絡
# 輸入state,輸出所有action對應的Q,[Q(s,a1), Q(s,a2), Q(s,a3)...]
h1 = self.fc1(obs)
h2 = self.fc2(h1)
Q = self.fc3(h2)
return Q
class DQN(parl.Algorithm):
def __init__(self, model, act_dim=None, gamma=None, lr=None):
""" DQN algorithm
Args:
model (parl.Model): 定義Q函數的前向網絡結構
act_dim (int): action空間的維度,即有幾個action
gamma (float): reward的衰減因子
lr (float): learning rate 學習率.
"""
self.model = model
self.target_model = copy.deepcopy(model)
assert isinstance(act_dim, int)
assert isinstance(gamma, float)
assert isinstance(lr, float)
self.act_dim = act_dim
self.gamma = gamma
self.lr = lr
def predict(self, obs):
""" 使用self.model的value網絡來獲取 [Q(s,a1),Q(s,a2),...]
"""
return self.model.value(obs)
def learn(self, obs, action, reward, next_obs, terminal):
""" 使用DQN算法更新self.model的value網絡
"""
# 從target_model中獲取 max Q' 的值,用於計算target_Q
next_pred_value = self.target_model.value(next_obs)
best_v = layers.reduce_max(next_pred_value, dim=1)
best_v.stop_gradient = True # 阻止梯度傳遞
terminal = layers.cast(terminal, dtype='float32')
target = reward + (1.0 - terminal) * self.gamma * best_v
pred_value = self.model.value(obs) # 獲取Q預測值
# 將action轉onehot向量,比如:3 => [0,0,0,1,0]
action_onehot = layers.one_hot(action, self.act_dim)
action_onehot = layers.cast(action_onehot, dtype='float32')
# 下面一行是逐元素相乘,拿到action對應的 Q(s,a)
# 比如:pred_value = [[2.3, 5.7, 1.2, 3.9, 1.4]], action_onehot = [[0,0,0,1,0]]
# ==> pred_action_value = [[3.9]]
pred_action_value = layers.reduce_sum(
layers.elementwise_mul(action_onehot, pred_value), dim=1)
# 計算 Q(s,a) 與 target_Q的均方差,得到loss
cost = layers.square_error_cost(pred_action_value, target)
cost = layers.reduce_mean(cost)
optimizer = fluid.optimizer.Adam(learning_rate=self.lr) # 使用Adam優化器
optimizer.minimize(cost)
return cost
def sync_target(self):
""" 把 self.model 的模型參數值同步到 self.target_model
"""
self.model.sync_weights_to(self.target_model)
class Agent(parl.Agent):
def __init__(self,
algorithm,
obs_dim,
act_dim,
e_greed=0.1,
e_greed_decrement=0):
assert isinstance(obs_dim, int)
assert isinstance(act_dim, int)
self.obs_dim = obs_dim
self.act_dim = act_dim
super(Agent, self).__init__(algorithm)
self.global_step = 0
self.update_target_steps = 200 # 每隔200個training steps再把model的參數複製到target_model中
self.e_greed = e_greed # 有一定概率隨機選取動作,探索
self.e_greed_decrement = e_greed_decrement # 隨着訓練逐步收斂,探索的程度慢慢降低
def build_program(self):
self.pred_program = fluid.Program()
self.learn_program = fluid.Program()
with fluid.program_guard(self.pred_program): # 搭建計算圖用於 預測動作,定義輸入輸出變量
obs = layers.data(
name='obs', shape=[self.obs_dim], dtype='float32')
self.value = self.alg.predict(obs)
with fluid.program_guard(self.learn_program): # 搭建計算圖用於 更新Q網絡,定義輸入輸出變量
obs = layers.data(
name='obs', shape=[self.obs_dim], dtype='float32')
action = layers.data(name='act', shape=[1], dtype='int32')
reward = layers.data(name='reward', shape=[], dtype='float32')
next_obs = layers.data(
name='next_obs', shape=[self.obs_dim], dtype='float32')
terminal = layers.data(name='terminal', shape=[], dtype='bool')
self.cost = self.alg.learn(obs, action, reward, next_obs, terminal)
def sample(self, obs):
sample = np.random.rand() # 產生0~1之間的小數
if sample < self.e_greed:
act = np.random.randint(self.act_dim) # 探索:每個動作都有概率被選擇
else:
act = self.predict(obs) # 選擇最優動作
self.e_greed = max(
0.01, self.e_greed - self.e_greed_decrement) # 隨着訓練逐步收斂,探索的程度慢慢降低
return act
def predict(self, obs): # 選擇最優動作
obs = np.expand_dims(obs, axis=0)
pred_Q = self.fluid_executor.run(
self.pred_program,
feed={'obs': obs.astype('float32')},
fetch_list=[self.value])[0]
pred_Q = np.squeeze(pred_Q, axis=0)
act = np.argmax(pred_Q) # 選擇Q最大的下標,即對應的動作
return act
def learn(self, obs, act, reward, next_obs, terminal):
# 每隔200個training steps同步一次model和target_model的參數
if self.global_step % self.update_target_steps == 0:
self.alg.sync_target()
self.global_step += 1
act = np.expand_dims(act, -1)
feed = {
'obs': obs.astype('float32'),
'act': act.astype('int32'),
'reward': reward,
'next_obs': next_obs.astype('float32'),
'terminal': terminal
}
cost = self.fluid_executor.run(
self.learn_program, feed=feed, fetch_list=[self.cost])[0] # 訓練一次網絡
return cost
import random
import collections
import numpy as np
class ReplayMemory(object):
def __init__(self, max_size):
self.buffer = collections.deque(maxlen=max_size)
# 增加一條經驗到經驗池中
def append(self, exp):
self.buffer.append(exp)
# 從經驗池中選取N條經驗出來
def sample(self, batch_size):
mini_batch = random.sample(self.buffer, batch_size)
obs_batch, action_batch, reward_batch, next_obs_batch, done_batch = [], [], [], [], []
for experience in mini_batch:
s, a, r, s_p, done = experience
obs_batch.append(s)
action_batch.append(a)
reward_batch.append(r)
next_obs_batch.append(s_p)
done_batch.append(done)
return np.array(obs_batch).astype('float32'), \
np.array(action_batch).astype('float32'), np.array(reward_batch).astype('float32'),\
np.array(next_obs_batch).astype('float32'), np.array(done_batch).astype('float32')
def __len__(self):
return len(self.buffer)
# 訓練一個episode
def run_episode(env, agent, rpm):
total_reward = 0
obs = env.reset()
step = 0
while True:
step += 1
action = agent.sample(obs) # 採樣動作,所有動作都有概率被嘗試到
next_obs, reward, done, _ = env.step(action)
rpm.append((obs, action, reward, next_obs, done))
# train model
if (len(rpm) > MEMORY_WARMUP_SIZE) and (step % LEARN_FREQ == 0):
(batch_obs, batch_action, batch_reward, batch_next_obs,
batch_done) = rpm.sample(BATCH_SIZE)
train_loss = agent.learn(batch_obs, batch_action, batch_reward,
batch_next_obs,
batch_done) # s,a,r,s',done
total_reward += reward
obs = next_obs
if done:
break
return total_reward
# 評估 agent, 跑 5 個episode,總reward求平均
def evaluate(env, agent, render=False):
eval_reward = []
for i in range(5):
obs = env.reset()
episode_reward = 0
while True:
action = agent.predict(obs) # 預測動作,只選最優動作
obs, reward, done, _ = env.step(action)
episode_reward += reward
if render:
env.render()
if done:
break
eval_reward.append(episode_reward)
return np.mean(eval_reward)
Lesson 4
Policy Gradient
採用神經網絡擬合策略函數,需計算策略梯度用於優化策略網絡。
- 優化的目標是在策略
π(s,a)
的期望回報:所有的軌跡獲得的回報R
與對應的軌跡發生概率p
的加權和,當N足夠大時,可通過採樣N個Episode求平均的方式近似表達。
- 優化目標對參數
θ
求導後得到策略梯度:
class Model(parl.Model):
def __init__(self, act_dim):
act_dim = act_dim
hid1_size = act_dim * 10
self.fc1 = layers.fc(size=hid1_size, act='tanh')
self.fc2 = layers.fc(size=act_dim, act='softmax')
def forward(self, obs): # 可直接用 model = Model(5); model(obs)調用
out = self.fc1(obs)
out = self.fc2(out)
return out
# from parl.algorithms import PolicyGradient # 也可以直接從parl庫中導入PolicyGradient算法,無需重複寫算法
class PolicyGradient(parl.Algorithm):
def __init__(self, model, lr=None):
""" Policy Gradient algorithm
Args:
model (parl.Model): policy的前向網絡.
lr (float): 學習率.
"""
self.model = model
assert isinstance(lr, float)
self.lr = lr
def predict(self, obs):
""" 使用policy model預測輸出的動作概率
"""
return self.model(obs)
def learn(self, obs, action, reward):
""" 用policy gradient 算法更新policy model
"""
act_prob = self.model(obs) # 獲取輸出動作概率
# log_prob = layers.cross_entropy(act_prob, action) # 交叉熵
log_prob = layers.reduce_sum(
-1.0 * layers.log(act_prob) * layers.one_hot(
action, act_prob.shape[1]),
dim=1)
cost = log_prob * reward
cost = layers.reduce_mean(cost)
optimizer = fluid.optimizer.Adam(self.lr)
optimizer.minimize(cost)
return cost
class Agent(parl.Agent):
def __init__(self, algorithm, obs_dim, act_dim):
self.obs_dim = obs_dim
self.act_dim = act_dim
super(Agent, self).__init__(algorithm)
def build_program(self):
self.pred_program = fluid.Program()
self.learn_program = fluid.Program()
with fluid.program_guard(self.pred_program): # 搭建計算圖用於 預測動作,定義輸入輸出變量
obs = layers.data(
name='obs', shape=[self.obs_dim], dtype='float32')
self.act_prob = self.alg.predict(obs)
with fluid.program_guard(
self.learn_program): # 搭建計算圖用於 更新policy網絡,定義輸入輸出變量
obs = layers.data(
name='obs', shape=[self.obs_dim], dtype='float32')
act = layers.data(name='act', shape=[1], dtype='int64')
reward = layers.data(name='reward', shape=[], dtype='float32')
self.cost = self.alg.learn(obs, act, reward)
def sample(self, obs):
obs = np.expand_dims(obs, axis=0) # 增加一維維度
act_prob = self.fluid_executor.run(
self.pred_program,
feed={'obs': obs.astype('float32')},
fetch_list=[self.act_prob])[0]
act_prob = np.squeeze(act_prob, axis=0) # 減少一維維度
act = np.random.choice(range(self.act_dim), p=act_prob) # 根據動作概率選取動作
return act
def predict(self, obs):
obs = np.expand_dims(obs, axis=0)
act_prob = self.fluid_executor.run(
self.pred_program,
feed={'obs': obs.astype('float32')},
fetch_list=[self.act_prob])[0]
act_prob = np.squeeze(act_prob, axis=0)
act = np.argmax(act_prob) # 根據動作概率選擇概率最高的動作
return act
def learn(self, obs, act, reward):
act = np.expand_dims(act, axis=-1)
feed = {
'obs': obs.astype('float32'),
'act': act.astype('int64'),
'reward': reward.astype('float32')
}
cost = self.fluid_executor.run(
self.learn_program, feed=feed, fetch_list=[self.cost])[0]
return cost
def run_episode(env, agent):
obs_list, action_list, reward_list = [], [], []
obs = env.reset()
while True:
obs_list.append(obs)
action = agent.sample(obs) # 採樣動作
action_list.append(action)
obs, reward, done, info = env.step(action)
reward_list.append(reward)
if done:
break
return obs_list, action_list, reward_list
# 評估 agent, 跑 5 個episode,總reward求平均
def evaluate(env, agent, render=False):
eval_reward = []
for i in range(5):
obs = env.reset()
episode_reward = 0
while True:
action = agent.predict(obs) # 選取最優動作
obs, reward, isOver, _ = env.step(action)
episode_reward += reward
if render:
env.render()
if isOver:
break
eval_reward.append(episode_reward)
return np.mean(eval_reward)
# 根據一個episode的每個step的reward列表,計算每一個Step的Gt
def calc_reward_to_go(reward_list, gamma=1.0):
for i in range(len(reward_list) - 2, -1, -1):
# G_t = r_t + γ·r_t+1 + ... = r_t + γ·G_t+1
reward_list[i] += gamma * reward_list[i + 1] # Gt
return np.array(reward_list)
# 創建環境
env = gym.make('CartPole-v0')
obs_dim = env.observation_space.shape[0]
act_dim = env.action_space.n
logger.info('obs_dim {}, act_dim {}'.format(obs_dim, act_dim))
# 根據parl框架構建agent
model = Model(act_dim=act_dim)
alg = PolicyGradient(model, lr=LEARNING_RATE)
agent = Agent(alg, obs_dim=obs_dim, act_dim=act_dim)
# 加載模型
# if os.path.exists('./model.ckpt'):
# agent.restore('./model.ckpt')
# run_episode(env, agent, train_or_test='test', render=True)
# exit()
for i in range(1000):
obs_list, action_list, reward_list = run_episode(env, agent)
if i % 10 == 0:
logger.info("Episode {}, Reward Sum {}.".format(
i, sum(reward_list)))
batch_obs = np.array(obs_list)
batch_action = np.array(action_list)
batch_reward = calc_reward_to_go(reward_list)
agent.learn(batch_obs, batch_action, batch_reward)
if (i + 1) % 100 == 0:
total_reward = evaluate(env, agent, render=False) # render=True 查看渲染效果,需要在本地運行,AIStudio無法顯示
logger.info('Test reward: {}'.format(total_reward))
# 保存模型到文件 ./model.ckpt
agent.save('./model.ckpt')
Lesson 5
DDPG(Deep Deterministic Policy Gradient)
DDPG
的提出動機其實是爲了讓DQN
可以擴展到連續的動作空間。DDPG
借鑑了DQN
的兩個技巧:經驗回放 和 固定Q
網絡。DDPG
使用策略網絡直接輸出確定性動作。DDPG
使用了單步更新的Actor-Critic
的架構。
Target Network參數軟更新 ,使target network更穩定不至於變化太大。
ACTOR_LR = 1e-3 # Actor網絡的 learning rate
CRITIC_LR = 1e-3 # Critic網絡的 learning rate
GAMMA = 0.99 # reward 的衰減因子
TAU = 0.001 # 軟更新的係數
MEMORY_SIZE = int(1e6) # 經驗池大小
MEMORY_WARMUP_SIZE = MEMORY_SIZE // 20 # 預存一部分經驗之後再開始訓練
BATCH_SIZE = 128
REWARD_SCALE = 0.1 # reward 縮放係數
NOISE = 0.05 # 動作噪聲方差
TRAIN_EPISODE = 6000 # 訓練的總episode數
class Model(parl.Model):
def __init__(self, act_dim):
self.actor_model = ActorModel(act_dim)
self.critic_model = CriticModel()
def policy(self, obs):
return self.actor_model.policy(obs)
def value(self, obs, act):
return self.critic_model.value(obs, act)
def get_actor_params(self):
return self.actor_model.parameters()
class ActorModel(parl.Model):
def __init__(self, act_dim):
hid_size = 100
self.fc1 = layers.fc(size=hid_size, act='relu')
self.fc2 = layers.fc(size=act_dim, act='tanh')
def policy(self, obs):
hid = self.fc1(obs)
means = self.fc2(hid)
return means
class CriticModel(parl.Model):
def __init__(self):
hid_size = 100
self.fc1 = layers.fc(size=hid_size, act='relu')
self.fc2 = layers.fc(size=1, act=None)
def value(self, obs, act):
concat = layers.concat([obs, act], axis=1)
hid = self.fc1(concat)
Q = self.fc2(hid)
Q = layers.squeeze(Q, axes=[1])
return Q
class DDPG(parl.Algorithm):
def __init__(self,
model,
gamma=None,
tau=None,
actor_lr=None,
critic_lr=None):
""" DDPG algorithm
Args:
model (parl.Model): actor and critic 的前向網絡.
model 必須實現 get_actor_params() 方法.
gamma (float): reward的衰減因子.
tau (float): self.target_model 跟 self.model 同步參數 的 軟更新參數
actor_lr (float): actor 的學習率
critic_lr (float): critic 的學習率
"""
assert isinstance(gamma, float)
assert isinstance(tau, float)
assert isinstance(actor_lr, float)
assert isinstance(critic_lr, float)
self.gamma = gamma
self.tau = tau
self.actor_lr = actor_lr
self.critic_lr = critic_lr
self.model = model
self.target_model = deepcopy(model)
def predict(self, obs):
""" 使用 self.model 的 actor model 來預測動作
"""
return self.model.policy(obs)
def learn(self, obs, action, reward, next_obs, terminal):
""" 用DDPG算法更新 actor 和 critic
"""
actor_cost = self._actor_learn(obs)
critic_cost = self._critic_learn(obs, action, reward, next_obs,
terminal)
return actor_cost, critic_cost
def _actor_learn(self, obs):
action = self.model.policy(obs)
Q = self.model.value(obs, action)
cost = layers.reduce_mean(-1.0 * Q)
optimizer = fluid.optimizer.AdamOptimizer(self.actor_lr)
optimizer.minimize(cost, parameter_list=self.model.get_actor_params())
return cost
def _critic_learn(self, obs, action, reward, next_obs, terminal):
next_action = self.target_model.policy(next_obs)
next_Q = self.target_model.value(next_obs, next_action)
terminal = layers.cast(terminal, dtype='float32')
target_Q = reward + (1.0 - terminal) * self.gamma * next_Q
target_Q.stop_gradient = True
Q = self.model.value(obs, action)
cost = layers.square_error_cost(Q, target_Q)
cost = layers.reduce_mean(cost)
optimizer = fluid.optimizer.AdamOptimizer(self.critic_lr)
optimizer.minimize(cost)
return cost
def sync_target(self, decay=None, share_vars_parallel_executor=None):
""" self.target_model從self.model複製參數過來,可設置軟更新參數
"""
if decay is None:
decay = 1.0 - self.tau
self.model.sync_weights_to(
self.target_model,
decay=decay,
share_vars_parallel_executor=share_vars_parallel_executor)
class Agent(parl.Agent):
def __init__(self, algorithm, obs_dim, act_dim):
assert isinstance(obs_dim, int)
assert isinstance(act_dim, int)
self.obs_dim = obs_dim
self.act_dim = act_dim
super(Agent, self).__init__(algorithm)
# 注意:最開始先同步self.model和self.target_model的參數.
self.alg.sync_target(decay=0)
def build_program(self):
self.pred_program = fluid.Program()
self.learn_program = fluid.Program()
with fluid.program_guard(self.pred_program):
obs = layers.data(
name='obs', shape=[self.obs_dim], dtype='float32')
self.pred_act = self.alg.predict(obs)
with fluid.program_guard(self.learn_program):
obs = layers.data(
name='obs', shape=[self.obs_dim], dtype='float32')
act = layers.data(
name='act', shape=[self.act_dim], dtype='float32')
reward = layers.data(name='reward', shape=[], dtype='float32')
next_obs = layers.data(
name='next_obs', shape=[self.obs_dim], dtype='float32')
terminal = layers.data(name='terminal', shape=[], dtype='bool')
_, self.critic_cost = self.alg.learn(obs, act, reward, next_obs,
terminal)
def predict(self, obs):
obs = np.expand_dims(obs, axis=0)
act = self.fluid_executor.run(
self.pred_program, feed={'obs': obs},
fetch_list=[self.pred_act])[0]
act = np.squeeze(act)
return act
def learn(self, obs, act, reward, next_obs, terminal):
feed = {
'obs': obs,
'act': act,
'reward': reward,
'next_obs': next_obs,
'terminal': terminal
}
critic_cost = self.fluid_executor.run(
self.learn_program, feed=feed, fetch_list=[self.critic_cost])[0]
self.alg.sync_target()
return critic_cost
def run_episode(agent, env, rpm):
obs = env.reset()
total_reward = 0
steps = 0
while True:
steps += 1
batch_obs = np.expand_dims(obs, axis=0)
action = agent.predict(batch_obs.astype('float32'))
# 增加探索擾動, 輸出限制在 [-1.0, 1.0] 範圍內
action = np.clip(np.random.normal(action, NOISE), -1.0, 1.0)
next_obs, reward, done, info = env.step(action)
action = [action] # 方便存入replaymemory
rpm.append((obs, action, REWARD_SCALE * reward, next_obs, done))
if len(rpm) > MEMORY_WARMUP_SIZE and (steps % 5) == 0:
(batch_obs, batch_action, batch_reward, batch_next_obs,
batch_done) = rpm.sample(BATCH_SIZE)
agent.learn(batch_obs, batch_action, batch_reward, batch_next_obs,
batch_done)
obs = next_obs
total_reward += reward
if done or steps >= 200:
break
return total_reward
import random
import collections
import numpy as np
class ReplayMemory(object):
def __init__(self, max_size):
self.buffer = collections.deque(maxlen=max_size)
def append(self, exp):
self.buffer.append(exp)
def sample(self, batch_size):
mini_batch = random.sample(self.buffer, batch_size)
obs_batch, action_batch, reward_batch, next_obs_batch, done_batch = [], [], [], [], []
for experience in mini_batch:
s, a, r, s_p, done = experience
obs_batch.append(s)
action_batch.append(a)
reward_batch.append(r)
next_obs_batch.append(s_p)
done_batch.append(done)
return np.array(obs_batch).astype('float32'), \
np.array(action_batch).astype('float32'), np.array(reward_batch).astype('float32'),\
np.array(next_obs_batch).astype('float32'), np.array(done_batch).astype('float32')
def __len__(self):
return len(self.buffer)
def evaluate(env, agent, render=False):
eval_reward = []
for i in range(5):
obs = env.reset()
total_reward = 0
steps = 0
while True:
batch_obs = np.expand_dims(obs, axis=0)
action = agent.predict(batch_obs.astype('float32'))
action = np.clip(action, -1.0, 1.0)
steps += 1
next_obs, reward, done, info = env.step(action)
obs = next_obs
total_reward += reward
if render:
env.render()
if done or steps >= 200:
break
eval_reward.append(total_reward)
return np.mean(eval_reward)
總結:學習強化學習算法還是需要結合理論和代碼才能更深入理解算法細節,更重要的是實踐,個人感覺強化學習的模型設計以及訓練調參還是比較玄學的,不像監督學習訓練時可以通過學習曲線預見模型的收斂性,強化學習模型訓練時間似乎也更長,利用經驗和直覺設計好的模型和適合的超參會使得強化學習順利很多(在作業實戰中我就是設計了不怎麼好的模型以及超參,導致訓練了好久也沒收斂到理想值,不得已又要重頭開始,花費了相當多的時間)。