DQN改进
DQN算法存在过估计问题,可以采用Double DQN方法来进行补偿。两种方法只在下图不同,其他地方一致。下图公式为 q_target
的输出值,
DQN:
Double DQN:
Policy Gradient
Policy gradient是基于策略的强化学习,该方法是存储每一轮的s,a,r值,用以计算梯度。
这里面,表示选择对应动作的概率,后面的表示对应的时刻的加上未来衰减的。一个基于policy gradient的pytorch程序如下:
import torch
import gym
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import torch.optim as optim
import math
from torch.distributions import Categorical
import matplotlib.pyplot as plt
from itertools import count
env = gym.make('CartPole-v1')
class PG_network(nn.Module):
def __init__(self):
super(PG_network,self).__init__()
self.linear1 = nn.Linear(4,128)
self.dropout = nn.Dropout(p=0.6)
self.linear2 = nn.Linear(128,2)
# self
# self.optimizer = optim.Adam(self.parameters(),lr=1e-2)
def forward(self,x):
x = self.linear1(x)
x = self.dropout(x)
x = F.relu(x)
action_scores = self.linear2(x)
# x = self.dropout(x)
# x = F.relu(x).unsqueeze(0)
# x = x.unsqueeze(0)
return F.softmax(action_scores,dim=1)
# maxvalue,index = torch.max(x,dim=1)
# y = x.squeeze(0)
# action_random = np.random.choice(y.detach().numpy())
# print(action_random)
# return x
policyG_object = PG_network()
optimizer = optim.Adam(policyG_object.parameters(),lr=1e-2)
possibility_store = []
r_store = []
def choose_action(s):
s = torch.from_numpy(s).float().unsqueeze(0)
probs = policyG_object(s)
m = Categorical(probs)
action = m.sample()
b = m.log_prob(action)
possibility_store.append(m.log_prob(action))
return action.item()
alpha = 0.9
gammar = 0.9
reward_delay = 0.9
eps = np.finfo(np.float64).eps.item()
# R_store = []
def policy_gradient_learn():
R = 0
R_store = []
delta_store = []
# theta = -torch.log10()
for r in r_store[::-1]:
R = r + reward_delay*R
R_store.insert(0,R)
R_store = torch.tensor(R_store)
R_store = (R_store - R_store.mean())/(R_store.std()+eps)
for p,v in zip(possibility_store,R_store):
delta_store.append(-p*v)
optimizer.zero_grad()
delta_store = torch.cat(delta_store).sum()
delta_store.backward()
optimizer.step()
del possibility_store[:]
del r_store[:]
# print(loss)
def main():
running_reward = 10
for i_episode in count(1):
s, ep_reward = env.reset(),0
for t in range(1,10000):
# env.render()
a = choose_action(s)
s,r,done,info = env.step(a)
r_store.append(r)
ep_reward += r
# print(r,a)
if done:
break
running_reward = 0.05 * ep_reward + (1 - 0.05) * running_reward
policy_gradient_learn()
if i_episode % 10 == 0:
print('Episode {}\tLast reward: {:.2f}\tAverage reward: {:.2f}'.format(
i_episode, ep_reward, running_reward))
if running_reward > env.spec.reward_threshold:
print("Solved! Running reward is now {} and "
"the last episode runs to {} time steps!".format(running_reward, t))
# torch.save(policy.state_dict(),'hello.pt')
if __name__ == '__main__':
main()
Actor-Critic
Q-learning不能够在连续空间进行使用,Policy gradient则可以。但是Policy gradient是回合更新的,这就使得学习效率大大降低(多说一句,policy gradient是on policy策略,这种做法比较浪费样本数据)(Policy gradient的方法学习效果为方差较大)。那么就产生了两种的结合体,Actor-critic的形式。在Actor-critic中,Actor采用Policy gradient的学习策略,而critic则采用以值为基础单步更新方式(TD-error),以此来更新policy gradient网络。
但是Actor-critic存在不收敛的问题,因此,在Actor-critic的基础上,提出了DDPG的策略。DDPG即deep deterministic policy gradient,该网络融合了DQN和Actor-critic。