强化学习--Pytorch--DQN扩展以及Policy Gradient网络结构

DQN改进

DQN算法存在过估计问题,可以采用Double DQN方法来进行补偿。两种方法只在下图不同,其他地方一致。下图公式为 q_target 的输出值,
DQN:
在这里插入图片描述
Double DQN:
在这里插入图片描述

Policy Gradient

Policy gradient是基于策略的强化学习,该方法是存储每一轮的s,a,r值,用以计算梯度。
在这里插入图片描述
这里面,πθ\pi_{\theta}表示选择对应动作的概率,后面的vtv_{t}表示对应的时刻ttrr加上未来衰减的rr。一个基于policy gradient的pytorch程序如下:

import torch
import gym
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import torch.optim as optim
import math
from torch.distributions import Categorical
import matplotlib.pyplot as plt
from itertools import count
env = gym.make('CartPole-v1')


class PG_network(nn.Module):
    def __init__(self):
        super(PG_network,self).__init__()
        self.linear1 = nn.Linear(4,128)
        self.dropout = nn.Dropout(p=0.6)
        self.linear2 = nn.Linear(128,2)


        # self
        # self.optimizer = optim.Adam(self.parameters(),lr=1e-2)

    def forward(self,x):
        x = self.linear1(x)
        x = self.dropout(x)
        x = F.relu(x)
        action_scores = self.linear2(x)
        # x = self.dropout(x)
        # x = F.relu(x).unsqueeze(0)
        # x = x.unsqueeze(0)
        return F.softmax(action_scores,dim=1)
        # maxvalue,index = torch.max(x,dim=1)
        # y = x.squeeze(0)
        # action_random = np.random.choice(y.detach().numpy())
        # print(action_random)
        # return x

policyG_object = PG_network()
optimizer = optim.Adam(policyG_object.parameters(),lr=1e-2)
possibility_store = []
r_store = []

def choose_action(s):
    s = torch.from_numpy(s).float().unsqueeze(0)
    probs = policyG_object(s)
    m = Categorical(probs)
    action = m.sample() 
    b = m.log_prob(action)

    possibility_store.append(m.log_prob(action))
    return action.item()


alpha = 0.9
gammar = 0.9
reward_delay = 0.9
eps = np.finfo(np.float64).eps.item()
# R_store = []

def policy_gradient_learn():
    R = 0
    R_store = []
    delta_store = []
    # theta = -torch.log10()
    for r in r_store[::-1]:
        R = r + reward_delay*R
        R_store.insert(0,R)
    R_store = torch.tensor(R_store)
    R_store = (R_store - R_store.mean())/(R_store.std()+eps)

    for p,v in zip(possibility_store,R_store):
        delta_store.append(-p*v)
    optimizer.zero_grad()

    delta_store = torch.cat(delta_store).sum()

    delta_store.backward()
    optimizer.step()
    del possibility_store[:]
    del r_store[:]
    # print(loss)

def main():
    running_reward = 10    
    for i_episode in count(1):
        s, ep_reward = env.reset(),0
        for t in range(1,10000):
            # env.render()
            a = choose_action(s)
            s,r,done,info = env.step(a)
            r_store.append(r)
            ep_reward += r
            # print(r,a)
            if done:
                break

        running_reward = 0.05 * ep_reward + (1 - 0.05) * running_reward
        policy_gradient_learn()
        if i_episode % 10 == 0:
            print('Episode {}\tLast reward: {:.2f}\tAverage reward: {:.2f}'.format(
                    i_episode, ep_reward, running_reward))
        if running_reward > env.spec.reward_threshold:
            print("Solved! Running reward is now {} and "
                    "the last episode runs to {} time steps!".format(running_reward, t))
        # torch.save(policy.state_dict(),'hello.pt')
if __name__ == '__main__':
    main()

Actor-Critic

Q-learning不能够在连续空间进行使用,Policy gradient则可以。但是Policy gradient是回合更新的,这就使得学习效率大大降低(多说一句,policy gradient是on policy策略,这种做法比较浪费样本数据)(Policy gradient的方法学习效果为方差较大)。那么就产生了两种的结合体,Actor-critic的形式。在Actor-critic中,Actor采用Policy gradient的学习策略,而critic则采用以值为基础单步更新方式(TD-error),以此来更新policy gradient网络。
在这里插入图片描述
但是Actor-critic存在不收敛的问题,因此,在Actor-critic的基础上,提出了DDPG的策略。DDPG即deep deterministic policy gradient,该网络融合了DQN和Actor-critic。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章