強化學習--Pytorch--DQN擴展以及Policy Gradient網絡結構

DQN改進

DQN算法存在過估計問題,可以採用Double DQN方法來進行補償。兩種方法只在下圖不同,其他地方一致。下圖公式爲 q_target 的輸出值,
DQN:
在這裏插入圖片描述
Double DQN:
在這裏插入圖片描述

Policy Gradient

Policy gradient是基於策略的強化學習,該方法是存儲每一輪的s,a,r值,用以計算梯度。
在這裏插入圖片描述
這裏面,πθ\pi_{\theta}表示選擇對應動作的概率,後面的vtv_{t}表示對應的時刻ttrr加上未來衰減的rr。一個基於policy gradient的pytorch程序如下:

import torch
import gym
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import torch.optim as optim
import math
from torch.distributions import Categorical
import matplotlib.pyplot as plt
from itertools import count
env = gym.make('CartPole-v1')


class PG_network(nn.Module):
    def __init__(self):
        super(PG_network,self).__init__()
        self.linear1 = nn.Linear(4,128)
        self.dropout = nn.Dropout(p=0.6)
        self.linear2 = nn.Linear(128,2)


        # self
        # self.optimizer = optim.Adam(self.parameters(),lr=1e-2)

    def forward(self,x):
        x = self.linear1(x)
        x = self.dropout(x)
        x = F.relu(x)
        action_scores = self.linear2(x)
        # x = self.dropout(x)
        # x = F.relu(x).unsqueeze(0)
        # x = x.unsqueeze(0)
        return F.softmax(action_scores,dim=1)
        # maxvalue,index = torch.max(x,dim=1)
        # y = x.squeeze(0)
        # action_random = np.random.choice(y.detach().numpy())
        # print(action_random)
        # return x

policyG_object = PG_network()
optimizer = optim.Adam(policyG_object.parameters(),lr=1e-2)
possibility_store = []
r_store = []

def choose_action(s):
    s = torch.from_numpy(s).float().unsqueeze(0)
    probs = policyG_object(s)
    m = Categorical(probs)
    action = m.sample() 
    b = m.log_prob(action)

    possibility_store.append(m.log_prob(action))
    return action.item()


alpha = 0.9
gammar = 0.9
reward_delay = 0.9
eps = np.finfo(np.float64).eps.item()
# R_store = []

def policy_gradient_learn():
    R = 0
    R_store = []
    delta_store = []
    # theta = -torch.log10()
    for r in r_store[::-1]:
        R = r + reward_delay*R
        R_store.insert(0,R)
    R_store = torch.tensor(R_store)
    R_store = (R_store - R_store.mean())/(R_store.std()+eps)

    for p,v in zip(possibility_store,R_store):
        delta_store.append(-p*v)
    optimizer.zero_grad()

    delta_store = torch.cat(delta_store).sum()

    delta_store.backward()
    optimizer.step()
    del possibility_store[:]
    del r_store[:]
    # print(loss)

def main():
    running_reward = 10    
    for i_episode in count(1):
        s, ep_reward = env.reset(),0
        for t in range(1,10000):
            # env.render()
            a = choose_action(s)
            s,r,done,info = env.step(a)
            r_store.append(r)
            ep_reward += r
            # print(r,a)
            if done:
                break

        running_reward = 0.05 * ep_reward + (1 - 0.05) * running_reward
        policy_gradient_learn()
        if i_episode % 10 == 0:
            print('Episode {}\tLast reward: {:.2f}\tAverage reward: {:.2f}'.format(
                    i_episode, ep_reward, running_reward))
        if running_reward > env.spec.reward_threshold:
            print("Solved! Running reward is now {} and "
                    "the last episode runs to {} time steps!".format(running_reward, t))
        # torch.save(policy.state_dict(),'hello.pt')
if __name__ == '__main__':
    main()

Actor-Critic

Q-learning不能夠在連續空間進行使用,Policy gradient則可以。但是Policy gradient是回合更新的,這就使得學習效率大大降低(多說一句,policy gradient是on policy策略,這種做法比較浪費樣本數據)(Policy gradient的方法學習效果爲方差較大)。那麼就產生了兩種的結合體,Actor-critic的形式。在Actor-critic中,Actor採用Policy gradient的學習策略,而critic則採用以值爲基礎單步更新方式(TD-error),以此來更新policy gradient網絡。
在這裏插入圖片描述
但是Actor-critic存在不收斂的問題,因此,在Actor-critic的基礎上,提出了DDPG的策略。DDPG即deep deterministic policy gradient,該網絡融合了DQN和Actor-critic。

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章