深度强化学习系列(15): TRPO算法原理及Tensorflow实现

前言: 策略梯度方法博大精深,但策略梯度算法的硬伤就在于更新步长 α\alpha 很难确定,当步长不恰当时,更新参数所对应的策略是一个更不好的策略,当利用这个更不好的策略进行采样学习时,再次更新的参数会更差,因此很容易导致越学越差,甚至崩溃学不到东西。所以,一个合适的步长选择对于强化学习来说是非常关键的,目前的大部分强化学习算法很难保证单调收敛,而TRPO算法则直面步长问题,并通过数学给出了一个单调的策略改善方法。

由于TRPO中数学推导特别复杂,同时PPO(Proximal Policy Optimization, 近端策略优化)算法【后期单独讲解】对其过程优化的更加简单的缘故一直觉得学的不太到位,因此本篇博客综合了很多大牛的技术博客, 继续写完了TRPO算法的学习,以便于把算法学习明白,在此向大神们表示膜拜与感谢!

置信域策略梯度优化: TRPO

在这里插入图片描述
TRPO(Trust Region Policy Optimization, 置信域策略梯度优化)算法是现OpenAI科学家John Schulman在2015年2月提出的一种策略优化算法,并且发表于《第31届ICML》顶会上,可以说是非常独特新颖,同时其在2016年的博士论文中将该算法作为第三章(page18-44)进行了单独阐述。下面我们思考一个问题,当我们在使用梯度下降法求解的时候,步子大容易错过某些最优值,步子小收敛速度非常慢,如图所示
在这里插入图片描述
在梯度上升法中,我们判断最陡峭的方向,之后朝着那个方向前进。但是如果学习率(learning rate,步长)太高的话,这样的行动也许让我们远离真正的回报函数( the real reward function ),甚至会变成灾难。在信任的区域之中,我们用 δ\delta 变量限制我们的搜索区域(置信区间),这样的区域可以保证在它达到局部或者全局最优策略之前,它的优化策略将会优于当前策略,结果如图所示
在这里插入图片描述

1、论文思想与原理

TRPO算法使用了MM(Minorize-Maximizatio)优化过程,在本文中将通过几个步骤推导出TRPO的目标函数。

MM(Minorize-Maximization)过程

MM算法是一个迭代优化,其利用凸函数方法,以便找到他们的最大值或最小值,取决于所需的优化是最大化还是最小化。MM本身不是算法,而是描述如何构造优化算法。MM算法的历史基础可以追溯到至少1970年,当时Ortega和Rheinboldt正在进行与线搜索方法相关的研究。同样的概念继续以不同的形式重新出现在不同的领域。2000年,亨特和兰格提出“MM”作为一般框架。最近的研究已将该方法应用于广泛的学科领域,如数学,统计学,机器学习和工程学。MM过程如下:
在这里插入图片描述
在这里插入图片描述
更多内容参考MM过程

1.1 Surrogate function(替代函数)

RL的目标就是最大化预期折扣奖励(the expected discounted rewards)。然而我们知道,根据策略梯度方法,参数更新方程式为:
θnew=θold+αθJ \theta_{new}=\theta_{old}+\alpha\nabla_{\theta}J
策略梯度算法的硬伤就在更新步长 α\alpha,当步长不合适时,更新的参数所对应的策略是一个更不好的策略,当利用这个更不好的策略进行采样学习时,再次更新的参数会更差,因此很容易导致越学越差,最后崩溃。所以,合适的步长对于强化学习非常关键。那么什么叫合适的步长?所谓合适的步长是指当策略更新后,回报函数的值不能更差。如何选择这个步长?或者说,如何找到新的策略使得新的回报函数的值单调递增,或单调不减。这就是TRPO算法要解决的问题。具体如下图,红色的线表示期望折扣回,
在这里插入图片描述

其中 η\eta 被定义为:
η(πθ)=Eτπθ[t=0γtrt] \eta(\pi_{\theta}) = \mathbb{E}_{\tau \sim \pi_{\theta}}[\sum_{t=0}^{\infty}\gamma^{t}r_{t}]
这是对累计奖励函数做的期望,其中图中的C为常数,蓝色曲线M中的KL表示散度(后文解释),对于每次迭代,我们发现Surrogate函数M(蓝线)有如下性质:

  • η\eta 的下界函数
  • 可用于估计当前策略的 折扣奖励 η\eta
  • 易于优化(我们将会把Surrogate函数近似估计为一个二次方程)

在每一次迭代之中,我们找到最佳的M点并且把它作为当前的策略。
在这里插入图片描述
之后,我们重新评估新策略的下界并且重复迭代。当我们持续这个过程,策略也会不断的改进。因为可能的策略是有限的,所以我们当前的概率最终将会收敛到局部或者全部最优的策略。

在这里插入图片描述

1.2 目标函数

在强化学习中,值函数、动作值函数是学习的基础,优势函数则是计算A3C算法、TRPO算法的重要过程,我们先定义Q值函数、状态值函数和优势函数。数学表示为:

Qπ(st,at)=Est+1,at+1,[k=0γkr(st+k)]Vπ(st)=Est+1,[k=0γkr(st+k)]Aπ(s,a)=Qπ(s,a)Vπwhereatπ(atst),st+1P(st+1st,at)fort0 Q_{\pi}(s_{t},a_{t}) = \mathbb{E{s_{t+1},a_{t+1},\dots}\left[ \sum_{k=0}^{\infty} \gamma^{k}r(s_{t+k})\right]} \\ V_{\pi}(s_{t}) = \mathbb{E{s_{t+1},\dots}\left[ \sum_{k=0}^{\infty} \gamma^{k}r(s_{t+k})\right]} \\ A_{\pi}(s,a) = Q_{\pi}(s,a)-V_{\pi} \\ where \quad a_{t} \sim \pi(a_{t}|s_{t}),s_{t+1}\sim P(s_{t+1}|s_{t},a_{t}) \quad for \quad t \geq 0

为了衡量一个动作所能够带来的回报过程,优势函数被广泛的使用,

折扣期望奖励被表示为:

η(π)=Es0,a0,[t=0γtr(st)]wheres0ρ0(s0),atπ(atst),st+1P(st+1st,at) \eta(\pi) = \mathbb{E}_{s_{0},a_{0},\dots}\left[ \sum_{t=0}^{\infty}\gamma^{t}r(s_{t}) \right] \\ where \quad s_{0} \sim \rho_{0}(s_{0}),a_{t} \sim \pi(a_{t}|s_{t}),s_{t+1}\sim P(s_{t+1}|s_{t},a_{t})

我们可以使用其他策略计算策略的奖励进行表示

η(π^)=η(π)+Es0,a0,π^[t=0γtAπ(st,at)] \eta(\hat{\pi}) = \eta(\pi) +\mathbb{E}_{s_{0},a_{0},\dots \sim \hat{\pi}}[\sum_{t=0}^{\infty}\gamma^{t}A_{\pi}(s_{t},a_{t})]

Aπ(s,a)=Qπ(s,a)Vπ(s)=EsP(ss,a)[r(s)+γVπ(s)Vπ(s)] A_{\pi}(s,a) = Q_{\pi}(s,a)-V_{\pi}(s) =\mathbb{E}_{s^{'}\sim P(s^{'}|s,a)}[r(s)+\gamma V_{\pi}(s^{})-V_{\pi}(s)]

其中符号 Es0,a0,π^[]\mathbb{E}_{s_{0},a_{0},\dots \sim \hat{\pi}}[\dots] 表示动作的采样 atπ(.st)^a_{t} \sim \hat{\pi(.|s_{t})}, η(π)\eta(\pi) 表示旧策略, η(π^)\eta{(\hat{\pi})} 表示新策略,而 Es0,a0,π^[t=0γtAπ(st,at)]\mathbb{E}_{s_{0},a_{0},\dots \sim \hat{\pi}}[\sum_{t=0}^{\infty}\gamma^{t}A_{\pi}(s_{t},a_{t})] 则表示新旧策略的回报差,这也是优势函数所谓的优势所在,同时我们设置 ρ(π)\rho(\pi) 为折扣的访问频率

ρπ(s)=P(s0=s)+γP(s1=s)+γ2P(s2=s)+, \rho_{\pi}(s) = P(s_{0}=s)+\gamma P(s_{1}=s)+\gamma^{2}P(s_{2}=s)+\dots,

其中关于新旧策略的回报差的证明如下:
Eτπ[t=0γtAπ(st,at)]=Eτπ[t=0γt(R(st,at,st+1)+γVπ(st+1)Vπ(st))]=η(π)+Eτπ[t=0γt+1Vπ(st+1)t=0γtVπ(st)]=η(π)+Eτπ[t=1γtVπ(st)t=0γtVπ(st)]=η(π)Eτπ[Vπ(s0)]=η(π)η(π) \begin{aligned} &\underset{\tau \sim \pi^{\prime}}{\mathrm{E}}\left[\sum_{t=0}^{\infty} \gamma^{t} A^{\pi}\left(s_{t}, a_{t}\right)\right] \\ &=\underset{\tau \sim \pi^{\prime}}{\mathrm{E}}\left[\sum_{t=0}^{\infty} \gamma^{t}\left(R\left(s_{t}, a_{t}, s_{t+1}\right)+\gamma V^{\pi}\left(s_{t+1}\right)-V^{\pi}\left(s_{t}\right)\right)\right] \\ &=\eta\left(\pi^{\prime}\right)+\underset{\tau \sim \pi^{\prime}}{\mathrm{E}}\left[\sum_{t=0}^{\infty} \gamma^{t+1} V^{\pi}\left(s_{t+1}\right)-\sum_{t=0}^{\infty} \gamma^{t} V^{\pi}\left(s_{t}\right)\right] \\ &=\eta\left(\pi^{\prime}\right)+\underset{\tau \sim \pi^{\prime}}{\mathrm{E}}\left[\sum_{t=1}^{\infty} \gamma^{t} V^{\pi}\left(s_{t}\right)-\sum_{t=0}^{\infty} \gamma^{t} V^{\pi}\left(s_{t}\right)\right] \\ &=\eta\left(\pi^{\prime}\right)-\underset{\tau \sim \pi^{\prime}}{\mathrm{E}}\left[V^{\pi}\left(s_{0}\right)\right] \\ &=\eta\left(\pi^{\prime}\right)-\eta(\pi) \end{aligned}

将上述证明公式展开得到:

η(π^)=η(π)+tsP(st=sπ^)π^(as)γtAπ(s,a)=η(π)+ρπ^(s)π^(as)Aπ(s,a) \begin{aligned} \eta(\hat{\pi}) &= \eta(\pi)+\sum_{t}^{\infty}\sum_{s}P(s_{t}=s|\hat{\pi})\sum\hat{\pi}(a|s)\gamma^{t}A_{\pi}(s,a) \\ &= \eta(\pi)+\sum\rho_{\hat{\pi}}(s)\sum\hat{\pi}(a|s)A_{\pi}(s,a) \end{aligned}
其中 ρπ^(s)=t=0γtP(st=sπ^)\rho_{\hat{\pi}}(s) = \sum_{t=0}^{\infty}\gamma^{t}P(s_{t} = s|\hat{\pi})


优势函数解释(如下树图所示):

值函数 V(s)V(s) 可以理解为在该状态下所有可能动作所对应的动作值函数乘以采取该动作的概率的和。更通俗的讲,值函数 V(s)V(s) 是该状态下所有动作值函数关于动作概率的平均值。而动作值函数 Q(s,a)Q(s,a) 是单个动作所对应的值函数, Qπ(s,a)Vπ(s)Q_{\pi}\left(s,a\right)-V_{\pi}\left(s\right) 能评价当前动作值函数相对于平均值的大小。所以,这里的优势指的是动作值函数相比于当前状态的值函数的优势。如果优势函数大于零,则说明该动作比平均动作好,如果优势函数小于零,则说明当前动作还不如平均动作好。在这里插入图片描述


1.3 一阶近似: L函数

1.3.1 技巧一:一阶近似

引入TRPO的第一个技巧对状态分布进行处理,我们忽略状态分布的变化,依然采用旧的策略所对应的状态分布。这个技巧是对原代价函数(目标函数)的一次近似。其实,当新旧参数很接近的时候,我们将用旧的状态分布替代新的状态分布是合理的,同时使用MM算法是为了找到当前策略下近似 η\eta 下界局部值的函数,原来的代价函数我们近似为函数 ?? :

Lπ(π^)=η(π)+sρπ(s)aπ^(as)Aπ(s,a) L_{\pi}(\hat{\pi}) = \eta(\pi) +\sum_{s}\rho_{\pi}(s)\sum_{a}\hat{\pi}(a|s)A_{\pi}(s,a)

第二项策略部分,这时的动作是由新策略 π^\hat{\pi} 产生的,可新策略是带参数 θ\theta的,这个参数的是未知的,无法用来产生动作。此时引入TRPO的第二个技巧–【重要性采样】。

1.3.2 重要性采样

TRPO第二个技巧是利用重要性采样对动作分布进行处理,后文将会介绍新旧策略之间的差异非常小,因此使用重要性采样能够取得非常好的效果
π^(as)Aπ(s,a)=Eaq(as)[π^(as)q(as)Aπ(st,at)] \sum\hat{\pi}(a|s)A_{\pi}(s,a) = \mathbb{E}_{a \sim q(a|s)}\left[ \frac{\hat{\pi}(a|s)}{q(a|s)}A_{\pi}(s_{t},a_{t})\right]
那么,经过将s求和的形式写为期望,则函数表示为:

Lπ(π^)=η(π)+Esρθold,aπθold(as)[πθ^(as)πθold(as)Aπθold(st,at)] L_{\pi}(\hat{\pi}) = \eta(\pi)+\mathbb{E}_{s \sim \rho_{\theta{old}},a \sim \pi{\theta_{old}}(a|s)}\left[\frac{\hat{\pi_{\theta}}(a|s)}{\pi_{\theta_{old}}(a|s)}A_{\pi_{\theta_{old}}}(s_{t},a_{t}) \right]

重要性采样
重要性采样(英语:importance sampling)是统计学中估计某一分布性质时使用的一种方法。该方法从与原分布不同的另一个分布中采样,而对原先分布的性质进行估计。重要性采样与计算物理学中的伞形采样相关。

1.3.3 步长的选择

通过对如下两个公式进行观察:

η(π^)=η(π)+tsP(st=sπ^)π^(as)γtAπ(s,a)Lπ(π^)=η(π)+Esρθold,aπθold(as)[πθ^(as)πθold(as)Aπθold(st,at)] \eta(\hat{\pi}) = \eta(\pi)+\sum_{t}^{\infty}\sum_{s}P(s_{t}=s|\hat{\pi})\sum\hat{\pi}(a|s)\gamma^{t}A_{\pi}(s,a) \\ \\ L_{\pi}(\hat{\pi}) = \eta(\pi)+\mathbb{E}_{s \sim \rho_{\theta{old}},a \sim \pi{\theta_{old}}(a|s)}\left[\frac{\hat{\pi_{\theta}}(a|s)}{\pi_{\theta_{old}}(a|s)}A_{\pi_{\theta_{old}}}(s_{t},a_{t}) \right]
我们发现 η(π^)\eta(\hat{\pi})Lπ(π^)L_{\pi}(\hat{\pi}) 的区别在于分布不同,他们都是 π^\hat{\pi} 的函数,且在 πθold\pi_{\theta_{old}} 处的

?? 是下界函数M 的一部分(红色下划线)
在这里插入图片描述

其中M中的第二项是KL-divergence KL散度


KL散度:
Kullback-Leibler(KL)散度由Solomon Kullback和Richard Leibler在1951 年引入,作为两 个分布之间的有向差异衡量,对于离散概率分布 P 和 Q Q在相同的概率空间上定义,Kullback-Leibler之间存在差异 PPQQ Q定义是
DKLPQ=xXP(x)logQ(x)P(x) D_{KL}(P\parallel Q)= - \sum_{x \in \mathcal{X}} P(x) \log {\frac{Q(x)}{P(x)}}
换句话说,期望概率之间的对数差异P 和Q,使用概率采取期望 P。Kullback-Leibler散度仅在所有人中定义 X, Q(x)= 0, Q(X)= 0 暗示 P(x)= 0,P(X)= 0(绝对连续性)。对于P 和 Q在连续随机变量中,Kullback-Leibler(KL)散度被定义为积分:
DKLPQ=p(x)log(q(x)p(x))dx D_{KL}(P\parallel Q)= \intop\nolimits_{-\infty }^{\infty } p(x) \log {\left(\frac{q(x)}{p(x)}\right)dx}
更多相关数学内容见点击查看维基百科


在当前的策略中,KL(θi,θi)=0KL( \theta_{i}, \theta_{i})=0 , CKLC*KL 项可以看作是 ???? 的上限误差。在策略 θi\theta_{i} 中,我们可以证明 LLηη 的第一个阶导数相同。

Lπθi(πθi)=η(πθi)θLπθi(πθ)θ=θi=θη(πθ)θ=θi L_{\pi_{\theta_{i}}}(\pi_{\theta_{i}}) = \eta(\pi_{\theta_{i}}) \\ \nabla_{\theta}L_{\pi_{\theta_{i}}}(\pi_{\theta})|_{\theta = \theta_{i}} = \nabla_{\theta}\eta(\pi_{\theta})|_{\theta = \theta_{i}}

根据上述公式,在 θi\theta_{i} 有限的范围内,两个目标函数相差十分的近似,梯度也近似,因此在本文可以使用 LπθoldL_{\pi_{\theta_{old}}} 代替原来的目标函数,并沿着近似函数导数方向做有限步长的更新,同时提升策略梯度。

然而,步长多大能够保证单调不减?既然近似函数与原目标函数是在策略附近一个很小的区域可以做到近似,那么就加入一个约束用于限定新旧策略的差异。如果把策略看做是一个概率分布,那么就可以使用KL散度表示两个分布之间的差异 DKLmax(π,π^)=maxsDKL(π(s)π^(s))D_{KL}^{max}(\pi, \hat{\pi}) = \max_{s}D_{KL}(\pi(\cdot |s)||\hat{\pi}(\cdot |s)) ,在作者原论文中证明过,当 π^=argmaxπLπ(π)\hat{\pi} = arg\max_{\pi^{'}}L_{\pi}(\pi^{'}) 成立时,则下限为:

η(π^)Lπ(π^)C×DKLmax(π,π^)where,C=2ϵγ(1γ)2 \eta(\hat{\pi}) \geq L_{\pi}(\hat{\pi})-C \times D_{KL}^{max}(\pi, \hat{\pi}) \\ where, \quad C = \frac{2\epsilon \gamma}{(1-\gamma)^{2}}

其中 DKLmax(π,π~)D_{KL}^{max}({\pi},{\tilde \pi }) 是KL散度,限制了策略模型更新前后的差异,则定义如下等式

Mi(π)=Lπi(π)C×DKLmax(πi,π) M_{i}(\pi) = L_{\pi_{i}}(\pi) -C\times D_{KL}^{max}(\pi_{i},\pi)

利用这个等式,我们证明策略的单调性
η(πi+1)Mi(πi+1)η(πi)=Mi(πi) \eta(\pi_{i+1}) \geq M_{i}(\pi_{i+1}) \\ \eta(\pi_{i}) = M_{i}(\pi_{i})

对该公式求减法得到:

η(πi+1)η(πi)Mi(πi+1)Mi(πi) \eta(\pi_{i+1})-\eta(\pi_{i}) \geq M_{i}(\pi_{i+1})-M_{i}(\pi_{i})

接下来我们讨论下界函数M的细节 。TRPO论文附录A中的两页证明 η的有一个确定的下限 。

KaTeX parse error: Undefined control sequence: \nonumber at position 177: …pha^{2} where \̲n̲o̲n̲u̲m̲b̲e̲r̲\\ & \epsilon=\…

D_TV是总的散度方差。但这并不重要,因为我们将马上使用KL散度替代它,因为(找下界)

DTV(pq)2DKL(pq) D_{TV}(p ||q)^{2} \leq D_{KL}(p||q)

下界函数可以被重定义为:
η(π~)Lπ(π~)CDKLmax(π,π~), where C=4ϵγ(1γ)2DKLmax(π,π~)=maxsDKL(π(s)π~(s)) \eta(\tilde{\pi}) \geq L_{\pi}(\tilde{\pi})-C D_{\mathrm{KL}}^{\max }(\pi, \tilde{\pi}), \text { where } C=\frac{4 \epsilon \gamma}{(1-\gamma)^{2}} \\ D_{\mathrm{KL}}^{\max }(\pi, \tilde{\pi})=\max _{s} D_{\mathrm{KL}}(\pi(\cdot | s) \| \tilde{\pi}(\cdot | s))

注意,符号可以简记为

η(θ):=η(πθ)Lθ(θ^):=Lπθ(πθ^) \eta(\theta):=\eta(\pi_{\theta}) \\ L_{\theta}(\hat{\theta}):= L_{\pi_{\theta}}(\pi_{\hat{\theta}})

1.4 单调递增证明:

自然策略梯度的关键思想是保证了函数单调上升。它是Policy Gradient方法家族中的“用货币担保”的版本(笑)。简而言之,至少在理论上,任何策略策更新都将比之前的好。我们在这需要证明的是,基于优化M的新策略 可以保证在 η (实际预期回报)方面的表现优于之前的策略。由于策略的数量是有限的,持续更新策略最终能达到局部或全局最优。这是证明:

Mi(πi+1)对比Mi(πi)的任何改进都会使得η(πi+1)获得改进。
Mi(π)=Lπi(π)CDKLmax(πi,π).thenη(πi+1)Mi(πi+1)η(πi)=Mi(πi),therefore,η(πi+1)η(πi)Mi(πi+1)M(πi) M_{i}(\pi)=L_{\pi_{i}}(\pi)-C D_{\mathrm{KL}}^{\max }\left(\pi_{i}, \pi\right). then \\ \eta\left(\pi_{i+1}\right) \geq M_{i}\left(\pi_{i+1}\right) \\ \eta\left(\pi_{i}\right)=M_{i}\left(\pi_{i}\right), therefore, \\ \eta\left(\pi_{i+1}\right)-\eta\left(\pi_{i}\right) \geq M_{i}\left(\pi_{i+1}\right)-M\left(\pi_{i}\right)

策略迭代中梯度单调上升的证明

这是迭代算法,它保证新策略总是比当前策略执行得更好。

在这里插入图片描述
然而,很难(在所有策略中)找到KL散度的最大值。因此我们将放宽要求为使用KL散度的均值。
在这里插入图片描述

其中 ?? 函数:
Lπ(π^)=η(π)+sρπ(s)aπ^(as)Aπ(s,a) L_{\pi}(\hat{\pi}) = \eta(\pi)+\sum_{s}\rho_{\pi}(s)\sum_{a}\hat{\pi}(a|s)A_{\pi}(s,a)

我们可以使用重要性抽样,从策略q中抽样来估计上式左边:
maximizeθEsρ old ,aq[πθ(as)q(as)A^θ old (s,a)]subjecttoEsρ old [DKL(πθ old (s)πθ(s))]δ \underset{\theta}{\operatorname{maximize}} \mathbb{E}_{s \sim \rho_{\text { old }}, a \sim q}\left[\frac{\pi_{\theta}(a | s)}{q(a | s)} \hat{A}_{\theta_{\text { old }}}(s, a)\right] \\ subject \quad to \quad \mathbb{E}_{s \sim \rho_{\text { old }}}\left[D_{\mathrm{KL}}\left(\pi_{\theta_{\text { old }}}(\cdot | s) \| \pi_{\theta}(\cdot | s)\right)\right] \leq \delta


重要性采样
重要性采样计算 f(x)f(x) 的期望值,其中 xx 具有数据分布 pp 。则计算为 Exp[f(x)]\mathbb{E}_{x \sim p}[f(x)] ,但在重要性抽样中,我们提供了不从 pp 中抽取 f(x)f(x) 值的选择。相反,我们从 qq 中采样数据并使用 ppqq 之间的概率比来重新校准结果,其计算过程为:Exp[f(x)p(x)q(x)]\mathbb{E}_{x \sim p}[\frac{f(x)p(x)}{q(x)}] , 则在PG中,我们使用当前策略来计算策略梯度表示为:
[外链图片转存失败(img-1TkyMuRi-1563675372814)(assets/markdown-img-paste-20190409103527137.png)]

因此,无论何时更改策略,我们都会收集新样本。旧样本不可重复使用。因此PG的样品效率很差。通过重要性抽样,我们的目标可以被重写,我们可以使用旧政策中的样本来计算政策梯度。

在这里插入图片描述


继续前文,利用拉格朗日对偶性,可以将目标函数的约束成带拉格朗日乘子的目标函数。两者在数学上是等价的:

maximizeθE^t[πθ(atst)πθold(atst)A^t]subjecttoE^t[KL[πθ old (st),πθ(st)]]δ \underset{\theta}{\operatorname{maximize}} \quad \hat{\mathbb{E}}_{t}\left[\frac{\pi_{\theta}\left(a_{t} | s_{t}\right)}{\pi_{\theta_{\mathrm{old}}}\left(a_{t} | s_{t}\right)} \hat{A}_{t}\right] \\ subject \quad to \quad \hat{\mathbb{E}}_{t}\left[\mathrm{KL}\left[\pi_{\theta_{\text { old }}}\left(\cdot | s_{t}\right), \pi_{\theta}\left(\cdot | s_{t}\right)\right]\right] \leq \delta
或者
maximizeθE^t[πθ(atst)πθ old (atst)A^t]βE^t[KL[πθ old (st),πθ(st)]] \underset{\theta}{\operatorname{maximize}} \quad \hat{\mathbb{E}}_{t}\left[\frac{\pi_{\theta}\left(a_{t} | s_{t}\right)}{\pi_{\theta_{\text { old }}}\left(a_{t} | s_{t}\right)} \hat{A}_{t}\right]-\beta \hat{\mathbb{E}}_{t}\left[\mathrm{KL}\left[\pi_{\theta_{\text { old }}}\left(\cdot | s_{t}\right), \pi_{\theta}\left(\cdot | s_{t}\right)\right]\right]

1.5 优化目标函数(Optimizing the objective function)

正如我们之前提到过的,下界函数M应该是易于优化的。事实上,我们使用泰勒级数(Taylor Series)来估计 L函数 和 KL距离(KL-divergence)的平均值,分别用泰勒将L 展开到一阶 ,KL距离 展开到二阶:
Lθk(θ)gT(θθk)gθLθk(θ)θk L_{\theta_{k}}(\theta) \approx g^{T}\left(\theta-\theta_{k}\right) \\ g \doteq \nabla_{\theta} L_{\theta_{k}}\left.(\theta)\right|_{\theta_{k}}

DKL(θθk)12(θθk)TH(θθk)Hθ2DKL(θθk)θk \overline{D}_{K L}\left(\theta \| \theta_{k}\right) \approx \frac{1}{2}\left(\theta-\theta_{k}\right)^{T} H\left(\theta-\theta_{k}\right) \\ H \doteq \nabla_{\theta}^{2} \overline{D}_{K L}\left.\left(\theta \| \theta_{k}\right)\right|_{\theta_{k}}

其中g是策略梯度, H 被叫做费雪信息矩阵(FIM,the Fisher Information matrix),以海森Hessian矩阵的形式(Hessian matrix)

H=2f=(2fx122fx1x22fx1xn2fx122fx1x22fx1xn2fx122fx1x22fx1xn) H=\nabla^{2} f=\left(\begin{array}{cccc}{\frac{\partial^{2} f}{\partial x_{1}^{2}}} & {\frac{\partial^{2} f}{\partial x_{1} \partial x_{2}}} & {\dots} & {\frac{\partial^{2} f}{\partial x_{1} \partial x_{n}}} \\ {\frac{\partial^{2} f}{\partial x_{1}^{2}}} & {\frac{\partial^{2} f}{\partial x_{1} \partial x_{2}}} & {\cdots} & {\frac{\partial^{2} f}{\partial x_{1} \partial x_{n}}} \\ {\vdots} & {\vdots} & {\ddots} & {\vdots} \\ {\frac{\partial^{2} f}{\partial x_{1}^{2}}} & {\frac{\partial^{2} f}{\partial x_{1} \partial x_{2}}} & {\cdots} & {\frac{\partial^{2} f}{\partial x_{1} \partial x_{n}}}\end{array}\right)
那么优化问题就变成了:

θk+1=argmaxθgT(θθk)subjectto12(θθk)TH(θθk)δ \theta_{k+1}=\arg \max _{\theta} g^{T}\left(\theta-\theta_{k}\right) \\ subject \quad to \quad \frac{1}{2}\left(\theta-\theta_{k}\right)^{T} H\left(\theta-\theta_{k}\right) \leq \delta

我们可以通过优化二次方程来解决它,并且解集为:
θk+1=θk+2δgTH1gH1g \theta_{k+1}=\theta_{k}+\sqrt{\frac{2 \delta}{g^{T} H^{-1} g}} H^{-1} g

上面的计算方法被叫做自然策略梯度法(the natural policy gradien)。以下是使用MM算法和自然梯度算法优化过的算法:

在这里插入图片描述

TRPO
但是,FIM (H) 及其逆运算的成本非常的高。TRPO估计如: H1gH^{-1} g , 通过求解x得到以下线性方程:
xkH^k1g^kH^kxkg^ x_{k} \approx \hat{H}_{k}^{-1} \hat{g}_{k} \\ \hat{H}_{k} x_{k} \approx \hat{g}

这证实了我们可以通过共轭梯度法(conjugate gradien)来求解它,并且这比计算矩阵H 的逆(inverse)更加简单。共轭梯度法类似于梯度下降法,但是它可以在最多N次迭代之中找到最优点,其中N表示模型之中的参数数量。

在这里插入图片描述

最后附图论文在atari游戏中的结果图:
在这里插入图片描述

Tensorflow代码实践

本部分代码来源于OpenAI官方SpinningUp

import numpy as np
import tensorflow as tf
import gym
import time
import spinup.algos.trpo.core as core
from spinup.utils.logx import EpochLogger
from spinup.utils.mpi_tf import MpiAdamOptimizer, sync_all_params
from spinup.utils.mpi_tools import mpi_fork, mpi_avg, proc_id, mpi_statistics_scalar, num_procs


EPS = 1e-8

class GAEBuffer:
    """
    A buffer for storing trajectories experienced by a TRPO agent interacting
    with the environment, and using Generalized Advantage Estimation (GAE-Lambda)
    for calculating the advantages of state-action pairs.
    """
    # 初始化过程,包括obs,act,优势函数等各种的初始化
    def __init__(self, obs_dim, act_dim, size, info_shapes, gamma=0.99, lam=0.95):
        self.obs_buf = np.zeros(core.combined_shape(size, obs_dim), dtype=np.float32)
        self.act_buf = np.zeros(core.combined_shape(size, act_dim), dtype=np.float32)
        self.adv_buf = np.zeros(size, dtype=np.float32)
        self.rew_buf = np.zeros(size, dtype=np.float32)
        self.ret_buf = np.zeros(size, dtype=np.float32)
        self.val_buf = np.zeros(size, dtype=np.float32)
        self.logp_buf = np.zeros(size, dtype=np.float32)
        self.info_bufs = {k: np.zeros([size] + list(v), dtype=np.float32) for k,v in info_shapes.items()}
        self.sorted_info_keys = core.keys_as_sorted_list(self.info_bufs)
        self.gamma, self.lam = gamma, lam
        self.ptr, self.path_start_idx, self.max_size = 0, 0, size
    # 单独将存储写了一个函数,类似于DDPG算法经验回放池一样
    def store(self, obs, act, rew, val, logp, info):
        """
        Append one timestep of agent-environment interaction to the buffer.
        """
        assert self.ptr < self.max_size     # buffer has to have room so you can store
        self.obs_buf[self.ptr] = obs
        self.act_buf[self.ptr] = act
        self.rew_buf[self.ptr] = rew
        self.val_buf[self.ptr] = val
        # 新旧策略
        self.logp_buf[self.ptr] = logp
        for i, k in enumerate(self.sorted_info_keys):
            self.info_bufs[k][self.ptr] = info[i]
        self.ptr += 1

    def finish_path(self, last_val=0):
        """
        Call this at the end of a trajectory, or when one gets cut off
        by an epoch ending. This looks back in the buffer to where the
        trajectory started, and uses rewards and value estimates from
        the whole trajectory to compute advantage estimates with GAE-Lambda,
        as well as compute the rewards-to-go for each state, to use as
        the targets for the value function.

        The "last_val" argument should be 0 if the trajectory ended
        because the agent reached a terminal state (died), and otherwise
        should be V(s_T), the value function estimated for the last state.
        This allows us to bootstrap the reward-to-go calculation to account
        for timesteps beyond the arbitrary episode horizon (or epoch cutoff).
        """

        path_slice = slice(self.path_start_idx, self.ptr)
        rews = np.append(self.rew_buf[path_slice], last_val)
        vals = np.append(self.val_buf[path_slice], last_val)

        # the next two lines implement GAE-Lambda advantage calculation
        deltas = rews[:-1] + self.gamma * vals[1:] - vals[:-1]
        self.adv_buf[path_slice] = core.discount_cumsum(deltas, self.gamma * self.lam)

        # the next line computes rewards-to-go, to be targets for the value function
        self.ret_buf[path_slice] = core.discount_cumsum(rews, self.gamma)[:-1]

        self.path_start_idx = self.ptr

    def get(self):
        """
        Call this at the end of an epoch to get all of the data from
        the buffer, with advantages appropriately normalized (shifted to have
        mean zero and std one). Also, resets some pointers in the buffer.
        """
        assert self.ptr == self.max_size    # buffer has to be full before you can get
        self.ptr, self.path_start_idx = 0, 0
        # the next two lines implement the advantage normalization trick
        adv_mean, adv_std = mpi_statistics_scalar(self.adv_buf)
        self.adv_buf = (self.adv_buf - adv_mean) / adv_std
        return [self.obs_buf, self.act_buf, self.adv_buf, self.ret_buf,
                self.logp_buf] + core.values_as_sorted_list(self.info_bufs)

"""

Trust Region Policy Optimization

(with support for Natural Policy Gradient)

"""
[文档]def trpo(env_fn, actor_critic=core.mlp_actor_critic, ac_kwargs=dict(), seed=0,
         steps_per_epoch=4000, epochs=50, gamma=0.99, delta=0.01, vf_lr=1e-3,
         train_v_iters=80, damping_coeff=0.1, cg_iters=10, backtrack_iters=10,
         backtrack_coeff=0.8, lam=0.97, max_ep_len=1000, logger_kwargs=dict(),
         save_freq=10, algo='trpo'):
    """

    Args:
        env_fn : A function which creates a copy of the environment.
            The environment must satisfy the OpenAI Gym API.

        actor_critic: A function which takes in placeholder symbols
            for state, ``x_ph``, and action, ``a_ph``, and returns the main
            outputs from the agent's Tensorflow computation graph:

            ============  ================  ========================================
            Symbol        Shape             Description
            ============  ================  ========================================
            ``pi``        (batch, act_dim)  | Samples actions from policy given
                                            | states.
            ``logp``      (batch,)          | Gives log probability, according to
                                            | the policy, of taking actions ``a_ph``
                                            | in states ``x_ph``.
            ``logp_pi``   (batch,)          | Gives log probability, according to
                                            | the policy, of the action sampled by
                                            | ``pi``.
            ``info``      N/A               | A dict of any intermediate quantities
                                            | (from calculating the policy or log
                                            | probabilities) which are needed for
                                            | analytically computing KL divergence.
                                            | (eg sufficient statistics of the
                                            | distributions)
            ``info_phs``  N/A               | A dict of placeholders for old values
                                            | of the entries in ``info``.
            ``d_kl``      ()                | A symbol for computing the mean KL
                                            | divergence between the current policy
                                            | (``pi``) and the old policy (as
                                            | specified by the inputs to
                                            | ``info_phs``) over the batch of
                                            | states given in ``x_ph``.
            ``v``         (batch,)          | Gives the value estimate for states
                                            | in ``x_ph``. (Critical: make sure
                                            | to flatten this!)
            ============  ================  ========================================

        ac_kwargs (dict): Any kwargs appropriate for the actor_critic
            function you provided to TRPO.

        seed (int): Seed for random number generators.

        steps_per_epoch (int): Number of steps of interaction (state-action pairs)
            for the agent and the environment in each epoch.

        epochs (int): Number of epochs of interaction (equivalent to
            number of policy updates) to perform.

        gamma (float): Discount factor. (Always between 0 and 1.)

        delta (float): KL-divergence limit for TRPO / NPG update.
            (Should be small for stability. Values like 0.01, 0.05.)

        vf_lr (float): Learning rate for value function optimizer.

        train_v_iters (int): Number of gradient descent steps to take on
            value function per epoch.

        damping_coeff (float): Artifact for numerical stability, should be
            smallish. Adjusts Hessian-vector product calculation:

            .. math:: Hv \\rightarrow (\\alpha I + H)v

            where :math:`\\alpha` is the damping coefficient.
            Probably don't play with this hyperparameter.

        cg_iters (int): Number of iterations of conjugate gradient to perform.
            Increasing this will lead to a more accurate approximation
            to :math:`H^{-1} g`, and possibly slightly-improved performance,
            but at the cost of slowing things down.

            Also probably don't play with this hyperparameter.

        backtrack_iters (int): Maximum number of steps allowed in the
            backtracking line search. Since the line search usually doesn't
            backtrack, and usually only steps back once when it does, this
            hyperparameter doesn't often matter.

        backtrack_coeff (float): How far back to step during backtracking line
            search. (Always between 0 and 1, usually above 0.5.)

        lam (float): Lambda for GAE-Lambda. (Always between 0 and 1,
            close to 1.)

        max_ep_len (int): Maximum length of trajectory / episode / rollout.

        logger_kwargs (dict): Keyword args for EpochLogger.

        save_freq (int): How often (in terms of gap between epochs) to save
            the current policy and value function.

        algo: Either 'trpo' or 'npg': this code supports both, since they are
            almost the same.

    """

    logger = EpochLogger(**logger_kwargs)
    logger.save_config(locals())

    seed += 10000 * proc_id()
    tf.set_random_seed(seed)
    np.random.seed(seed)

    env = env_fn()
    obs_dim = env.observation_space.shape
    act_dim = env.action_space.shape

    # Share information about action space with policy architecture
    ac_kwargs['action_space'] = env.action_space

    # Inputs to computation graph
    x_ph, a_ph = core.placeholders_from_spaces(env.observation_space, env.action_space)
    adv_ph, ret_ph, logp_old_ph = core.placeholders(None, None, None)

    # Main outputs from computation graph, plus placeholders for old pdist (for KL)
    pi, logp, logp_pi, info, info_phs, d_kl, v = actor_critic(x_ph, a_ph, **ac_kwargs)

    # Need all placeholders in *this* order later (to zip with data from buffer)
    all_phs = [x_ph, a_ph, adv_ph, ret_ph, logp_old_ph] + core.values_as_sorted_list(info_phs)

    # Every step, get: action, value, logprob, & info for pdist (for computing kl div)
    get_action_ops = [pi, v, logp_pi] + core.values_as_sorted_list(info)

    # Experience buffer
    local_steps_per_epoch = int(steps_per_epoch / num_procs())
    info_shapes = {k: v.shape.as_list()[1:] for k,v in info_phs.items()}
    buf = GAEBuffer(obs_dim, act_dim, local_steps_per_epoch, info_shapes, gamma, lam)

    # Count variables
    var_counts = tuple(core.count_vars(scope) for scope in ['pi', 'v'])
    logger.log('\nNumber of parameters: \t pi: %d, \t v: %d\n'%var_counts)

    # TRPO losses
    ratio = tf.exp(logp - logp_old_ph)          # pi(a|s) / pi_old(a|s)
    pi_loss = -tf.reduce_mean(ratio * adv_ph)
    v_loss = tf.reduce_mean((ret_ph - v)**2)

    # Optimizer for value function
    train_vf = MpiAdamOptimizer(learning_rate=vf_lr).minimize(v_loss)

    # Symbols needed for CG solver
    pi_params = core.get_vars('pi')
    gradient = core.flat_grad(pi_loss, pi_params)
    v_ph, hvp = core.hessian_vector_product(d_kl, pi_params)
    if damping_coeff > 0:
        hvp += damping_coeff * v_ph

    # Symbols for getting and setting params
    get_pi_params = core.flat_concat(pi_params)
    set_pi_params = core.assign_params_from_flat(v_ph, pi_params)

    sess = tf.Session()
    sess.run(tf.global_variables_initializer())

    # Sync params across processes
    sess.run(sync_all_params())

    # Setup model saving
    logger.setup_tf_saver(sess, inputs={'x': x_ph}, outputs={'pi': pi, 'v': v})

    def cg(Ax, b):
        """
        Conjugate gradient algorithm
        (see https://en.wikipedia.org/wiki/Conjugate_gradient_method)
        """
        x = np.zeros_like(b)
        r = b.copy() # Note: should be 'b - Ax(x)', but for x=0, Ax(x)=0. Change if doing warm start.
        p = r.copy()
        r_dot_old = np.dot(r,r)
        for _ in range(cg_iters):
            z = Ax(p)
            alpha = r_dot_old / (np.dot(p, z) + EPS)
            x += alpha * p
            r -= alpha * z
            r_dot_new = np.dot(r,r)
            p = r + (r_dot_new / r_dot_old) * p
            r_dot_old = r_dot_new
        return x

    def update():
        # Prepare hessian func, gradient eval
        inputs = {k:v for k,v in zip(all_phs, buf.get())}
        Hx = lambda x : mpi_avg(sess.run(hvp, feed_dict={**inputs, v_ph: x}))
        g, pi_l_old, v_l_old = sess.run([gradient, pi_loss, v_loss], feed_dict=inputs)
        g, pi_l_old = mpi_avg(g), mpi_avg(pi_l_old)

        # Core calculations for TRPO or NPG
        x = cg(Hx, g)
        alpha = np.sqrt(2*delta/(np.dot(x, Hx(x))+EPS))
        old_params = sess.run(get_pi_params)

        def set_and_eval(step):
            sess.run(set_pi_params, feed_dict={v_ph: old_params - alpha * x * step})
            return mpi_avg(sess.run([d_kl, pi_loss], feed_dict=inputs))

        if algo=='npg':
            # npg has no backtracking or hard kl constraint enforcement
            kl, pi_l_new = set_and_eval(step=1.)

        elif algo=='trpo':
            # trpo augments npg with backtracking line search, hard kl
            for j in range(backtrack_iters):
                kl, pi_l_new = set_and_eval(step=backtrack_coeff**j)
                if kl <= delta and pi_l_new <= pi_l_old:
                    logger.log('Accepting new params at step %d of line search.'%j)
                    logger.store(BacktrackIters=j)
                    break

                if j==backtrack_iters-1:
                    logger.log('Line search failed! Keeping old params.')
                    logger.store(BacktrackIters=j)
                    kl, pi_l_new = set_and_eval(step=0.)

        # Value function updates
        for _ in range(train_v_iters):
            sess.run(train_vf, feed_dict=inputs)
        v_l_new = sess.run(v_loss, feed_dict=inputs)

        # Log changes from update
        logger.store(LossPi=pi_l_old, LossV=v_l_old, KL=kl,
                     DeltaLossPi=(pi_l_new - pi_l_old),
                     DeltaLossV=(v_l_new - v_l_old))

    start_time = time.time()
    o, r, d, ep_ret, ep_len = env.reset(), 0, False, 0, 0

    # Main loop: collect experience in env and update/log each epoch
    for epoch in range(epochs):
        for t in range(local_steps_per_epoch):
            agent_outs = sess.run(get_action_ops, feed_dict={x_ph: o.reshape(1,-1)})
            a, v_t, logp_t, info_t = agent_outs[0][0], agent_outs[1], agent_outs[2], agent_outs[3:]

            # save and log
            buf.store(o, a, r, v_t, logp_t, info_t)
            logger.store(VVals=v_t)

            o, r, d, _ = env.step(a)
            ep_ret += r
            ep_len += 1

            terminal = d or (ep_len == max_ep_len)
            if terminal or (t==local_steps_per_epoch-1):
                if not(terminal):
                    print('Warning: trajectory cut off by epoch at %d steps.'%ep_len)
                # if trajectory didn't reach terminal state, bootstrap value target
                last_val = r if d else sess.run(v, feed_dict={x_ph: o.reshape(1,-1)})
                buf.finish_path(last_val)
                if terminal:
                    # only save EpRet / EpLen if trajectory finished
                    logger.store(EpRet=ep_ret, EpLen=ep_len)
                o, r, d, ep_ret, ep_len = env.reset(), 0, False, 0, 0

        # Save model
        if (epoch % save_freq == 0) or (epoch == epochs-1):
            logger.save_state({'env': env}, None)

        # Perform TRPO or NPG update!
        update()

        # Log info about epoch
        logger.log_tabular('Epoch', epoch)
        logger.log_tabular('EpRet', with_min_and_max=True)
        logger.log_tabular('EpLen', average_only=True)
        logger.log_tabular('VVals', with_min_and_max=True)
        logger.log_tabular('TotalEnvInteracts', (epoch+1)*steps_per_epoch)
        logger.log_tabular('LossPi', average_only=True)
        logger.log_tabular('LossV', average_only=True)
        logger.log_tabular('DeltaLossPi', average_only=True)
        logger.log_tabular('DeltaLossV', average_only=True)
        logger.log_tabular('KL', average_only=True)
        if algo=='trpo':
            logger.log_tabular('BacktrackIters', average_only=True)
        logger.log_tabular('Time', time.time()-start_time)
        logger.dump_tabular()


if __name__ == '__main__':
    import argparse
    parser = argparse.ArgumentParser()
    parser.add_argument('--env', type=str, default='HalfCheetah-v2')
    parser.add_argument('--hid', type=int, default=64)
    parser.add_argument('--l', type=int, default=2)
    parser.add_argument('--gamma', type=float, default=0.99)
    parser.add_argument('--seed', '-s', type=int, default=0)
    parser.add_argument('--cpu', type=int, default=4)
    parser.add_argument('--steps', type=int, default=4000)
    parser.add_argument('--epochs', type=int, default=50)
    parser.add_argument('--exp_name', type=str, default='trpo')
    args = parser.parse_args()

    mpi_fork(args.cpu)  # run parallel code with mpi

    from spinup.utils.run_utils import setup_logger_kwargs
    logger_kwargs = setup_logger_kwargs(args.exp_name, args.seed)

    trpo(lambda : gym.make(args.env), actor_critic=core.mlp_actor_critic,
         ac_kwargs=dict(hidden_sizes=[args.hid]*args.l), gamma=args.gamma,
         seed=args.seed, steps_per_epoch=args.steps, epochs=args.epochs,
         logger_kwargs=logger_kwargs)

关于core.py代码如下:

import numpy as np
import tensorflow as tf
import scipy.signal
from gym.spaces import Box, Discrete

EPS = 1e-8

def combined_shape(length, shape=None):
    if shape is None:
        return (length,)
    return (length, shape) if np.isscalar(shape) else (length, *shape)

def keys_as_sorted_list(dict):
    return sorted(list(dict.keys()))

def values_as_sorted_list(dict):
    return [dict[k] for k in keys_as_sorted_list(dict)]

def placeholder(dim=None):
    return tf.placeholder(dtype=tf.float32, shape=combined_shape(None,dim))

def placeholders(*args):
    return [placeholder(dim) for dim in args]

def placeholder_from_space(space):
    if isinstance(space, Box):
        return placeholder(space.shape)
    elif isinstance(space, Discrete):
        return tf.placeholder(dtype=tf.int32, shape=(None,))
    raise NotImplementedError

def placeholders_from_spaces(*args):
    return [placeholder_from_space(space) for space in args]

def mlp(x, hidden_sizes=(32,), activation=tf.tanh, output_activation=None):
    for h in hidden_sizes[:-1]:
        x = tf.layers.dense(x, units=h, activation=activation)
    return tf.layers.dense(x, units=hidden_sizes[-1], activation=output_activation)

def get_vars(scope=''):
    return [x for x in tf.trainable_variables() if scope in x.name]

def count_vars(scope=''):
    v = get_vars(scope)
    return sum([np.prod(var.shape.as_list()) for var in v])
def gaussian_likelihood(x, mu, log_std):
    pre_sum = -0.5 * (((x-mu)/(tf.exp(log_std)+EPS))**2 + 2*log_std + np.log(2*np.pi))
    return tf.reduce_sum(pre_sum, axis=1)

def diagonal_gaussian_kl(mu0, log_std0, mu1, log_std1):
    """
    tf symbol for mean KL divergence between two batches of diagonal gaussian distributions,
    where distributions are specified by means and log stds.
    (https://en.wikipedia.org/wiki/Kullback-Leibler_divergence#Multivariate_normal_distributions)
    """
    var0, var1 = tf.exp(2 * log_std0), tf.exp(2 * log_std1)
    pre_sum = 0.5*(((mu1- mu0)**2 + var0)/(var1 + EPS) - 1) +  log_std1 - log_std0
    all_kls = tf.reduce_sum(pre_sum, axis=1)
    return tf.reduce_mean(all_kls)

def categorical_kl(logp0, logp1):
    """
    tf symbol for mean KL divergence between two batches of categorical probability distributions,
    where the distributions are input as log probs.
    """
    all_kls = tf.reduce_sum(tf.exp(logp1) * (logp1 - logp0), axis=1)
    return tf.reduce_mean(all_kls)

def flat_concat(xs):
    return tf.concat([tf.reshape(x,(-1,)) for x in xs], axis=0)

def flat_grad(f, params):
    return flat_concat(tf.gradients(xs=params, ys=f))

def hessian_vector_product(f, params):
    # for H = grad**2 f, compute Hx
    g = flat_grad(f, params)
    x = tf.placeholder(tf.float32, shape=g.shape)
    return x, flat_grad(tf.reduce_sum(g*x), params)

def assign_params_from_flat(x, params):
    flat_size = lambda p : int(np.prod(p.shape.as_list())) # the 'int' is important for scalars
    splits = tf.split(x, [flat_size(p) for p in params])
    new_params = [tf.reshape(p_new, p.shape) for p, p_new in zip(params, splits)]
    return tf.group([tf.assign(p, p_new) for p, p_new in zip(params, new_params)])

def discount_cumsum(x, discount):
    """
    magic from rllab for computing discounted cumulative sums of vectors.

    input:
        vector x,
        [x0,
         x1,
         x2]

    output:
        [x0 + discount * x1 + discount^2 * x2,
         x1 + discount * x2,
         x2]
    """
    return scipy.signal.lfilter([1], [1, float(-discount)], x[::-1], axis=0)[::-1]

"""
Policies
"""
def mlp_categorical_policy(x, a, hidden_sizes, activation, output_activation, action_space):
    act_dim = action_space.n
    logits = mlp(x, list(hidden_sizes)+[act_dim], activation, None)
    logp_all = tf.nn.log_softmax(logits)
    pi = tf.squeeze(tf.multinomial(logits,1), axis=1)
    logp = tf.reduce_sum(tf.one_hot(a, depth=act_dim) * logp_all, axis=1)
    logp_pi = tf.reduce_sum(tf.one_hot(pi, depth=act_dim) * logp_all, axis=1)

    old_logp_all = placeholder(act_dim)
    d_kl = categorical_kl(logp_all, old_logp_all)

    info = {'logp_all': logp_all}
    info_phs = {'logp_all': old_logp_all}

    return pi, logp, logp_pi, info, info_phs, d_kl


def mlp_gaussian_policy(x, a, hidden_sizes, activation, output_activation, action_space):
    act_dim = a.shape.as_list()[-1]
    mu = mlp(x, list(hidden_sizes)+[act_dim], activation, output_activation)
    log_std = tf.get_variable(name='log_std', initializer=-0.5*np.ones(act_dim, dtype=np.float32))
    std = tf.exp(log_std)
    pi = mu + tf.random_normal(tf.shape(mu)) * std
    logp = gaussian_likelihood(a, mu, log_std)
    logp_pi = gaussian_likelihood(pi, mu, log_std)

    old_mu_ph, old_log_std_ph = placeholders(act_dim, act_dim)
    d_kl = diagonal_gaussian_kl(mu, log_std, old_mu_ph, old_log_std_ph)

    info = {'mu': mu, 'log_std': log_std}
    info_phs = {'mu': old_mu_ph, 'log_std': old_log_std_ph}

    return pi, logp, logp_pi, info, info_phs, d_kl


"""
Actor-Critics
"""
def mlp_actor_critic(x, a, hidden_sizes=(64,64), activation=tf.tanh,
                     output_activation=None, policy=None, action_space=None):

    # default policy builder depends on action space
    if policy is None and isinstance(action_space, Box):
        policy = mlp_gaussian_policy
    elif policy is None and isinstance(action_space, Discrete):
        policy = mlp_categorical_policy

    with tf.variable_scope('pi'):
        policy_outs = policy(x, a, hidden_sizes, activation, output_activation, action_space)
        pi, logp, logp_pi, info, info_phs, d_kl = policy_outs
    with tf.variable_scope('v'):
        v = tf.squeeze(mlp(x, list(hidden_sizes)+[1], activation, None), axis=1)
    return pi, logp, logp_pi, info, info_phs, d_kl, v

参考文献:

[1].《Trust Region Policy Optimization》https://arxiv.org/pdf/1502.05477.pdf

[2]. https://medium.com/@jonathan_hui/rl-the-math-behind-trpo-ppo-d12f6c745f33

[3]. https://ai.yanxishe.com/page/TextTranslation/1419

[4]. https://blog.csdn.net/weixin_37895339/article/details/83044731

[5]. https://en.wikipedia.org/wiki/MM_algorithm

[6]. https://www.youtube.com/watch?v=wSjQ4LVUgoU

[7]. https://zhuanlan.zhihu.com/p/26308073

[8]. https://medium.com/@jonathan_hui/rl-trust-region-policy-optimization-trpo-explained-a6ee04eeeee9

[9]. https://medium.com/@jonathan_hui/rl-natural-policy-gradient-actor-critic-using-kronecker-factored-trust-region-acktr-58f3798a4a93

[10]. KL散度:https://en.wikipedia.org/wiki/Kullback%E2%80%93Leibler_divergence

[11]. https://spinningup.readthedocs.io/zh_CN/latest/_modules/spinup/algos/trpo/trpo.html#trpo

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章