Discovering and Achieving Goals via World Models

原創

2022-05-05 13:32

发表时间：2021（NeurIPS 2021）
文章要点：这篇文章提出Latent Explorer Achiever (LEXA)算法，通过学习world model的imagined rollouts来训练一个explorer策略和一个achiever策略，通过unsupervised learning学习策略，最后可以zero-shot迁移到其他任务。这个方式的好处在于之前的探索方法只能让agent返回到之前访问过的state，而用world model和explorer可以发现没去过的state，然后就可以生成多种多样的target作为训练目标，实现unsupervised learning和zero-shot迁移到其他任务（Unlike prior methods that explore by reaching previously visited states, the explorer plans to discover unseen surprising states through foresight）。
具体的，分别训练explorer和achiever，先通过explorer在model里planning，发现novel state，然后在真正的环境里执行这个动作序列得到真实的state，最后用这个state作为achiever的target来学习。学完之后，这个achiever就可以直接用来完成其他任务了（the achiever solves tasks specified as goal images zero-shot without any additional learning）。

而这个world model的训练用的Recurrent State Space Model (RSSM)（Learning Latent Dynamics for Planning from Pixels）

Explorer的训练目标是最大化exploration reward，这个reward通过估计model uncertainty获得。先是训练一组model来做1-step prediction，然后用这些model的预测方差来作为reward

然后就在model里面，再加上这个reward用RL去学policy和value

这个方式和dreamer算法一样（DREAM TO CONTROL: LEARNING BEHAVIORS BY LATENT IMAGINATION）。完了之后，这个policy会在真实环境中来采样轨迹到buffer里用来更新model。
然后achiever会根据采样的goal，在model里面训练achiever的策略

这里\(x_g\)是从buffer里采样的真实环境的goal，然后用encoder得到embedding \(e_g\)，这个任务的reward就是一个到goal的距离度量，这个距离可以是当前state和goal state的余弦值

或者是学的一个和多少步能走到goal state的一个相关值

然后训练也用dreamer来做。整个算法如下

总结：非常好的一个思路啊，主要的好处就是可以探索没有去过的state，而之前的探索方法都是先到过，然后通过加reward来重复到。相当于一个是foresight，一个是hindsight。另外他还直接做到了unsupervised learning，用explorer来找新的goal，用achiever来学最优策略，实现了zero-shot，相当于又更进一步了。之前我也想用model uncertainty的方式来找新的状态，然后做动作去环境里探索，再用DQN的方式更新，果然很多人都能想到。
疑问：感觉抄的Planning to Explore via Self-Supervised World Models啊，简直一模一样。细看了一下作者，居然是一拨人。这，为啥分开发了两篇顶会？非要说区别，就是之前那个只做到few-shot，这个做到zero-shot？之前没有严格区别explorer和achiever？

發表評論

所有評論

還沒有人評論，想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.

Discovering and Achieving Goals via World Models

.Net 8.0 下的新RPC，IceRPC之试试的新玩法"打洞"

关于游戏付费的一点想法

我通过CKA和CKS啦！

《最新出炉》系列入门篇-Python+Playwright自动化测试-42-强大的可视化追踪利器Trace Viewer

大数据怎么学？对大数据开发领域及岗位的详细解读，完整理解大数据开发领域技术体系

Reflexion: Language Agents with Verbal Reinforcement Learning

Large Language Models Are Semi-Parametric Reinforcement Learning Agents

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Improved Soft Actor-Critic: Mixing Prioritized Off-Policy Samples with On-Policy Experience

State Distribution-aware Sampling for Deep Q-learning

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結