Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

發表時間：2020
文章要點：這篇文章主要介紹當前offline RL的研究進展，可能的問題以及一些解決方法。
作者先介紹了強化學習的準備知識，比如policy gradients，Approximate dynamic programming，Actor-critic algorithms，Model-based reinforcement learning，這裏不具體說了。接着開始說offline RL，和online相比，主要的區別就是我們只能有一個static dataset，並且不能和環境交互獲得新數據，所以offline RL排除了exploration，只能基於這個dataset來學策略。這種設定和很多實際應用相關，比如文中提到的Decision making in health care，Learning goal-directed dialogue policies，Learning robotic manipulation skills等等。接着說一下每個章節大概在說什麼。

2.4 What Makes Offline Reinforcement Learning Difficult
文章提到offline面臨的一些問題，其中distribution shift是offline RL裏一個主要的挑戰。這個問題是說在訓練policy最大化return的過程中，agent訓練得到的policy和收集數據的policy是不一樣的，是有偏移的（distributional shift: while our function approximator (policy, value function, or model) might be trained under one distribution, it will be evaluated on a different distribution, due both to the change in visited states for the new policy and, more subtly, by the act of maximizing the expected return.）。
作者還理論說了下即使在有optimal action label的情況下，offline RL的error和time horizon的平方成正比，而online的時候是線性的。

通常的解決辦法是約束the learned policy和behavior policy的距離，不要差太多。

3 Offline Evaluation and Reinforcement Learning via Importance Sampling
接着作者介紹了importance sampling在offline RL裏面的作用，這裏主要是針對policy gradient方法的。
首先是，Offline Evaluation，通常要學一個policy就需要知道怎麼評估策略，評估完了後就可以知道哪個策略最好了。其中一個方式是importance sampling，

接着作者介紹了幾個減少方差的方法，不細說了。
另一個用處是Off-Policy Policy Gradient，就是直接用policy gradient結合importance sampling在static dataset上面學policy，

爲了讓learned policy和behavior policy不要差太遠，還可以加一些約束，比如KL-divergence, entropy regularizer等等。
因爲policy gradient需要基於軌跡來做importance sampling（per-action importance weights），再加上數據集的樣本有限，這樣會加劇方差的問題。一個可能的解決辦法是Approximate Off-Policy Policy Gradients，將對軌跡的修正轉變爲對state distribution的採樣，甚至把importance sampling去掉，這樣會有誤差，但是實踐表明一點誤差是可以接受的，而且最後效果不錯。
如果不想引入誤差，另一個方式是Marginalized Importance Sampling，直接估計state-marginal importance ratio而不是per-action importance weighting

這部分的主要問題還是importance sampling帶來的方差，這個問題在offline下會更嚴重。the maximum improvement that can be reliably obtained via importance sampling is limited by (i) the suboptimality of the behavior policy; (ii) the dimensionality of the state and action space; (iii) the effective horizon of the task.

4 Offline Reinforcement Learning via Dynamic Programming
下一個內容是Dynamic Programming，這個主要是針對Q-learning。這裏遇到的主要問題就是distributional shift，通常方法包括policy constraint methods和uncertainty-based methods。前者約束learned policy和behavior policy的距離，後者估計Q-value的uncertainty，從而用來檢測是否存在distributional shift的問題。
文章先說了value estimation的方法，這部分和online區別不大，比如Bellman residual minimization，Least-squares fixed point approximation，Least squares temporal difference Q-learning (LSTD-Q)，Least squares policy iteration (LSPI)。
接着具體說了distributional shift，作者想說因爲learned policy和behavior policy差太多，學到的policy通常都在沒見過的state上選動作，這就導致policy的性能無法保證。作者進一步用一個實驗來說明，即使增大數據量，這個問題還是沒有被緩解。這說明這個問題不是因爲overfitting造成的

接着就介紹了policy constraint的方法，explicit f-divergence constraints，implicit f-divergence constraints，integral probability metric (IPM) constraints，這些方式可以直接加policy constraints或者通過添加reward的方式做policy penalty。這裏有個直覺的準則，Intuitively, an effective policy constraint should prevent the learned policy \(\pi(a|s)\) from going outside the set of actions that have a high probability in the data, but would not prevent it from concentrating around a subset of high-probability actions.
之前的方式都是約束policy的距離，這個方式不利於學到一個好的policy。另一個方式是約束兩個policy的support，這個support是說我不需要policy很近，我只需要大家的策略動作都出現在data裏面，所以只需要約束OOD的概率

接着說了uncertainty的方式，大概做法就是在估計Q的時候考慮uncertainty

另一個方式是加正則項，比如加個penalty

這裏面存在的問題有，uncertainty不好估計，可能會導致過分保守的估計或者過分寬鬆的估計。

5 Offline Model-Based Reinforcement Learning
這章主要說了下Model-Based的情況。首先主要的問題還是Model Exploitation and Distribution Shift，之前model-free的話就是value exploitation。主要的解決方式還是搞一個constraint或者penalty，估計model uncertainty等等。

6 Applications and Evaluation
這章就講了下應用和benchmark。Benchmark主要說了D4RL，應用說了Robotics，Healthcare，Autonomous Driving，Advertising and Recommender Systems，Language and Dialogue。

總結：對這個方向的瞭解還是不夠深刻，看起來這篇文章寫了很多，但是讀下來沒有醍醐灌頂的感覺，還沒摸到文章的邏輯在哪。
疑問：感覺寫的挺難的，有些結論都沒看明白，講forward/backward bellman equation那裏完全沒看明白。
裏面的章節順序其實有點看不明白，有的地方感覺內容重複了，可能還沒真正理解爲啥要這麼劃分，比如3,4章裏面講具體方法的地方。

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

電子科技大學計算機科學與技術就讀體驗

Golang爬蟲代理接入的技術與實踐

Large Language Models Are Semi-Parametric Reinforcement Learning Agents

Offline Reinforcement Learning: Tutorial, Review, and Perspectives on Open Problems

Improved Soft Actor-Critic: Mixing Prioritized Off-Policy Samples with On-Policy Experience

State Distribution-aware Sampling for Deep Q-learning

Large Batch Experience Replay

https://yachay.unat.edu.pe/blog/index.php?comment_area=format_blog&comment_component=blog&comment_co

linux以太網驅動總結