強化學習在機器人中的應用 --- 概述

強化學習是機器學習中的一個子領域,其中智能體通過與環境的交互,觀測交互結果以及獲得相應的回報。這種學習的方式是模擬人或動物的學習過程

我們人類,與我們所處的環境有一個直接的感官接觸,我們可以通過執行動作,目睹動作所產生的影響。這個觀點可以理解成“cause and effect”,毫無疑問地,這就是我們人生中建立起對環境的認知的關鍵。本文章將從以下幾個方面介紹強化學習在機器人中的應用:

  • 強化學習目前應用在哪些方面
  • 爲什麼強化學習與機器人有密切的聯繫
  • 機器人強化學習簡介
    • 值函數近似
  • 機器人強化學習挑戰
    • 維度災難
    • 實際環境採樣災難
    • 模型限制和模型不確定性災難
  • 機器人強化學習準則
    • 有效的表徵
    • 模型近似
    • 先驗知識

強化學習目前應用在哪些方面

很多問題都已經通過強化學習得以解決,因爲RL所設計的智能體不需要通過專家進行監督學習,對於那些複雜的沒有明顯或不容易通過流程來解決的這類問題就最適合使用強化學習得以解決,比如說:

  • Game playing – 在遊戲場景中做出最佳的動作往往依賴與很多因素,特別是在特定的遊戲中可能的遊戲狀態有非常多的時候。要想通過傳統的方法去覆蓋如此多的狀態,意味着將要設定非常多的人工定製的規則,RL將會不需要人爲定製規則,智能體能通過玩遊戲學到規則。對於雙人對抗遊戲比如“backgammon”,智能體能夠通過與人類玩家或者其他RL智能體對抗遊戲中得到訓練。
  • Control problems – 比如電梯調度問題。同樣的,沒有一個明顯的策略能夠提供最好最省時的電梯服務。對於這樣的控制問題,RL智能體能夠在仿真環境中進行學習最終學到最佳的控制策略。RL在控制問題中的應用的一些優點是能夠很容易的適應環境的變化,連續不斷的訓練,同時不斷的提升自己的性能。

一個最近的比較好的列子是DeepMind’s paper Human-level control through deep reinforcement learning

爲什麼強化學習與機器人有密切的聯繫

J. Kober, J. Andrew (Drew) Bagnell, and J. Peters 在Reinforcement Learning in Robotics: A Survey中指出:

Reinforcement learning offers to robotics a framework and set of tools for the design of sophisticated and hard-to-engineer behaviors
強化學習提供給機器人學一個設計複雜和人爲難以設定的工程的行爲的工具集和框架

同時,強化學習逐漸成爲現實世界中一個普遍存在的工具。一般來說,對於人類來說複雜的問題恰好機器人可能會很容易解決,同時,對於我們人類來說簡單的問題,機器人可能解決起來會非常複雜。也導致很多人認爲,機器人可以解決複雜但又在每次試驗任務上表現簡單,換句話說:

What’s complex for a human, robots can do easily and viceversa - Víctor Mayoral Vilches

舉個簡單例子,想象我們桌上有一個三關節的操作機器人從事某項重複任務,傳統來說,機器人工程師處理這樣一個特定任務要麼設定整個應用要麼使用已有的工具(已經由製造商提供)去編程設計這個應用案例。不管這個工具和任務的複雜程度,我們都會遇到:

  • 逆運動學(Inverse Kinematics)時產生的每個電機(關節)的誤差
  • 設定閉環的時候模型精度
  • 設計整個控制流程
  • 經常的程序式標定(Calibration)

所有的這些是爲了讓機器人在被控制環境下產生一個確定的動作。
但是事實是:真實環境是不可控

Problems in robotics are often best represented with high-dimensional, continuous states and actions (note that the 10-30 dimensional continuous actions common in robot reinforcement learning are considered large (Powell, 2012)). In robotics, it is often unrealistic to assume that the true state is completely observable and noise-free.
機器人中問題一般都表現爲:高維,連續狀態,連續動作(通常10-30維的連續動作空間在機器人強化學習中都認爲是巨大的),在機器人學中,假設狀態空間被完全觀察是不現實的,同時觀測量一般都帶有噪聲

回到J. Kober 等的文章Reinforcement Learning in Robotics: A Survey

Reinforcement learning (RL) enables a robot to autonomously discover an optimal behavior through trial-and-error interactions with its environment. Instead of explicitly detailing the solution to a problem, in reinforcement learning the designer of a control task provides feedback in terms of a scalar objective function that measures the one-step performance of the robot.
強化學習會是機器人通過與環境交互地式錯學習的方式自主的發現最優行爲。不需要關心解決問題的具體細節,強化學習中任務的設計器將會依據目標函數提供反饋,以度量機器人每一步的表現性能。

這樣說是很有道理的。舉個投籃的列子:

  1. I get myself behind the 3 point line and get ready for a shot
  2. At this point, my consciousness has no whatsoever information about the exact distance to the basket, neither the strength I should use to use to make the shot so my brain produces an estimate based on the model that I have (built upon years of trial an error shots)
  3. With this estimate, I produce a shot. Let’s assume I miss the shot, which I notice through my eyes (sensors). Again, the information perceived through my eyes is not accurate thereby what I pass to my brain is not: “I missed the shot by 5.45 cm to the right” but more like “The shot was slightly too much to the right and i missed”. This information updates the model in my brain which receives a negative reward. We could get ourselves discussing about why did my estimate failed. Was it because the model is wrong regardless of the fact that I’ve made hundreds of 3-pointers before with that exact model? Was it because of the wind (playing outdoors generally)? or was it because i didn’t eat properly that morning?. It could easily be all of those or none, but the fact is that many of those aspects can’t really be controlled be me. So i proceed iterating.
  4. With the updated model, i make another shot which in case it fails drives me to step 2) but if I make it, I proceed to step 5).
  5. Making a shot means that my model did a good job so my brain strengthens those links that produced a proper shot by giving them a positive reward.
    Making a shot means that my model did a good job so my brain strengthens those links that produced a proper shot by giving them a positive reward.

機器人強化學習簡介

強化學習的目標在於:找到一個狀態空間 S 到動作空間 X 的映射,稱爲策略 π , 獲得在給定狀態 s 下的動作 a ,保證最大化累積期望回報 r 。 q強化學習找到最優策略π ,狀態空間或者觀察空間到動作空間的映射,達到最大化期望回報J :
Jπ=E[R(τ|π)]=(R(τ)pπ(τ)dτ)
其中pπ(τ) 代表關於軌跡τ=(x0,a0,x1,a1,...) 的分佈,同時R(τ) 是關於該軌跡的累積折扣回報:
R(τ)=t=0γtr(xt,at)
qi其中γt[0,1) 代表折扣因子,這樣的數學描述,機器人中的很多任務都可以很自然的形式化描述成強化學習(Reinforcement Learning)問題。 傳統的強化學習方法,典型的是去估計在某一個策略π 下,每個狀態x 和時間步t 的長期期望回報,稱爲值函數(Value Function) Vπt(x) .值函數方法有時候稱爲*”critic-only method”*. 核心思想是先觀測,評估所選擇的控制的性能(value function),然後根據獲得的知識選擇策略。於此相對於的*policy search method* 策略搜索,直接推斷最優策略π ,有時候稱爲*actor-only methods*.

函數逼近

函數逼近是用於表示感興趣區域的函數的一系列數學和統計技術,它在計算上或信息理論上難以精確地或完全地表示函數。 正如J. Kober等人說:

Typically, in reinforcement learning the function approximation is based on sample data collected during interaction with the environment. Function approximation is critical in nearly every RL problem, and becomes inevitable in continuous state ones. In large discrete spaces it is also often impractical to visit or even represent all states and actions, and function approximation in this setting can be used as a means to generalize to neighboring states and actions. Function approximation can be employed to represent policies, value functions, and forward models.
強化學習中函數逼近是基於在智能體與環境交互過程中的樣本數據的。函數逼近在幾乎每個強化學習問題上都很重要,並且在連續空間中顯得很有必要。在巨大的離散空間中,遍歷每個狀態和動作是很不現實的,而函數逼近就代表着相鄰狀態和動作的均值,函數逼近可以用於代表策略(policy),代表值函數(value function),或前向模型(forward models).

機器人強化學習挑戰

維度災難

Bellman 在1957年提出這個詞*Curse of Dimensionality*,他在最優控制中發現在離散高維空間中,探索狀態和動作將會面臨高維災難。隨着維數的增加,在cover 整個狀態-動作空間的時候,將會需要更多的數據和計算。 b比如,我們控制一個7自由度的機械臂的時候,機器人的狀態將要代表每個自由度的關節角度和速度,同時還有末端執行器*end_effector*的笛卡爾位置和速度。
NumberofStates=2×(7+3)=20NumberofActions=7
假設每個狀態空間化分10個levels,在我們這個機器人中就有1020 種不一樣的狀態。

現實世界樣本災難

正如 J. Kober 等人提出的,機器人強化學習面臨這就是現實世界中的問題,比如機器人設備一般比較昂貴,機器人設備的磨損(wear and tear ),需要精心的維護機器人,修復機器人需要代價,物理勞動,長期的等待週期等付出。 在 J. Kober的論文中有現實世界樣本災難的例子。主要體現在以下幾個方面:
  1. Applying reinforcement learning in robotics demands safe exploration which becomes a key issue of the learning process, a problem often neglected in the general reinforcement learning community (due to the use of simulated environments).
  2. While learning, the dynamics of a robot can change due to many external factors ranging from temperature to wear thereby the learning process may never fully converge (i.e. how light conditions affect the performance of the vision system and, as a result, the task’s performance). This problem makes comparing algorithms particularly hard.
  3. Reinforcement learning algorithms are implemented on a digital computer where the discretization of time is unavoidable despite that physical systems are inherently continuous time systems. Time discretization of the actuation can generate undesirable artifacts (e.g., the distortion of distance between states) even for idealized physical systems, which cannot be avoided.

模型限制和模型不確定性災難

精確模型仿真可以用於代替現實世界的交互。比如現在的機器人的仿真通常使用Gazebo,由ROS提供的gazebo,可以與很多物理引擎兼容,給機器人研究人員提供很有用的工具。
i在完美的假設情況下,這樣的方法允許我們在仿真中學習行爲,隨後遷移到實際機器人上。不幸的是,創建一個足夠精確的機器人模型和他的環境往往是具有挑戰的,同時需要許多數據樣本採集,由於在模型精度下小的模型誤差,將是仿真機器人與實際機器人有隔離。
從仿真轉移到實際機器人上通常分爲兩個大的場景:
1.Tasks where the system is self-stabilizing (that is, where the robot does not require active control to remain in a safe state or return to it), transferring policies often works well.
2. Unstable tasks where small variations have drastic consequences. In such scenarios transferred policies often perform poorly

機器人強化學習準則

考慮到前面提到的機器人強化學習面臨的挑戰,我們可能會覺得強化學習在機器人中的應用註定失敗。實際上,爲了使機器人強化學習表現出好的性能,我們需要考慮一下原則:

  • Effective representations
  • Approximate models
  • Prior knowledge or information
    以下將依次討論他們,更多細節可以參考J. Kober的論文

Effective representations有效的表徵

許多強化學習成功的案例大多巧妙的利用了近似特徵表示,由於基於表格的表示不具備擴展性,近似的需求在機器人中顯得尤爲重要:
- Smart State-Action discretization: Reducing the dimensionality of states or actions by smart state-action discretization is a representational simplification that may enhance both policy search and value function-based methods.(巧妙的狀態動作離散化)
- Value Function Approximation: A value function-based approach requires an accurate and robust but general function approximator that can capture the value function with sufficient precision while maintaining stability during learning (e.g. ANNs).(值函數近似)
- Pre-structured policies: Policy search methods require a choice of policy representation that controls the complexity of representable policies to enhance learning speed.(預結構化策略)

Approximate models 模型近似

現實世界中經驗採集可以用於從數據中學習前向模型(Åström and Wittenmark, 1989) ,我們渴望大大減少從實際機器人中學習,因爲在仿真中,學習將會更加快速,安全。在機器人仿真強化學習中,學習通常稱之爲mental rehearsal(內心演練)

mental rehearsal 的核心問題是
1. 仿真偏差
2. 現實世界的複雜性
3. 來着仿真環境的樣本數據的有效優化
這些問題通常在 Iterative Learning ControlValue Function Methods with Learned ModelsLocally Linear Quadratic Regulators得以體現和解決

Simulation biases Given how hard is to obtain a forward model that is accurate enough to simulate a complex real-world robot system, many robot RL policies learned on simulation perform poorly on the real robot. This is known as simulation bias. It is analogous to over-fitting in supervised learning – that is, the algorithm is doing its job well on the model and the training data, respectively, but does not generalize well to the real system or novel data. It has been proved that simulation biases can be addressed by introducing stochastic models or distributions over models even if the system is very close to deterministic.

Prior knowledge or information 先驗知識

先驗知識可以顯著的幫助學習進程,這些方法可以有效的減少搜索空間,加快學習進度

  • Prior Knowledge Through Demonstration: Providing a (partially) successful initial policy allows a reinforcement learning method to focus on promising regions in the value function or in policy space.
  • Prior Knowledge Through Task Structuring: Pre-structuring a complex task such that it can be broken down into several more tractable ones can significantly reduce the complexity of the learning task.

參考文獻


By the way, Happy Lantern Festival

發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章