强化学习QLearning

强化学习QLearning

我是看 B站莫烦的是视频学习的.

主要公式

Q(s,a)Q(s,a)+α[r+γmaxaQ(s,a)Q(s,a)]Q(s,a) \Leftarrow Q(s,a) + \alpha \left[ r + \gamma \text{max}_{a'}Q(s',a') - Q(s,a) \right]
根据我的理解,QLearning算法的主要工作都是围绕着这个公式展开的,算法学习的知识是存储在这个Q表中,表示在这个过程中知识的积累。

算法思想

Q(s,a)Q(s,a)表示在某一时刻的s状态下,采取动作a能够获得的期望收益。

QTable Left Right
Step1 q(s1,a1)q(s_1,a_1) q(s1,a2)q(s_1,a_2)
Step2 q(s2,a1)q(s_2,a_1) q(s2,a2)q(s_2,a_2)
Step3 q(s3,a1)q(s_3,a_1) q(s3,a2)q(s_3,a_2)
Step4 q(s4,a1)q(s_4,a_1) q(s4,a2)q(s_4,a_2)
Step5 q(s5,a1)q(s_5,a_1) q(s5,a2)q(s_5,a_2)

程序

代码也是莫烦的小程序

"""
A simple example for Reinforcement Learning using table lookup Q-learning method.
An agent "o" is on the left of a 1 dimensional world, the treasure is on the rightmost location.
Run this program and to see how the agent will improve its strategy of finding the treasure.
View more on my tutorial page: https://morvanzhou.github.io/tutorials/
"""

import numpy as np
import pandas as pd
import time

np.random.seed(2)  # reproducible  设置随机数种子,种下一颗种子,只在下一次有效,相同的种子,得到的随机数是一样的


N_STATES = 6   # the length of the 1 dimensional world
ACTIONS = ['left', 'right']     # available actions
EPSILON = 0.9   # greedy police
ALPHA = 0.1     # learning rate
GAMMA = 0.9    # discount factor
MAX_EPISODES = 13   # maximum episodes
FRESH_TIME = 0.3    # fresh time for one move


def build_q_table(n_states, actions):
    table = pd.DataFrame(
        np.zeros((n_states, len(actions))),     # q_table initial values
        columns=actions,    # actions's name
    )
    #print(table)    # show table
    return table


def choose_action(state, q_table):
    # This is how to choose an action
    state_actions = q_table.iloc[state, :]
    if (np.random.uniform() > EPSILON) or ((state_actions == 0).all()):  # act non-greedy or state-action have no value 90%随机选择
        action_name = np.random.choice(ACTIONS)
    else:   # act greedy
        action_name = state_actions.idxmax()    # replace argmax to idxmax as argmax means a different function in newer version of pandas
    return action_name

#这是智能体在环境中的规则
def get_env_feedback(S, A):
    # This is how agent will interact with the environment
    if A == 'right':    # move right
        if S == N_STATES - 2:   # terminate
            S_ = 'terminal'
            R = 1
        else:
            S_ = S + 1
            R = 0
    else:   # move left
        R = 0
        if S == 0:
            S_ = S  # reach the wall
        else:
            S_ = S - 1
    return S_, R

#跟新智能体在环境中移动的情况
def update_env(S, episode, step_counter):
    # This is how environment be updated
    env_list = ['-']*(N_STATES-1) + ['T']   # '---------T' our environment
    if S == 'terminal':
        interaction = 'Episode %s: total_steps = %s' % (episode+1, step_counter)
        print('\r{}'.format(interaction), end='')
        time.sleep(2)
        print('\r                                ', end='')
    else:
        env_list[S] = 'o'
        interaction = ''.join(env_list)
        print('\r{}'.format(interaction), end='')
        time.sleep(FRESH_TIME)

##QLearning算法的执行
def rl():
    # main part of RL loop
    q_table = build_q_table(N_STATES, ACTIONS)  #初始化空的Q表,表示此时没有知识
    for episode in range(MAX_EPISODES):
        step_counter = 0
        S = 0
        is_terminated = False
        update_env(S, episode, step_counter)
        while not is_terminated:

            A = choose_action(S, q_table)
            S_, R = get_env_feedback(S, A)  # take action & get next state and reward
            q_predict = q_table.loc[S, A]
            if S_ != 'terminal':
                q_target = R + GAMMA * q_table.iloc[S_, :].max()   # next state is not terminal
            else:
                q_target = R     # next state is terminal
                is_terminated = True    # terminate this episode

            q_table.loc[S, A] += ALPHA * (q_target - q_predict)  # update
            S = S_  # move to next state

            update_env(S, episode, step_counter+1)
            step_counter += 1
        print(q_table)
    return q_table
if __name__ == "__main__":
    q_table = rl()
    print('\r\nQ-table:\n')
    print(q_table)
發表評論
所有評論
還沒有人評論,想成為第一個評論的人麼? 請在上方評論欄輸入並且點擊發布.
相關文章