[reinforcement learning] example 1 of implementing q-learning in python

Author: hhh5460

Problem situation

-o---T
#T is the location of the treasure, o is the location of the explorer

This time, we will use the q-learning method to implement a small example. The environment of the example is a one-dimensional world, and there are treasures on the right side of the world. As long as the Explorer gets the treasure and tastes the sweetness, he will remember the way to get the treasure. This is the behavior he learned with reinforcement learning.

Q-learning is a method of recording behavior value (Q value). Every behavior in a certain state has a value Q(s, a), that is, the value of behavior a in s state is Q(s, a). S, in the Explorer game above, is where o is. And each location Explorer can do two actions left/right, which is all the explorers can do.

Thanks: the above three paragraphs are from here: https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/2-1-general-rl/

To solve this problem, the following things should be clarified first:

0. Related parameters

epsilon = 0.9 # greediness greedy alpha = 0.1 # Learning rate gamma = 0.8 # Diminishing reward value

1. State set

The seeker's state, that is, where it can reach, has six. So definition

states = range(6) # State set, from 0 to 5

So, how to determine the next state after performing an action in a certain state?

def get_next_state(state, action): '''After performing actions on the state, the next state is obtained''' global states # left, right = -1,+1 # Generally speaking, this is the case, but the two positions should be considered if action == 'right' and state != states[-1]: # Except for the last state (position), it can be right(+1) next_state = state + 1 elif action == 'left' and state != states[0]: # Except for the first state (position), it can be left(-1) next_state = state -1 else: next_state = state return next_state

2. Action set

When a seeker is in each state, there are only two possible actions: "left" or "right". So definition

actions = ['left', 'right'] # Action set. You can also add actions'none'，It means staying

So, in a given state (position), how to determine all legal actions?

def get_valid_actions(state): '''Take the legal action set in the current state, and rewards irrelevant!''' global actions # ['left', 'right'] valid_actions = set(actions) if state == states[-1]: # Last state (position), then valid_actions -= set(['right']) # Remove the action to the right if state == states[0]: # The most previous state (position), then valid_actions -= set(['left']) # Remove left return list(valid_actions)

3. Reward collection

When a seeker reaches each state (position), there should be a reward. So definition

rewards = [0,0,0,0,0,1] # Reward collection. Only the location of the last treasure has a reward of 1, and all others are 0

Obviously, getting the reward under state state is simple: rewards[state]. According to the state, you can follow the diagram without defining an additional function.

4.Q table

Most important. Q table is a table that records state behavior values. The common q-table s are two-dimensional, and the basic length is as follows:

(note that there are also three-dimensional Q table s)

So definition

q_table = pd.DataFrame(data=[[0 for _ in actions] for _ in states], index=states, columns=actions)

5. Environment and its renewal

The purpose of thinking about the environment is to allow people to observe the explorers' exploration process through the screen, that's all.

The environment is very simple, just a string of characters' --- T '! When the seeker reaches the state (position), replace the character in the position with 'o', and finally reprint the whole string! therefore

def update_env(state): '''Update environment and print''' global states env = list('-----T') if state != states[-1]: env[state] = 'o' print('\r{}'.format(''.join(env)), end='') time.sleep(0.1)

6. Finally, Q-learning algorithm

Pseudo code of Q-learning algorithm

Pseudo code of Chinese version:

Source: https://www.hhyz.me/2018/08/05/2018-08-05-RL/

Q value is updated according to Behrman equation

$$Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha[r_ + \lambda \max _ Q(s_, a) - Q(s_t,a_t)] \tag $$

Well, it's time to do it:

# A total of 13 explorations were made for i in range(13): # 0.Start from the leftmost position (not necessary) current_state = 0 #current_state = random.choice(states) # Also random while current_state != states[-1]: # 1.Choose one of the legal actions in the current state randomly (or greedily) as the current action if (random.uniform(0,1) > epsilon) or ((q_table.ix[current_state] == 0).all()): # explore current_action = random.choice(get_valid_actions(current_state)) else: current_action = q_table.ix[current_state].idxmax() # Take advantage of (greed) # 2.Execute the current action to get the next state (position) next_state = get_next_state(current_state, current_action) # 3.Remove all of the Q value，The maximum value is to be taken next_state_q_values = q_table.ix[next_state, get_valid_actions(next_state)] # 4.According to the Behrman equation, update Q table Current status in-Action corresponding Q value q_table.ix[current_state, current_action] += alpha * (rewards[next_state] + gamma * next_state_q_values.max() - q_table.ix[current_state, current_action]) # 5.Go to the next state (position) current_state = next_state print('\nq_table:') print(q_table)

Well, this is the famous Q-learning algorithm!

Notice that in Behrman's equation, rewards are used [next]_ State] again: next_state

Of course, we hope to see the explorers' exploration process. We can update (print) the environment at any time

for i in range(13): #current_state = random.choice(states) current_state = 0 update_env(current_state) # Environment related total_steps = 0 # Environment related while current_state != states[-1]: if (random.uniform(0,1) > epsilon) or ((q_table.ix[current_state] == 0).all()): # explore current_action = random.choice(get_valid_actions(current_state)) else: current_action = q_table.ix[current_state].idxmax() # Take advantage of (greed) next_state = get_next_state(current_state, current_action) next_state_q_values = q_table.ix[next_state, get_valid_actions(next_state)] q_table.ix[current_state, current_action] += alpha * (reward[next_state] + gamma * next_state_q_values.max() - q_table.ix[current_state, current_action]) current_state = next_state update_env(current_state) # Environment related total_steps += 1 # Environment related print('\rEpisode {}: total_steps = {}'.format(i, total_steps), end='') # Environment related time.sleep(1) # Environment related print('\r ', end='') # Environment related print('\nq_table:') print(q_table)

7. Complete code

''' -o---T # T is the location of the treasure, o is the location of the explorer '''

# author: hhh5460 # Time: 20181217
import pandas as pd import random import time epsilon = 0.9 # greediness greedy alpha = 0.1 # Learning rate gamma = 0.8 # Diminishing reward value states = range(6) # State set. From 0 to 5 actions = ['left', 'right'] # Action set. You can also add actions'none'，It means staying rewards = [0,0,0,0,0,1] # Reward collection. Only the location of the last treasure has a reward of 1, and all others are 0 q_table = pd.DataFrame(data=[[0 for _ in actions] for _ in states], index=states, columns=actions) def update_env(state): '''Update environment and print''' global states env = list('-----T') # Environment is such a string(list)!! if state != states[-1]: env[state] = 'o' print('\r{}'.format(''.join(env)), end='') time.sleep(0.1) def get_next_state(state, action): '''After performing actions on the state, the next state is obtained''' global states # l,r,n = -1,+1,0 if action == 'right' and state != states[-1]: # Except for the last state (position), to the right+1 next_state = state + 1 elif action == 'left' and state != states[0]: # Except for the first state (position), the left is on the left-1 next_state = state -1 else: next_state = state return next_state def get_valid_actions(state): '''Take the legal action set in the current state, and reward irrelevant!''' global actions # ['left', 'right'] valid_actions = set(actions) if state == states[-1]: # Last state (position), then valid_actions -= set(['right']) # Not to the right if state == states[0]: # The most previous state (position), then valid_actions -= set(['left']) # Not to the left return list(valid_actions) for i in range(13): #current_state = random.choice(states) current_state = 0 update_env(current_state) # Environment related total_steps = 0 # Environment related while current_state != states[-1]: if (random.uniform(0,1) > epsilon) or ((q_table.ix[current_state] == 0).all()): # explore current_action = random.choice(get_valid_actions(current_state)) else: current_action = q_table.ix[current_state].idxmax() # Take advantage of (greed) next_state = get_next_state(current_state, current_action) next_state_q_values = q_table.ix[next_state, get_valid_actions(next_state)] q_table.ix[current_state, current_action] += alpha * (rewards[next_state] + gamma * next_state_q_values.max() - q_table.ix[current_state, current_action]) current_state = next_state update_env(current_state) # Environment related total_steps += 1 # Environment related print('\rEpisode {}: total_steps = {}'.format(i, total_steps), end='') # Environment related time.sleep(2) # Environment related print('\r ', end='') # Environment related print('\nq_table:') print(q_table)