# [reinforcement learning] example 1 of implementing q-learning in python

Author: hhh5460

## Problem situation

-o---T
#T is the location of the treasure, o is the location of the explorer

This time, we will use the q-learning method to implement a small example. The environment of the example is a one-dimensional world, and there are treasures on the right side of the world. As long as the Explorer gets the treasure and tastes the sweetness, he will remember the way to get the treasure. This is the behavior he learned with reinforcement learning.

Q-learning is a method of recording behavior value (Q value). Every behavior in a certain state has a value Q(s, a), that is, the value of behavior a in s state is Q(s, a). S, in the Explorer game above, is where o is. And each location Explorer can do two actions left/right, which is all the explorers can do.

Thanks: the above three paragraphs are from here: https://morvanzhou.github.io/tutorials/machine-learning/reinforcement-learning/2-1-general-rl/

To solve this problem, the following things should be clarified first:

## 0. Related parameters

epsilon = 0.9   # greediness  greedy
alpha = 0.1     # Learning rate
gamma = 0.8     # Diminishing reward value

## 1. State set

The seeker's state, that is, where it can reach, has six. So definition

states = range(6) # State set, from 0 to 5

So, how to determine the next state after performing an action in a certain state?

def get_next_state(state, action):
'''After performing actions on the state, the next state is obtained'''
global states

# left, right = -1,+1 # Generally speaking, this is the case, but the two positions should be considered
if action == 'right' and state != states[-1]: # Except for the last state (position), it can be right(+1)
next_state = state + 1
elif action == 'left' and state != states[0]: # Except for the first state (position), it can be left(-1)
next_state = state -1
else:
next_state = state
return next_state

## 2. Action set

When a seeker is in each state, there are only two possible actions: "left" or "right". So definition

actions = ['left', 'right'] # Action set. You can also add actions'none'，It means staying

So, in a given state (position), how to determine all legal actions?

def get_valid_actions(state):
'''Take the legal action set in the current state, and rewards irrelevant!'''
global actions # ['left', 'right']

valid_actions = set(actions)
if state == states[-1]:             # Last state (position), then
valid_actions -= set(['right']) # Remove the action to the right
if state == states[0]:              # The most previous state (position), then
valid_actions -= set(['left'])  # Remove left
return list(valid_actions)


## 3. Reward collection

When a seeker reaches each state (position), there should be a reward. So definition

rewards = [0,0,0,0,0,1] # Reward collection. Only the location of the last treasure has a reward of 1, and all others are 0

Obviously, getting the reward under state state is simple: rewards[state]. According to the state, you can follow the diagram without defining an additional function.

## 4.Q table

Most important. Q table is a table that records state behavior values. The common q-table s are two-dimensional, and the basic length is as follows:

(note that there are also three-dimensional Q table s)

So definition

q_table = pd.DataFrame(data=[[0 for _ in actions] for _ in states],
index=states, columns=actions)

## 5. Environment and its renewal

The purpose of thinking about the environment is to allow people to observe the explorers' exploration process through the screen, that's all.

The environment is very simple, just a string of characters' --- T '! When the seeker reaches the state (position), replace the character in the position with 'o', and finally reprint the whole string! therefore

def update_env(state):
'''Update environment and print'''
global states

env = list('-----T')
if state != states[-1]:
env[state] = 'o'
print('\r{}'.format(''.join(env)), end='')
time.sleep(0.1)

## 6. Finally, Q-learning algorithm

Pseudo code of Q-learning algorithm

Pseudo code of Chinese version:

Source: https://www.hhyz.me/2018/08/05/2018-08-05-RL/

Q value is updated according to Behrman equation

$$Q(s_t,a_t) \leftarrow Q(s_t,a_t) + \alpha[r_{t+1} + \lambda \max _{a} Q(s_{t+1}, a) - Q(s_t,a_t)] \tag {1}$$

Well, it's time to do it:

# A total of 13 explorations were made
for i in range(13):
# 0.Start from the leftmost position (not necessary)
current_state = 0
#current_state = random.choice(states) # Also random
while current_state != states[-1]:
# 1.Choose one of the legal actions in the current state randomly (or greedily) as the current action
if (random.uniform(0,1) > epsilon) or ((q_table.ix[current_state] == 0).all()):  # explore
current_action = random.choice(get_valid_actions(current_state))
else:
current_action = q_table.ix[current_state].idxmax() # Take advantage of (greed)
# 2.Execute the current action to get the next state (position)
next_state = get_next_state(current_state, current_action)
# 3.Remove all of the Q value，The maximum value is to be taken
next_state_q_values = q_table.ix[next_state, get_valid_actions(next_state)]
# 4.According to the Behrman equation, update Q table Current status in-Action corresponding Q value
q_table.ix[current_state, current_action] += alpha * (rewards[next_state] + gamma * next_state_q_values.max() - q_table.ix[current_state, current_action])
# 5.Go to the next state (position)
current_state = next_state

print('\nq_table:')
print(q_table)

Well, this is the famous Q-learning algorithm!

Notice that in Behrman's equation, rewards are used [next]_ State] again: next_state

Of course, we hope to see the explorers' exploration process. We can update (print) the environment at any time

for i in range(13):
#current_state = random.choice(states)
current_state = 0

update_env(current_state) # Environment related
total_steps = 0           # Environment related

while current_state != states[-1]:
if (random.uniform(0,1) > epsilon) or ((q_table.ix[current_state] == 0).all()):  # explore
current_action = random.choice(get_valid_actions(current_state))
else:
current_action = q_table.ix[current_state].idxmax() # Take advantage of (greed)

next_state = get_next_state(current_state, current_action)
next_state_q_values = q_table.ix[next_state, get_valid_actions(next_state)]
q_table.ix[current_state, current_action] += alpha * (reward[next_state] + gamma * next_state_q_values.max() - q_table.ix[current_state, current_action])
current_state = next_state

update_env(current_state) # Environment related
total_steps += 1          # Environment related

print('\rEpisode {}: total_steps = {}'.format(i, total_steps), end='') # Environment related
time.sleep(1)                                                          # Environment related
print('\r                                ', end='')                    # Environment related

print('\nq_table:')
print(q_table)

## 7. Complete code

'''
-o---T
# T is the location of the treasure, o is the location of the explorer
'''
# author: hhh5460
# Time: 20181217
import pandas as pd
import random
import time

epsilon = 0.9   # greediness  greedy
alpha = 0.1     # Learning rate
gamma = 0.8     # Diminishing reward value

states = range(6)           # State set. From 0 to 5
actions = ['left', 'right'] # Action set. You can also add actions'none'，It means staying
rewards = [0,0,0,0,0,1]     # Reward collection. Only the location of the last treasure has a reward of 1, and all others are 0

q_table = pd.DataFrame(data=[[0 for _ in actions] for _ in states],
index=states, columns=actions)

def update_env(state):
'''Update environment and print'''
global states

env = list('-----T') # Environment is such a string(list)!!
if state != states[-1]:
env[state] = 'o'
print('\r{}'.format(''.join(env)), end='')
time.sleep(0.1)

def get_next_state(state, action):
'''After performing actions on the state, the next state is obtained'''
global states

# l,r,n = -1,+1,0
if action == 'right' and state != states[-1]: # Except for the last state (position), to the right+1
next_state = state + 1
elif action == 'left' and state != states[0]: # Except for the first state (position), the left is on the left-1
next_state = state -1
else:
next_state = state
return next_state

def get_valid_actions(state):
'''Take the legal action set in the current state, and reward irrelevant!'''
global actions # ['left', 'right']

valid_actions = set(actions)
if state == states[-1]:             # Last state (position), then
valid_actions -= set(['right']) # Not to the right
if state == states[0]:              # The most previous state (position), then
valid_actions -= set(['left'])  # Not to the left
return list(valid_actions)

for i in range(13):
#current_state = random.choice(states)
current_state = 0

update_env(current_state) # Environment related
total_steps = 0           # Environment related

while current_state != states[-1]:
if (random.uniform(0,1) > epsilon) or ((q_table.ix[current_state] == 0).all()):  # explore
current_action = random.choice(get_valid_actions(current_state))
else:
current_action = q_table.ix[current_state].idxmax() # Take advantage of (greed)

next_state = get_next_state(current_state, current_action)
next_state_q_values = q_table.ix[next_state, get_valid_actions(next_state)]
q_table.ix[current_state, current_action] += alpha * (rewards[next_state] + gamma * next_state_q_values.max() - q_table.ix[current_state, current_action])
current_state = next_state

update_env(current_state) # Environment related
total_steps += 1          # Environment related

print('\rEpisode {}: total_steps = {}'.format(i, total_steps), end='') # Environment related
time.sleep(2)                                                          # Environment related
print('\r                                ', end='')                    # Environment related

print('\nq_table:')
print(q_table)

## 8. Real final, renderings

Tags: github Lambda

Posted on Mon, 29 Jun 2020 22:14:31 -0400 by Hardbyte