Reinforcement learning notes: dobby slot machine problem -- optimal initial value

catalogue

0. Preface

1. Optimistic Initial Value

2. Simulation

2.1 sample-average method

2.2 Constant Step-size alpha=0.1

3. Analysis and discussion of simulation results

4. Exercises

0. Preface


        In the previous sections, we have discussed the problem of multi arm slot machines. See:

        Reinforcement learning notes: multi arm slot machine problem (1)

        Reinforcement learning notes: dobby slot machine problem (2)--Python simulation

        Reinforcement learning notes: multi arm slot machine problem (3) -- incremental implementation of action value estimation

        Reinforcement learning notes: dobby slot machine problem (4) -- tracking non-stationary environment

        In this section, we continue to explore the impact of the initial value of action value estimation on reinforcement learning based on the multi arm slot machine problem.

        Ref: Sutton-RLBook2020-2.6: Optimistic Initial Value

1. Optimistic Initial Value

The methods discussed so far depend to some extent on the initial setting Q1(a) of action value estimation. In the language of statistics, these methods are called biased by their initial estimates. For sample average methods, this bias disappears after each action is selected at least once. And for using α na= α This bias will always exist (although it will decrease exponentially over time - so it will disappear after a long enough time in an approximate sense).

In fact, this bias is usually not a problem and sometimes even helpful. The disadvantage of this biased method that depends on initial values is that it requires the user to set these initial values (even if they are only set to all 0). The advantage is that it is possible to use the initial value setting to provide some a priori information to the model.

Setting the initial value of action value can provide a simple way to encourage exploration. Suppose that in the 10 armed testbed discussed earlier, the initial value of all action values is set to 5 instead of 0. Considering that $q^*(a) $is sampled from the standard normal distribution (mean value is 0 and variance is 1), it is a very optimistic setting to set the initial value to 5. In the initial stage, no matter what action is selected, the value estimate of the corresponding action will be greatly reduced. Therefore, the agent will choose other actions (i.e. show a preference for exploration) because of "disappointment" with the reward feedback received, The preference for exploration will not end until each action is selected several times, resulting in the basic convergence of action value estimation. Note that even in the 0-green strategy, the agent will turn to pure exploration after full exploration.

2. Simulation

import numpy as np
import matplotlib.pyplot as plt
import utilities as util

%matplotlib inline

          For k_armed_bandit_one_run() modifies the Q value initialization setting to be passed in through the Qinit parameter.

def k_armed_bandit_one_run(qstar,epsilon,nStep,Qinit,QUpdtAlgo='sample_average',alpha=0, stationary=True):
    """
    One run of K-armed bandit simulation.
    Add Qinit to the interface.
    Input:
        qstar:     Mean reward for each candition actions
        epsilon:   Epsilon value for epsilon-greedy algorithm
        nStep:     The number of steps for simulation
        Qinit:     Initial setting for action-value estimate
        QUpdtAlgo: The algorithm for updating Q value--'sample_average','exp_decaying'
        alpha:     step-size in case of 'exp_decaying'
    Output:
        a[t]: action series for each step in one run
        r[t]: reward series for each step in one run
        Q[k]: reward sample average up to t-1 for action[k]
        aNum[k]: The number of being selected for action[k]
        optRatio[t]: Ration of optimal action being selected over tim
    """
    
    K     = len(qstar)
    #Q     = np.zeros(K)
    Q     = Qinit
    a     = np.zeros(nStep+1,dtype='int') # Item#0 for initialization
    aNum  = np.zeros(K,dtype='int')       # Record the number of action#k being selected
    
    r     = np.zeros(nStep+1)             # Item#0 for initialization

    if stationary == False:
        qstar = np.ones(K)/K                 # qstart initialized to 1/K for all K actions    
    
    optCnt   = 0
    optRatio = np.zeros(nStep+1,dtype='float') # Item#0 for initialization

    for t in range(1,nStep+1):

        #0. For non-stationary environment, optAct also changes over time.Hence, move to inside the loop.
        optAct   = np.argmax(qstar)
        #1. action selection
        tmp = np.random.uniform(0,1)
        #print(tmp)
        if tmp < epsilon: # random selection
            a[t] = np.random.choice(np.arange(K))
            #print('random selection: a[{0}] = {1}'.format(t,a[t]))
        else:             # greedy selection
            #Select the one with the largest Q value. When multiple Q values are first in parallel, choose one of them -- but how to judge that there are multiple first in parallel?
            #This problem can be solved equivalently by finding the maximum value after random permutation of Q
            p = np.random.permutation(K)
            a[t] = p[np.argmax(Q[p])]
            #print('greedy selection: a[{0}] = {1}'.format(t,a[t]))

        aNum[a[t]] = aNum[a[t]] + 1

        #2. reward: draw from the pre-defined probability distribution    
        r[t] = np.random.randn() + qstar[a[t]]        

        #3.Update Q of the selected action - #2.4 Incremental Implementation
        # Q[a[t]] = (Q[a[t]]*(aNum[a[t]]-1) + r[t])/aNum[a[t]]    
        if QUpdtAlgo == 'sample_average':
            Q[a[t]] = Q[a[t]] + (r[t]-Q[a[t]])/aNum[a[t]]    
        elif QUpdtAlgo == 'exp_decaying':
            Q[a[t]] = Q[a[t]] + (r[t]-Q[a[t]])*alpha
        
        #4. Optimal Action Ratio tracking
        #print(a[t], optAct)
        if a[t] == optAct:
            optCnt = optCnt + 1
        optRatio[t] = optCnt/t

        #5. Random walk of qstar simulating non-stationary environment
        # Take independent random walks (say by adding a normally distributed increment with mean 0
        # and standard deviation 0.01 to all the q⇤(a) on each step).   
        if stationary == False:        
            qstar = qstar + np.random.randn(K)*0.01 # Standard Deviation = 0.01
            #print('t={0}, qstar={1}, sum={2}'.format(t,qstar,np.sum(qstar)))
        
    return a,aNum,r,Q,optRatio

         Below, we compare the differences between {epsilon=0, Q=5} and {epsilon=0.1, Q=0} through simulation for the two averaging methods (or step mechanism) of 'sample average method' and 'contant step size (exponential recency decaying weighted average)'.

2.1 sample-average method

nStep = 1000
nRun  = 2000
K     = 10

r_0p0_Q5   = np.zeros((nRun,nStep+1))
r_0p1_Q0   = np.zeros((nRun,nStep+1))
optRatio_0p0_Q5 = np.zeros((nRun,nStep+1))
optRatio_0p1_Q0 = np.zeros((nRun,nStep+1))

for run in range(nRun):
    print('.',end='')
    if run%100==99:        
        print('run = ',run+1)
    
    qstar   = np.random.randn(10)     
    a,aNum,r_0p0_Q5[run,:],Q,optRatio_0p0_Q5[run,:] = k_armed_bandit_one_run(qstar,epsilon=0,nStep=nStep, Qinit=np.ones(K)*5)
    a,aNum,r_0p1_Q0[run,:],Q,optRatio_0p1_Q0[run,:] = k_armed_bandit_one_run(qstar,epsilon=0.1,nStep=nStep, Qinit=np.zeros(K))


# Plotting simulation results
rEnsembleMean_0p0_Q5 = np.mean(r_0p0_Q5,axis=0)
rEnsembleMean_0p1_Q0 = np.mean(r_0p1_Q0,axis=0)

optRatioEnsembleMean_0p0_Q5 = np.mean(optRatio_0p0_Q5,axis=0)
optRatioEnsembleMean_0p1_Q0 = np.mean(optRatio_0p1_Q0,axis=0)


fig,ax = plt.subplots(1,2,figsize=(15,4))

ax[0].plot(rEnsembleMean_0p0_Q5)  # Without time-domain smooth filtering
ax[0].plot(rEnsembleMean_0p1_Q0)
ax[0].legend(['epsilon = 0, Qinit = 5','epsilon = 0.1, Qinit = 0'])
ax[0].set_title('ensemble average reward')

ax[1].plot(optRatioEnsembleMean_0p0_Q5)
ax[1].plot(optRatioEnsembleMean_0p1_Q0)
ax[1].legend(['epsilon = 0, Qinit = 5','epsilon = 0.1, Qinit = 0'])
ax[1].set_title('Optional action selection ratio')

         Operation results:


 

2.2 Constant Step-size alpha=0.1

nStep = 1000
nRun  = 2000
K     = 10

r_0p0_Q5   = np.zeros((nRun,nStep+1))
r_0p1_Q0   = np.zeros((nRun,nStep+1))
optRatio_0p0_Q5 = np.zeros((nRun,nStep+1))
optRatio_0p1_Q0 = np.zeros((nRun,nStep+1))

for run in range(nRun):
    print('.',end='')
    if run%100==99:        
        print('run = ',run+1)
    
    qstar   = np.random.randn(10)     
    a,aNum,r_0p0_Q5[run,:],Q,optRatio_0p0_Q5[run,:] = k_armed_bandit_one_run(qstar,epsilon=0,nStep=nStep,QUpdtAlgo='exp_decaying',alpha=0.1, Qinit=np.ones(K)*5)
    a,aNum,r_0p1_Q0[run,:],Q,optRatio_0p1_Q0[run,:] = k_armed_bandit_one_run(qstar,epsilon=0.1,nStep=nStep,QUpdtAlgo='exp_decaying',alpha=0.1, Qinit=np.zeros(K))
rEnsembleMean_0p0_Q5 = np.mean(r_0p0_Q5,axis=0)
rEnsembleMean_0p1_Q0 = np.mean(r_0p1_Q0,axis=0)

optRatioEnsembleMean_0p0_Q5 = np.mean(optRatio_0p0_Q5,axis=0)
optRatioEnsembleMean_0p1_Q0 = np.mean(optRatio_0p1_Q0,axis=0)


fig,ax = plt.subplots(1,2,figsize=(15,4))

ax[0].plot(rEnsembleMean_0p0_Q5)  # Without time-domain smooth filtering
ax[0].plot(rEnsembleMean_0p1_Q0)
ax[0].legend(['epsilon = 0, Qinit = 5','epsilon = 0.1, Qinit = 0'])
ax[0].set_title('ensemble average reward')

ax[1].plot(optRatioEnsembleMean_0p0_Q5)
ax[1].plot(optRatioEnsembleMean_0p1_Q0)
ax[1].legend(['epsilon = 0, Qinit = 5','epsilon = 0.1, Qinit = 0'])
ax[1].set_title('Optional action selection ratio')

          Operation results:

  

3. Analysis and discussion of simulation results

        The simulation results in Section 2.2 above reproduce the simulation results of the original book (see sutton-rlbook2020-figure 2.3)

         At the beginning, the "Optimal" method is even worse because it takes more exploration strategy. But after a period of time, it surpassed. This "trick" works well in a stable environment, but it is not a general good way to encourage exploration. For example, its phenotype in a non-stationary environment is not very. Because its encouragement to exploration is only temporary and limited to the first period of time. For non-stationary environments, the performance of the initial period of time has no impact on the overall results. After all, it only happens once during the beginning of the game. When the task changes (as mentioned earlier, the task continues to change in a non-stationary environment, which is reflected in the continuous change of qstar), the encouragement to exploring no longer works in the initial stage.

In fact, any strategy that depends on the initial conditions is unlikely to help in a non-stationary environment. The same criticism is equally effective for the sample average method, which also treats the beginning of the game as a special event. However, because these methods are very simple, they (and their combinations) often have good results in practice.

4. Exercises

        I don't understand^-^ Let me continue to think slowly

 

Solution:

       According to the above conditions:

 

  because , so:

         Therefore, when an action is selected for the first time, it "forgets" the initial value Q1, thus avoiding the bias problem.

         Secondly, because when n tends to infinity \ beta_n tends to \ alpha, so it is gradually equivalent to the exponential recency weighted average method discussed earlier.

  

 

Tags: Python

Posted on Fri, 03 Dec 2021 08:21:53 -0500 by wolves