Learning Rate Scheduler
MiniVGGNet network was trained on cifar-10 data set before. In order to alleviate over fitting, we introduced the concept of learning rate decay when applying SGD.
This article will discuss the concept of learning rate scheduler, sometimes called adaptive learning rate. By adjusting the learning rate on different epoch s, we can reduce loss, improve accuracy, and even reduce the time of training the network in some cases.
We can think of the process of adjusting the learning rate as:
- Use a higher learning rate to find a set of reasonable weights in the early training process.
- Then slowly adjust the weight with a smaller learning rate until the optimal weight is found
There are two basic types of learning rate schedulers you may encounter:
-
Gradually reduce the learning rate according to the epoch number, such as linear, polynomial or exponential equations
-
Drop according to a specific epoch, such as a piecewise function.
1. Standard attenuation scheduler in keras
Look back at the code we used to initialize SGD:
print("[INFO] compiling model...") opt = SGD(learning_rate=0.01, decay=0.01/40, momentum=0.9, nesterov=True) model = MiniVGGNet.build(width=32, height=32, depth=3, classes=10) model.compile(loss="categorical_crossentropy", optimizer=opt, metrics=["accuracy"])
Here we use learning_rate α \alpha α=0.01,momentum γ \gamma γ= 0.9 and point out that Nesterov accelerated gradient is used in our door. Then we divide the learning rate by the total number of epoch s, and the result is 0.01/40 = 0.00025
At the bottom layer, Keras calls the following learning rate scheduler to adjust the learning rate of each epoch:
α
e
+
1
=
α
e
×
1
/
(
1
+
γ
∗
e
)
\alpha_{e+1}=\alpha_e \times 1/(1+\gamma*e)
αe+1=αe×1/(1+γ∗e)
When
γ
\gamma
γ= At 0, lr remains unchanged
When γ \gamma γ= At 0.01/40, lr will begin to decrease at the end of each cycle
2. Phase based attenuation
This scheduler will automatically lower the learning rate after a specific epoch. We can regard it as a piecewise function. In this case, the learning rate will maintain a constant within a few epochs, then drop suddenly, then continue to maintain a constant for a few epochs, and then drop suddenly, and so on.
When our learning rate scheduler uses this phase attenuation, we have two options:
-
Define this piecewise function about learning rate
-
When training the neural network, notice that the performance is not good at verification, stop script with ctrl+c, then adjust the learning rate, then continue training.
This paper mainly focuses on the first method. The second method is more advanced. It is often used to train deep neural networks on large data sets. At this time, it is impossible to predict where we want to adjust the learning rate, so we use the second method.
3. Customize the learning rate scheduler of keras
The Keras library provides us with the LearningRateScheduler class, so that we can define a personalized learning rate function and apply it automatically during training.
This user-defined function requires epoch as a parameter, and then calculates the corresponding learning rate based on the function we define.
We define a piecewise function to reduce the learning rate according to the specific factor F for every D epoch. Our function is as follows:
α
E
+
1
=
α
1
×
F
(
1
+
E
)
/
D
\alpha_{E+1}=\alpha_1\times F^{(1+E)/D}
αE+1=α1×F(1+E)/D
among
α
1
\alpha_1
α 1 ¢ is the initial learning rate, F is the factor controlling the reduction of learning rate, and D is the number of epoch s. The implementation code is as follows:
alpha = initAlpha * (factor ** np.floor((1 + epoch) / dropEvery))
Open a python file named cifar10_lr_decay.py, write the following code:
import matplotlib matplotlib.use("Agg") from sklearn.preprocessing import LabelBinarizer from sklearn.metrics import classification_report from nn.conv.minivggnet import MiniVGGNet from tensorflow.keras.callbacks import LearningRateScheduler from tensorflow.keras.optimizers import SGD from tensorflow.keras.datasets import cifar10 import matplotlib.pyplot as plt import numpy as np def step_decay(epoch): # Initialize the basic learning rate, drop factor and how many cycles drop once initAlpha = 0.01 factor = 0.25 dropEvery = 5 # Calculate the learning rate of the current epoch alpha = initAlpha * (factor ** np.floor((1 + epoch) / dropEvery)) return float(alpha) # Define the path to the output loss/acc diagram output = "/Users/liushanlin/PycharmProjects/DLstudy/result" # Load training data and test data and zoom to the range of [0, 1] print("[INFO] loading cifar_10 data...") ((trainX, trainY), (testX, testY)) = cifar10.load_data() trainX = trainX.astype("float") / 255.0 testX = testX.astype("float") / 255.0 # Convert labels from integers to vectors lb = LabelBinarizer() trainY = lb.fit_transform(trainY) testY = lb.transform(testY) # Initialize the label name of the CIFAR-10 dataset labelNames = ["airplane", "automobile", "bird", "cat", "deer", "dog", "frog", "horse", "ship", "truck"] # Pass the model to the callbacks set when defining training callbacks = [LearningRateScheduler(step_decay)] # Initialize model and optimizer opt = SGD(learning_rate=0.01, momentum=0.9, nesterov=True) model = MiniVGGNet.build(width=32, height=32, depth=3, classes=10) model.compile(loss="categorical_crossentropy", optimizer=opt, metrics=["accuracy"]) # Training network H = model.fit(trainX, trainY, validation_data=(testX, testY), batch_size=64, epochs=40, callbacks=callbacks, verbose=1) # Evaluation network print("[INFO] evaluating network...") predictions = model.predict(testX, batch_size=64) print(classification_report(testY.argmax(axis=1), predictions.argmax(axis=1), target_names=labelNames)) #Drawing plt.style.use("ggplot") plt.figure() plt.plot(np.arange(0, 40), H.history["loss"], label="train_loss") plt.plot(np.arange(0, 40), H.history["val_loss"], label="val_loss") plt.plot(np.arange(0, 40), H.history["accuracy"], label="train_acc") plt.plot(np.arange(0, 40), H.history["val_accuracy"], label="val_acc") plt.title("Training Loss and Accuracy on CIFAR-10") plt.xlabel("Epoch#") plt.ylabel("Loss/Accuracy") plt.legend() plt.savefig(output)
Operation results:
precision recall f1-score support airplane 0.83 0.72 0.77 1000 automobile 0.92 0.80 0.86 1000 bird 0.72 0.56 0.63 1000 cat 0.59 0.54 0.56 1000 deer 0.64 0.80 0.71 1000 dog 0.65 0.68 0.67 1000 frog 0.73 0.88 0.80 1000 horse 0.85 0.77 0.81 1000 ship 0.82 0.89 0.85 1000 truck 0.79 0.88 0.83 1000 accuracy 0.75 10000 macro avg 0.76 0.75 0.75 10000 weighted avg 0.76 0.75 0.75 10000

It can be seen that the accuracy rate of our network is only 76%, and the learning rate decreases very fast. After 15epoch, the learning rate is only 0.00125, which means that the pace of our network is very small.
If we set factor=0.5, what will be the result?
Change as follows:
def step_decay(epoch): # Initialize the basic learning rate, drop factor and how many cycles drop once initAlpha = 0.01 factor = 0.5 dropEvery = 5 # Calculate the learning rate of the current epoch alpha = initAlpha * (factor ** np.floor((1 + epoch) / dropEvery)) return float(alpha)
Run the program again and the results are as follows:
precision recall f1-score support airplane 0.81 0.81 0.81 1000 automobile 0.90 0.88 0.89 1000 bird 0.71 0.61 0.66 1000 cat 0.65 0.55 0.60 1000 deer 0.69 0.78 0.74 1000 dog 0.67 0.69 0.68 1000 frog 0.79 0.86 0.82 1000 horse 0.82 0.83 0.83 1000 ship 0.87 0.90 0.88 1000 truck 0.83 0.89 0.86 1000 accuracy 0.78 10000 macro avg 0.78 0.78 0.78 10000 weighted avg 0.78 0.78 0.78 10000 The process has ended with exit code 0