The main content includes:
- Basic elements of linear regression
- Zero-based implementation of linear regression models
- Simple implementation of linear regression model using pytorch
Basic elements of linear regression
Model
For simplicity, here we assume that the price depends only on two factors of the housing condition: area (square meters) and age (years).Next, we want to explore the specific relationship between prices and these two factors.Linear regression assumes that the output has a linear relationship with each input:
price=warea⋅area+wage⋅age+b \mathrm = w_{\mathrm} \cdot \mathrm + w_{\mathrm} \cdot \mathrm + b price=warea⋅area+wage⋅age+b
data set
We usually collect a series of real data, such as the true selling prices of multiple houses and their corresponding area and age.We want to look for model parameters on this data to minimize the error between the predicted price and the true price of the model.In machine learning terminology, this data set is called a training data set or training set, a house is called a sample, its real selling price is called a label, and the two factors used to predict the label are called feature s.Features are used to characterize the characteristics of a sample.
loss function
In model training, we need to measure the error between the price forecast and the true value.Usually we choose a non-negative number as the error, and smaller values mean smaller errors.A common choice is the square function.It evaluates the sample error indexed as iii as
l(i)(w,b)=12(y^(i)−y(i))2, l^{(i)}(\mathbf, b) = \frac \left(\hat^{(i)} - y^{(i)}\right)^2, l(i)(w,b)=21(y^(i)−y(i))2,
L(w,b)=1n∑i=1nl(i)(w,b)=1n∑i=1n12(w⊤x(i)+b−y(i))2. L(\mathbf, b) =\frac\sum_^n l^{(i)}(\mathbf, b) =\frac \sum_^n \frac\left(\mathbf^\top \mathbf^{(i)} + b - y^{(i)}\right)^2. L(w,b)=n1i=1∑nl(i)(w,b)=n1i=1∑n21(w⊤x(i)+b−y(i))2.
Optimized Function-Random Gradient Decrease
When the model and loss function are relatively simple, the above solution to the error minimization problem can be expressed directly in a formula.This kind of solution is called analytical solution.The linear regression and square errors used in this section fall precisely into this category.However, most in-depth learning models do not have an analytic solution and can only minimize the value of the loss function by optimizing the finite iteration model parameters.This type of solution is called a numerical solution.
Small batch stochastic gradient descent (mini-batch stochastic gradient descent) is widely used in the optimization algorithm for solving numerical problems.Its algorithm is simple: first select the initial values of a set of model parameters, such as random selection; then iterate the parameters several times so that each iteration may reduce the value of the loss function.In each iteration, a small batch (mini-batch) B\mathcalB consisting of a fixed number of training data samples is randomly and evenly sampled, and then the derivative (gradient) of the model parameters is calculated for the average loss of the data samples in the small batch. Finally, the product of this result and a preset positive number is used as the reduction of the model parameters in this iteration.
(w,b)←(w,b)−η∣B∣∑i∈B∂(w,b)l(i)(w,b) (\mathbf,b) \leftarrow (\mathbf,b) - \frac{\eta}{|\mathcal|} \sum_} \partial_{(\mathbf,b)} l^{(i)}(\mathbf,b) (w,b)←(w,b)−∣B∣ηi∈B∑∂(w,b)l(i)(w,b)
Learning rate: _eta represents the size of the steps that can be learned in each optimization
batch size: B\mathcalB batch size in small batch calculations
To summarize, there are two steps to optimize a function:
- (i) Initializing model parameters, generally using random initialization;
- (ii) We iterate over the data several times, updating each parameter by moving it in the direction of a negative gradient.
Vector calculation
In model training or prediction, we often process multiple data samples at the same time and use them for vector calculation.Before introducing vector calculation expressions for linear regression, let's consider two ways to add two vectors together.
- One way to add vectors is to add them one by one as scalars.
- Another way to add vectors is to add them directly.
import torch import time # init variable a, b as 1000 dimension vector n = 1000 a = torch.ones(n) b = torch.ones(n)
# define a timer class to record time class Timer(object): """Record multiple running times.""" def __init__(self): self.times = [] self.start() def start(self): # start the timer self.start_time = time.time() def stop(self): # stop the timer and record time into a list self.times.append(time.time() - self.start_time) return self.times[-1] def avg(self): # calculate the average and return return sum(self.times)/len(self.times) def sum(self): # return the sum of recorded time return sum(self.times)
Now we can test it.First, two vectors are scalar added one by one by element using a for loop.
timer = Timer() c = torch.zeros(n) for i in range(n): c[i] = a[i] + b[i] '%.5f sec' % timer.stop()
'0.01432 sec'
Another is to use torch to add two vectors directly:
timer.start() d = a + b '%.5f sec' % timer.stop()
'0.00024 sec'
The result is obvious, the latter is faster than the former.Therefore, we should use vector computing as much as possible to improve computational efficiency.
1. Realization of Linear Regression Model from Zero1.1 Import required packages
# import packages and modules %matplotlib inline import torch from IPython import display from matplotlib import pyplot as plt import numpy as np import random print(torch.__version__)1.3.0
1.2 Simulating the generation of experimental datasets
Using a linear model to generate a dataset, a dataset of 1,000 samples is generated. The following are the linear relationships used to generate the data:
price=warea⋅area+wage⋅age+b \mathrm = w_{\mathrm} \cdot \mathrm + w_{\mathrm} \cdot \mathrm + b price=warea⋅area+wage⋅age+b
# set input feature number num_inputs = 2 # set example number num_examples = 1000 # set true weight and bias in order to generate corresponded label true_w = [2, -3.4] true_b = 4.2 features = torch.randn(num_examples, num_inputs, dtype=torch.float32) labels = true_w[0] * features[:, 0] + true_w[1] * features[:, 1] + true_b labels += torch.tensor(np.random.normal(0, 0.01, size=labels.size()), dtype=torch.float32)
1.3 Visualization of simulated data
plt.scatter(features[:, 1].numpy(), labels.numpy(), 1);
1.4 Read simulated datasets
When training a model, we need to traverse the dataset and constantly read small batches of data samples.
# Define a function, data_iter, that returns the characteristics and labels of batch_size (batch size) random samples at a time. def data_iter(batch_size, features, labels): num_examples = len(features) indices = list(range(num_examples)) # The order in which samples are read is random random.shuffle(indices) # random read 10 samples for i in range(0, num_examples, batch_size): j = torch.LongTensor(indices[i: min(i + batch_size, num_examples)]) # the last time may be not enough for a whole batch #Function returns corresponding elements based on index yield features.index_select(0, j), labels.index_select(0, j)
# A program that validates the torch.index_select() function: j=torch.LongTensor([1,2,3,4,5,6,7,8,10]) features.index_select(0,j) #torch.index_select() #Select on the specified dimension dim, for example, select some rows, some columns.Returns that no Tensor does not share memory with the original Tensor. #In python, dim=0 means take by row, and dim=1 means take by column
tensor([[-0.8797, 1.0840], [ 1.4686, 0.5604], [ 0.6072, -1.0188], [-0.3210, 1.1137], [ 0.4691, 1.2000], [-0.8294, -0.8613], [ 0.9604, -0.2414], [ 0.3751, -0.8777], [-0.2483, 0.1386]])
#Read the first small batch data sample and print it.The characteristic shape of each batch is (10, 2), #Corresponds to the batch size and the number of inputs; the label shape is the batch size. # Note that all batch data will be output after break is removed batch_size = 10 for X, y in data_iter(batch_size, features, labels): print(X, '\n', y) break
tensor([[ 0.6798, 0.2157], [-0.3323, -0.7287], [ 0.7562, -0.0557], [-0.2248, -0.6173], [-3.0879, -0.7436], [-0.9020, -0.1528], [ 0.4947, -0.2986], [ 1.4328, -0.7418], [ 0.3510, -0.3221], [ 0.5044, -1.0165]]) tensor([4.8287, 5.9974, 5.8995, 5.8659, 0.5635, 2.9160, 6.1984, 9.6012, 6.0001, 8.6722])
1.5 Initialize model parameters
We initialize the weight to a normal random number with a mean of 0 and a standard deviation of 0.01, and the deviation to 0:
w = torch.tensor(np.random.normal(0, 0.01, (num_inputs, 1)), dtype=torch.float32)# Notice that num_inputs is 2 b = torch.zeros(1, dtype=torch.float32) #In model training, gradients are required for these parameters to iterate over their values, so we need to create gradients for them: w.requires_grad_(requires_grad=True) b.requires_grad_(requires_grad=True)
tensor([0.], requires_grad=True)
1.6 Define Model
Define the training model for the training parameters:
price=warea⋅area+wage⋅age+b \mathrm = w_{\mathrm} \cdot \mathrm + w_{\mathrm} \cdot \mathrm + b price=warea⋅area+wage⋅age+b
#To implement the vector calculation expression of linear regression, we use torch.mm to multiply the matrix: #torch.mm(a, b) is the multiplication of matrices A and b, and torch.mul(a, b) is the multiplication of corresponding bits of matrices A and B def linreg(X, w, b): return torch.mm(X, w) + b
1.7 Define the loss function
We use the mean square error loss function:
l(i)(w,b)=12(y^(i)−y(i))2, l^{(i)}(\mathbf, b) = \frac \left(\hat^{(i)} - y^{(i)}\right)^2, l(i)(w,b)=21(y^(i)−y(i))2,
To implement the operation, y.view(y_hat.size()) is used to transform y into the shape of y_hat to prevent y_hat from being different from the Y dimension
The torch type can use view() to change the Tensor shape.For example, y = x.view(12), z = x.view(-1,6), where -1 indicates that the dimension specified can be calculated from the values of other dimensions.
1. The new tensor returned by view () shares memory with the source tensor, which is one tensor. Changing one of the tensors will also change the other. View simply changes the angle of view of the tensor.
2. In addition, the reshape() function can change shape, but it does not guarantee that a copy will be returned and is not recommended.
3. It is recommended to create a copy using clone and then use view.Another advantage of using clone is that it is recorded in the computational diagram, where gradients are also passed to the source Tensor as they are returned to the copy.
def squared_loss(y_hat, y): return (y_hat - y.view(y_hat.size())) ** 2 / 2
1.8 Define the optimization function
Here the optimization function uses a small batch random gradient descent:
(w,b)←(w,b)−η∣B∣∑i∈B∂(w,b)l(i)(w,b) (\mathbf,b) \leftarrow (\mathbf,b) - \frac{\eta}{|\mathcal|} \sum_} \partial_{(\mathbf,b)} l^{(i)}(\mathbf,b) (w,b)←(w,b)−∣B∣ηi∈B∑∂(w,b)l(i)(w,b)
The following sgd function implements a small batch random gradient descent algorithm.It optimizes the loss function by iterating over the model parameters.
Here the gradient calculated by the automatic gradient module is the sum of the gradients of a batch of samples.We divide it by the batch size to get the average.
def sgd(params, lr, batch_size): for param in params: param.data -= lr * param.grad / batch_size # ues .data to operate param without gradient track
train
Once the dataset, model, loss function, and optimization function are defined, you are ready to train the model.
# super parameters init lr = 0.03 num_epochs = 5 net = linreg loss = squared_loss # training for epoch in range(num_epochs): # training repeats num_epochs times # in each epoch, all the samples in dataset will be used once # X is the feature and y is the label of a batch sample for X, y in data_iter(batch_size, features, labels): l = loss(net(X, w, b), y).sum() # calculate the gradient of batch sample loss l.backward() # using small batch random gradient descent to iter model parameters sgd([w, b], lr, batch_size) # reset parameter gradient #In grad updates, the last gradient record needs to be cleared after each operation w.grad.data.zero_() b.grad.data.zero_() train_l = loss(net(features, w, b), labels) print('epoch %d, loss %f' % (epoch + 1, train_l.mean().item()))
epoch 1, loss 0.035207 epoch 2, loss 0.000123 epoch 3, loss 0.000054 epoch 4, loss 0.000054 epoch 5, loss 0.000054
w, true_w, b, true_b
(tensor([[ 1.9999], [-3.4002]], requires_grad=True), [2, -3.4], tensor([4.2003], requires_grad=True), 4.2)2. Simple implementation of linear regression model using pytorch
import torch from torch import nn import numpy as np torch.manual_seed(1) print(torch.__version__) torch.set_default_tensor_type('torch.FloatTensor')1.3.0
2.1 Generating datasets
Generating a dataset here is exactly the same as a zero-based implementation.
num_inputs = 2 num_examples = 1000 true_w = [2, -3.4] true_b = 4.2 features = torch.tensor(np.random.normal(0, 1, (num_examples, num_inputs)), dtype=torch.float) labels = true_w[0] * features[:, 0] + true_w[1] * features[:, 1] + true_b labels += torch.tensor(np.random.normal(0, 0.01, size=labels.size()), dtype=torch.float)
2.2 Reading datasets
import torch.utils.data as Data batch_size = 10 # combine featues and labels of dataset dataset = Data.TensorDataset(features, labels) # put dataset into DataLoader data_iter = Data.DataLoader( dataset=dataset, # torch TensorDataset format batch_size=batch_size, # mini batch size shuffle=True, # whether shuffle the data or not num_workers=2, # read data in multithreading )
for X, y in data_iter: print(X, '\n', y) break
tensor([[ 0.0949, -2.0367], [ 0.0957, -2.4354], [ 0.1520, -1.5686], [ 1.3453, 0.1253], [ 0.3076, -1.0100], [-0.6013, 1.6175], [ 0.2898, 0.2359], [ 0.4352, -0.4930], [ 0.9694, -0.8326], [-1.0968, -0.2515]]) tensor([11.3024, 12.6900, 9.8462, 6.4771, 8.2533, -2.4928, 3.9811, 6.7626, 8.9806, 2.8489])
2.3 Define Model
nn.Module is a class provided in pytorch and is the base class of all neural network modules. Our custom modules inherit this base class.
class LinearNet(nn.Module): # Inherited from torch import nn def __init__(self, n_feature): super(LinearNet, self).__init__() # call father function to init, inherits Module's u init_u function, # Define the form of each layer self.linear = nn.Linear(n_feature, 1) ## Note that the linear here is of type nn.Linear, which can be interpreted as a linear layer and supports "feed" data as input, i.e. linear(x) # function prototype: `torch.nn.Linear(in_features, out_features, bias=True)` def forward(self, x):#This is also the forward function in Module y = self.linear(x) return y
net = LinearNet(num_inputs) print(net)
LinearNet( (linear): Linear(in_features=2, out_features=1, bias=True) )
super( LinearNet, self).init()
Initialize attributes inherited from parent class
First find the parent class of LinearNet (for example, class A), then convert the object self of class LinearNet to an object of class A.
Then the class A object "converted" calls its own u init_u function
## Understanding examples of nn.linear import torch nn1 = torch.nn.Linear(100, 50) input1 = torch.randn(140, 100) output1 = nn1(input1) output1.size()
torch.Size([140, 50])
# ways to init a multilayer network # method one net = nn.Sequential( nn.Linear(num_inputs, 1) # other layers can be added here ) # method two net = nn.Sequential() net.add_module('linear', nn.Linear(num_inputs, 1)) # net.add_module ...... # method three from collections import OrderedDict net = nn.Sequential(OrderedDict([ ('linear', nn.Linear(num_inputs, 1)) # ...... ])) print(net) print(net[0])
Sequential( (linear): Linear(in_features=2, out_features=1, bias=True) ) Linear(in_features=2, out_features=1, bias=True)
Initialize model parameters
from torch.nn import init init.normal_(net[0].weight, mean=0.0, std=0.01) init.constant_(net[0].bias, val=0.0) # or you can use `net[0].bias.data.fill_(0)` to modify it directly
Parameter containing: tensor([0.], requires_grad=True)
for param in net.parameters(): print(param)
Parameter containing: tensor([[-0.0152, 0.0038]], requires_grad=True) Parameter containing: tensor([0.], requires_grad=True)
Define loss function
loss = nn.MSELoss() # nn built-in squared loss function # function prototype: `torch.nn.MSELoss(size_average=None, reduce=None, reduction='mean')`
Define the optimization function
import torch.optim as optim optimizer = optim.SGD(net.parameters(), lr=0.03) # built-in random gradient descent function print(optimizer) # function prototype: `torch.optim.SGD(params, lr=, momentum=0, dampening=0, weight_decay=0, nesterov=False)`
SGD ( Parameter Group 0 dampening: 0 lr: 0.03 momentum: 0 nesterov: False weight_decay: 0 )
train
num_epochs = 3 for epoch in range(1, num_epochs + 1): for X, y in data_iter: output = net(X) l = loss(output, y.view(-1, 1)) optimizer.zero_grad() # reset gradient, equal to net.zero_grad() l.backward() optimizer.step() print('epoch %d, loss: %f' % (epoch, l.item()))
epoch 1, loss: 0.000290 epoch 2, loss: 0.000128 epoch 3, loss: 0.000107
# result comparision dense = net[0] print(true_w, dense.weight.data) print(true_b, dense.bias.data)
[2, -3.4] tensor([[ 1.9997, -3.3999]]) 4.2 tensor([4.2002])
Comparison of two implementations
-
Zero-based implementation (recommended for learning)
Better understanding of models and underlying principles of neural networks
-
Simple implementation using pytorch
Can complete model design and implementation more quickly