L1 Linear Regression

linear regression

The main content includes:

  1. Basic elements of linear regression
  2. Zero-based implementation of linear regression models
  3. Simple implementation of linear regression model using pytorch

Basic elements of linear regression


For simplicity, here we assume that the price depends only on two factors of the housing condition: area (square meters) and age (years).Next, we want to explore the specific relationship between prices and these two factors.Linear regression assumes that the output has a linear relationship with each input:

price=warea⋅area+wage⋅age+b \mathrm{price} = w_{\mathrm{area}} \cdot \mathrm{area} + w_{\mathrm{age}} \cdot \mathrm{age} + b price=warea​⋅area+wage​⋅age+b

data set

We usually collect a series of real data, such as the true selling prices of multiple houses and their corresponding area and age.We want to look for model parameters on this data to minimize the error between the predicted price and the true price of the model.In machine learning terminology, this data set is called a training data set or training set, a house is called a sample, its real selling price is called a label, and the two factors used to predict the label are called feature s.Features are used to characterize the characteristics of a sample.

loss function

In model training, we need to measure the error between the price forecast and the true value.Usually we choose a non-negative number as the error, and smaller values mean smaller errors.A common choice is the square function.It evaluates the sample error indexed as iii as

l(i)(w,b)=12(y^(i)−y(i))2, l^{(i)}(\mathbf{w}, b) = \frac{1}{2} \left(\hat{y}^{(i)} - y^{(i)}\right)^2, l(i)(w,b)=21​(y^​(i)−y(i))2,

L(w,b)=1n∑i=1nl(i)(w,b)=1n∑i=1n12(w⊤x(i)+b−y(i))2. L(\mathbf{w}, b) =\frac{1}{n}\sum_{i=1}^n l^{(i)}(\mathbf{w}, b) =\frac{1}{n} \sum_{i=1}^n \frac{1}{2}\left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right)^2. L(w,b)=n1​i=1∑n​l(i)(w,b)=n1​i=1∑n​21​(w⊤x(i)+b−y(i))2.

Optimized Function-Random Gradient Decrease

When the model and loss function are relatively simple, the above solution to the error minimization problem can be expressed directly in a formula.This kind of solution is called analytical solution.The linear regression and square errors used in this section fall precisely into this category.However, most in-depth learning models do not have an analytic solution and can only minimize the value of the loss function by optimizing the finite iteration model parameters.This type of solution is called a numerical solution.

Small batch stochastic gradient descent (mini-batch stochastic gradient descent) is widely used in the optimization algorithm for solving numerical problems.Its algorithm is simple: first select the initial values of a set of model parameters, such as random selection; then iterate the parameters several times so that each iteration may reduce the value of the loss function.In each iteration, a small batch (mini-batch) B\mathcal{B}B consisting of a fixed number of training data samples is randomly and evenly sampled, and then the derivative (gradient) of the model parameters is calculated for the average loss of the data samples in the small batch. Finally, the product of this result and a preset positive number is used as the reduction of the model parameters in this iteration.

(w,b)←(w,b)−η∣B∣∑i∈B∂(w,b)l(i)(w,b) (\mathbf{w},b) \leftarrow (\mathbf{w},b) - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_{(\mathbf{w},b)} l^{(i)}(\mathbf{w},b) (w,b)←(w,b)−∣B∣η​i∈B∑​∂(w,b)​l(i)(w,b)

Learning rate: _eta represents the size of the steps that can be learned in each optimization
batch size: B\mathcal{B}B batch size in small batch calculations

To summarize, there are two steps to optimize a function:

  • (i) Initializing model parameters, generally using random initialization;
  • (ii) We iterate over the data several times, updating each parameter by moving it in the direction of a negative gradient.

Vector calculation

In model training or prediction, we often process multiple data samples at the same time and use them for vector calculation.Before introducing vector calculation expressions for linear regression, let's consider two ways to add two vectors together.

  1. One way to add vectors is to add them one by one as scalars.
  2. Another way to add vectors is to add them directly.
import torch
import time

# init variable a, b as 1000 dimension vector
n = 1000
a = torch.ones(n)
b = torch.ones(n)

# define a timer class to record time
class Timer(object):
    """Record multiple running times."""
    def __init__(self):
        self.times = []

    def start(self):
        # start the timer
        self.start_time = time.time()

    def stop(self):
        # stop the timer and record time into a list
        self.times.append(time.time() - self.start_time)
        return self.times[-1]

    def avg(self):
        # calculate the average and return
        return sum(self.times)/len(self.times)

    def sum(self):
        # return the sum of recorded time
        return sum(self.times)

Now we can test it.First, two vectors are scalar added one by one by element using a for loop.

timer = Timer()
c = torch.zeros(n)
for i in range(n):
    c[i] = a[i] + b[i]
'%.5f sec' % timer.stop()
'0.01432 sec'

Another is to use torch to add two vectors directly:

d = a + b
'%.5f sec' % timer.stop()
'0.00024 sec'

The result is obvious, the latter is faster than the former.Therefore, we should use vector computing as much as possible to improve computational efficiency.

1. Realization of Linear Regression Model from Zero

1.1 Import required packages

# import packages and modules
%matplotlib inline
import torch
from IPython import display
from matplotlib import pyplot as plt
import numpy as np
import random


1.2 Simulating the generation of experimental datasets

Using a linear model to generate a dataset, a dataset of 1,000 samples is generated. The following are the linear relationships used to generate the data:

price=warea⋅area+wage⋅age+b \mathrm{price} = w_{\mathrm{area}} \cdot \mathrm{area} + w_{\mathrm{age}} \cdot \mathrm{age} + b price=warea​⋅area+wage​⋅age+b

# set input feature number 
num_inputs = 2
# set example number
num_examples = 1000

# set true weight and bias in order to generate corresponded label
true_w = [2, -3.4]
true_b = 4.2

features = torch.randn(num_examples, num_inputs,
labels = true_w[0] * features[:, 0] + true_w[1] * features[:, 1] + true_b
labels += torch.tensor(np.random.normal(0, 0.01, size=labels.size()),

1.3 Visualization of simulated data

plt.scatter(features[:, 1].numpy(), labels.numpy(), 1);

1.4 Read simulated datasets

When training a model, we need to traverse the dataset and constantly read small batches of data samples.

# Define a function, data_iter, that returns the characteristics and labels of batch_size (batch size) random samples at a time.
def data_iter(batch_size, features, labels):
    num_examples = len(features)
    indices = list(range(num_examples))
    # The order in which samples are read is random
    random.shuffle(indices)  # random read 10 samples
    for i in range(0, num_examples, batch_size):
        j = torch.LongTensor(indices[i: min(i + batch_size, num_examples)]) # the last time may be not enough for a whole batch
        #Function returns corresponding elements based on index
        yield  features.index_select(0, j), labels.index_select(0, j)
# A program that validates the torch.index_select() function:
#Select on the specified dimension dim, for example, select some rows, some columns.Returns that no Tensor does not share memory with the original Tensor.
#In python, dim=0 means take by row, and dim=1 means take by column
tensor([[-0.8797,  1.0840],
        [ 1.4686,  0.5604],
        [ 0.6072, -1.0188],
        [-0.3210,  1.1137],
        [ 0.4691,  1.2000],
        [-0.8294, -0.8613],
        [ 0.9604, -0.2414],
        [ 0.3751, -0.8777],
        [-0.2483,  0.1386]])
#Read the first small batch data sample and print it.The characteristic shape of each batch is (10, 2),
#Corresponds to the batch size and the number of inputs; the label shape is the batch size.
# Note that all batch data will be output after break is removed
batch_size = 10
for X, y in data_iter(batch_size, features, labels):
    print(X, '\n', y)
tensor([[ 0.6798,  0.2157],
        [-0.3323, -0.7287],
        [ 0.7562, -0.0557],
        [-0.2248, -0.6173],
        [-3.0879, -0.7436],
        [-0.9020, -0.1528],
        [ 0.4947, -0.2986],
        [ 1.4328, -0.7418],
        [ 0.3510, -0.3221],
        [ 0.5044, -1.0165]]) 
 tensor([4.8287, 5.9974, 5.8995, 5.8659, 0.5635, 2.9160, 6.1984, 9.6012, 6.0001,

1.5 Initialize model parameters

We initialize the weight to a normal random number with a mean of 0 and a standard deviation of 0.01, and the deviation to 0:

w = torch.tensor(np.random.normal(0, 0.01, (num_inputs, 1)), dtype=torch.float32)# Notice that num_inputs is 2
b = torch.zeros(1, dtype=torch.float32)                                                                           
#In model training, gradients are required for these parameters to iterate over their values, so we need to create gradients for them:
tensor([0.], requires_grad=True)

1.6 Define Model

Define the training model for the training parameters:

price=warea⋅area+wage⋅age+b \mathrm{price} = w_{\mathrm{area}} \cdot \mathrm{area} + w_{\mathrm{age}} \cdot \mathrm{age} + b price=warea​⋅area+wage​⋅age+b

#To implement the vector calculation expression of linear regression, we use torch.mm to multiply the matrix:
#torch.mm(a, b) is the multiplication of matrices A and b, and torch.mul(a, b) is the multiplication of corresponding bits of matrices A and B
def linreg(X, w, b):
    return torch.mm(X, w) + b

1.7 Define the loss function

We use the mean square error loss function:
l(i)(w,b)=12(y^(i)−y(i))2, l^{(i)}(\mathbf{w}, b) = \frac{1}{2} \left(\hat{y}^{(i)} - y^{(i)}\right)^2, l(i)(w,b)=21​(y^​(i)−y(i))2,

To implement the operation, y.view(y_hat.size()) is used to transform y into the shape of y_hat to prevent y_hat from being different from the Y dimension
The torch type can use view() to change the Tensor shape.For example, y = x.view(12), z = x.view(-1,6), where -1 indicates that the dimension specified can be calculated from the values of other dimensions.
1. The new tensor returned by view () shares memory with the source tensor, which is one tensor. Changing one of the tensors will also change the other. View simply changes the angle of view of the tensor.
2. In addition, the reshape() function can change shape, but it does not guarantee that a copy will be returned and is not recommended.
3. It is recommended to create a copy using clone and then use view.Another advantage of using clone is that it is recorded in the computational diagram, where gradients are also passed to the source Tensor as they are returned to the copy.

def squared_loss(y_hat, y): 
    return (y_hat - y.view(y_hat.size())) ** 2 / 2

1.8 Define the optimization function

Here the optimization function uses a small batch random gradient descent:

(w,b)←(w,b)−η∣B∣∑i∈B∂(w,b)l(i)(w,b) (\mathbf{w},b) \leftarrow (\mathbf{w},b) - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_{(\mathbf{w},b)} l^{(i)}(\mathbf{w},b) (w,b)←(w,b)−∣B∣η​i∈B∑​∂(w,b)​l(i)(w,b)
The following sgd function implements a small batch random gradient descent algorithm.It optimizes the loss function by iterating over the model parameters.
Here the gradient calculated by the automatic gradient module is the sum of the gradients of a batch of samples.We divide it by the batch size to get the average.

def sgd(params, lr, batch_size): 
    for param in params:
        param.data -= lr * param.grad / batch_size # ues .data to operate param without gradient track


Once the dataset, model, loss function, and optimization function are defined, you are ready to train the model.

# super parameters init
lr = 0.03
num_epochs = 5

net = linreg
loss = squared_loss

# training
for epoch in range(num_epochs):  # training repeats num_epochs times
    # in each epoch, all the samples in dataset will be used once
    # X is the feature and y is the label of a batch sample
    for X, y in data_iter(batch_size, features, labels):
        l = loss(net(X, w, b), y).sum()  
        # calculate the gradient of batch sample loss 
        # using small batch random gradient descent to iter model parameters
        sgd([w, b], lr, batch_size)  
        # reset parameter gradient
        #In grad updates, the last gradient record needs to be cleared after each operation
    train_l = loss(net(features, w, b), labels)
    print('epoch %d, loss %f' % (epoch + 1, train_l.mean().item()))
epoch 1, loss 0.035207
epoch 2, loss 0.000123
epoch 3, loss 0.000054
epoch 4, loss 0.000054
epoch 5, loss 0.000054
w, true_w, b, true_b
(tensor([[ 1.9999],
         [-3.4002]], requires_grad=True),
 [2, -3.4],
 tensor([4.2003], requires_grad=True),

2. Simple implementation of linear regression model using pytorch

import torch
from torch import nn
import numpy as np


2.1 Generating datasets

Generating a dataset here is exactly the same as a zero-based implementation.

num_inputs = 2
num_examples = 1000

true_w = [2, -3.4]
true_b = 4.2

features = torch.tensor(np.random.normal(0, 1, (num_examples, num_inputs)), dtype=torch.float)
labels = true_w[0] * features[:, 0] + true_w[1] * features[:, 1] + true_b
labels += torch.tensor(np.random.normal(0, 0.01, size=labels.size()), dtype=torch.float)

2.2 Reading datasets

import torch.utils.data as Data

batch_size = 10

# combine featues and labels of dataset
dataset = Data.TensorDataset(features, labels)

# put dataset into DataLoader
data_iter = Data.DataLoader(
    dataset=dataset,            # torch TensorDataset format
    batch_size=batch_size,      # mini batch size
    shuffle=True,               # whether shuffle the data or not
    num_workers=2,              # read data in multithreading
for X, y in data_iter:
    print(X, '\n', y)
tensor([[ 0.0949, -2.0367],
        [ 0.0957, -2.4354],
        [ 0.1520, -1.5686],
        [ 1.3453,  0.1253],
        [ 0.3076, -1.0100],
        [-0.6013,  1.6175],
        [ 0.2898,  0.2359],
        [ 0.4352, -0.4930],
        [ 0.9694, -0.8326],
        [-1.0968, -0.2515]]) 
 tensor([11.3024, 12.6900,  9.8462,  6.4771,  8.2533, -2.4928,  3.9811,  6.7626,
         8.9806,  2.8489])

2.3 Define Model

nn.Module is a class provided in pytorch and is the base class of all neural network modules. Our custom modules inherit this base class.

class LinearNet(nn.Module): # Inherited from torch import nn
    def __init__(self, n_feature):
        super(LinearNet, self).__init__()      # call father function to init, inherits Module's u init_u function,
         # Define the form of each layer
        self.linear = nn.Linear(n_feature, 1) ## Note that the linear here is of type nn.Linear, which can be interpreted as a linear layer and supports "feed" data as input, i.e. linear(x)
        # function prototype: `torch.nn.Linear(in_features, out_features, bias=True)`
    def forward(self, x):#This is also the forward function in Module
        y = self.linear(x)
        return y
net = LinearNet(num_inputs)
  (linear): Linear(in_features=2, out_features=1, bias=True)

super( LinearNet, self).init()
Initialize attributes inherited from parent class
First find the parent class of LinearNet (for example, class A), then convert the object self of class LinearNet to an object of class A.
Then the class A object "converted" calls its own u init_u function

## Understanding examples of nn.linear
import torch
nn1 = torch.nn.Linear(100, 50)
input1 = torch.randn(140, 100)
output1 = nn1(input1)
torch.Size([140, 50])
# ways to init a multilayer network
# method one
net = nn.Sequential(
    nn.Linear(num_inputs, 1)
    # other layers can be added here

# method two
net = nn.Sequential()
net.add_module('linear', nn.Linear(num_inputs, 1))
# net.add_module ......

# method three
from collections import OrderedDict
net = nn.Sequential(OrderedDict([
          ('linear', nn.Linear(num_inputs, 1))
          # ......

  (linear): Linear(in_features=2, out_features=1, bias=True)
Linear(in_features=2, out_features=1, bias=True)

Initialize model parameters

from torch.nn import init

init.normal_(net[0].weight, mean=0.0, std=0.01)
init.constant_(net[0].bias, val=0.0)  # or you can use `net[0].bias.data.fill_(0)` to modify it directly
Parameter containing:
tensor([0.], requires_grad=True)
for param in net.parameters():
Parameter containing:
tensor([[-0.0152,  0.0038]], requires_grad=True)
Parameter containing:
tensor([0.], requires_grad=True)

Define loss function

loss = nn.MSELoss()    # nn built-in squared loss function
                       # function prototype: `torch.nn.MSELoss(size_average=None, reduce=None, reduction='mean')`

Define the optimization function

import torch.optim as optim

optimizer = optim.SGD(net.parameters(), lr=0.03)   # built-in random gradient descent function
print(optimizer)  # function prototype: `torch.optim.SGD(params, lr=, momentum=0, dampening=0, weight_decay=0, nesterov=False)`
Parameter Group 0
    dampening: 0
    lr: 0.03
    momentum: 0
    nesterov: False
    weight_decay: 0


num_epochs = 3
for epoch in range(1, num_epochs + 1):
    for X, y in data_iter:
        output = net(X)
        l = loss(output, y.view(-1, 1))
        optimizer.zero_grad() # reset gradient, equal to net.zero_grad()
    print('epoch %d, loss: %f' % (epoch, l.item()))
epoch 1, loss: 0.000290
epoch 2, loss: 0.000128
epoch 3, loss: 0.000107
# result comparision
dense = net[0]
print(true_w, dense.weight.data)
print(true_b, dense.bias.data)
[2, -3.4] tensor([[ 1.9997, -3.3999]])
4.2 tensor([4.2002])

Comparison of two implementations

  1. Zero-based implementation (recommended for learning)

    Better understanding of models and underlying principles of neural networks

  2. Simple implementation using pytorch

    Can complete model design and implementation more quickly

Two original articles have been published. Approved 0. Visits 5
Private letter follow

Tags: network IPython Python

Posted on Thu, 13 Feb 2020 23:40:06 -0500 by oneski