Hands on deep learning: linear regression

linear regression

The main contents include:

  1. Basic elements of linear regression
  2. The realization of linear regression model from zero
  3. Simple implementation of linear regression model using pytorch

Basic elements of linear regression


For the sake of simplicity, here we assume that the price only depends on two factors of housing condition, namely area (M2) and housing age (year). Next, we hope to explore the specific relationship between price and these two factors. Linear regression assumes that there is a linear relationship between the output and each input:

price=warea⋅area+wage⋅age+b \mathrm{price} = w_{\mathrm{area}} \cdot \mathrm{area} + w_{\mathrm{age}} \cdot \mathrm{age} + b price=warea​⋅area+wage​⋅age+b

data set

We usually collect a series of real data, such as the real selling price of multiple houses and their corresponding area and age. We hope to find the model parameters on this data to minimize the error between the predicted price and the real price. In machine learning terminology, this data set is called training data set or training set. A house is called a sample, its real selling price is called label, and the two factors used to predict the label are called feature. Characteristics are used to represent the characteristics of samples.

loss function

In model training, we need to measure the error between the predicted value and the real value. Generally, we choose a non negative number as the error, and the smaller the value is, the smaller the error is. A common choice is the square function. Its expression in evaluating the sample error with index iii is

l(i)(w,b)=12(y^(i)−y(i))2, l^{(i)}(\mathbf{w}, b) = \frac{1}{2} \left(\hat{y}^{(i)} - y^{(i)}\right)^2, l(i)(w,b)=21​(y^​(i)−y(i))2,

L(w,b)=1n∑i=1nl(i)(w,b)=1n∑i=1n12(w⊤x(i)+b−y(i))2. L(\mathbf{w}, b) =\frac{1}{n}\sum_{i=1}^n l^{(i)}(\mathbf{w}, b) =\frac{1}{n} \sum_{i=1}^n \frac{1}{2}\left(\mathbf{w}^\top \mathbf{x}^{(i)} + b - y^{(i)}\right)^2. L(w,b)=n1​i=1∑n​l(i)(w,b)=n1​i=1∑n​21​(w⊤x(i)+b−y(i))2.

Optimization function random gradient descent

When the form of model and loss function is simple, the solution of the above error minimization problem can be directly expressed by formula. Such solutions are called analytical solution s. The linear regression and square error used in this section fall into this category. However, most deep learning models don't have analytic solutions, so we can only reduce the value of loss function as much as possible by optimizing the model parameters of finite iterations of algorithm. Such solutions are called numerical solution s.

In the optimization algorithm of numerical solution, mini batch stochastic gradient descent is widely used in deep learning. Its algorithm is very simple: first select the initial value of a group of model parameters, such as random selection; then iterate the parameters many times, so that each iteration may reduce the value of the loss function. In each iteration, a small batch (B \ match {B} b) consisting of a fixed number of training data samples is randomly and uniformly sampled, and then the derivative (gradient) of model parameters related to the average loss of data samples in small batch is calculated. Finally, the product of this result and a preset positive number is used as the reduction of model parameters in this iteration.

(w,b)←(w,b)−η∣B∣∑i∈B∂(w,b)l(i)(w,b) (\mathbf{w},b) \leftarrow (\mathbf{w},b) - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_{(\mathbf{w},b)} l^{(i)}(\mathbf{w},b) (w,b)←(w,b)−∣B∣η​i∈B∑​∂(w,b)​l(i)(w,b)

Learning rate: η \ eta η represents the step size that can be learned in each optimization
Batch size: B \ batch {B} B is the batch size in small batch calculation

To summarize, there are two steps to optimize the function:

  • (i) Initialization of model parameters, generally using random initialization;
  • (ii) we iterate on the data several times, and update each parameter by moving the parameter in the negative gradient direction.

Vector computing

In model training or prediction, we often process multiple data samples at the same time and use vector calculation. Before introducing the vector calculation expression of linear regression, let's consider two methods of adding two vectors.

  1. One way to add vectors is to add them scalar by element.
  2. Another way to add vectors is to add these two vectors directly.
import torch
import time

# init variable a, b as 1000 dimension vector
n = 1000
a = torch.ones(n)
b = torch.ones(n)

# define a timer class to record time
class Timer(object):
    """Record multiple running times."""
    def __init__(self):
        self.times = []

    def start(self):
        # start the timer
        self.start_time = time.time()

    def stop(self):
        # stop the timer and record time into a list
        self.times.append(time.time() - self.start_time)
        return self.times[-1]

    def avg(self):
        # calculate the average and return
        return sum(self.times)/len(self.times)

    def sum(self):
        # return the sum of recorded time
        return sum(self.times)

Now we can test it. First, we use the for loop to do scalar addition by element.

timer = Timer()
c = torch.zeros(n)
for i in range(n):
    c[i] = a[i] + b[i]
'%.5f sec' % timer.stop()
'0.01162 sec'

In addition, torch is used to add two vectors directly:

d = a + b
'%.5f sec' % timer.stop()
'0.00027 sec'

The results show that the latter is faster than the former. Therefore, we should use vector calculation as much as possible to improve the efficiency of calculation.

Realization of linear regression model from zero

# import packages and modules
%matplotlib inline
import torch
from IPython import display
from matplotlib import pyplot as plt
import numpy as np
import random


Generate data set

Use linear model to generate data set, and generate a data set of 1000 samples. The following is the linear relationship used to generate data:

price=warea⋅area+wage⋅age+b \mathrm{price} = w_{\mathrm{area}} \cdot \mathrm{area} + w_{\mathrm{age}} \cdot \mathrm{age} + b price=warea​⋅area+wage​⋅age+b

# set input feature number 
num_inputs = 2
# set example number
num_examples = 1000

# set true weight and bias in order to generate corresponded label
true_w = [2, -3.4]
true_b = 4.2

features = torch.randn(num_examples, num_inputs,
labels = true_w[0] * features[:, 0] + true_w[1] * features[:, 1] + true_b
labels += torch.tensor(np.random.normal(0, 0.01, size=labels.size()),   # It is impossible to strictly linear distribution, plus a deviation

Use images to show the generated data

plt.scatter(features[:, 1].numpy(), labels.numpy(), 1)

Read data set

def data_iter(batch_size, features, labels):
    num_examples = len(features)
    indices = list(range(num_examples))
    random.shuffle(indices)  # random read 10 samples
    for i in range(0, num_examples, batch_size):
        # the last time may be not enough for a whole batch
        j = torch.LongTensor(indices[i: min(i + batch_size, num_examples)]) #B? Size per jump
        yield features.index_select(0, j), labels.index_select(0, j)   # Return one and exit
batch_size = 10

for X, y in data_iter(batch_size, features, labels):
    print(X, '\n', y)
tensor([[ 0.6802, -0.1529],
        [-1.1139, -1.0506],
        [ 1.1355,  0.6313],
        [ 0.9057,  1.4527],
        [-0.7477, -0.4656],
        [-0.4012, -0.1810],
        [ 0.0038, -1.3701],
        [-0.9438, -1.6571],
        [ 0.4016,  0.4233],
        [ 0.8824, -1.0067]]) 
 tensor([6.0793, 5.5343, 4.3112, 1.0657, 4.2856, 4.0040, 8.8709, 7.9642, 3.5696,

Initialize model parameters

w = torch.tensor(np.random.normal(0, 0.01, (num_inputs, 1)), dtype=torch.float32)
b = torch.zeros(1, dtype=torch.float32)

w.requires_grad_(requires_grad=True)  # Gradient requirement
tensor([0.], requires_grad=True)

Definition model

Define the training model for training parameters:

price=warea⋅area+wage⋅age+b \mathrm{price} = w_{\mathrm{area}} \cdot \mathrm{area} + w_{\mathrm{age}} \cdot \mathrm{age} + b price=warea​⋅area+wage​⋅age+b

def linreg(X, w, b):
    return torch.mm(X, w) + b   # mm multiplication

Define loss function

We use the mean square error loss function:
l(i)(w,b)=12(y^(i)−y(i))2, l^{(i)}(\mathbf{w}, b) = \frac{1}{2} \left(\hat{y}^{(i)} - y^{(i)}\right)^2, l(i)(w,b)=21​(y^​(i)−y(i))2,

def squared_loss(y_hat, y): 
    return (y_hat - y.view(y_hat.size())) ** 2 / 2   # view indicates the same data and different size
# s = squared_loss(2.33,3.14) + squared_loss(1.07,0.98) + squared_loss(1.23,1.32)
y_hat = torch.tensor([[2.33],
        [ 1.07],
        [ 1.23]])
y = torch.tensor([3.14, 0.98, 1.32])
s = squared_loss(y_hat, y)

Define optimization function

Here, the optimization function uses random gradient descent of small batch:

(w,b)←(w,b)−η∣B∣∑i∈B∂(w,b)l(i)(w,b) (\mathbf{w},b) \leftarrow (\mathbf{w},b) - \frac{\eta}{|\mathcal{B}|} \sum_{i \in \mathcal{B}} \partial_{(\mathbf{w},b)} l^{(i)}(\mathbf{w},b) (w,b)←(w,b)−∣B∣η​i∈B∑​∂(w,b)​l(i)(w,b)

def sgd(params, lr, batch_size): 
    # params : [w, b]
    for param in params:
        # Accumulate a value in the negative direction of the gradient to achieve the optimization effect (tend to the bottom)
        param.data -= lr * param.grad / batch_size # ues .data to operate param without gradient track


When the data set, model, loss function and optimization function are defined, they can prepare for model training.

# super parameters init
lr = 0.03  # lr learning rate
num_epochs = 5

net = linreg
loss = squared_loss

# training
for epoch in range(num_epochs):  # training repeats num_epochs times
    # in each epoch, all the samples in dataset will be used once
    # X is the feature and y is the label of a batch sample
    for X, y in data_iter(batch_size, features, labels):
        l = loss(net(X, w, b), y).sum()  
        # calculate the gradient of batch sample loss 
        # using small batch random gradient descent to iter model parameters
        sgd([w, b], lr, batch_size)  
        # reset parameter gradient
        w.grad.data.zero_()  # After each operation, the previous gradient record needs to be cleared
    train_l = loss(net(features, w, b), labels)
    print('epoch %d, loss %f' % (epoch + 1, train_l.mean().item()))

Simple implementation of linear regression model using pytorch

import torch
from torch import nn
import numpy as np
torch.manual_seed(1)   # The same random initialization seed ensures that the results can be repeated


Generate data set

Generating datasets here is exactly the same as in a zero based implementation.

num_inputs = 2
num_examples = 1000

true_w = [2, -3.4]
true_b = 4.2

features = torch.tensor(np.random.normal(0, 1, (num_examples, num_inputs)), dtype=torch.float)
labels = true_w[0] * features[:, 0] + true_w[1] * features[:, 1] + true_b
labels += torch.tensor(np.random.normal(0, 0.01, size=labels.size()), dtype=torch.float)

Read data set

import torch.utils.data as Data

batch_size = 10

# combine featues and labels of dataset
dataset = Data.TensorDataset(features, labels)

# put dataset into DataLoader
data_iter = Data.DataLoader(
    dataset=dataset,            # torch TensorDataset format
    batch_size=batch_size,      # mini batch size
    shuffle=True,               # whether shuffle the data or not
    num_workers=2,              # read data in multithreading
for X, y in data_iter:
    print(X, '\n', y)
tensor([[ 0.2134,  0.7981],
        [-0.3899, -0.4544],
        [ 1.4472, -1.2160],
        [-0.7354,  1.2216],
        [-1.3233,  0.6937],
        [-0.2810,  0.5505],
        [-1.6620, -0.1457],
        [-0.7635,  0.7058],
        [ 0.6079, -0.7497],
        [ 0.4924, -0.1376]]) 
 tensor([ 1.9045,  4.9770, 11.2503, -1.4180, -0.8159,  1.7737,  1.3910,  0.2721,
         7.9648,  5.6408])

Definition model

class LinearNet(nn.Module):
    def __init__(self, n_feature):
        super(LinearNet, self).__init__()      # call father function to init 
        self.linear = nn.Linear(n_feature, 1)  # function prototype: `torch.nn.Linear(in_features, out_features, bias=True)`

    def forward(self, x):
        y = self.linear(x)
        return y
net = LinearNet(num_inputs)
  (linear): Linear(in_features=2, out_features=1, bias=True)
# ways to init a multilayer network
# method one
net = nn.Sequential(   # A timing container in which Modules are added in the order they are passed in
    nn.Linear(num_inputs, 1)
    # other layers can be added here

# method two
net = nn.Sequential()
net.add_module('linear', nn.Linear(num_inputs, 1))
# net.add_module ......

# method three
from collections import OrderedDict
net = nn.Sequential(OrderedDict([
          ('linear', nn.Linear(num_inputs, 1))
          # ......

# print(net)
Parameter containing:
tensor([[0.6652, 0.1260]], requires_grad=True)
Parameter containing:
tensor([0.5704], requires_grad=True)

Initialize model parameters

from torch.nn import init

init.normal_(net[0].weight, mean=0.0, std=0.01)  # weight regeneration
init.constant_(net[0].bias, val=0.0)  # bias [assigned 0

Parameter containing:
tensor([[0.0016, 0.0029]], requires_grad=True)
Parameter containing:
tensor([0.], requires_grad=True)
for param in net.parameters():
    print(param)  # param is the above weight and bias
Parameter containing:
tensor([[0.0016, 0.0029]], requires_grad=True)
Parameter containing:
tensor([0.], requires_grad=True)

Define loss function

loss = nn.MSELoss()    # NN built in squared loss function
                       # function prototype: `torch.nn.MSELoss(size_average=None, reduce=None, reduction='mean')`

Define optimization function

import torch.optim as optim

optimizer = optim.SGD(net.parameters(), lr=0.03)   # built-in random gradient descent function
print(optimizer)  # function prototype: `torch.optim.SGD(params, lr=, momentum=0, dampening=0, weight_decay=0, nesterov=False)`
Parameter Group 0
    dampening: 0
    lr: 0.03
    momentum: 0
    nesterov: False
    weight_decay: 0


num_epochs = 3
for epoch in range(1, num_epochs + 1):
    for X, y in data_iter:
        output = net(X)
        l = loss(output, y.view(-1, 1))
        optimizer.zero_grad() # reset gradient, equal to net.zero_grad()
    print('epoch %d, loss: %f' % (epoch, l.item()))
epoch 1, loss: 0.000299
epoch 2, loss: 0.000058
epoch 3, loss: 0.000051
# result comparision
dense = net[0]
print(true_w, dense.weight.data)
print(true_b, dense.bias.data)
[2, -3.4] tensor([[ 2.0007, -3.3992]])
4.2 tensor([4.1989])
Published 2 original articles · praised 0 · visited 1
Private letter follow

Tags: IPython network

Posted on Fri, 14 Feb 2020 09:08:06 -0500 by WDPEjoe