# linear regression

## Model

Linear regression assumes that there is a linear relationship between the output and each input.

## data set

We usually collect a series of real data, such as the real selling price of multiple houses and their corresponding area and age. We hope to find the model parameters on this data to minimize the error between the predicted price and the real price. In machine learning terminology, this data set is called training data set or training set. A house is called a sample, its real selling price is called label, and the two factors used to predict the label are called feature. Characteristics are used to represent the characteristics of samples.

## loss function

In model training, we need to measure the error between the predicted value and the real value. Generally, we choose a non negative number as the error, and the smaller the value is, the smaller the error is. A common choice is the square function.

## Optimization function random gradient descent

In the optimization algorithm of numerical solution, mini batch stochastic gradient descent is widely used in deep learning. Its algorithm is very simple: first select the initial value of a group of model parameters, such as random selection; then iterate the parameters many times, so that each iteration may reduce the value of the loss function. In each iteration, a small batch (mini-batch) B composed of a fixed number of training data samples is randomly and uniformly sampled, and then the derivative (gradient) of model parameters related to the average loss of data samples in the small batch is calculated. Finally, the product of this result and a positive number set in advance is used as the reduction of model parameters in this iteration.

# softmax and classification model

## Implementation of softmax from scratch

```import torch
import torchvision
import numpy as np
import sys
sys.path.append("/home/kesci/input")
import d2lzh1981 as d2l

print(torch.__version__)
print(torchvision.__version__)
batch_size = 256
num_inputs = 784
print(28*28)
num_outputs = 10

W = torch.tensor(np.random.normal(0, 0.01, (num_inputs, num_outputs)), dtype=torch.float)
b = torch.zeros(num_outputs, dtype=torch.float)
X = torch.tensor([[1, 2, 3], [4, 5, 6]])
print(X.sum(dim=0, keepdim=True))  # dim is 0, sum according to the same column, and retain the column characteristics in the result
print(X.sum(dim=1, keepdim=True))  # dim is 1, sum according to the same lines, and retain the line features in the result
print(X.sum(dim=0, keepdim=False)) # dim is 0, sum according to the same column, do not retain the column characteristics in the result
print(X.sum(dim=1, keepdim=False)) # dim is 1, sum according to the same lines, and do not retain the line features in the result
def softmax(X):
X_exp = X.exp()
partition = X_exp.sum(dim=1, keepdim=True)
# print("X size is ", X_exp.size())
# print("partition size is ", partition, partition.size())
return X_exp / partition  # The broadcast mechanism is applied here
X = torch.rand((2, 5))
X_prob = softmax(X)
print(X_prob, '\n', X_prob.sum(dim=1))

```

# Multilayer perceptron

## Activation function

The root of the problem is that the full connection layer only performs affine transformation on the data, and the superposition of multiple affine transformations is still an affine transformation. One way to solve the problem is to introduce nonlinear transformation, for example, the hidden variable is transformed by the nonlinear function calculated by elements, and then it is used as the input of the next full connection layer. This nonlinear function is called activation function.

Here are some common activation functions:

ReLU function
The ReLU (corrected linear unit) function provides a very simple nonlinear transformation. Given the element x, the function is defined as:
ReLU(x)=max(x,0).
As you can see, the ReLU function only retains positive elements and clears negative elements.
Sigmoid function
The sigmoid function transforms the value of an element between 0 and 1.
tanh function
tanh (hyperbolic tangent) function can transform the value of an element between - 1 and 1.

## pytorch implementation of multi-layer perceptron

```import torch
from torch import nn
from torch.nn import init
import numpy as np
import sys
sys.path.append("/home/kesci/input")
import d2lzh1981 as d2l

print(torch.__version__)
num_inputs, num_outputs, num_hiddens = 784, 10, 256

net = nn.Sequential(
d2l.FlattenLayer(),
nn.Linear(num_inputs, num_hiddens),
nn.ReLU(),
nn.Linear(num_hiddens, num_outputs),
)

for params in net.parameters():
init.normal_(params, mean=0, std=0.01)
batch_size = 256  