Linear regression assumes that there is a linear relationship between the output and each input.
We usually collect a series of real data, such as the real selling price of multiple houses and their corresponding area and age. We hope to find the model parameters on this data to minimize the error between the predicted price and the real price. In machine learning terminology, this data set is called training data set or training set. A house is called a sample, its real selling price is called label, and the two factors used to predict the label are called feature. Characteristics are used to represent the characteristics of samples.
In model training, we need to measure the error between the predicted value and the real value. Generally, we choose a non negative number as the error, and the smaller the value is, the smaller the error is. A common choice is the square function.
In the optimization algorithm of numerical solution, mini batch stochastic gradient descent is widely used in deep learning. Its algorithm is very simple: first select the initial value of a group of model parameters, such as random selection; then iterate the parameters many times, so that each iteration may reduce the value of the loss function. In each iteration, a small batch (mini-batch) B composed of a fixed number of training data samples is randomly and uniformly sampled, and then the derivative (gradient) of model parameters related to the average loss of data samples in the small batch is calculated. Finally, the product of this result and a positive number set in advance is used as the reduction of model parameters in this iteration.
import torch import torchvision import numpy as np import sys sys.path.append("/home/kesci/input") import d2lzh1981 as d2l print(torch.__version__) print(torchvision.__version__) batch_size = 256 train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size) num_inputs = 784 print(28*28) num_outputs = 10 W = torch.tensor(np.random.normal(0, 0.01, (num_inputs, num_outputs)), dtype=torch.float) b = torch.zeros(num_outputs, dtype=torch.float) W.requires_grad_(requires_grad=True) b.requires_grad_(requires_grad=True) X = torch.tensor([[1, 2, 3], [4, 5, 6]]) print(X.sum(dim=0, keepdim=True)) # dim is 0, sum according to the same column, and retain the column characteristics in the result print(X.sum(dim=1, keepdim=True)) # dim is 1, sum according to the same lines, and retain the line features in the result print(X.sum(dim=0, keepdim=False)) # dim is 0, sum according to the same column, do not retain the column characteristics in the result print(X.sum(dim=1, keepdim=False)) # dim is 1, sum according to the same lines, and do not retain the line features in the result def softmax(X): X_exp = X.exp() partition = X_exp.sum(dim=1, keepdim=True) # print("X size is ", X_exp.size()) # print("partition size is ", partition, partition.size()) return X_exp / partition # The broadcast mechanism is applied here X = torch.rand((2, 5)) X_prob = softmax(X) print(X_prob, '\n', X_prob.sum(dim=1))
The root of the problem is that the full connection layer only performs affine transformation on the data, and the superposition of multiple affine transformations is still an affine transformation. One way to solve the problem is to introduce nonlinear transformation, for example, the hidden variable is transformed by the nonlinear function calculated by elements, and then it is used as the input of the next full connection layer. This nonlinear function is called activation function.
Here are some common activation functions:
The ReLU (corrected linear unit) function provides a very simple nonlinear transformation. Given the element x, the function is defined as:
As you can see, the ReLU function only retains positive elements and clears negative elements.
The sigmoid function transforms the value of an element between 0 and 1.
tanh (hyperbolic tangent) function can transform the value of an element between - 1 and 1.
import torch from torch import nn from torch.nn import init import numpy as np import sys sys.path.append("/home/kesci/input") import d2lzh1981 as d2l print(torch.__version__) num_inputs, num_outputs, num_hiddens = 784, 10, 256 net = nn.Sequential( d2l.FlattenLayer(), nn.Linear(num_inputs, num_hiddens), nn.ReLU(), nn.Linear(num_hiddens, num_outputs), ) for params in net.parameters(): init.normal_(params, mean=0, std=0.01) batch_size = 256 train_iter, test_iter = d2l.load_data_fashion_mnist(batch_size,root='/home/kesci/input/FashionMNIST2065') loss = torch.nn.CrossEntropyLoss() optimizer = torch.optim.SGD(net.parameters(), lr=0.5) num_epochs = 5 d2l.train_ch3(net, train_iter, test_iter, loss, num_epochs, batch_size, None, None, optimizer)