This paper analyzes the logistic regression (LR) model from the perspective of model, learning objectives and optimization algorithm, and realizes LR training and prediction from scratch in Python.

## 1, Structure of logistic regression model

Logistic regression is a generalized linear classification model, and its model structure can be regarded as a single-layer neural network, which is composed of an input layer and an output layer of neurons with only one sigmoid activation function, without a hidden layer. The function of the model can be simplified into two steps, "linear summation of input features [x] through model weight [w] + sigmoid activation output probability".Specifically, we input the data feature x, multiply it by the one-to-one corresponding model weight W, and then sum it through the neuron activation function of the output layer σ (sigmoid function) converts the calculated nonlinearity of (wx + b) into the probability value of 0 ~ 1 interval and outputs it. The process of learning and training (optimizing model weight) is to learn the appropriate model weight [w] through gradient descent, so as to minimize the error between the model output value Y=sigmoid(wx + b) and the actual value y.

Note: sigmoid function is an s-shaped curve, and its output value is between [0, 1]. When it is far away from 0, the value of the function will soon approach 0 or 1. For the rationality of sigmoid output as probability, we can refer to the following proof:

Logistic regression is a discriminant model. In order to directly model the conditional probability P(y|x), it is assumed that P(x|y) is Gaussian distribution and P(y) is polynomial distribution. If we consider the binary classification problem, we can get it through formula Transformation:

It can be seen that the output probability of logical regression (or logarithmic probability regression) is consistent with the sigmoid form.

The logistic regression model essentially belongs to a generalized linear classifier (the decision boundary is linear). This can be seen from the decision function of the logistic regression model. The decision function y = sigmoid (wx+b), when wx+b > 0, Y > 0.5; when wx+b < 0, y < 0.5, Y=0 or 1 can be distinguished by wx+b (as shown in the figure below). It can be seen that the decision boundary is linear.

## 2, Learning objectives

Logistic regression is a classical classification model. For model prediction, our goal is: the predicted probability corresponds to the label of the actual positive and negative samples. The output of Sigmoid function represents the probability that the label of the current sample is 1, and y ^ can be expressed as

The probability that the current sample is predicted to be 0 can be expressed as 1-y^

For the positive sample y=1, we expect the prediction probability to approach 1 as much as possible. For the negative sample y=0, we expect the prediction probability to approach 0 as much as possible. That is, we want the prediction probability to maximize the probability of the following formula (maximum likelihood method)

We introduce the log function to P(y|x) because the log operation does not affect the monotonicity of the function itself. There are:

We hope that the larger the log P(y|x), the better. Conversely, as long as the negative value of log P(y|x) - the smaller the log P(y|x). Then we can introduce the loss function and let Loss = -log P(y|x), the loss function is:

We have derived the Loss function of a single sample. If it is to calculate the average Loss function of M samples, just stack m losses and average them:

This is the cross entropy loss (or logarithmic loss function) which is the learning goal of lr derived from the maximum likelihood method That is to maximize the model prediction probability to follow the distribution of the real value. The closer the distribution of the prediction probability is to the real distribution, the better the model. You can pay attention to one point. For the prediction probability output by sigmoid, the probability value can only approach 0 or 1 as much as possible, and similarly, loss will not be 0.

## 3, Optimization algorithm

We take the minimum cross entropy as the learning goal. The next thing to do is to use the optimization algorithm to optimize the parameters to achieve this goal. Since there is no (optimal) analytical solution for the logistic regression under the maximum likelihood estimation, we often use the gradient descent algorithm. After many iterations, the final learned parameters are better numerical solutions.

The gradient descent algorithm can be intuitively understood as a downhill method. The loss function J(w) is compared to a mountain. Our goal is to reach the foot of the mountain (that is, solve the optimal model parameter w to minimize the loss function).

All you have to do to go down the mountain is "go downhill, step by step". On the mountain of loss function, the downhill direction of each position is its negative gradient direction (straight white point, that is, the oblique downward direction of the mountain). Every step down (step by step) α Control) when reaching a position, solve the gradient of the current position and take another step to the position where this step is located along the steepest and easiest position to go down the mountain. Go step by step until we feel that we have reached the foot of the mountain.

Of course, if we go on like this, we may not go to the foot of the mountain (Global cost minimun), but to a small valley (Local cost minimun), which is also a place where the gradient descent algorithm can be further optimized.

Corresponding algorithm steps:

In addition, from the perspective of non maximum likelihood estimation, to solve the (optimal) analytical solution of logistic regression, see kexue.fm/archives/8578

## 4, Python implements logical regression

The data set of this project is cancer cell classification data. Based on the numpy Library of Python, the logical regression model is realized, the objective function is defined as cross entropy, the gradient descent iterative optimization model is used, and the classification effect is verified:

# coding: utf-8 import numpy as np import matplotlib.pyplot as plt import h5py import scipy from sklearn import datasets # Load the data and simply divide it into training set / test set def load_dataset(): dataset = datasets.load_breast_cancer() train_x,train_y = dataset['data'][0:400], dataset['target'][0:400] test_x, test_y = dataset['data'][400:-1], dataset['target'][400:-1] return train_x, train_y, test_x, test_y # logit activation function def sigmoid(z): s = 1 / (1 + np.exp(-z)) return s # Weight initialization 0 def initialize_with_zeros(dim): w = np.zeros((dim, 1)) b = 0 assert(w.shape == (dim, 1)) assert(isinstance(b, float) or isinstance(b, int)) return w, b # Define the learning objective function and calculate the gradient def propagate(w, b, X, Y): m = X.shape[1] A = sigmoid(np.dot(w.T, X) + b) # Logistic regression output prediction cost = -1 / m * np.sum(Y * np.log(A) + (1 - Y) * np.log(1 - A)) # Cross entropy loss as objective function dw = 1 / m * np.dot(X, (A - Y).T) # Calculate weight w gradient db = 1 / m * np.sum(A - Y) assert(dw.shape == w.shape) assert(db.dtype == float) cost = np.squeeze(cost) assert(cost.shape == ()) grads = {"dw": dw, "db": db} return grads, cost # Define optimization algorithm def optimize(w, b, X, Y, num_iterations, learning_rate, print_cost): costs = [] for i in range(num_iterations): # Gradient descent iterative optimization grads, cost = propagate(w, b, X, Y) dw = grads["dw"] # Weight w gradient db = grads["db"] w = w - learning_rate * dw # Update w in the direction of learning_rate negative gradient (dw) b = b - learning_rate * db if i % 50 == 0: costs.append(cost) if print_cost and i % 100 == 0: print ("Cost after iteration %i: %f" %(i, cost)) params = {"w": w, "b": b} grads = {"dw": dw, "db": db} return params, grads, costs #Input the optimized model parameters w, b, model prediction def predict(w, b, X): m = X.shape[1] Y_prediction = np.zeros((1,m)) A = sigmoid(np.dot(w.T, X) + b) for i in range(A.shape[1]): if A[0, i] <= 0.5: Y_prediction[0, i] = 0 else: Y_prediction[0, i] = 1 assert(Y_prediction.shape == (1, m)) return Y_prediction def model(X_train, Y_train, X_test, Y_test, num_iterations, learning_rate, print_cost): # initialization w, b = initialize_with_zeros(X_train.shape[0]) # Gradient descent optimization model parameters parameters, grads, costs = optimize(w, b, X_train, Y_train, num_iterations, learning_rate, print_cost) w = parameters["w"] b = parameters["b"] # Model prediction results Y_prediction_test = predict(w, b, X_test) Y_prediction_train = predict(w, b, X_train) # Model evaluation accuracy print("train accuracy: {} %".format(100 - np.mean(np.abs(Y_prediction_train - Y_train)) * 100)) print("test accuracy: {} %".format(100 - np.mean(np.abs(Y_prediction_test - Y_test)) * 100)) d = {"costs": costs, "Y_prediction_test": Y_prediction_test, "Y_prediction_train" : Y_prediction_train, "w" : w, "b" : b, "learning_rate" : learning_rate, "num_iterations": num_iterations} return d # Load cancer cell dataset train_set_x, train_set_y, test_set_x, test_set_y = load_dataset() # reshape train_set_x = train_set_x.reshape(train_set_x.shape[0], -1).T test_set_x = test_set_x.reshape(test_set_x.shape[0], -1).T print(train_set_x.shape) print(test_set_x.shape) #Train the model and evaluate the accuracy paras = model(train_set_x, train_set_y, test_set_x, test_set_y, num_iterations = 100, learning_rate = 0.001, print_cost = False)

## (END)

The first official account is "algorithm advanced", and the original text can be accessed. Article related code