# Machine learning -- logistic regression (classification) programming training

reference material:

This paper is the second programming training in Wu Enda's machine learning course. For a detailed introduction to logistic regression, please refer to Wu Enda's machine learning course

Programming tasks:

Given the two test scores and admission conditions of some students, training a classifier can judge whether students are admitted according to the two test scores

## 1. Principle of logistic regression algorithm

The implementation process of logistic regression algorithm is basically the same as that of linear regression The implementation process is the same.

• Read data
• Get eigenvalue
• Get tag value
• Feature scaling
• Building functional relationships between features and labels (hypothetical functions)
• Constructing cost function
• Find the partial derivative of the cost function
• Minimize the cost function and obtain the weight at this time
• The functional relationship between features and labels is obtained according to the weight

The difference between logistic regression algorithm and linear regression algorithm is mainly reflected in the functional relationship between features and labels:

• Linear regression algorithm: h θ ( x ) = θ 0 x 0 + θ 1 x 1 + θ 2 x 2 + θ 3 x 3 + . . . h_θ(x) = θ_0x_0 + θ_1x_1+θ_2x_2+θ_3x_3+... hθ​(x)=θ0​x0​+θ1​x1​+θ2​x2​+θ3​x3​+...

• Logistic regression algorithm: h θ ( x ) = g ( θ 0 x 0 + θ 1 x 1 + θ 2 x 2 + θ 3 x 3 + . . . ) h_θ(x) =g(θ_0x_0 + θ_1x_1+θ_2x_2+θ_3x_3+...) hθ​(x)=g(θ0​x0​+θ1​x1​+θ2​x2​+θ3​x3​+...)

The function here g ( z ) g(z) g(z) is the sigmoid function:
g ( z ) = 1 1 + e − z g(z) = \frac{1}{1+e^{-z}} g(z)=1+e−z1​

The above figure is the image of sigmoid function with values between 0 and 1

Here is a supplementary explanation of why the hypothetical function of the logistic regression algorithm should be assumed like this.

Because the logistic regression algorithm is exactly a classification problem, the tag value is only 0 and 1. It is difficult to classify the data using the assumption of linear regression. The logistic regression algorithm uses the form of probability to classify. Since the value of sigmoid function is (0,1), we can assume that when the value of sigmoid function is greater than or equal to 0.5, it is divided into class 1, and when it is less than 0.5, it is divided into Class 0. In this way, a classification problem is well solved.

## 2. Cost function

Since the hypothetical function we use is different from that used in linear regression, the cost function will change accordingly.
J ( θ ) = 1 m ∑ m ( y ( i ) ∗ log ⁡ ( h θ ( x ( i ) ) ) + ( 1 − y ( i ) ) log ⁡ ( 1 − h θ ( x ( i ) ) ) ) J(θ) = \frac{1}{m}\sum^m(y^{(i)}*\log(h_θ(x^{(i)})) +(1-y^{(i)})\log(1-h_θ(x^{(i)}))) J(θ)=m1​∑m​(y(i)∗log(hθ​(x(i)))+(1−y(i))log(1−hθ​(x(i))))
Superscript here ( i ) (i) (i) Represents the features and labels in the first few samples

Partial derivative of cost function:
∂ J ( θ ) ∂ θ = 1 m ∑ m ( h θ ( x ( i ) ) − y ( i ) ) x ( i ) \frac{∂J(θ)}{∂θ} =\frac{1}{m}\sum^m(h_θ(x^{(i)}) -y^{(i)})x^{(i)} ∂θ∂J(θ)​=m1​∑m​(hθ​(x(i))−y(i))x(i)

## 3. Minimize cost function

The method used in the linear regression process is batch gradient descent method. The batch gradient descent method is also applicable in logistic regression, but in this case, the author can not (completely) solve the classification problem by using the batch gradient descent method.

In this example, another optimization algorithm is used: Newton CG. In addition, there are conjugate gradient method (CG), variable scale method (BFGS) and limited variable scale method (L-BFGS)

The specific mathematical principles of these optimization algorithms are not introduced here (because the author does not understand them)

The mathematical principle of Newton CG can refer to the paper Improved BFGS iterative method and Newton CG algorithm based on Newton method

But this is not the point. The point is how to apply these algorithms to minimize the cost function

Reference here Introduction to the use of scipy module functions

scipy.optimize.minimize(fun, x0, args=(), method=None, jac=None, hess=None, hessp=None, bounds=None, constraints=(), tol=None, callback=None, options=None)
'''
fun:Objective function that needs to be minimized
x0: Initial value of iteration, required x0.shape by(n,)
args:Additional parameters passed to the objective function
method:Minimization method to be used
jac:Calculate gradient vector(Partial derivative)，Callable function
'''


## 4. Decision boundaries

In this example, the decision boundary is: X × θ = 0 X × θ=0 X ×θ= 0 = i.e.: θ 0 x 0 + θ 1 x 1 + θ 2 x 2 = 0 θ_0x_0 + θ_1x_1+θ_2x_2=0 θ0​x0​+θ1​x1​+θ2​x2​=0

We know that: θ 0 , x 0 = 1 , θ 1 , x 1 , θ 2 θ_0,x_0=1,θ_1,x_1,θ_2 θ 0​,x0​=1, θ 1​,x1​, θ 2, so we can get it according to the above formula x 2 x_2 x2​

So we get a straight line, which is the decision boundary: x 2 = − ( θ 0 θ 2 + θ 1 θ 0 x 1 ) x_2 = -(\frac{θ_0}{θ_2} + \frac{θ_1}{θ_0}x_1) x2​=−(θ2​θ0​​+θ0​θ1​​x1​)

## 5. Code implementation

Data set: from Mr. Huang Haiguang's GitHub warehouse, Specific links

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import scipy.optimize as opt
from sklearn.metrics import classification_report  # This package is an evaluation report

def get_X(df):  # Get features
ones = pd.DataFrame({'ones': np.ones(len(df))})
data = pd.concat([ones, df], axis=1)
return data.iloc[:, :-1].values

def get_y(df):  # Get label
return np.array(df.iloc[:, -1])

def normalize_feature(df):  # Feature scaling is not used in this example
return df.apply(lambda column: (column - column.mean()) / column.std())

def sigmoid(z):  # Define sigmoid function
return 1 / (1 + np.exp(-z))

def cost(theta, X, y):  # To define the cost function, we need to pay attention to the dimensions of X, theta and y
return np.mean(-y * np.log(sigmoid(X @ theta)) - (1 - y) * np.log(1 - sigmoid(X @ theta)))
# (100,1) * [(100,3) (3,1)] here does not affect the calculation results

def gradient(theta, X, y):  # Partial derivative
return (1 / len(X)) * X.T @ (sigmoid(X @ theta) - y)  # (3,100)[(100,3)(3,1)]====>(3,1)

def predict(x, theta):  # Prediction function for verification
prob = sigmoid(x @ theta)
return (prob >= 0.5).astype(int)

# Read data, get features and tags, and initialize theta
data = pd.read_csv("ex2data1.txt", names=['exam1', 'exam2', 'admitted'])
X = get_X(data)  # (100, 3)
y = get_y(data)  # (100,)
theta = np.zeros(3)  # (3,)

# The optimization algorithm Newton CG is used to minimize the cost function
result = opt.minimize(fun=cost, x0=theta, args=(X, y), method='Newton-CG', jac=gradient)
final_theta = result.x  # theta obtained by minimizing the cost function
y_pred = predict(X, final_theta)  # Predict the original data characteristics
print(classification_report(y, y_pred))  # Compare the real value and predicted value, and output the evaluation report

# Scatter plot
sns.set(context="notebook", style="ticks", font_scale=1.5)
sns.lmplot('exam1', 'exam2', hue='admitted', data=data,
height=6,
fit_reg=False,
scatter_kws={"s": 25}
)

# Draw decision boundaries
coefficient = -(result.x / result.x[2])   # Known weight, x_0 = 1, then x_1 as X, x_2 as y, according to θ x = 0 to solve y
x = np.arange(130, step=0.1)   # x here is x_ one
y = coefficient[0] + coefficient[1] * x  # y here is X_ X for 1_ two
plt.plot(x, y, 'grey')
plt.xlim(0, 130)
plt.ylim(0, 130)
plt.title('Decision Boundary')
plt.show()


## 6. Operation results

              precision    recall  f1-score   support

0       0.87      0.85      0.86        40
1       0.90      0.92      0.91        60

accuracy                           0.89       100
macro avg       0.89      0.88      0.88       100
weighted avg       0.89      0.89      0.89       100


It can be seen from the evaluation report that the accuracy of classification has reached 89%.
The following figure is a scatter diagram of data and decision boundary.

Tags: Machine Learning

Posted on Thu, 07 Oct 2021 23:38:39 -0400 by pnoeric