Machine learning -- logistic regression (classification) programming training
reference material:
1.Mr. Huang haiguang: Wu Enda machine learning notes github
This paper is the second programming training in Wu Enda's machine learning course. For a detailed introduction to logistic regression, please refer to Wu Enda's machine learning course
Programming tasks:
Given the two test scores and admission conditions of some students, training a classifier can judge whether students are admitted according to the two test scores
1. Principle of logistic regression algorithm
The implementation process of logistic regression algorithm is basically the same as that of linear regression The implementation process is the same.
- Read data
- Get eigenvalue
- Get tag value
- Feature scaling
- Building functional relationships between features and labels (hypothetical functions)
- Constructing cost function
- Find the partial derivative of the cost function
- Minimize the cost function and obtain the weight at this time
- The functional relationship between features and labels is obtained according to the weight
The difference between logistic regression algorithm and linear regression algorithm is mainly reflected in the functional relationship between features and labels:
-
Linear regression algorithm: h θ ( x ) = θ 0 x 0 + θ 1 x 1 + θ 2 x 2 + θ 3 x 3 + . . . h_θ(x) = θ_0x_0 + θ_1x_1+θ_2x_2+θ_3x_3+... hθ(x)=θ0x0+θ1x1+θ2x2+θ3x3+...
-
Logistic regression algorithm: h θ ( x ) = g ( θ 0 x 0 + θ 1 x 1 + θ 2 x 2 + θ 3 x 3 + . . . ) h_θ(x) =g(θ_0x_0 + θ_1x_1+θ_2x_2+θ_3x_3+...) hθ(x)=g(θ0x0+θ1x1+θ2x2+θ3x3+...)
The function here g ( z ) g(z) g(z) is the sigmoid function:
g ( z ) = 1 1 + e − z g(z) = \frac{1}{1+e^{-z}} g(z)=1+e−z1
The above figure is the image of sigmoid function with values between 0 and 1
Here is a supplementary explanation of why the hypothetical function of the logistic regression algorithm should be assumed like this.
Because the logistic regression algorithm is exactly a classification problem, the tag value is only 0 and 1. It is difficult to classify the data using the assumption of linear regression. The logistic regression algorithm uses the form of probability to classify. Since the value of sigmoid function is (0,1), we can assume that when the value of sigmoid function is greater than or equal to 0.5, it is divided into class 1, and when it is less than 0.5, it is divided into Class 0. In this way, a classification problem is well solved.
2. Cost function
Since the hypothetical function we use is different from that used in linear regression, the cost function will change accordingly.
J
(
θ
)
=
1
m
∑
m
(
y
(
i
)
∗
log
(
h
θ
(
x
(
i
)
)
)
+
(
1
−
y
(
i
)
)
log
(
1
−
h
θ
(
x
(
i
)
)
)
)
J(θ) = \frac{1}{m}\sum^m(y^{(i)}*\log(h_θ(x^{(i)})) +(1-y^{(i)})\log(1-h_θ(x^{(i)})))
J(θ)=m1∑m(y(i)∗log(hθ(x(i)))+(1−y(i))log(1−hθ(x(i))))
Superscript here
(
i
)
(i)
(i) Represents the features and labels in the first few samples
Partial derivative of cost function:
∂
J
(
θ
)
∂
θ
=
1
m
∑
m
(
h
θ
(
x
(
i
)
)
−
y
(
i
)
)
x
(
i
)
\frac{∂J(θ)}{∂θ} =\frac{1}{m}\sum^m(h_θ(x^{(i)}) -y^{(i)})x^{(i)}
∂θ∂J(θ)=m1∑m(hθ(x(i))−y(i))x(i)
3. Minimize cost function
The method used in the linear regression process is batch gradient descent method. The batch gradient descent method is also applicable in logistic regression, but in this case, the author can not (completely) solve the classification problem by using the batch gradient descent method.
In this example, another optimization algorithm is used: Newton CG. In addition, there are conjugate gradient method (CG), variable scale method (BFGS) and limited variable scale method (L-BFGS)
The specific mathematical principles of these optimization algorithms are not introduced here (because the author does not understand them)
The mathematical principle of Newton CG can refer to the paper Improved BFGS iterative method and Newton CG algorithm based on Newton method
But this is not the point. The point is how to apply these algorithms to minimize the cost function
Reference here Introduction to the use of scipy module functions
scipy.optimize.minimize(fun, x0, args=(), method=None, jac=None, hess=None, hessp=None, bounds=None, constraints=(), tol=None, callback=None, options=None) ''' fun:Objective function that needs to be minimized x0: Initial value of iteration, required x0.shape by(n,) args:Additional parameters passed to the objective function method:Minimization method to be used jac:Calculate gradient vector(Partial derivative),Callable function '''
4. Decision boundaries
In this example, the decision boundary is: X × θ = 0 X × θ=0 X ×θ= 0 = i.e.: θ 0 x 0 + θ 1 x 1 + θ 2 x 2 = 0 θ_0x_0 + θ_1x_1+θ_2x_2=0 θ0x0+θ1x1+θ2x2=0
We know that: θ 0 , x 0 = 1 , θ 1 , x 1 , θ 2 θ_0,x_0=1,θ_1,x_1,θ_2 θ 0,x0=1, θ 1,x1, θ 2, so we can get it according to the above formula x 2 x_2 x2
So we get a straight line, which is the decision boundary: x 2 = − ( θ 0 θ 2 + θ 1 θ 0 x 1 ) x_2 = -(\frac{θ_0}{θ_2} + \frac{θ_1}{θ_0}x_1) x2=−(θ2θ0+θ0θ1x1)
5. Code implementation
Data set: from Mr. Huang Haiguang's GitHub warehouse, Specific links
import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt import scipy.optimize as opt from sklearn.metrics import classification_report # This package is an evaluation report def get_X(df): # Get features ones = pd.DataFrame({'ones': np.ones(len(df))}) data = pd.concat([ones, df], axis=1) return data.iloc[:, :-1].values def get_y(df): # Get label return np.array(df.iloc[:, -1]) def normalize_feature(df): # Feature scaling is not used in this example return df.apply(lambda column: (column - column.mean()) / column.std()) def sigmoid(z): # Define sigmoid function return 1 / (1 + np.exp(-z)) def cost(theta, X, y): # To define the cost function, we need to pay attention to the dimensions of X, theta and y return np.mean(-y * np.log(sigmoid(X @ theta)) - (1 - y) * np.log(1 - sigmoid(X @ theta))) # (100,1) * [(100,3) (3,1)] here does not affect the calculation results def gradient(theta, X, y): # Partial derivative return (1 / len(X)) * X.T @ (sigmoid(X @ theta) - y) # (3,100)[(100,3)(3,1)]====>(3,1) def predict(x, theta): # Prediction function for verification prob = sigmoid(x @ theta) return (prob >= 0.5).astype(int) # Read data, get features and tags, and initialize theta data = pd.read_csv("ex2data1.txt", names=['exam1', 'exam2', 'admitted']) X = get_X(data) # (100, 3) y = get_y(data) # (100,) theta = np.zeros(3) # (3,) # The optimization algorithm Newton CG is used to minimize the cost function result = opt.minimize(fun=cost, x0=theta, args=(X, y), method='Newton-CG', jac=gradient) final_theta = result.x # theta obtained by minimizing the cost function y_pred = predict(X, final_theta) # Predict the original data characteristics print(classification_report(y, y_pred)) # Compare the real value and predicted value, and output the evaluation report # Scatter plot sns.set(context="notebook", style="ticks", font_scale=1.5) sns.lmplot('exam1', 'exam2', hue='admitted', data=data, height=6, fit_reg=False, scatter_kws={"s": 25} ) # Draw decision boundaries coefficient = -(result.x / result.x[2]) # Known weight, x_0 = 1, then x_1 as X, x_2 as y, according to θ x = 0 to solve y x = np.arange(130, step=0.1) # x here is x_ one y = coefficient[0] + coefficient[1] * x # y here is X_ X for 1_ two plt.plot(x, y, 'grey') plt.xlim(0, 130) plt.ylim(0, 130) plt.title('Decision Boundary') plt.show()
6. Operation results
precision recall f1-score support 0 0.87 0.85 0.86 40 1 0.90 0.92 0.91 60 accuracy 0.89 100 macro avg 0.89 0.88 0.88 100 weighted avg 0.89 0.89 0.89 100
It can be seen from the evaluation report that the accuracy of classification has reached 89%.
The following figure is a scatter diagram of data and decision boundary.