Machine Learning_ 4:logistic regression

Experimental Background

Compared with the k-nearest neighbor algorithm and decision tree algorithm, the logistic regression algorithm is really the basic machine learning algorithm. Even if it is the present in-depth learning, it can also be used as the content of the logistic regression algorithm. The logistic regression algorithm is largely related to linear regression and logarithmic linear regression. This experiment introduces the principle of the logistic regression algorithm and its effect as a classification algorithm.

1.logistic Regression Algorithms

1.1. Linear Regression

Suppose you have such a bunch of datasets

Our goal is to find

For example, if we want to find a linear regression function for house prices, considering that the function is only related to the area of the house,

The core goal is to find this w and b so that the estimated f(x) corresponds to the corresponding y. For us, the smaller the difference between f(x) and y, the better, so we use least squares to make our judgment

When we were in high school, we already learned the formula of least squares. Here we list the complete solution process. We have solved the problem of finding more than one w. We derive W and b and make them equal to 0 to get the formula.


This is the solution of W and b when there is only one w. So what does w mean? W affects the slope of the linear regression function, which means that w can be interpreted as an attribute of an instance, and multiple w can be interpreted as attributes of an instance, which together affect the value of an instance
. Similarly, for multiple w, we can derive them individually and make them equal to 0. Solving this system of multivariate linear regression equations will find the linear regression equation we want.

The graph shows the relationship between the regression function and x in the case of two w.

1.2. Logarithmic linear regression

Although linear regression equations look good, in fact, there are few phenomena in life that can be described by linear regression functions alone. At this point, we need to use the logarithmic linear regression function, which enables us to better predict the nonlinear complex function while preserving the characteristics of the linear regression function.
Linear Regression Model:

It can be extended to:

In fact, this function can be nested continuously to get a complex nesting function. Let's take a simple example


This is the most basic logarithmic linear regression on which logistic is based to improve and improve.

1.3.logistic regression

logistic regression is generally used to solve classification problems. Although it does not limit the scale of the problems it solves, generally speaking, the problem that is most suitable for him is the two-class problem. For linear functions, the binary classification problem is best divided into step functions of y=0 and y=1.

Then the problem is obvious, it is not continuous and indistinguishable. How can we tell what kind of predictive target he belongs to when he is at the boundary of the range?
The logistic function (sigmoid) function solves this problem



Obviously, when the value range of x is expanded, the value of y is very close to 0 or 1, which is why it is suitable for solving the problem of binary classification.
So for w and b, we can use the maximum likelihood estimate, assuming we have a dataset

We want to find

order

be

Abbreviated as

Reorder

The probability that x corresponds to y 1

The probability that x corresponds to y being zero
Then there are

x is the value of y, the probability of y=1 and the probability of y=0 multiplied by the sum of the actual values
The probability of y=1 and the probability of y=0 are derived from the logarithmic probability



Finally we can get it

According to the sigmoid function, the likelihood function can be overridden as:

The final solution is:

It's clear that this function can't be solved mathematically, so we can only adjust the parameters continuously by Newton's method so that the values we solve approximate our ideal values.
Update Formula for Newton's t+1 Round Iterative Solution

2. Code analysis of logistic regression algorithm

from numpy import *
#sigmoid function
def sigmoid(inX):
    return 1.0/(1.0+exp(-inX))
#Calculate 0,1 probability
def classifyVector(inX, weights):
    prob = sigmoid(sum(inX*weights))
    if prob > 0.5: 
        return 1.0
    else: 
        return 0.0
#Iterative calculation of weights
def stocGradAscent1(dataMatrix, classLabels, numIter=150):
    m,n = shape(dataMatrix)
    weights = ones(n)
    for j in range(numIter):
        dataIndex = list(range(m))
        for i in range(m):
            alpha = 4/(1.0+j+i)+0.0001
            randIndex = int(random.uniform(0,len(dataIndex)))
            h = sigmoid(sum(dataMatrix[randIndex]*weights))
            error = classLabels[randIndex] - h
            weights = weights + alpha * error * dataMatrix[randIndex]
            del(dataIndex[randIndex])
    return weights
#Training Return Error Rate
def colicTest():
    #training set
    frTrain = open('D:/vscode/python/.vscode/test_data.txt')
    #Test Set
    frTest = open('D:/vscode/python/.vscode/train_data.txt')
    trainingSet = []
    trainingLabels = []
    #training set
    for line in frTrain.readlines():
        currLine = line.strip().split('\t')
        lineArr =[]
        for i in range(21):
            lineArr.append(float(currLine[i]))
        trainingSet.append(lineArr)
        trainingLabels.append(float(currLine[21]))
    trainWeights = stocGradAscent1(array(trainingSet), trainingLabels, 500)
    errorCount = 0
    numTestVec = 0.0
    #Test Set
    for line in frTest.readlines():
        numTestVec += 1.0
        #Cutting Properties
        currLine = line.strip().split('\t')
        lineArr =[]
        for i in range(21):
            lineArr.append(float(currLine[i]))
        #Determine if the value is consistent with the true value
        if int(classifyVector(array(lineArr), trainWeights))!= int(float(currLine[21])):
            errorCount += 1
    #Calculation error rate
    errorRate = (float(errorCount)/numTestVec)
    print ("The error rate of this training is%f" % errorRate)
    return errorRate
#Call functions multiple times to get the average error rate
def multiTest():
    numTests = 100
    errorSum=0.0
    for k in range(numTests):
        errorSum += colicTest()
    print ("%d Average error rate per training session: %f" % (numTests, errorSum/float(numTests)))

3.logistic Regression Algorithm Experiments

This experiment uses a good article dataset to judge whether an article is good or not by 20 attribute data such as clicks and reading. The results of 100 experiments are as follows

4. Summary

Logistic Regression Advantages
1. There is no need to assume data distribution beforehand
2. Approximate probability predictions for Categories can be obtained (probability values can also be used for subsequent applications)
3. It can directly apply existing numerical optimization algorithms (such as Newton method) to find the optimal solution, which is fast and efficient.
This experiment found a large number of datasets, found that if the range of attributes is -1~1, it is difficult to accurately classify. Changing the dataset will significantly improve the effect when the digital range is expanded.

Tags: Python Algorithm Machine Learning

Posted on Sun, 21 Nov 2021 12:59:08 -0500 by victordb