# 1, Introduction to optimization algorithm and regression

Embodiment of optimization algorithm: many optimization problems have been encountered in life, such as how to get from A to B in the shortest time, and how to mix spices to make good dishes. It can be seen that optimization plays A great role.

Regression: assuming that there are some data points, we use a straight line to fit these points. This fitting process is called regression.

# 2, General process of logistic regression

- Data collection: collect data in any way.
- Prepare data: the data type is numeric, and it is best to preprocess it into structured data format.
- Analyze data: analyze the data by any method.
- Training algorithm: most of the time will be used for training. The purpose of training is to find the best classification regression coefficient.
- Test algorithm: once the training steps are completed, the classification will be fast.
- Using algorithm: first, we need to input some data and convert it into corresponding structured values; Then, based on the trained regression coefficients, simple regression operations can be carried out on these values to determine which category they belong to; After that, we can do some other analysis work on the output categories.

# 3, Linear regression and logarithmic linear regression

Linear regression model and log linear regression model are to determine the best regression coefficient based on optimization method.

## linear regression

The general form of linear model is:

f
(
x
)
=
w
1
x
1
+
w
2
x
2
+
.
.
.
+
w
d
x
d
+
b
f ( x )= w_{1}x_{}1+w_{2}x_{2}+...+w_{d}x_{d} + b
f(x)=w1x1+w2x2+...+wdxd+b

Where x=(x1, x2,..., xd) is the sample described by the d-dimensional attribute, where xi is the value of X on the ith attribute.

Purpose of linear regression: learn a linear model to predict real value output markers as accurately as possible:

f
(
x
)
=
w
x
i
+
b
f(x)=wx_{i}+b
f(x)=wxi+b

bring

f
(
x
i
)
≃
y
i
f(x_{i})\simeq y_{i}
f(xi)≃yi

## Log linear regression

set up
y
=
g
(
x
)
=
e
z
y=g(x)=e^{z}
y=g(x)=ez,z=wx+b, find the logarithm on both sides to obtain the logarithmic linear regression model:

l
n
y
=
w
x
+
b
lny=wx+b
lny=wx+b

# 4, Least square method

As mentioned above, the linear regression model is used to predict the value of the sample, which must not be 100% coincident with the sample value. Therefore, we introduce the least square method to restrict: the regression line should meet the minimum sum of squares of errors between all observed values and corresponding regression estimates. The least square method is used for calculation, and the derivation process is as follows:

# 5, sigmoid function

The function we want is to take all the input and predict the category. For example, in the case of two classes, only 0 and 1. The unit transition function can do this, but it also has a problem: it jumps from 0 to 1 at the jump point, which is sometimes difficult to deal with. The sigmoid function is easier. The formula of sigmoid function is as follows:

σ
(
x
)
=
1
1
+
e
−
x
\sigma (x)=\frac{1}{1+e^{-x}}
σ(x)=1+e−x1

When x=0, the sigmoid function value is 0.5. With the increase or decrease of X, the sigmoid function approaches 0 and 1 respectively, which is very close to the transition function. Therefore, in order to realize the logistic classifier, we can multiply each feature by a regression coefficient, then add all the results and bring the results into the sigmoid function. The value greater than 0.5 is classified as 1 and less than 0. In this way, the linear regression model and sigmoid function are combined.

Now the question becomes: what is the best regression coefficient?

# 6, Gradient rise method

Gradient rising is based on the idea that the best way to find the maximum value of a function is to explore along the gradient direction of the function. If the gradient is written as ▽ \triangledown ▽, then the gradient of function f(x,y) is expressed by the following formula:

This gradient means moving in the direction of x
∂
f
(
x
,
y
)
∂
x
\frac{\partial f(x,y)}{\partial x}
∂ x ∂ f(x,y), move in the direction of Y
∂
f
(
x
,
y
)
∂
y
\frac{\partial f(x,y)}{\partial y}
∂y∂f(x,y).

After reaching each point, the gradient rise algorithm will re estimate the moving direction. Starting from p0, after calculating the gradient of this point, the function moves to the next point p1 according to the gradient. At point p1, the gradient is recalculated again and moved to p2 along the new gradient direction. This cycle iterates until the stop condition is met. In the iterative process, the gradient operator always ensures that we can select the best moving direction.

The iterative formula of gradient rise algorithm is as follows:

w
:
=
w
+
α
▽
w
f
(
w
)
w:=w+\alpha \bigtriangledown w f(w)
w:=w+α▽wf(w)

# 7, Training algorithm

def loadDataSet(): # Open the testSet file and read it. The first two columns are features x1 and x2, and the last column is the category label dataMat = [] labelMat = [] fr = open("testSet.txt") for line in fr.readlines(): lineArr = line.strip().split() dataMat.append([1.0, float(lineArr[0]), float(lineArr[1])]) labelMat.append(int(lineArr[2])) return dataMat, labelMat def sigmoid(inX): # Classification function return 1.0 / (1 + exp(-inX)) def gradAscent(dataMatIn, classLabels): # Gradient rise to find the best parameter w dataMatrix = mat(dataMatIn) labelMat = mat(classLabels).transpose() # Class label matrix transpose m, n = shape(dataMatrix) alpha = 0.001 # step maxCycles = 500 # Number of iterations weights = ones((n, 1)) for k in range(maxCycles): h = sigmoid(dataMatrix * weights) # The sigmoid function calculates the predicted value error = (labelMat - h) # Calculate the difference between the true category and the forecast category weights = weights + alpha * dataMatrix.transpose() * error #Gradient rise formula, transpose the 100 * 3 dataMatrix and multiply it by the 100 * 1 error return weights # Return regression coefficient

The operation result is the coefficient matrix:

# 8, Draw decision boundaries

def plotBestFit(weights): # Draw the best fitting line between the data set and logistic regression dataMat, labelMat =loadDataSet() dataArr=array(dataMat) n=shape(dataArr)[0] xcord1=[] ycord1=[] xcord2=[] ycord2=[] for i in range(n): if int(labelMat[i])==1: xcord1.append(dataArr[i,1]) ycord1.append(dataArr[i,2]) else: xcord2.append(dataArr[i,1]) ycord2.append(dataArr[i,2]) fig=plt.figure() # Create custom image ax=fig.add_subplot(111) # Draw subgraph function ax.scatter(xcord1,ycord1,s=30,c="red",marker="s") # Scatter plot ax.scatter(xcord2,ycord2,s=30,c="green") x=arange(-3.0,3.0,0.1) # Create an isometric array y=(-weights[0]-weights[1]*x)/weights[2] ax.plot(x,y) plt.xlabel("X1") plt.ylabel("X2") plt.show()

# 9, Random gradient rising algorithm and its improvement

def stocGradAscent(dataMatrix,classLabels): # Random gradient rise algorithm m,n=shape(dataMatrix) alpha=0.01 weights=ones(n) for i in range(m): h=sigmoid(sum(dataMatrix[i]*weights)) error=classLabels[i]-h weights=weights+alpha*error*dataMatrix[i] return weights

It can be found that the gradient rise algorithms of gradAscent function and stocGradAscent function are very similar, but there are differences. The error of the former calculates a vector, the latter is a numerical value, and the latter has no matrix conversion process.

The fitted straight line is not as good as the gradient rise, which is due to the difference in the number of iterations.

## Improved random gradient rise algorithm

def stocGradAscent1(dataMatrix,classLabels,numIter=1500): # Improved random gradient rise algorithm m,n=shape(dataMatrix) weights=ones(n) for j in range(numIter): dataIndex=list(range(m)) for i in range(m): alpha=4/(1.0+j+i)+0.01 randIndex=int(random.uniform(0, len(dataIndex))) h=sigmoid(sum(dataMatrix[randIndex]*weights)) error=classLabels[randIndex]-h weights=weights+alpha*error*dataMatrix[randIndex] del(dataIndex[randIndex]) return weights

- The improved algorithm sets the step size of each iteration to decrease with the number of iterations.
- Here, the regression coefficient is updated by randomly selecting samples. This method will reduce cyclical fluctuations.

# 10, Prediction of mortality of sick horses by logistic regression

Training data:

Test data:

Classification code:

def classifyVector(inX,weights): # Categories 0 and 1 prob=sigmoid(sum(inX*weights)) if prob >0.5: return 1.0 else: return 0.0 def colicTest(): # logistic classification test function frTrain=open("horseColicTraining.txt") frTest=open("horseColicTest.txt") trainingSet=[] trainingLabels=[] for line in frTrain.readlines(): currLine=line.strip().split("\t") lineArr=[] for i in range(21): lineArr.append(float(currLine[i])) trainingSet.append(lineArr) trainingLabels.append(float(currLine[21])) trainWeights=stocGradAscent1(array(trainingSet),trainingLabels,500) errorCount=0 numTestVec=0.0 for line in frTest.readlines(): numTestVec+=1.0 currLine=line.strip().split("\t") lineArr=[] for i in range(21): lineArr.append(float(currLine[i])) if int(classifyVector(array(lineArr),trainWeights)!=int(currLine[21])): errorCount+=1 errorRate=float(errorCount)/numTestVec print("the error rate of this test is: %f" %errorRate) return errorRate def multiTest(): numTests=10 errorSum=0.0 for k in range(numTests): errorSum+=colicTest() print("after %d iterations the average error rate is: %f" %(numTests,errorSum/float(numTests)))

After changing the number of iterations, the average error rate does not change significantly, but you can see that the error rate is reduced to 20% in one iteration.

# summary

logistic regression has the advantages of low cost and easy to understand and implement. However, it is easy to under fit, so it is difficult to reduce the classification accuracy. In terms of data preprocessing, it is necessary to eliminate bad data first and expand to a large number of data, which is a big problem.