# Practical course of machine learning: linear regression

### What is regression?

The purpose of regression is to predict the numerical target value. The most direct way is to write out a calculation formula of the target value according to the input. If you want to predict the power of your little sister's boyfriend's car, you may calculate as follows:

HorsePower = 0.0015 * annualSalary - 0.99 * hoursListeningToPublicRadio

Written in Chinese is:

Car power of little sister's boyfriend = 0.0015 * annual salary of little sister's boyfriend - 0.99 * time to listen to public radio

This is the so-called regression equation, in which 0.0015 and -0.99 are called regression weights. The process of finding these regression coefficients is regression. Once we have these regression coefficients, given the input, it is very easy to make prediction. The specific method is to multiply the regression coefficient by the input value, and then add all the results together to get the predicted value.

When it comes to regression, it generally refers to linear regression, so regression and linear regression in this paper represent the same meaning. Linear regression means that you can multiply the input items by some constants, and then add up the results to get the output. It should be noted that there is another regression model called nonlinear regression, which does not agree with the above practice. For example, it thinks that the output may be the product of the input. In this way, the above power calculation formula can also be written as follows:

HorsePower = 0.0015 * annualSalary / hoursListeningToPublicRadio

This is an example of nonlinear regression, which is not discussed in depth in this paper.

### 1, Find the best fitting line by linear regression

How do we get the regression equation from a lot of data? Suppose that the input data is stored in matrix X and the result is stored in vector y: The regression coefficients are stored in the vector w: Then, for the given data x1, that is, the first column data of matrix X, the prediction result u1 will be given by the following formula: Now the question is, with the data matrix X and the corresponding label vector Y in hand, how can we find w? A common method is to find the w that minimizes the error. The error here refers to the difference between the predicted u value and the real y value. The simple accumulation of the error will make the positive difference and negative difference offset each other, so we use the square error.

The sum of squared errors can be written as: The matrix representation can also be written as: Why can it change like this? Remember a premise: if x is a vector, X is the column vector and x^T is the row vector by default. Bring in the data matrix X and label vector y mentioned above to know why it changes so much.

Before we continue the derivation, we should first clarify one purpose: find w and minimize the sum of square errors. Because we believe that the smaller the sum of square errors, the better the fitting effect of linear regression.

Now, we take the derivative of w by the sum of the square errors expressed by the matrix: If you are not familiar with matrix calculation, you can move here: https://blog.csdn.net/nomadlx53/article/details/50849941

Let the above formula be equal to 0 to obtain: The small mark above w indicates that this is the optimal solution of w that can be estimated at present. The w estimated from the existing data may not be the real W value in the data, so a "hat" symbol is used here to indicate that it is only a best estimate of W.

It is worth noting that the above formula contains an inverse matrix, that is, the equation is only used when the inverse matrix exists, that is, the matrix is a square matrix and its determinant is not 0.

The best w solution is a common problem in statistics. There are many other methods besides matrix method. By calling the matrix method in NumPy library, we can complete the required function with only a few lines of code. This method is also called OLS, which means "ordinary least squares".

Let's look at the dataset first: The first column is 1.0, i.e. x0. The second column is x1, i.e. X-axis data. The third column is x2, i.e. y-axis data. First, draw the data and look at the data distribution. The code is as follows:

```import matplotlib.pyplot as plt
import numpy as np

"""
Parameters:
fileName - file name
Returns:
xArr - x data set
yArr - y data set
"""

xArr = []; yArr = []
fr = open(fileName)
lineArr =[]
curLine = line.strip().split('\t')
for i in range(numFeat):
lineArr.append(float(curLine[i]))
xArr.append(lineArr)
yArr.append(float(curLine[-1]))
return xArr, yArr

def plotDataSet():
"""
Function description:Draw dataset
Parameters:
nothing
Returns:
nothing
"""
n = len(xArr)                                                        #Number of data
xcord = []; ycord = []                                                #Sample point
for i in range(n):
xcord.append(xArr[i]); ycord.append(yArr[i])                    #Sample point
fig = plt.figure()
ax.scatter(xcord, ycord, s = 20, c = 'blue',alpha = .5)                #Draw sample points
plt.title('DataSet')                                                #Draw title
plt.xlabel('X')
plt.show()

if __name__ == '__main__':
plotDataSet()
``` By visualizing the data, we can see the distribution of the data. Next, let's calculate the regression coefficient vector according to the regression coefficient calculation method derived above, and draw the regression curve according to the regression coefficient vector. The code is as follows:

```import matplotlib.pyplot as plt
import numpy as np

"""
Parameters:
fileName - file name
Returns:
xArr - x data set
yArr - y data set
"""
xArr = []; yArr = []
fr = open(fileName)
lineArr =[]
curLine = line.strip().split('\t')
for i in range(numFeat):
lineArr.append(float(curLine[i]))
xArr.append(lineArr)
yArr.append(float(curLine[-1]))
return xArr, yArr

def standRegres(xArr,yArr):
"""
Function description:Calculate regression coefficient w
Parameters:
xArr - x data set
yArr - y data set
Returns:
ws - regression coefficient
"""
xMat = np.mat(xArr); yMat = np.mat(yArr).T
xTx = xMat.T * xMat                            #The regression coefficient is calculated according to the publicity derived in this paper
if np.linalg.det(xTx) == 0.0:
print("The matrix is singular,No inversion")
return
ws = xTx.I * (xMat.T*yMat)
return ws

def plotRegression():
"""
Function description:Draw regression curves and data points
Parameters:
nothing
Returns:
nothing
"""
ws = standRegres(xArr, yArr)                                        #Calculate regression coefficient
xMat = np.mat(xArr)                                                    #Create xMat matrix
yMat = np.mat(yArr)                                                    #Create yMat matrix
xCopy = xMat.copy()                                                    #Deep copy xMat matrix
xCopy.sort(0)                                                        #sort
yHat = xCopy * ws                                                     #Calculate the corresponding y value
fig = plt.figure()
ax.plot(xCopy[:, 1], yHat, c = 'red')                                #Draw regression curve
ax.scatter(xMat[:,1].flatten().A, yMat.flatten().A, s = 20, c = 'blue',alpha = .5)                #Draw sample points
plt.title('DataSet')                                                #Draw title
plt.xlabel('X')
plt.show()

if __name__ == '__main__':
plotRegression()
``` How to judge the fitting effect of the fitting curve? Of course, we can observe according to our own experience. In addition, we can also use the corrcoef method to compare the correlation between the predicted value and the real value. The code is as follows:

```import numpy as np

"""
Parameters:
fileName - file name
Returns:
xArr - x data set
yArr - y data set
"""
xArr = []; yArr = []
fr = open(fileName)
lineArr =[]
curLine = line.strip().split('\t')
for i in range(numFeat):
lineArr.append(float(curLine[i]))
xArr.append(lineArr)
yArr.append(float(curLine[-1]))
return xArr, yArr

def standRegres(xArr,yArr):
"""
Function description:Calculate regression coefficient w
Parameters:
xArr - x data set
yArr - y data set
Returns:
ws - regression coefficient
"""
xMat = np.mat(xArr); yMat = np.mat(yArr).T
xTx = xMat.T * xMat                            #The regression coefficient is calculated according to the publicity derived in this paper
if np.linalg.det(xTx) == 0.0:
print("The matrix is singular,No inversion")
return
ws = xTx.I * (xMat.T*yMat)
return ws

if __name__ == '__main__':
ws = standRegres(xArr, yArr)                                        #Calculate regression coefficient
xMat = np.mat(xArr)                                                    #Create xMat matrix
yMat = np.mat(yArr)                                                    #Create yMat matrix
yHat = xMat * ws
print(np.corrcoef(yHat.T, yMat))
``` It can be seen that the data on the diagonal is 1.0, because the match between yMat and itself is perfect, while the correlation coefficient between yhhat and yMat is 0.98.

The best fitting method regards the data as a straight line for modeling, which has a very good performance. There seem to be other potential patterns in the data. So how can we use these patterns? We can locally adjust the forecast according to the data. This method will be introduced below.

### 2, Locally weighted linear regression

One problem of linear regression is that under fitting may occur, because it seeks unbiased estimation with small mean square error. Obviously, if the model is under fitted, it will not achieve good prediction results. Therefore, some methods allow some deviations to be introduced into the estimation, so as to reduce the mean square error of prediction.

One method is Locally Weighted Linear Regression (LWLR). In this method, we give a certain weight to each point near the point to be predicted. Like kNN, this algorithm needs to select the corresponding data subset in advance for each prediction. The form of the regression coefficient W released by the algorithm is as follows: Where W is a matrix. The difference between this formula and the formula derived above is that w is used to give weight to each store.

LWLR uses a "kernel" (similar to the kernel in support vector machine) to give higher weights to nearby points. The type of kernel can be selected freely. The most commonly used kernel is Gaussian kernel. The weight of Gaussian kernel is as follows: In this way, we can write locally weighted linear regression according to the above formula. We can adjust the regression effect by changing the value of k. The code is as follows:

```from matplotlib.font_manager import FontProperties
import matplotlib.pyplot as plt
import numpy as np

"""
Parameters:
fileName - file name
Returns:
xArr - x data set
yArr - y data set
"""
xArr = []; yArr = []
fr = open(fileName)
lineArr =[]
curLine = line.strip().split('\t')
for i in range(numFeat):
lineArr.append(float(curLine[i]))
xArr.append(lineArr)
yArr.append(float(curLine[-1]))
return xArr, yArr

def plotlwlrRegression():
"""
Function description:Draw multiple locally weighted regression curves
Parameters:
nothing
Returns:
nothing
"""
font = FontProperties(fname=r"c:\windows\fonts\simsun.ttc", size=14)
yHat_1 = lwlrTest(xArr, xArr, yArr, 1.0)                            #Yhhat was calculated from locally weighted linear regression
yHat_2 = lwlrTest(xArr, xArr, yArr, 0.01)                            #Yhhat was calculated from locally weighted linear regression
yHat_3 = lwlrTest(xArr, xArr, yArr, 0.003)                            #Yhhat was calculated from locally weighted linear regression
xMat = np.mat(xArr)                                                    #Create xMat matrix
yMat = np.mat(yArr)                                                    #Create yMat matrix
srtInd = xMat[:, 1].argsort(0)                                        #Sort, return index value
xSort = xMat[srtInd][:,0,:]
fig, axs = plt.subplots(nrows=3, ncols=1,sharex=False, sharey=False, figsize=(10,8))
axs.plot(xSort[:, 1], yHat_1[srtInd], c = 'red')                        #Draw regression curve
axs.plot(xSort[:, 1], yHat_2[srtInd], c = 'red')                        #Draw regression curve
axs.plot(xSort[:, 1], yHat_3[srtInd], c = 'red')                        #Draw regression curve
axs.scatter(xMat[:,1].flatten().A, yMat.flatten().A, s = 20, c = 'blue', alpha = .5)                #Draw sample points
axs.scatter(xMat[:,1].flatten().A, yMat.flatten().A, s = 20, c = 'blue', alpha = .5)                #Draw sample points
axs.scatter(xMat[:,1].flatten().A, yMat.flatten().A, s = 20, c = 'blue', alpha = .5)                #Draw sample points
#Set title, x-axis label,y-axis label
axs0_title_text = axs.set_title(u'Locally weighted regression curve,k=1.0',FontProperties=font)
axs1_title_text = axs.set_title(u'Locally weighted regression curve,k=0.01',FontProperties=font)
axs2_title_text = axs.set_title(u'Locally weighted regression curve,k=0.003',FontProperties=font)
plt.setp(axs0_title_text, size=8, weight='bold', color='red')
plt.setp(axs1_title_text, size=8, weight='bold', color='red')
plt.setp(axs2_title_text, size=8, weight='bold', color='red')
plt.xlabel('X')
plt.show()

def lwlr(testPoint, xArr, yArr, k = 1.0):
"""
Function description:Regression coefficients were calculated using locally weighted linear regression w
Parameters:
testPoint - Test sample point
xArr - x data set
yArr - y data set
k - Gaussian kernel k,Custom parameters
Returns:
ws - regression coefficient
"""
xMat = np.mat(xArr); yMat = np.mat(yArr).T
m = np.shape(xMat)
weights = np.mat(np.eye((m)))                                        #Create weight diagonal matrix
for j in range(m):                                                  #Traverse the data set to calculate the weight of each sample
diffMat = testPoint - xMat[j, :]
weights[j, j] = np.exp(diffMat * diffMat.T/(-2.0 * k**2))
xTx = xMat.T * (weights * xMat)
if np.linalg.det(xTx) == 0.0:
print("The matrix is singular,No inversion")
return
ws = xTx.I * (xMat.T * (weights * yMat))                            #Calculate regression coefficient
return testPoint * ws

def lwlrTest(testArr, xArr, yArr, k=1.0):
"""
Function description:Locally weighted linear regression test
Parameters:
testArr - Test data set
xArr - x data set
yArr - y data set
k - Gaussian kernel k,Custom parameters
Returns:
ws - regression coefficient
"""
m = np.shape(testArr)                                            #Calculate test dataset size
yHat = np.zeros(m)
for i in range(m):                                                    #Predict each sample point
yHat[i] = lwlr(testArr[i],xArr,yArr,k)
return yHat

if __name__ == '__main__':
plotlwlrRegression()
``` It can be seen that the smaller the K, the better the fitting effect. However, when k is too small, over fitting occurs, for example, when k is equal to 0.003.

### 3, Example: predicting the age of abalone

Next, we apply regression to real data. abalone.txt file records the age of Abalone (an aquatic organism → _→), which is from the data of UCI data set. Abalone age can be calculated from the number of layers of abalone shell.

The data of the dataset is as follows: As you can see, the data set is multidimensional, so it is difficult for us to draw its distribution. The meaning of the representation of each dimension data is not given, but it doesn't matter. We just need to know that the data in the last column is the y value. The last column represents the real age of abalone. The data in the previous columns are the characteristics of abalone, such as the number of layers of abalone shell. Without data cleaning, we directly use all features to test our locally weighted regression.

Create a new. py file and add rssError function to evaluate the final regression results. The code is as follows:

```from matplotlib.font_manager import FontProperties
import matplotlib.pyplot as plt
import numpy as np

"""
Parameters:
fileName - file name
Returns:
xArr - x data set
yArr - y data set
"""
xArr = []; yArr = []
fr = open(fileName)
lineArr =[]
curLine = line.strip().split('\t')
for i in range(numFeat):
lineArr.append(float(curLine[i]))
xArr.append(lineArr)
yArr.append(float(curLine[-1]))
return xArr, yArr

def lwlr(testPoint, xArr, yArr, k = 1.0):
"""
Function description:Regression coefficients were calculated using locally weighted linear regression w
Parameters:
testPoint - Test sample point
xArr - x data set
yArr - y data set
k - Gaussian kernel k,Custom parameters
Returns:
ws - regression coefficient
"""
xMat = np.mat(xArr); yMat = np.mat(yArr).T
m = np.shape(xMat)
weights = np.mat(np.eye((m)))                                        #Create weight diagonal matrix
for j in range(m):                                                  #Traverse the data set to calculate the weight of each sample
diffMat = testPoint - xMat[j, :]
weights[j, j] = np.exp(diffMat * diffMat.T/(-2.0 * k**2))
xTx = xMat.T * (weights * xMat)
if np.linalg.det(xTx) == 0.0:
print("The matrix is singular,No inversion")
return
ws = xTx.I * (xMat.T * (weights * yMat))                            #Calculate regression coefficient
return testPoint * ws

def lwlrTest(testArr, xArr, yArr, k=1.0):
"""
Function description:Locally weighted linear regression test
Parameters:
testArr - Test data set,Test set
xArr - x data set,Training set
yArr - y data set,Training set
k - Gaussian kernel k,Custom parameters
Returns:
ws - regression coefficient
"""
m = np.shape(testArr)                                            #Calculate test dataset size
yHat = np.zeros(m)
for i in range(m):                                                    #Predict each sample point
yHat[i] = lwlr(testArr[i],xArr,yArr,k)
return yHat

def standRegres(xArr,yArr):
"""
Function description:Calculate regression coefficient w
Parameters:
xArr - x data set
yArr - y data set
Returns:
ws - regression coefficient
"""
xMat = np.mat(xArr); yMat = np.mat(yArr).T
xTx = xMat.T * xMat                            #The regression coefficient is calculated according to the publicity derived in this paper
if np.linalg.det(xTx) == 0.0:
print("The matrix is singular,No inversion")
return
ws = xTx.I * (xMat.T*yMat)
return ws

"""
Error size evaluation function
Parameters:
yArr - Real data
yHatArr - Forecast data
Returns:
Error size
"""
return ((yArr - yHatArr) **2).sum()

if __name__ == '__main__':
print('The training set is the same as the test set:Locally weighted linear regression,nucleus k Effect of the size of on the prediction:')
yHat01 = lwlrTest(abX[0:99], abX[0:99], abY[0:99], 0.1)
yHat1 = lwlrTest(abX[0:99], abX[0:99], abY[0:99], 1)
yHat10 = lwlrTest(abX[0:99], abX[0:99], abY[0:99], 10)
print('k=0.1 Time,The error size is:',rssError(abY[0:99], yHat01.T))
print('k=1  Time,The error size is:',rssError(abY[0:99], yHat1.T))
print('k=10 Time,The error size is:',rssError(abY[0:99], yHat10.T))
print('')
print('The training set is different from the test set:Locally weighted linear regression,nucleus k Is the smaller the better? Replace dataset,The test results are as follows:')
yHat01 = lwlrTest(abX[100:199], abX[0:99], abY[0:99], 0.1)
yHat1 = lwlrTest(abX[100:199], abX[0:99], abY[0:99], 1)
yHat10 = lwlrTest(abX[100:199], abX[0:99], abY[0:99], 10)
print('k=0.1 Time,The error size is:',rssError(abY[100:199], yHat01.T))
print('k=1  Time,The error size is:',rssError(abY[100:199], yHat1.T))
print('k=10 Time,The error size is:',rssError(abY[100:199], yHat10.T))
print('')
print('The training set is different from the test set:Simple linear regression and k=1 Comparison of locally weighted linear regression:')
print('k=1 Time,The error size is:', rssError(abY[100:199], yHat1.T))
ws = standRegres(abX[0:99], abY[0:99])
yHat = np.mat(abX[100:199]) * ws
print('Simple linear regression error size:', rssError(abY[100:199], yHat.T.A))
``` It can be seen that when k=0.1, the error of the training set is small, but when applied to the new data set, the error becomes larger. This is often called over fitting phenomenon. For the model we train, we should ensure the high accuracy of the test set, so that the trained model can be applied to new data, that is, we should strengthen the universality of the model. It can be seen that when k=1, the effect of locally weighted linear regression is similar to that of simple linear regression. This also shows that the best model can only be selected by comparing the effects on unknown data. So the best nuclear size is 10? Maybe, but if you want to get better results, you should do 10 tests with 10 different sample sets to compare the results.

This example shows how to use locally weighted linear regression to build a model, which can get better results than ordinary linear regression. The problem with locally weighted linear regression is that it must be run on the whole data set each time. In other words, in order to make a prediction, all training data must be saved.

### 4, Reduction factor to "understand" data

#### 1. Ridge regression

```from matplotlib.font_manager import FontProperties
import matplotlib.pyplot as plt
import numpy as np

"""
Parameters:
fileName - file name
Returns:
xArr - x data set
yArr - y data set
"""
xArr = []; yArr = []
fr = open(fileName)
lineArr =[]
curLine = line.strip().split('\t')
for i in range(numFeat):
lineArr.append(float(curLine[i]))
xArr.append(lineArr)
yArr.append(float(curLine[-1]))
return xArr, yArr

def ridgeRegres(xMat, yMat, lam = 0.2):
"""
Function description:Ridge regression
Parameters:
xMat - x data set
yMat - y data set
lam - Reduction factor
Returns:
ws - regression coefficient
"""
xTx = xMat.T * xMat
denom = xTx + np.eye(np.shape(xMat)) * lam
if np.linalg.det(denom) == 0.0:
print("The matrix is singular,Cannot transpose")
return
ws = denom.I * (xMat.T * yMat)
return ws

def ridgeTest(xArr, yArr):
"""
Function description:Ridge regression test
Parameters:
xMat - x data set
yMat - y data set
Returns:
wMat - Regression coefficient matrix
"""
xMat = np.mat(xArr); yMat = np.mat(yArr).T
#Data standardization
yMean = np.mean(yMat, axis = 0)                        #Row to row operation, find the mean
yMat = yMat - yMean                                    #Data minus mean
xMeans = np.mean(xMat, axis = 0)                    #Row to row operation, find the mean
xVar = np.var(xMat, axis = 0)                        #Row to row operation, variance
xMat = (xMat - xMeans) / xVar                        #Data is normalized by subtracting mean divided by variance
numTestPts = 30                                        #30 different lambda tests
wMat = np.zeros((numTestPts, np.shape(xMat)))    #Initial regression coefficient matrix
for i in range(numTestPts):                            #Changing lambda to calculate regression coefficient
ws = ridgeRegres(xMat, yMat, np.exp(i - 10))    #lambda changes exponentially with e, initially a very small number,
wMat[i, :] = ws.T                                 #Calculate regression coefficient matrix
return wMat

def plotwMat():
"""
Function description:Plot ridge regression coefficient matrix
"""
font = FontProperties(fname=r"c:\windows\fonts\simsun.ttc", size=14)
redgeWeights = ridgeTest(abX, abY)
fig = plt.figure()
ax.plot(redgeWeights)
ax_title_text = ax.set_title(u'log(lambada)Relationship with regression coefficient', FontProperties = font)
ax_xlabel_text = ax.set_xlabel(u'log(lambada)', FontProperties = font)
ax_ylabel_text = ax.set_ylabel(u'regression coefficient', FontProperties = font)
plt.setp(ax_title_text, size = 20, weight = 'bold', color = 'red')
plt.setp(ax_xlabel_text, size = 10, weight = 'bold', color = 'black')
plt.setp(ax_ylabel_text, size = 10, weight = 'bold', color = 'black')
plt.show()

if __name__ == '__main__':
plotwMat()
``` The regression coefficients and log are plotted( λ) On the far left, i.e λ At the minimum, the original values of all coefficients can be obtained (consistent with linear regression); on the right, the coefficients are reduced to 0; at a certain position in the middle, the best prediction result will be obtained. Want to get the best prediction result λ Parameters can be obtained by cross validation, which will be explained later in the article.

#### 2,lasso

It is not difficult to prove that when the following constraints are added, the ordinary least squares regression will obtain the same formula as the ridge regression: The above formula defines that the sum of squares of all regression coefficients cannot be greater than λ. When two or more features are correlated, a large positive coefficient and a large negative number may be obtained by ordinary least squares regression. Because of the above constraints, ridge regression can avoid this problem.
Similar to ridge regression, another reduction method lasso also limits the regression coefficient, and the corresponding constraints are as follows: The only difference is that this constraint replaces the sum of squares with absolute values. Although the form of constraints changes only slightly, the results are quite different: in λ When it is small enough, some coefficients will be forced to reduce to 0. This feature can help us better understand the data. The two constraints look similar in terms of formula, but slight changes greatly increase the computational complexity (in order to solve the regression coefficient under this new constraint, the quadratic programming algorithm needs to be used). The following will introduce a simpler method to obtain the results, which is called forward stepwise regression.

#### 3. Forward stepwise regression

The forward stepwise linear regression algorithm belongs to a greedy algorithm, that is, the error is reduced as much as possible at each step. We calculate the regression coefficient no longer by formula, but by fine tuning each regression coefficient every time, and then calculating the prediction error. The set of regression coefficients that minimize the error is the best regression coefficient we need.

The implementation of forward stepwise linear regression is also very simple. Of course, the data should be standardized first. The code is as follows:

```from matplotlib.font_manager import FontProperties
import matplotlib.pyplot as plt
import numpy as np

"""
Parameters:
fileName - file name
Returns:
xArr - x data set
yArr - y data set
"""
xArr = []; yArr = []
fr = open(fileName)
lineArr =[]
curLine = line.strip().split('\t')
for i in range(numFeat):
lineArr.append(float(curLine[i]))
xArr.append(lineArr)
yArr.append(float(curLine[-1]))
return xArr, yArr

def regularize(xMat, yMat):
"""
Function description:Data standardization
Parameters:
xMat - x data set
yMat - y data set
Returns:
inxMat - Standardized x data set
inyMat - Standardized y data set
"""
inxMat = xMat.copy()                                                        #Data copy
inyMat = yMat.copy()
yMean = np.mean(yMat, 0)                                                    #Row to row operation, find the mean
inyMat = yMat - yMean                                                        #Data minus mean
inMeans = np.mean(inxMat, 0)                                                   #Row to row operation, find the mean
inVar = np.var(inxMat, 0)                                                     #Row to row operation, variance
inxMat = (inxMat - inMeans) / inVar                                            #Data is normalized by subtracting mean divided by variance
return inxMat, inyMat

"""
Function description:Calculate square error
Parameters:
yArr - Estimate
yHatArr - True value
Returns:
"""
return ((yArr-yHatArr)**2).sum()

def stageWise(xArr, yArr, eps = 0.01, numIt = 100):
"""
Function description:Forward stepwise linear regression
Parameters:
xArr - x input data
yArr - y Forecast data
eps - Step size to be adjusted for each iteration
numIt - Number of iterations
Returns:
returnMat - numIt Regression coefficient matrix of second iteration
"""
xMat = np.mat(xArr); yMat = np.mat(yArr).T                                         #data set
xMat, yMat = regularize(xMat, yMat)                                                #Data standardization
m, n = np.shape(xMat)
returnMat = np.zeros((numIt, n))                                                #Initialize the regression coefficient matrix of numIt iterations
ws = np.zeros((n, 1))                                                            #Initialize regression coefficient matrix
wsTest = ws.copy()
wsMax = ws.copy()
for i in range(numIt):                                                            #Iteration numIt times
# print(ws.T)                                                                    #Print current regression coefficient matrix
lowestError = float('inf');                                                 #Positive infinity
for j in range(n):                                                            #Traverse the regression coefficients of each feature
wsTest = ws.copy()
wsTest[j] += eps * sign                                                #Fine tuning regression coefficient
yTest = xMat * wsTest                                                #Calculate predicted value
if rssE < lowestError:                                                #If the error is smaller, the current optimal regression coefficient is updated
wsMax = wsTest
ws = wsMax.copy()
returnMat[i,:] = ws.T                                                         #Record the regression coefficient matrix of numIt iterations
return returnMat

def plotstageWiseMat():
"""
Function description:Plot ridge regression coefficient matrix
"""
font = FontProperties(fname=r"c:\windows\fonts\simsun.ttc", size=14)
returnMat = stageWise(xArr, yArr, 0.005, 1000)
fig = plt.figure()
ax.plot(returnMat)
ax_title_text = ax.set_title(u'Forward stepwise regression:Relationship between iteration number and regression coefficient', FontProperties = font)
ax_xlabel_text = ax.set_xlabel(u'Number of iterations', FontProperties = font)
ax_ylabel_text = ax.set_ylabel(u'regression coefficient', FontProperties = font)
plt.setp(ax_title_text, size = 15, weight = 'bold', color = 'red')
plt.setp(ax_xlabel_text, size = 10, weight = 'bold', color = 'black')
plt.setp(ax_ylabel_text, size = 10, weight = 'bold', color = 'black')
plt.show()

if __name__ == '__main__':
plotstageWiseMat()
``` Still, we printed the relationship curve between the number of iterations and the regression coefficient. It can be seen that some coefficients are about 0 from beginning to end, which shows that they do not have any impact on the target, that is, these characteristics are probably unnecessary. The advantage of the stepwise linear regression algorithm is that it can help people understand some models and make improvements. After building a model , the algorithm can be run to find out the important features, so it is possible to stop the collection of those unimportant features in time.

To sum up:

The reduction method (stepwise linear regression or ridge regression) is to reduce some coefficients to very small values or directly to 0. This will increase the deviation of the model (reduce the weight of some features) , by reducing the regression coefficient of some features to 0, the complexity of the model is reduced. After eliminating the redundant features, the model is easier to understand and the prediction error is reduced. However, when the reduction is too strict, there will be over fitting, that is, the prediction result with the training set is very good, and the prediction with the test set is much worse.

### 5, Example: forecasting the price of LEGO toy sets

LEGO produces assembled toys, which are composed of many plastic inserts of different sizes. Its appearance is shown in the figure below: Generally speaking, these inserts are sold in sets. They can be assembled into many different things, such as ships, castles, some famous buildings, etc. the number of parts contained in each set of LEGO company ranges from 10 to 5000.

A Lego kit will basically be discontinued in a few years, but LEGO collectors will still trade with each other after shutdown. This example is to use the regression method to predict the transaction price between collectors.

#### 1. Get data

We get the information we need by parsing the html file:

```from bs4 import BeautifulSoup

def scrapePage(retX, retY, inFile, yr, numPce, origPrc):
"""
Function description:Read data from the page and generate retX and retY list
Parameters:
retX - data X
retY - data Y
inFile - HTML file
yr - particular year
numPce - Number of LEGO parts
origPrc - original price
Returns:
nothing
"""
# Open and read HTML file
with open(inFile, encoding='utf-8') as f:
soup = BeautifulSoup(html)
i = 1
# Parse according to HTML page structure
currentRow = soup.find_all('table', r = "%d" % i)
while(len(currentRow) != 0):
currentRow = soup.find_all('table', r = "%d" % i)
title = currentRow.find_all('a').text
lwrTitle = title.lower()
# Find out if there are brand new labels
if (lwrTitle.find('new') > -1) or (lwrTitle.find('nisb') > -1):
newFlag = 1.0
else:
newFlag = 0.0
# To find out whether the logo has been sold, we only collect the data that has been sold
soldUnicde = currentRow.find_all('td').find_all('span')
if len(soldUnicde) == 0:
print("commodity #%d not sold "% i)
else:
# Analyze the page to get the current price
soldPrice = currentRow.find_all('td')
priceStr = soldPrice.text
priceStr = priceStr.replace('\$','')
priceStr = priceStr.replace(',','')
if len(soldPrice) > 1:
priceStr = priceStr.replace('Free shipping', '')
sellingPrice = float(priceStr)
# Remove incomplete package price
if  sellingPrice > origPrc * 0.5:
print("%d\t%d\t%d\t%f\t%f" % (yr, numPce, newFlag, origPrc, sellingPrice))
retX.append([yr, numPce, newFlag, origPrc])
retY.append(sellingPrice)
i += 1
currentRow = soup.find_all('table', r = "%d" % i)

def setDataCollect(retX, retY):
"""
Function description:Read the data of six Lego sets in turn and generate the data matrix
Parameters:
nothing
Returns:
nothing
"""
scrapePage(retX, retY, './lego/lego8288.html', 2006, 800, 49.99)                #LEGO 8288 in 2006, number of parts 800, original price 49.99
scrapePage(retX, retY, './lego/lego10030.html', 2002, 3096, 269.99)                #LEGO 10030 in 2002, number of parts 3096, original price 269.99
scrapePage(retX, retY, './lego/lego10179.html', 2007, 5195, 499.99)                #LEGO 10179 in 2007, number of parts 5195, original price 499.99
scrapePage(retX, retY, './lego/lego10181.html', 2007, 3428, 199.99)                #LEGO 10181 in 2007, number of parts 3428, original price 199.99
scrapePage(retX, retY, './lego/lego10189.html', 2008, 5922, 299.99)                #LEGO 10189 in 2008, number of parts 5922, original price 299.99
scrapePage(retX, retY, './lego/lego10196.html', 2009, 3263, 249.99)                #LEGO 10196 in 2009, 3263 parts, original price 249.99

if __name__ == '__main__':
lgX = []
lgY = []
setDataCollect(lgX, lgY)
``` We deal with goods that are not available. These characteristics are: year of production, number of parts, whether it is brand new, original price and selling price (second-hand transaction).

#### 2. Model building

After we have processed the data set, the next step is to train the model. First, we need to add the feature X0 column with all 0. Because the first column of linear regression requires 1.0. Then, use the simplest ordinary linear regression i, and write the code as follows:

```import numpy as np
from bs4 import BeautifulSoup

def scrapePage(retX, retY, inFile, yr, numPce, origPrc):
"""
Function description:Read data from the page and generate retX and retY list
Parameters:
retX - data X
retY - data Y
inFile - HTML file
yr - particular year
numPce - Number of LEGO parts
origPrc - original price
Returns:
nothing
"""
# Open and read HTML file
with open(inFile, encoding='utf-8') as f:
soup = BeautifulSoup(html)
i = 1
# Parse according to HTML page structure
currentRow = soup.find_all('table', r = "%d" % i)
while(len(currentRow) != 0):
currentRow = soup.find_all('table', r = "%d" % i)
title = currentRow.find_all('a').text
lwrTitle = title.lower()
# Find out if there are brand new labels
if (lwrTitle.find('new') > -1) or (lwrTitle.find('nisb') > -1):
newFlag = 1.0
else:
newFlag = 0.0
# To find out whether the logo has been sold, we only collect the data that has been sold
soldUnicde = currentRow.find_all('td').find_all('span')
if len(soldUnicde) == 0:
print("commodity #%d not sold "% i)
else:
# Analyze the page to get the current price
soldPrice = currentRow.find_all('td')
priceStr = soldPrice.text
priceStr = priceStr.replace('\$','')
priceStr = priceStr.replace(',','')
if len(soldPrice) > 1:
priceStr = priceStr.replace('Free shipping', '')
sellingPrice = float(priceStr)
# Remove incomplete package price
if  sellingPrice > origPrc * 0.5:
print("%d\t%d\t%d\t%f\t%f" % (yr, numPce, newFlag, origPrc, sellingPrice))
retX.append([yr, numPce, newFlag, origPrc])
retY.append(sellingPrice)
i += 1
currentRow = soup.find_all('table', r = "%d" % i)

def setDataCollect(retX, retY):
"""
Function description:Read the data of six Lego sets in turn and generate the data matrix
Parameters:
nothing
Returns:
nothing
"""
scrapePage(retX, retY, './lego/lego8288.html', 2006, 800, 49.99)                #LEGO 8288 in 2006, number of parts 800, original price 49.99
scrapePage(retX, retY, './lego/lego10030.html', 2002, 3096, 269.99)                #LEGO 10030 in 2002, number of parts 3096, original price 269.99
scrapePage(retX, retY, './lego/lego10179.html', 2007, 5195, 499.99)                #LEGO 10179 in 2007, number of parts 5195, original price 499.99
scrapePage(retX, retY, './lego/lego10181.html', 2007, 3428, 199.99)                #LEGO 10181 in 2007, number of parts 3428, original price 199.99
scrapePage(retX, retY, './lego/lego10189.html', 2008, 5922, 299.99)                #LEGO 10189 in 2008, number of parts 5922, original price 299.99
scrapePage(retX, retY, './lego/lego10196.html', 2009, 3263, 249.99)                #LEGO 10196 in 2009, 3263 parts, original price 249.99

def regularize(xMat, yMat):
"""
Function description:Data standardization
Parameters:
xMat - x data set
yMat - y data set
Returns:
inxMat - Standardized x data set
inyMat - Standardized y data set
"""
inxMat = xMat.copy()                                                        #Data copy
inyMat = yMat.copy()
yMean = np.mean(yMat, 0)                                                    #Row to row operation, find the mean
inyMat = yMat - yMean                                                        #Data minus mean
inMeans = np.mean(inxMat, 0)                                                   #Row to row operation, find the mean
inVar = np.var(inxMat, 0)                                                     #Row to row operation, variance
# print(inxMat)
print(inMeans)
# print(inVar)
inxMat = (inxMat - inMeans) / inVar                                            #Data is normalized by subtracting mean divided by variance
return inxMat, inyMat

"""
Function description:Calculate square error
Parameters:
yArr - Estimate
yHatArr - True value
Returns:
"""
return ((yArr-yHatArr)**2).sum()

def standRegres(xArr,yArr):
"""
Function description:Calculate regression coefficient w
Parameters:
xArr - x data set
yArr - y data set
Returns:
ws - regression coefficient
"""
xMat = np.mat(xArr); yMat = np.mat(yArr).T
xTx = xMat.T * xMat                            #The regression coefficient is calculated according to the publicity derived in this paper
if np.linalg.det(xTx) == 0.0:
print("The matrix is singular,Cannot transpose")
return
ws = xTx.I * (xMat.T*yMat)
return ws

def useStandRegres():
"""
Function description:Using simple linear regression
Parameters:
nothing
Returns:
nothing
"""
lgX = []
lgY = []
setDataCollect(lgX, lgY)
data_num, features_num = np.shape(lgX)
lgX1 = np.mat(np.ones((data_num, features_num + 1)))
lgX1[:, 1:5] = np.mat(lgX)
ws = standRegres(lgX1, lgY)
print('%f%+f*particular year%+f*Number of parts%+f*Is it brand new%+f*original price' % (ws,ws,ws,ws,ws))

if __name__ == '__main__':
useStandRegres()
``` It can be seen that the formula used in the model is shown in the figure above. Although this model fits the data well, it doesn't seem to make any sense. The more parts in the kit, the lower the price, which is unreasonable.

We use ridge regression to find the method to minimize the error through cross validation λ Corresponding regression coefficient. The code is as follows:

```import numpy as np
from bs4 import BeautifulSoup
import random

def scrapePage(retX, retY, inFile, yr, numPce, origPrc):
"""
Function description:Read data from the page and generate retX and retY list
Parameters:
retX - data X
retY - data Y
inFile - HTML file
yr - particular year
numPce - Number of LEGO parts
origPrc - original price
Returns:
nothing
"""
# Open and read HTML file
with open(inFile, encoding='utf-8') as f:
soup = BeautifulSoup(html)
i = 1
# Parse according to HTML page structure
currentRow = soup.find_all('table', r = "%d" % i)
while(len(currentRow) != 0):
currentRow = soup.find_all('table', r = "%d" % i)
title = currentRow.find_all('a').text
lwrTitle = title.lower()
# Find out if there are brand new labels
if (lwrTitle.find('new') > -1) or (lwrTitle.find('nisb') > -1):
newFlag = 1.0
else:
newFlag = 0.0
# To find out whether the logo has been sold, we only collect the data that has been sold
soldUnicde = currentRow.find_all('td').find_all('span')
if len(soldUnicde) == 0:
print("commodity #%d not sold "% i)
else:
# Analyze the page to get the current price
soldPrice = currentRow.find_all('td')
priceStr = soldPrice.text
priceStr = priceStr.replace('\$','')
priceStr = priceStr.replace(',','')
if len(soldPrice) > 1:
priceStr = priceStr.replace('Free shipping', '')
sellingPrice = float(priceStr)
# Remove incomplete package price
if  sellingPrice > origPrc * 0.5:
print("%d\t%d\t%d\t%f\t%f" % (yr, numPce, newFlag, origPrc, sellingPrice))
retX.append([yr, numPce, newFlag, origPrc])
retY.append(sellingPrice)
i += 1
currentRow = soup.find_all('table', r = "%d" % i)

def ridgeRegres(xMat, yMat, lam = 0.2):
"""
Function description:Ridge regression
Parameters:
xMat - x data set
yMat - y data set
lam - Reduction factor
Returns:
ws - regression coefficient
"""
xTx = xMat.T * xMat
denom = xTx + np.eye(np.shape(xMat)) * lam
if np.linalg.det(denom) == 0.0:
print("The matrix is singular,Cannot transpose")
return
ws = denom.I * (xMat.T * yMat)
return ws

def setDataCollect(retX, retY):
"""
Function description:Read the data of six Lego sets in turn and generate the data matrix
Parameters:
nothing
Returns:
nothing
"""
scrapePage(retX, retY, './lego/lego8288.html', 2006, 800, 49.99)                #LEGO 8288 in 2006, number of parts 800, original price 49.99
scrapePage(retX, retY, './lego/lego10030.html', 2002, 3096, 269.99)                #LEGO 10030 in 2002, number of parts 3096, original price 269.99
scrapePage(retX, retY, './lego/lego10179.html', 2007, 5195, 499.99)                #LEGO 10179 in 2007, number of parts 5195, original price 499.99
scrapePage(retX, retY, './lego/lego10181.html', 2007, 3428, 199.99)                #LEGO 10181 in 2007, number of parts 3428, original price 199.99
scrapePage(retX, retY, './lego/lego10189.html', 2008, 5922, 299.99)                #LEGO 10189 in 2008, number of parts 5922, original price 299.99
scrapePage(retX, retY, './lego/lego10196.html', 2009, 3263, 249.99)                #LEGO 10196 in 2009, 3263 parts, original price 249.99

def regularize(xMat, yMat):
"""
Function description:Data standardization
Parameters:
xMat - x data set
yMat - y data set
Returns:
inxMat - Standardized x data set
inyMat - Standardized y data set
"""
inxMat = xMat.copy()                                                        #Data copy
inyMat = yMat.copy()
yMean = np.mean(yMat, 0)                                                    #Row to row operation, find the mean
inyMat = yMat - yMean                                                        #Data minus mean
inMeans = np.mean(inxMat, 0)                                                   #Row to row operation, find the mean
inVar = np.var(inxMat, 0)                                                     #Row to row operation, variance
# print(inxMat)
print(inMeans)
# print(inVar)
inxMat = (inxMat - inMeans) / inVar                                            #Data is normalized by subtracting mean divided by variance
return inxMat, inyMat

"""
Function description:Calculate square error
Parameters:
yArr - Estimate
yHatArr - True value
Returns:
"""
return ((yArr-yHatArr)**2).sum()

def standRegres(xArr,yArr):
"""
Function description:Calculate regression coefficient w
Parameters:
xArr - x data set
yArr - y data set
Returns:
ws - regression coefficient
"""
xMat = np.mat(xArr); yMat = np.mat(yArr).T
xTx = xMat.T * xMat                            #The regression coefficient is calculated according to the publicity derived in this paper
if np.linalg.det(xTx) == 0.0:
print("The matrix is singular,Cannot transpose")
return
ws = xTx.I * (xMat.T*yMat)
return ws

def crossValidation(xArr, yArr, numVal = 10):
"""
Function description:Cross validation ridge regression
Parameters:
xArr - x data set
yArr - y data set
numVal - Number of cross validation
Returns:
wMat - Regression coefficient matrix
"""
m = len(yArr)                                                                        #Number of statistical samples
indexList = list(range(m))                                                            #Generate index value list
errorMat = np.zeros((numVal,30))                                                    #create error mat 30columns numVal rows
for i in range(numVal):                                                                #Cross validation numVal times
trainX = []; trainY = []                                                        #Training set
testX = []; testY = []                                                            #Test set
random.shuffle(indexList)                                                        #Disorganize
for j in range(m):                                                                #Divided data set: 90% training set and 10% test set
if j < m * 0.9:
trainX.append(xArr[indexList[j]])
trainY.append(yArr[indexList[j]])
else:
testX.append(xArr[indexList[j]])
testY.append(yArr[indexList[j]])
wMat = ridgeTest(trainX, trainY)                                                #The ridge regression coefficients under 30 different lambda were obtained
for k in range(30):                                                                #Traverse all ridge regression coefficients
matTestX = np.mat(testX); matTrainX = np.mat(trainX)                        #Test set
meanTrain = np.mean(matTrainX,0)                                            #Test set mean
varTrain = np.var(matTrainX,0)                                                #Test set variance
matTestX = (matTestX - meanTrain) / varTrain                                 #Test set standardization
yEst = matTestX * np.mat(wMat[k,:]).T + np.mean(trainY)                        #Predict y value according to ws
errorMat[i, k] = rssError(yEst.T.A, np.array(testY))                            #Statistical error
meanErrors = np.mean(errorMat,0)                                                    #Calculate the average error of each cross validation
minMean = float(min(meanErrors))                                                    #Find the minimum error
bestWeights = wMat[np.nonzero(meanErrors == minMean)]                                #Find the best regression coefficient
xMat = np.mat(xArr); yMat = np.mat(yArr).T
meanX = np.mean(xMat,0); varX = np.var(xMat,0)
unReg = bestWeights / varX                                                            #Data is standardized and therefore needs to be restored
print('%f%+f*particular year%+f*Number of parts%+f*Is it brand new%+f*original price' % ((-1 * np.sum(np.multiply(meanX,unReg)) + np.mean(yMat)), unReg[0,0], unReg[0,1], unReg[0,2], unReg[0,3]))

def ridgeTest(xArr, yArr):
"""
Function description:Ridge regression test
Parameters:
xMat - x data set
yMat - y data set
Returns:
wMat - Regression coefficient matrix
"""
xMat = np.mat(xArr); yMat = np.mat(yArr).T
#Data standardization
yMean = np.mean(yMat, axis = 0)                        #Row to row operation, find the mean
yMat = yMat - yMean                                    #Data minus mean
xMeans = np.mean(xMat, axis = 0)                    #Row to row operation, find the mean
xVar = np.var(xMat, axis = 0)                        #Row to row operation, variance
xMat = (xMat - xMeans) / xVar                        #Data is normalized by subtracting mean divided by variance
numTestPts = 30                                        #30 different lambda tests
wMat = np.zeros((numTestPts, np.shape(xMat)))    #Initial regression coefficient matrix
for i in range(numTestPts):                            #Changing lambda to calculate regression coefficient
ws = ridgeRegres(xMat, yMat, np.exp(i - 10))    #lambda changes exponentially with e, initially a very small number,
wMat[i, :] = ws.T                                 #Calculate regression coefficient matrix
return wMat

if __name__ == '__main__':
lgX = []
lgY = []
setDataCollect(lgX, lgY)
crossValidation(lgX, lgY)
``` The samples are randomly selected here. Because of their randomness, the results of each run may be slightly different. However, as shown in the figure above, it is not very different from the conventional least squares method, that is, ordinary linear regression. We expected to find a more understandable model, which obviously did not achieve the expected effect.

Now let's look at how the regression coefficient changes during the reduction process. The code is as follows:

```import numpy as np
from bs4 import BeautifulSoup
import random

def scrapePage(retX, retY, inFile, yr, numPce, origPrc):
"""
Function description:Read data from the page and generate retX and retY list
Parameters:
retX - data X
retY - data Y
inFile - HTML file
yr - particular year
numPce - Number of LEGO parts
origPrc - original price
Returns:
nothing
"""
# Open and read HTML file
with open(inFile, encoding='utf-8') as f:
soup = BeautifulSoup(html)
i = 1
# Parse according to HTML page structure
currentRow = soup.find_all('table', r = "%d" % i)
while(len(currentRow) != 0):
currentRow = soup.find_all('table', r = "%d" % i)
title = currentRow.find_all('a').text
lwrTitle = title.lower()
# Find out if there are brand new labels
if (lwrTitle.find('new') > -1) or (lwrTitle.find('nisb') > -1):
newFlag = 1.0
else:
newFlag = 0.0
# To find out whether the logo has been sold, we only collect the data that has been sold
soldUnicde = currentRow.find_all('td').find_all('span')
if len(soldUnicde) == 0:
print("commodity #%d not sold "% i)
else:
# Analyze the page to get the current price
soldPrice = currentRow.find_all('td')
priceStr = soldPrice.text
priceStr = priceStr.replace('\$','')
priceStr = priceStr.replace(',','')
if len(soldPrice) > 1:
priceStr = priceStr.replace('Free shipping', '')
sellingPrice = float(priceStr)
# Remove incomplete package price
if  sellingPrice > origPrc * 0.5:
print("%d\t%d\t%d\t%f\t%f" % (yr, numPce, newFlag, origPrc, sellingPrice))
retX.append([yr, numPce, newFlag, origPrc])
retY.append(sellingPrice)
i += 1
currentRow = soup.find_all('table', r = "%d" % i)

def ridgeRegres(xMat, yMat, lam = 0.2):
"""
Function description:Ridge regression
Parameters:
xMat - x data set
yMat - y data set
lam - Reduction factor
Returns:
ws - regression coefficient
"""
xTx = xMat.T * xMat
denom = xTx + np.eye(np.shape(xMat)) * lam
if np.linalg.det(denom) == 0.0:
print("The matrix is singular,Cannot transpose")
return
ws = denom.I * (xMat.T * yMat)
return ws

def setDataCollect(retX, retY):
"""
Function description:Read the data of six Lego sets in turn and generate the data matrix
Parameters:
nothing
Returns:
nothing
"""
scrapePage(retX, retY, './lego/lego8288.html', 2006, 800, 49.99)                #LEGO 8288 in 2006, number of parts 800, original price 49.99
scrapePage(retX, retY, './lego/lego10030.html', 2002, 3096, 269.99)                #LEGO 10030 in 2002, number of parts 3096, original price 269.99
scrapePage(retX, retY, './lego/lego10179.html', 2007, 5195, 499.99)                #LEGO 10179 in 2007, number of parts 5195, original price 499.99
scrapePage(retX, retY, './lego/lego10181.html', 2007, 3428, 199.99)                #LEGO 10181 in 2007, number of parts 3428, original price 199.99
scrapePage(retX, retY, './lego/lego10189.html', 2008, 5922, 299.99)                #LEGO 10189 in 2008, number of parts 5922, original price 299.99
scrapePage(retX, retY, './lego/lego10196.html', 2009, 3263, 249.99)                #LEGO 10196 in 2009, 3263 parts, original price 249.99

def regularize(xMat, yMat):
"""
Function description:Data standardization
Parameters:
xMat - x data set
yMat - y data set
Returns:
inxMat - Standardized x data set
inyMat - Standardized y data set
"""
inxMat = xMat.copy()                                                        #Data copy
inyMat = yMat.copy()
yMean = np.mean(yMat, 0)                                                    #Row to row operation, find the mean
inyMat = yMat - yMean                                                        #Data minus mean
inMeans = np.mean(inxMat, 0)                                                   #Row to row operation, find the mean
inVar = np.var(inxMat, 0)                                                     #Row to row operation, variance
# print(inxMat)
print(inMeans)
# print(inVar)
inxMat = (inxMat - inMeans) / inVar                                            #Data is normalized by subtracting mean divided by variance
return inxMat, inyMat

"""
Function description:Calculate square error
Parameters:
yArr - Estimate
yHatArr - True value
Returns:
"""
return ((yArr-yHatArr)**2).sum()

def standRegres(xArr,yArr):
"""
Function description:Calculate regression coefficient w
Parameters:
xArr - x data set
yArr - y data set
Returns:
ws - regression coefficient
"""
xMat = np.mat(xArr); yMat = np.mat(yArr).T
xTx = xMat.T * xMat                            #The regression coefficient is calculated according to the publicity derived in this paper
if np.linalg.det(xTx) == 0.0:
print("The matrix is singular,Cannot transpose")
return
ws = xTx.I * (xMat.T*yMat)
return ws

def ridgeTest(xArr, yArr):
"""
Function description:Ridge regression test
Parameters:
xMat - x data set
yMat - y data set
Returns:
wMat - Regression coefficient matrix
"""
xMat = np.mat(xArr); yMat = np.mat(yArr).T
#Data standardization
yMean = np.mean(yMat, axis = 0)                        #Row to row operation, find the mean
yMat = yMat - yMean                                    #Data minus mean
xMeans = np.mean(xMat, axis = 0)                    #Row to row operation, find the mean
xVar = np.var(xMat, axis = 0)                        #Row to row operation, variance
xMat = (xMat - xMeans) / xVar                        #Data is normalized by subtracting mean divided by variance
numTestPts = 30                                        #30 different lambda tests
wMat = np.zeros((numTestPts, np.shape(xMat)))    #Initial regression coefficient matrix
for i in range(numTestPts):                            #Changing lambda to calculate regression coefficient
ws = ridgeRegres(xMat, yMat, np.exp(i - 10))    #lambda changes exponentially with e, initially a very small number,
wMat[i, :] = ws.T                                 #Calculate regression coefficient matrix
return wMat

if __name__ == '__main__':
lgX = []
lgY = []
setDataCollect(lgX, lgY)
print(ridgeTest(lgX, lgY))
``` Looking at the first line of the run result, you can see that the largest is item 4 and the second largest is item 2.

Therefore, if only one feature is selected for prediction, we should select the fourth feature. If 2 features can be selected, the 4th and 2nd features should be selected.

This analysis method enables us to mine the internal laws of a large number of data. When there are only four features, the effect of this method may not be obvious; But if there are more than 100 features, the method will become very effective: it can point out which features are key and which features are unimportant.

### 6, Summary

• Like classification, regression is also the process of predicting the target value. The difference between regression and classification is that the former predicts continuous variables, while the latter predicts discrete variables.
• Ridge regression is a kind of reduction method, which is equivalent to limiting the size of regression coefficient. Another good reduction method is lasso. Lasso is difficult to solve, but the approximate solution can be obtained by the simple stepwise linear regression method.
• The reduction method can also be regarded as a method to reduce the deviation of a model at the same time.

Tags: Machine Learning

Posted on Fri, 05 Nov 2021 21:30:26 -0400 by kael.shipman