What is regression?
The purpose of regression is to predict the numerical target value. The most direct way is to write out a calculation formula of the target value according to the input. If you want to predict the power of your little sister's boyfriend's car, you may calculate as follows:
HorsePower = 0.0015 * annualSalary - 0.99 * hoursListeningToPublicRadio
Written in Chinese is:
Car power of little sister's boyfriend = 0.0015 * annual salary of little sister's boyfriend - 0.99 * time to listen to public radio
This is the so-called regression equation, in which 0.0015 and -0.99 are called regression weights. The process of finding these regression coefficients is regression. Once we have these regression coefficients, given the input, it is very easy to make prediction. The specific method is to multiply the regression coefficient by the input value, and then add all the results together to get the predicted value.
When it comes to regression, it generally refers to linear regression, so regression and linear regression in this paper represent the same meaning. Linear regression means that you can multiply the input items by some constants, and then add up the results to get the output. It should be noted that there is another regression model called nonlinear regression, which does not agree with the above practice. For example, it thinks that the output may be the product of the input. In this way, the above power calculation formula can also be written as follows:
HorsePower = 0.0015 * annualSalary / hoursListeningToPublicRadio
This is an example of nonlinear regression, which is not discussed in depth in this paper.
1, Find the best fitting line by linear regression
How do we get the regression equation from a lot of data? Suppose that the input data is stored in matrix X and the result is stored in vector y:
The regression coefficients are stored in the vector w:
Then, for the given data x1, that is, the first column data of matrix X, the prediction result u1 will be given by the following formula:
Now the question is, with the data matrix X and the corresponding label vector Y in hand, how can we find w? A common method is to find the w that minimizes the error. The error here refers to the difference between the predicted u value and the real y value. The simple accumulation of the error will make the positive difference and negative difference offset each other, so we use the square error.
The sum of squared errors can be written as:
The matrix representation can also be written as:
Why can it change like this? Remember a premise: if x is a vector, X is the column vector and x^T is the row vector by default. Bring in the data matrix X and label vector y mentioned above to know why it changes so much.
Before we continue the derivation, we should first clarify one purpose: find w and minimize the sum of square errors. Because we believe that the smaller the sum of square errors, the better the fitting effect of linear regression.
Now, we take the derivative of w by the sum of the square errors expressed by the matrix:
If you are not familiar with matrix calculation, you can move here: https://blog.csdn.net/nomadlx53/article/details/50849941
Let the above formula be equal to 0 to obtain:
The small mark above w indicates that this is the optimal solution of w that can be estimated at present. The w estimated from the existing data may not be the real W value in the data, so a "hat" symbol is used here to indicate that it is only a best estimate of W.
It is worth noting that the above formula contains an inverse matrix, that is, the equation is only used when the inverse matrix exists, that is, the matrix is a square matrix and its determinant is not 0.
The best w solution is a common problem in statistics. There are many other methods besides matrix method. By calling the matrix method in NumPy library, we can complete the required function with only a few lines of code. This method is also called OLS, which means "ordinary least squares".
Let's look at the dataset first:
The first column is 1.0, i.e. x0. The second column is x1, i.e. X-axis data. The third column is x2, i.e. y-axis data. First, draw the data and look at the data distribution. The code is as follows:
import matplotlib.pyplot as plt import numpy as np def loadDataSet(fileName): """ Function description:Load data Parameters: fileName - file name Returns: xArr - x data set yArr - y data set """ numFeat = len(open(fileName).readline().split('\t')) - 1 xArr = []; yArr = [] fr = open(fileName) for line in fr.readlines(): lineArr =[] curLine = line.strip().split('\t') for i in range(numFeat): lineArr.append(float(curLine[i])) xArr.append(lineArr) yArr.append(float(curLine[-1])) return xArr, yArr def plotDataSet(): """ Function description:Draw dataset Parameters: nothing Returns: nothing """ xArr, yArr = loadDataSet('ex0.txt') #Load dataset n = len(xArr) #Number of data xcord = []; ycord = [] #Sample point for i in range(n): xcord.append(xArr[i][1]); ycord.append(yArr[i]) #Sample point fig = plt.figure() ax = fig.add_subplot(111) #Add subplot ax.scatter(xcord, ycord, s = 20, c = 'blue',alpha = .5) #Draw sample points plt.title('DataSet') #Draw title plt.xlabel('X') plt.show() if __name__ == '__main__': plotDataSet()
By visualizing the data, we can see the distribution of the data. Next, let's calculate the regression coefficient vector according to the regression coefficient calculation method derived above, and draw the regression curve according to the regression coefficient vector. The code is as follows:
import matplotlib.pyplot as plt import numpy as np def loadDataSet(fileName): """ Function description:Load data Parameters: fileName - file name Returns: xArr - x data set yArr - y data set """ numFeat = len(open(fileName).readline().split('\t')) - 1 xArr = []; yArr = [] fr = open(fileName) for line in fr.readlines(): lineArr =[] curLine = line.strip().split('\t') for i in range(numFeat): lineArr.append(float(curLine[i])) xArr.append(lineArr) yArr.append(float(curLine[-1])) return xArr, yArr def standRegres(xArr,yArr): """ Function description:Calculate regression coefficient w Parameters: xArr - x data set yArr - y data set Returns: ws - regression coefficient """ xMat = np.mat(xArr); yMat = np.mat(yArr).T xTx = xMat.T * xMat #The regression coefficient is calculated according to the publicity derived in this paper if np.linalg.det(xTx) == 0.0: print("The matrix is singular,No inversion") return ws = xTx.I * (xMat.T*yMat) return ws def plotRegression(): """ Function description:Draw regression curves and data points Parameters: nothing Returns: nothing """ xArr, yArr = loadDataSet('ex0.txt') #Load dataset ws = standRegres(xArr, yArr) #Calculate regression coefficient xMat = np.mat(xArr) #Create xMat matrix yMat = np.mat(yArr) #Create yMat matrix xCopy = xMat.copy() #Deep copy xMat matrix xCopy.sort(0) #sort yHat = xCopy * ws #Calculate the corresponding y value fig = plt.figure() ax = fig.add_subplot(111) #Add subplot ax.plot(xCopy[:, 1], yHat, c = 'red') #Draw regression curve ax.scatter(xMat[:,1].flatten().A[0], yMat.flatten().A[0], s = 20, c = 'blue',alpha = .5) #Draw sample points plt.title('DataSet') #Draw title plt.xlabel('X') plt.show() if __name__ == '__main__': plotRegression()
How to judge the fitting effect of the fitting curve? Of course, we can observe according to our own experience. In addition, we can also use the corrcoef method to compare the correlation between the predicted value and the real value. The code is as follows:
import numpy as np def loadDataSet(fileName): """ Function description:Load data Parameters: fileName - file name Returns: xArr - x data set yArr - y data set """ numFeat = len(open(fileName).readline().split('\t')) - 1 xArr = []; yArr = [] fr = open(fileName) for line in fr.readlines(): lineArr =[] curLine = line.strip().split('\t') for i in range(numFeat): lineArr.append(float(curLine[i])) xArr.append(lineArr) yArr.append(float(curLine[-1])) return xArr, yArr def standRegres(xArr,yArr): """ Function description:Calculate regression coefficient w Parameters: xArr - x data set yArr - y data set Returns: ws - regression coefficient """ xMat = np.mat(xArr); yMat = np.mat(yArr).T xTx = xMat.T * xMat #The regression coefficient is calculated according to the publicity derived in this paper if np.linalg.det(xTx) == 0.0: print("The matrix is singular,No inversion") return ws = xTx.I * (xMat.T*yMat) return ws if __name__ == '__main__': xArr, yArr = loadDataSet('ex0.txt') #Load dataset ws = standRegres(xArr, yArr) #Calculate regression coefficient xMat = np.mat(xArr) #Create xMat matrix yMat = np.mat(yArr) #Create yMat matrix yHat = xMat * ws print(np.corrcoef(yHat.T, yMat))
It can be seen that the data on the diagonal is 1.0, because the match between yMat and itself is perfect, while the correlation coefficient between yhhat and yMat is 0.98.
The best fitting method regards the data as a straight line for modeling, which has a very good performance. There seem to be other potential patterns in the data. So how can we use these patterns? We can locally adjust the forecast according to the data. This method will be introduced below.
2, Locally weighted linear regression
One problem of linear regression is that under fitting may occur, because it seeks unbiased estimation with small mean square error. Obviously, if the model is under fitted, it will not achieve good prediction results. Therefore, some methods allow some deviations to be introduced into the estimation, so as to reduce the mean square error of prediction.
One method is Locally Weighted Linear Regression (LWLR). In this method, we give a certain weight to each point near the point to be predicted. Like kNN, this algorithm needs to select the corresponding data subset in advance for each prediction. The form of the regression coefficient W released by the algorithm is as follows:
Where W is a matrix. The difference between this formula and the formula derived above is that w is used to give weight to each store.
LWLR uses a "kernel" (similar to the kernel in support vector machine) to give higher weights to nearby points. The type of kernel can be selected freely. The most commonly used kernel is Gaussian kernel. The weight of Gaussian kernel is as follows:
In this way, we can write locally weighted linear regression according to the above formula. We can adjust the regression effect by changing the value of k. The code is as follows:
from matplotlib.font_manager import FontProperties import matplotlib.pyplot as plt import numpy as np def loadDataSet(fileName): """ Function description:Load data Parameters: fileName - file name Returns: xArr - x data set yArr - y data set """ numFeat = len(open(fileName).readline().split('\t')) - 1 xArr = []; yArr = [] fr = open(fileName) for line in fr.readlines(): lineArr =[] curLine = line.strip().split('\t') for i in range(numFeat): lineArr.append(float(curLine[i])) xArr.append(lineArr) yArr.append(float(curLine[-1])) return xArr, yArr def plotlwlrRegression(): """ Function description:Draw multiple locally weighted regression curves Parameters: nothing Returns: nothing """ font = FontProperties(fname=r"c:\windows\fonts\simsun.ttc", size=14) xArr, yArr = loadDataSet('ex0.txt') #Load dataset yHat_1 = lwlrTest(xArr, xArr, yArr, 1.0) #Yhhat was calculated from locally weighted linear regression yHat_2 = lwlrTest(xArr, xArr, yArr, 0.01) #Yhhat was calculated from locally weighted linear regression yHat_3 = lwlrTest(xArr, xArr, yArr, 0.003) #Yhhat was calculated from locally weighted linear regression xMat = np.mat(xArr) #Create xMat matrix yMat = np.mat(yArr) #Create yMat matrix srtInd = xMat[:, 1].argsort(0) #Sort, return index value xSort = xMat[srtInd][:,0,:] fig, axs = plt.subplots(nrows=3, ncols=1,sharex=False, sharey=False, figsize=(10,8)) axs[0].plot(xSort[:, 1], yHat_1[srtInd], c = 'red') #Draw regression curve axs[1].plot(xSort[:, 1], yHat_2[srtInd], c = 'red') #Draw regression curve axs[2].plot(xSort[:, 1], yHat_3[srtInd], c = 'red') #Draw regression curve axs[0].scatter(xMat[:,1].flatten().A[0], yMat.flatten().A[0], s = 20, c = 'blue', alpha = .5) #Draw sample points axs[1].scatter(xMat[:,1].flatten().A[0], yMat.flatten().A[0], s = 20, c = 'blue', alpha = .5) #Draw sample points axs[2].scatter(xMat[:,1].flatten().A[0], yMat.flatten().A[0], s = 20, c = 'blue', alpha = .5) #Draw sample points #Set title, x-axis label,y-axis label axs0_title_text = axs[0].set_title(u'Locally weighted regression curve,k=1.0',FontProperties=font) axs1_title_text = axs[1].set_title(u'Locally weighted regression curve,k=0.01',FontProperties=font) axs2_title_text = axs[2].set_title(u'Locally weighted regression curve,k=0.003',FontProperties=font) plt.setp(axs0_title_text, size=8, weight='bold', color='red') plt.setp(axs1_title_text, size=8, weight='bold', color='red') plt.setp(axs2_title_text, size=8, weight='bold', color='red') plt.xlabel('X') plt.show() def lwlr(testPoint, xArr, yArr, k = 1.0): """ Function description:Regression coefficients were calculated using locally weighted linear regression w Parameters: testPoint - Test sample point xArr - x data set yArr - y data set k - Gaussian kernel k,Custom parameters Returns: ws - regression coefficient """ xMat = np.mat(xArr); yMat = np.mat(yArr).T m = np.shape(xMat)[0] weights = np.mat(np.eye((m))) #Create weight diagonal matrix for j in range(m): #Traverse the data set to calculate the weight of each sample diffMat = testPoint - xMat[j, :] weights[j, j] = np.exp(diffMat * diffMat.T/(-2.0 * k**2)) xTx = xMat.T * (weights * xMat) if np.linalg.det(xTx) == 0.0: print("The matrix is singular,No inversion") return ws = xTx.I * (xMat.T * (weights * yMat)) #Calculate regression coefficient return testPoint * ws def lwlrTest(testArr, xArr, yArr, k=1.0): """ Function description:Locally weighted linear regression test Parameters: testArr - Test data set xArr - x data set yArr - y data set k - Gaussian kernel k,Custom parameters Returns: ws - regression coefficient """ m = np.shape(testArr)[0] #Calculate test dataset size yHat = np.zeros(m) for i in range(m): #Predict each sample point yHat[i] = lwlr(testArr[i],xArr,yArr,k) return yHat if __name__ == '__main__': plotlwlrRegression()
It can be seen that the smaller the K, the better the fitting effect. However, when k is too small, over fitting occurs, for example, when k is equal to 0.003.
3, Example: predicting the age of abalone
Next, we apply regression to real data. abalone.txt file records the age of Abalone (an aquatic organism → _→), which is from the data of UCI data set. Abalone age can be calculated from the number of layers of abalone shell.
The data of the dataset is as follows:
As you can see, the data set is multidimensional, so it is difficult for us to draw its distribution. The meaning of the representation of each dimension data is not given, but it doesn't matter. We just need to know that the data in the last column is the y value. The last column represents the real age of abalone. The data in the previous columns are the characteristics of abalone, such as the number of layers of abalone shell. Without data cleaning, we directly use all features to test our locally weighted regression.
Create a new. py file and add rssError function to evaluate the final regression results. The code is as follows:
from matplotlib.font_manager import FontProperties import matplotlib.pyplot as plt import numpy as np def loadDataSet(fileName): """ Function description:Load data Parameters: fileName - file name Returns: xArr - x data set yArr - y data set """ numFeat = len(open(fileName).readline().split('\t')) - 1 xArr = []; yArr = [] fr = open(fileName) for line in fr.readlines(): lineArr =[] curLine = line.strip().split('\t') for i in range(numFeat): lineArr.append(float(curLine[i])) xArr.append(lineArr) yArr.append(float(curLine[-1])) return xArr, yArr def lwlr(testPoint, xArr, yArr, k = 1.0): """ Function description:Regression coefficients were calculated using locally weighted linear regression w Parameters: testPoint - Test sample point xArr - x data set yArr - y data set k - Gaussian kernel k,Custom parameters Returns: ws - regression coefficient """ xMat = np.mat(xArr); yMat = np.mat(yArr).T m = np.shape(xMat)[0] weights = np.mat(np.eye((m))) #Create weight diagonal matrix for j in range(m): #Traverse the data set to calculate the weight of each sample diffMat = testPoint - xMat[j, :] weights[j, j] = np.exp(diffMat * diffMat.T/(-2.0 * k**2)) xTx = xMat.T * (weights * xMat) if np.linalg.det(xTx) == 0.0: print("The matrix is singular,No inversion") return ws = xTx.I * (xMat.T * (weights * yMat)) #Calculate regression coefficient return testPoint * ws def lwlrTest(testArr, xArr, yArr, k=1.0): """ Function description:Locally weighted linear regression test Parameters: testArr - Test data set,Test set xArr - x data set,Training set yArr - y data set,Training set k - Gaussian kernel k,Custom parameters Returns: ws - regression coefficient """ m = np.shape(testArr)[0] #Calculate test dataset size yHat = np.zeros(m) for i in range(m): #Predict each sample point yHat[i] = lwlr(testArr[i],xArr,yArr,k) return yHat def standRegres(xArr,yArr): """ Function description:Calculate regression coefficient w Parameters: xArr - x data set yArr - y data set Returns: ws - regression coefficient """ xMat = np.mat(xArr); yMat = np.mat(yArr).T xTx = xMat.T * xMat #The regression coefficient is calculated according to the publicity derived in this paper if np.linalg.det(xTx) == 0.0: print("The matrix is singular,No inversion") return ws = xTx.I * (xMat.T*yMat) return ws def rssError(yArr, yHatArr): """ Error size evaluation function Parameters: yArr - Real data yHatArr - Forecast data Returns: Error size """ return ((yArr - yHatArr) **2).sum() if __name__ == '__main__': abX, abY = loadDataSet('abalone.txt') print('The training set is the same as the test set:Locally weighted linear regression,nucleus k Effect of the size of on the prediction:') yHat01 = lwlrTest(abX[0:99], abX[0:99], abY[0:99], 0.1) yHat1 = lwlrTest(abX[0:99], abX[0:99], abY[0:99], 1) yHat10 = lwlrTest(abX[0:99], abX[0:99], abY[0:99], 10) print('k=0.1 Time,The error size is:',rssError(abY[0:99], yHat01.T)) print('k=1 Time,The error size is:',rssError(abY[0:99], yHat1.T)) print('k=10 Time,The error size is:',rssError(abY[0:99], yHat10.T)) print('') print('The training set is different from the test set:Locally weighted linear regression,nucleus k Is the smaller the better? Replace dataset,The test results are as follows:') yHat01 = lwlrTest(abX[100:199], abX[0:99], abY[0:99], 0.1) yHat1 = lwlrTest(abX[100:199], abX[0:99], abY[0:99], 1) yHat10 = lwlrTest(abX[100:199], abX[0:99], abY[0:99], 10) print('k=0.1 Time,The error size is:',rssError(abY[100:199], yHat01.T)) print('k=1 Time,The error size is:',rssError(abY[100:199], yHat1.T)) print('k=10 Time,The error size is:',rssError(abY[100:199], yHat10.T)) print('') print('The training set is different from the test set:Simple linear regression and k=1 Comparison of locally weighted linear regression:') print('k=1 Time,The error size is:', rssError(abY[100:199], yHat1.T)) ws = standRegres(abX[0:99], abY[0:99]) yHat = np.mat(abX[100:199]) * ws print('Simple linear regression error size:', rssError(abY[100:199], yHat.T.A))
It can be seen that when k=0.1, the error of the training set is small, but when applied to the new data set, the error becomes larger. This is often called over fitting phenomenon. For the model we train, we should ensure the high accuracy of the test set, so that the trained model can be applied to new data, that is, we should strengthen the universality of the model. It can be seen that when k=1, the effect of locally weighted linear regression is similar to that of simple linear regression. This also shows that the best model can only be selected by comparing the effects on unknown data. So the best nuclear size is 10? Maybe, but if you want to get better results, you should do 10 tests with 10 different sample sets to compare the results.
This example shows how to use locally weighted linear regression to build a model, which can get better results than ordinary linear regression. The problem with locally weighted linear regression is that it must be run on the whole data set each time. In other words, in order to make a prediction, all training data must be saved.
4, Reduction factor to "understand" data
1. Ridge regression
from matplotlib.font_manager import FontProperties import matplotlib.pyplot as plt import numpy as np def loadDataSet(fileName): """ Function description:Load data Parameters: fileName - file name Returns: xArr - x data set yArr - y data set """ numFeat = len(open(fileName).readline().split('\t')) - 1 xArr = []; yArr = [] fr = open(fileName) for line in fr.readlines(): lineArr =[] curLine = line.strip().split('\t') for i in range(numFeat): lineArr.append(float(curLine[i])) xArr.append(lineArr) yArr.append(float(curLine[-1])) return xArr, yArr def ridgeRegres(xMat, yMat, lam = 0.2): """ Function description:Ridge regression Parameters: xMat - x data set yMat - y data set lam - Reduction factor Returns: ws - regression coefficient """ xTx = xMat.T * xMat denom = xTx + np.eye(np.shape(xMat)[1]) * lam if np.linalg.det(denom) == 0.0: print("The matrix is singular,Cannot transpose") return ws = denom.I * (xMat.T * yMat) return ws def ridgeTest(xArr, yArr): """ Function description:Ridge regression test Parameters: xMat - x data set yMat - y data set Returns: wMat - Regression coefficient matrix """ xMat = np.mat(xArr); yMat = np.mat(yArr).T #Data standardization yMean = np.mean(yMat, axis = 0) #Row to row operation, find the mean yMat = yMat - yMean #Data minus mean xMeans = np.mean(xMat, axis = 0) #Row to row operation, find the mean xVar = np.var(xMat, axis = 0) #Row to row operation, variance xMat = (xMat - xMeans) / xVar #Data is normalized by subtracting mean divided by variance numTestPts = 30 #30 different lambda tests wMat = np.zeros((numTestPts, np.shape(xMat)[1])) #Initial regression coefficient matrix for i in range(numTestPts): #Changing lambda to calculate regression coefficient ws = ridgeRegres(xMat, yMat, np.exp(i - 10)) #lambda changes exponentially with e, initially a very small number, wMat[i, :] = ws.T #Calculate regression coefficient matrix return wMat def plotwMat(): """ Function description:Plot ridge regression coefficient matrix """ font = FontProperties(fname=r"c:\windows\fonts\simsun.ttc", size=14) abX, abY = loadDataSet('abalone.txt') redgeWeights = ridgeTest(abX, abY) fig = plt.figure() ax = fig.add_subplot(111) ax.plot(redgeWeights) ax_title_text = ax.set_title(u'log(lambada)Relationship with regression coefficient', FontProperties = font) ax_xlabel_text = ax.set_xlabel(u'log(lambada)', FontProperties = font) ax_ylabel_text = ax.set_ylabel(u'regression coefficient', FontProperties = font) plt.setp(ax_title_text, size = 20, weight = 'bold', color = 'red') plt.setp(ax_xlabel_text, size = 10, weight = 'bold', color = 'black') plt.setp(ax_ylabel_text, size = 10, weight = 'bold', color = 'black') plt.show() if __name__ == '__main__': plotwMat()
The regression coefficients and log are plotted( λ) On the far left, i.e λ At the minimum, the original values of all coefficients can be obtained (consistent with linear regression); on the right, the coefficients are reduced to 0; at a certain position in the middle, the best prediction result will be obtained. Want to get the best prediction result λ Parameters can be obtained by cross validation, which will be explained later in the article.
2,lasso
It is not difficult to prove that when the following constraints are added, the ordinary least squares regression will obtain the same formula as the ridge regression:
The above formula defines that the sum of squares of all regression coefficients cannot be greater than λ. When two or more features are correlated, a large positive coefficient and a large negative number may be obtained by ordinary least squares regression. Because of the above constraints, ridge regression can avoid this problem.
Similar to ridge regression, another reduction method lasso also limits the regression coefficient, and the corresponding constraints are as follows:
The only difference is that this constraint replaces the sum of squares with absolute values. Although the form of constraints changes only slightly, the results are quite different: in λ When it is small enough, some coefficients will be forced to reduce to 0. This feature can help us better understand the data. The two constraints look similar in terms of formula, but slight changes greatly increase the computational complexity (in order to solve the regression coefficient under this new constraint, the quadratic programming algorithm needs to be used). The following will introduce a simpler method to obtain the results, which is called forward stepwise regression.
3. Forward stepwise regression
The forward stepwise linear regression algorithm belongs to a greedy algorithm, that is, the error is reduced as much as possible at each step. We calculate the regression coefficient no longer by formula, but by fine tuning each regression coefficient every time, and then calculating the prediction error. The set of regression coefficients that minimize the error is the best regression coefficient we need.
The implementation of forward stepwise linear regression is also very simple. Of course, the data should be standardized first. The code is as follows:
from matplotlib.font_manager import FontProperties import matplotlib.pyplot as plt import numpy as np def loadDataSet(fileName): """ Function description:Load data Parameters: fileName - file name Returns: xArr - x data set yArr - y data set """ numFeat = len(open(fileName).readline().split('\t')) - 1 xArr = []; yArr = [] fr = open(fileName) for line in fr.readlines(): lineArr =[] curLine = line.strip().split('\t') for i in range(numFeat): lineArr.append(float(curLine[i])) xArr.append(lineArr) yArr.append(float(curLine[-1])) return xArr, yArr def regularize(xMat, yMat): """ Function description:Data standardization Parameters: xMat - x data set yMat - y data set Returns: inxMat - Standardized x data set inyMat - Standardized y data set """ inxMat = xMat.copy() #Data copy inyMat = yMat.copy() yMean = np.mean(yMat, 0) #Row to row operation, find the mean inyMat = yMat - yMean #Data minus mean inMeans = np.mean(inxMat, 0) #Row to row operation, find the mean inVar = np.var(inxMat, 0) #Row to row operation, variance inxMat = (inxMat - inMeans) / inVar #Data is normalized by subtracting mean divided by variance return inxMat, inyMat def rssError(yArr,yHatArr): """ Function description:Calculate square error Parameters: yArr - Estimate yHatArr - True value Returns: """ return ((yArr-yHatArr)**2).sum() def stageWise(xArr, yArr, eps = 0.01, numIt = 100): """ Function description:Forward stepwise linear regression Parameters: xArr - x input data yArr - y Forecast data eps - Step size to be adjusted for each iteration numIt - Number of iterations Returns: returnMat - numIt Regression coefficient matrix of second iteration """ xMat = np.mat(xArr); yMat = np.mat(yArr).T #data set xMat, yMat = regularize(xMat, yMat) #Data standardization m, n = np.shape(xMat) returnMat = np.zeros((numIt, n)) #Initialize the regression coefficient matrix of numIt iterations ws = np.zeros((n, 1)) #Initialize regression coefficient matrix wsTest = ws.copy() wsMax = ws.copy() for i in range(numIt): #Iteration numIt times # print(ws.T) #Print current regression coefficient matrix lowestError = float('inf'); #Positive infinity for j in range(n): #Traverse the regression coefficients of each feature for sign in [-1, 1]: wsTest = ws.copy() wsTest[j] += eps * sign #Fine tuning regression coefficient yTest = xMat * wsTest #Calculate predicted value rssE = rssError(yMat.A, yTest.A) #Calculate square error if rssE < lowestError: #If the error is smaller, the current optimal regression coefficient is updated lowestError = rssE wsMax = wsTest ws = wsMax.copy() returnMat[i,:] = ws.T #Record the regression coefficient matrix of numIt iterations return returnMat def plotstageWiseMat(): """ Function description:Plot ridge regression coefficient matrix """ font = FontProperties(fname=r"c:\windows\fonts\simsun.ttc", size=14) xArr, yArr = loadDataSet('abalone.txt') returnMat = stageWise(xArr, yArr, 0.005, 1000) fig = plt.figure() ax = fig.add_subplot(111) ax.plot(returnMat) ax_title_text = ax.set_title(u'Forward stepwise regression:Relationship between iteration number and regression coefficient', FontProperties = font) ax_xlabel_text = ax.set_xlabel(u'Number of iterations', FontProperties = font) ax_ylabel_text = ax.set_ylabel(u'regression coefficient', FontProperties = font) plt.setp(ax_title_text, size = 15, weight = 'bold', color = 'red') plt.setp(ax_xlabel_text, size = 10, weight = 'bold', color = 'black') plt.setp(ax_ylabel_text, size = 10, weight = 'bold', color = 'black') plt.show() if __name__ == '__main__': plotstageWiseMat()
Still, we printed the relationship curve between the number of iterations and the regression coefficient. It can be seen that some coefficients are about 0 from beginning to end, which shows that they do not have any impact on the target, that is, these characteristics are probably unnecessary. The advantage of the stepwise linear regression algorithm is that it can help people understand some models and make improvements. After building a model , the algorithm can be run to find out the important features, so it is possible to stop the collection of those unimportant features in time.
To sum up:
The reduction method (stepwise linear regression or ridge regression) is to reduce some coefficients to very small values or directly to 0. This will increase the deviation of the model (reduce the weight of some features) , by reducing the regression coefficient of some features to 0, the complexity of the model is reduced. After eliminating the redundant features, the model is easier to understand and the prediction error is reduced. However, when the reduction is too strict, there will be over fitting, that is, the prediction result with the training set is very good, and the prediction with the test set is much worse.
5, Example: forecasting the price of LEGO toy sets
LEGO produces assembled toys, which are composed of many plastic inserts of different sizes. Its appearance is shown in the figure below:
Generally speaking, these inserts are sold in sets. They can be assembled into many different things, such as ships, castles, some famous buildings, etc. the number of parts contained in each set of LEGO company ranges from 10 to 5000.
A Lego kit will basically be discontinued in a few years, but LEGO collectors will still trade with each other after shutdown. This example is to use the regression method to predict the transaction price between collectors.
1. Get data
We get the information we need by parsing the html file:
from bs4 import BeautifulSoup def scrapePage(retX, retY, inFile, yr, numPce, origPrc): """ Function description:Read data from the page and generate retX and retY list Parameters: retX - data X retY - data Y inFile - HTML file yr - particular year numPce - Number of LEGO parts origPrc - original price Returns: nothing """ # Open and read HTML file with open(inFile, encoding='utf-8') as f: html = f.read() soup = BeautifulSoup(html) i = 1 # Parse according to HTML page structure currentRow = soup.find_all('table', r = "%d" % i) while(len(currentRow) != 0): currentRow = soup.find_all('table', r = "%d" % i) title = currentRow[0].find_all('a')[1].text lwrTitle = title.lower() # Find out if there are brand new labels if (lwrTitle.find('new') > -1) or (lwrTitle.find('nisb') > -1): newFlag = 1.0 else: newFlag = 0.0 # To find out whether the logo has been sold, we only collect the data that has been sold soldUnicde = currentRow[0].find_all('td')[3].find_all('span') if len(soldUnicde) == 0: print("commodity #%d not sold "% i) else: # Analyze the page to get the current price soldPrice = currentRow[0].find_all('td')[4] priceStr = soldPrice.text priceStr = priceStr.replace('$','') priceStr = priceStr.replace(',','') if len(soldPrice) > 1: priceStr = priceStr.replace('Free shipping', '') sellingPrice = float(priceStr) # Remove incomplete package price if sellingPrice > origPrc * 0.5: print("%d\t%d\t%d\t%f\t%f" % (yr, numPce, newFlag, origPrc, sellingPrice)) retX.append([yr, numPce, newFlag, origPrc]) retY.append(sellingPrice) i += 1 currentRow = soup.find_all('table', r = "%d" % i) def setDataCollect(retX, retY): """ Function description:Read the data of six Lego sets in turn and generate the data matrix Parameters: nothing Returns: nothing """ scrapePage(retX, retY, './lego/lego8288.html', 2006, 800, 49.99) #LEGO 8288 in 2006, number of parts 800, original price 49.99 scrapePage(retX, retY, './lego/lego10030.html', 2002, 3096, 269.99) #LEGO 10030 in 2002, number of parts 3096, original price 269.99 scrapePage(retX, retY, './lego/lego10179.html', 2007, 5195, 499.99) #LEGO 10179 in 2007, number of parts 5195, original price 499.99 scrapePage(retX, retY, './lego/lego10181.html', 2007, 3428, 199.99) #LEGO 10181 in 2007, number of parts 3428, original price 199.99 scrapePage(retX, retY, './lego/lego10189.html', 2008, 5922, 299.99) #LEGO 10189 in 2008, number of parts 5922, original price 299.99 scrapePage(retX, retY, './lego/lego10196.html', 2009, 3263, 249.99) #LEGO 10196 in 2009, 3263 parts, original price 249.99 if __name__ == '__main__': lgX = [] lgY = [] setDataCollect(lgX, lgY)
We deal with goods that are not available. These characteristics are: year of production, number of parts, whether it is brand new, original price and selling price (second-hand transaction).
2. Model building
After we have processed the data set, the next step is to train the model. First, we need to add the feature X0 column with all 0. Because the first column of linear regression requires 1.0. Then, use the simplest ordinary linear regression i, and write the code as follows:
import numpy as np from bs4 import BeautifulSoup def scrapePage(retX, retY, inFile, yr, numPce, origPrc): """ Function description:Read data from the page and generate retX and retY list Parameters: retX - data X retY - data Y inFile - HTML file yr - particular year numPce - Number of LEGO parts origPrc - original price Returns: nothing """ # Open and read HTML file with open(inFile, encoding='utf-8') as f: html = f.read() soup = BeautifulSoup(html) i = 1 # Parse according to HTML page structure currentRow = soup.find_all('table', r = "%d" % i) while(len(currentRow) != 0): currentRow = soup.find_all('table', r = "%d" % i) title = currentRow[0].find_all('a')[1].text lwrTitle = title.lower() # Find out if there are brand new labels if (lwrTitle.find('new') > -1) or (lwrTitle.find('nisb') > -1): newFlag = 1.0 else: newFlag = 0.0 # To find out whether the logo has been sold, we only collect the data that has been sold soldUnicde = currentRow[0].find_all('td')[3].find_all('span') if len(soldUnicde) == 0: print("commodity #%d not sold "% i) else: # Analyze the page to get the current price soldPrice = currentRow[0].find_all('td')[4] priceStr = soldPrice.text priceStr = priceStr.replace('$','') priceStr = priceStr.replace(',','') if len(soldPrice) > 1: priceStr = priceStr.replace('Free shipping', '') sellingPrice = float(priceStr) # Remove incomplete package price if sellingPrice > origPrc * 0.5: print("%d\t%d\t%d\t%f\t%f" % (yr, numPce, newFlag, origPrc, sellingPrice)) retX.append([yr, numPce, newFlag, origPrc]) retY.append(sellingPrice) i += 1 currentRow = soup.find_all('table', r = "%d" % i) def setDataCollect(retX, retY): """ Function description:Read the data of six Lego sets in turn and generate the data matrix Parameters: nothing Returns: nothing """ scrapePage(retX, retY, './lego/lego8288.html', 2006, 800, 49.99) #LEGO 8288 in 2006, number of parts 800, original price 49.99 scrapePage(retX, retY, './lego/lego10030.html', 2002, 3096, 269.99) #LEGO 10030 in 2002, number of parts 3096, original price 269.99 scrapePage(retX, retY, './lego/lego10179.html', 2007, 5195, 499.99) #LEGO 10179 in 2007, number of parts 5195, original price 499.99 scrapePage(retX, retY, './lego/lego10181.html', 2007, 3428, 199.99) #LEGO 10181 in 2007, number of parts 3428, original price 199.99 scrapePage(retX, retY, './lego/lego10189.html', 2008, 5922, 299.99) #LEGO 10189 in 2008, number of parts 5922, original price 299.99 scrapePage(retX, retY, './lego/lego10196.html', 2009, 3263, 249.99) #LEGO 10196 in 2009, 3263 parts, original price 249.99 def regularize(xMat, yMat): """ Function description:Data standardization Parameters: xMat - x data set yMat - y data set Returns: inxMat - Standardized x data set inyMat - Standardized y data set """ inxMat = xMat.copy() #Data copy inyMat = yMat.copy() yMean = np.mean(yMat, 0) #Row to row operation, find the mean inyMat = yMat - yMean #Data minus mean inMeans = np.mean(inxMat, 0) #Row to row operation, find the mean inVar = np.var(inxMat, 0) #Row to row operation, variance # print(inxMat) print(inMeans) # print(inVar) inxMat = (inxMat - inMeans) / inVar #Data is normalized by subtracting mean divided by variance return inxMat, inyMat def rssError(yArr,yHatArr): """ Function description:Calculate square error Parameters: yArr - Estimate yHatArr - True value Returns: """ return ((yArr-yHatArr)**2).sum() def standRegres(xArr,yArr): """ Function description:Calculate regression coefficient w Parameters: xArr - x data set yArr - y data set Returns: ws - regression coefficient """ xMat = np.mat(xArr); yMat = np.mat(yArr).T xTx = xMat.T * xMat #The regression coefficient is calculated according to the publicity derived in this paper if np.linalg.det(xTx) == 0.0: print("The matrix is singular,Cannot transpose") return ws = xTx.I * (xMat.T*yMat) return ws def useStandRegres(): """ Function description:Using simple linear regression Parameters: nothing Returns: nothing """ lgX = [] lgY = [] setDataCollect(lgX, lgY) data_num, features_num = np.shape(lgX) lgX1 = np.mat(np.ones((data_num, features_num + 1))) lgX1[:, 1:5] = np.mat(lgX) ws = standRegres(lgX1, lgY) print('%f%+f*particular year%+f*Number of parts%+f*Is it brand new%+f*original price' % (ws[0],ws[1],ws[2],ws[3],ws[4])) if __name__ == '__main__': useStandRegres()
It can be seen that the formula used in the model is shown in the figure above. Although this model fits the data well, it doesn't seem to make any sense. The more parts in the kit, the lower the price, which is unreasonable.
We use ridge regression to find the method to minimize the error through cross validation λ Corresponding regression coefficient. The code is as follows:
import numpy as np from bs4 import BeautifulSoup import random def scrapePage(retX, retY, inFile, yr, numPce, origPrc): """ Function description:Read data from the page and generate retX and retY list Parameters: retX - data X retY - data Y inFile - HTML file yr - particular year numPce - Number of LEGO parts origPrc - original price Returns: nothing """ # Open and read HTML file with open(inFile, encoding='utf-8') as f: html = f.read() soup = BeautifulSoup(html) i = 1 # Parse according to HTML page structure currentRow = soup.find_all('table', r = "%d" % i) while(len(currentRow) != 0): currentRow = soup.find_all('table', r = "%d" % i) title = currentRow[0].find_all('a')[1].text lwrTitle = title.lower() # Find out if there are brand new labels if (lwrTitle.find('new') > -1) or (lwrTitle.find('nisb') > -1): newFlag = 1.0 else: newFlag = 0.0 # To find out whether the logo has been sold, we only collect the data that has been sold soldUnicde = currentRow[0].find_all('td')[3].find_all('span') if len(soldUnicde) == 0: print("commodity #%d not sold "% i) else: # Analyze the page to get the current price soldPrice = currentRow[0].find_all('td')[4] priceStr = soldPrice.text priceStr = priceStr.replace('$','') priceStr = priceStr.replace(',','') if len(soldPrice) > 1: priceStr = priceStr.replace('Free shipping', '') sellingPrice = float(priceStr) # Remove incomplete package price if sellingPrice > origPrc * 0.5: print("%d\t%d\t%d\t%f\t%f" % (yr, numPce, newFlag, origPrc, sellingPrice)) retX.append([yr, numPce, newFlag, origPrc]) retY.append(sellingPrice) i += 1 currentRow = soup.find_all('table', r = "%d" % i) def ridgeRegres(xMat, yMat, lam = 0.2): """ Function description:Ridge regression Parameters: xMat - x data set yMat - y data set lam - Reduction factor Returns: ws - regression coefficient """ xTx = xMat.T * xMat denom = xTx + np.eye(np.shape(xMat)[1]) * lam if np.linalg.det(denom) == 0.0: print("The matrix is singular,Cannot transpose") return ws = denom.I * (xMat.T * yMat) return ws def setDataCollect(retX, retY): """ Function description:Read the data of six Lego sets in turn and generate the data matrix Parameters: nothing Returns: nothing """ scrapePage(retX, retY, './lego/lego8288.html', 2006, 800, 49.99) #LEGO 8288 in 2006, number of parts 800, original price 49.99 scrapePage(retX, retY, './lego/lego10030.html', 2002, 3096, 269.99) #LEGO 10030 in 2002, number of parts 3096, original price 269.99 scrapePage(retX, retY, './lego/lego10179.html', 2007, 5195, 499.99) #LEGO 10179 in 2007, number of parts 5195, original price 499.99 scrapePage(retX, retY, './lego/lego10181.html', 2007, 3428, 199.99) #LEGO 10181 in 2007, number of parts 3428, original price 199.99 scrapePage(retX, retY, './lego/lego10189.html', 2008, 5922, 299.99) #LEGO 10189 in 2008, number of parts 5922, original price 299.99 scrapePage(retX, retY, './lego/lego10196.html', 2009, 3263, 249.99) #LEGO 10196 in 2009, 3263 parts, original price 249.99 def regularize(xMat, yMat): """ Function description:Data standardization Parameters: xMat - x data set yMat - y data set Returns: inxMat - Standardized x data set inyMat - Standardized y data set """ inxMat = xMat.copy() #Data copy inyMat = yMat.copy() yMean = np.mean(yMat, 0) #Row to row operation, find the mean inyMat = yMat - yMean #Data minus mean inMeans = np.mean(inxMat, 0) #Row to row operation, find the mean inVar = np.var(inxMat, 0) #Row to row operation, variance # print(inxMat) print(inMeans) # print(inVar) inxMat = (inxMat - inMeans) / inVar #Data is normalized by subtracting mean divided by variance return inxMat, inyMat def rssError(yArr,yHatArr): """ Function description:Calculate square error Parameters: yArr - Estimate yHatArr - True value Returns: """ return ((yArr-yHatArr)**2).sum() def standRegres(xArr,yArr): """ Function description:Calculate regression coefficient w Parameters: xArr - x data set yArr - y data set Returns: ws - regression coefficient """ xMat = np.mat(xArr); yMat = np.mat(yArr).T xTx = xMat.T * xMat #The regression coefficient is calculated according to the publicity derived in this paper if np.linalg.det(xTx) == 0.0: print("The matrix is singular,Cannot transpose") return ws = xTx.I * (xMat.T*yMat) return ws def crossValidation(xArr, yArr, numVal = 10): """ Function description:Cross validation ridge regression Parameters: xArr - x data set yArr - y data set numVal - Number of cross validation Returns: wMat - Regression coefficient matrix """ m = len(yArr) #Number of statistical samples indexList = list(range(m)) #Generate index value list errorMat = np.zeros((numVal,30)) #create error mat 30columns numVal rows for i in range(numVal): #Cross validation numVal times trainX = []; trainY = [] #Training set testX = []; testY = [] #Test set random.shuffle(indexList) #Disorganize for j in range(m): #Divided data set: 90% training set and 10% test set if j < m * 0.9: trainX.append(xArr[indexList[j]]) trainY.append(yArr[indexList[j]]) else: testX.append(xArr[indexList[j]]) testY.append(yArr[indexList[j]]) wMat = ridgeTest(trainX, trainY) #The ridge regression coefficients under 30 different lambda were obtained for k in range(30): #Traverse all ridge regression coefficients matTestX = np.mat(testX); matTrainX = np.mat(trainX) #Test set meanTrain = np.mean(matTrainX,0) #Test set mean varTrain = np.var(matTrainX,0) #Test set variance matTestX = (matTestX - meanTrain) / varTrain #Test set standardization yEst = matTestX * np.mat(wMat[k,:]).T + np.mean(trainY) #Predict y value according to ws errorMat[i, k] = rssError(yEst.T.A, np.array(testY)) #Statistical error meanErrors = np.mean(errorMat,0) #Calculate the average error of each cross validation minMean = float(min(meanErrors)) #Find the minimum error bestWeights = wMat[np.nonzero(meanErrors == minMean)] #Find the best regression coefficient xMat = np.mat(xArr); yMat = np.mat(yArr).T meanX = np.mean(xMat,0); varX = np.var(xMat,0) unReg = bestWeights / varX #Data is standardized and therefore needs to be restored print('%f%+f*particular year%+f*Number of parts%+f*Is it brand new%+f*original price' % ((-1 * np.sum(np.multiply(meanX,unReg)) + np.mean(yMat)), unReg[0,0], unReg[0,1], unReg[0,2], unReg[0,3])) def ridgeTest(xArr, yArr): """ Function description:Ridge regression test Parameters: xMat - x data set yMat - y data set Returns: wMat - Regression coefficient matrix """ xMat = np.mat(xArr); yMat = np.mat(yArr).T #Data standardization yMean = np.mean(yMat, axis = 0) #Row to row operation, find the mean yMat = yMat - yMean #Data minus mean xMeans = np.mean(xMat, axis = 0) #Row to row operation, find the mean xVar = np.var(xMat, axis = 0) #Row to row operation, variance xMat = (xMat - xMeans) / xVar #Data is normalized by subtracting mean divided by variance numTestPts = 30 #30 different lambda tests wMat = np.zeros((numTestPts, np.shape(xMat)[1])) #Initial regression coefficient matrix for i in range(numTestPts): #Changing lambda to calculate regression coefficient ws = ridgeRegres(xMat, yMat, np.exp(i - 10)) #lambda changes exponentially with e, initially a very small number, wMat[i, :] = ws.T #Calculate regression coefficient matrix return wMat if __name__ == '__main__': lgX = [] lgY = [] setDataCollect(lgX, lgY) crossValidation(lgX, lgY)
The samples are randomly selected here. Because of their randomness, the results of each run may be slightly different. However, as shown in the figure above, it is not very different from the conventional least squares method, that is, ordinary linear regression. We expected to find a more understandable model, which obviously did not achieve the expected effect.
Now let's look at how the regression coefficient changes during the reduction process. The code is as follows:
import numpy as np from bs4 import BeautifulSoup import random def scrapePage(retX, retY, inFile, yr, numPce, origPrc): """ Function description:Read data from the page and generate retX and retY list Parameters: retX - data X retY - data Y inFile - HTML file yr - particular year numPce - Number of LEGO parts origPrc - original price Returns: nothing """ # Open and read HTML file with open(inFile, encoding='utf-8') as f: html = f.read() soup = BeautifulSoup(html) i = 1 # Parse according to HTML page structure currentRow = soup.find_all('table', r = "%d" % i) while(len(currentRow) != 0): currentRow = soup.find_all('table', r = "%d" % i) title = currentRow[0].find_all('a')[1].text lwrTitle = title.lower() # Find out if there are brand new labels if (lwrTitle.find('new') > -1) or (lwrTitle.find('nisb') > -1): newFlag = 1.0 else: newFlag = 0.0 # To find out whether the logo has been sold, we only collect the data that has been sold soldUnicde = currentRow[0].find_all('td')[3].find_all('span') if len(soldUnicde) == 0: print("commodity #%d not sold "% i) else: # Analyze the page to get the current price soldPrice = currentRow[0].find_all('td')[4] priceStr = soldPrice.text priceStr = priceStr.replace('$','') priceStr = priceStr.replace(',','') if len(soldPrice) > 1: priceStr = priceStr.replace('Free shipping', '') sellingPrice = float(priceStr) # Remove incomplete package price if sellingPrice > origPrc * 0.5: print("%d\t%d\t%d\t%f\t%f" % (yr, numPce, newFlag, origPrc, sellingPrice)) retX.append([yr, numPce, newFlag, origPrc]) retY.append(sellingPrice) i += 1 currentRow = soup.find_all('table', r = "%d" % i) def ridgeRegres(xMat, yMat, lam = 0.2): """ Function description:Ridge regression Parameters: xMat - x data set yMat - y data set lam - Reduction factor Returns: ws - regression coefficient """ xTx = xMat.T * xMat denom = xTx + np.eye(np.shape(xMat)[1]) * lam if np.linalg.det(denom) == 0.0: print("The matrix is singular,Cannot transpose") return ws = denom.I * (xMat.T * yMat) return ws def setDataCollect(retX, retY): """ Function description:Read the data of six Lego sets in turn and generate the data matrix Parameters: nothing Returns: nothing """ scrapePage(retX, retY, './lego/lego8288.html', 2006, 800, 49.99) #LEGO 8288 in 2006, number of parts 800, original price 49.99 scrapePage(retX, retY, './lego/lego10030.html', 2002, 3096, 269.99) #LEGO 10030 in 2002, number of parts 3096, original price 269.99 scrapePage(retX, retY, './lego/lego10179.html', 2007, 5195, 499.99) #LEGO 10179 in 2007, number of parts 5195, original price 499.99 scrapePage(retX, retY, './lego/lego10181.html', 2007, 3428, 199.99) #LEGO 10181 in 2007, number of parts 3428, original price 199.99 scrapePage(retX, retY, './lego/lego10189.html', 2008, 5922, 299.99) #LEGO 10189 in 2008, number of parts 5922, original price 299.99 scrapePage(retX, retY, './lego/lego10196.html', 2009, 3263, 249.99) #LEGO 10196 in 2009, 3263 parts, original price 249.99 def regularize(xMat, yMat): """ Function description:Data standardization Parameters: xMat - x data set yMat - y data set Returns: inxMat - Standardized x data set inyMat - Standardized y data set """ inxMat = xMat.copy() #Data copy inyMat = yMat.copy() yMean = np.mean(yMat, 0) #Row to row operation, find the mean inyMat = yMat - yMean #Data minus mean inMeans = np.mean(inxMat, 0) #Row to row operation, find the mean inVar = np.var(inxMat, 0) #Row to row operation, variance # print(inxMat) print(inMeans) # print(inVar) inxMat = (inxMat - inMeans) / inVar #Data is normalized by subtracting mean divided by variance return inxMat, inyMat def rssError(yArr,yHatArr): """ Function description:Calculate square error Parameters: yArr - Estimate yHatArr - True value Returns: """ return ((yArr-yHatArr)**2).sum() def standRegres(xArr,yArr): """ Function description:Calculate regression coefficient w Parameters: xArr - x data set yArr - y data set Returns: ws - regression coefficient """ xMat = np.mat(xArr); yMat = np.mat(yArr).T xTx = xMat.T * xMat #The regression coefficient is calculated according to the publicity derived in this paper if np.linalg.det(xTx) == 0.0: print("The matrix is singular,Cannot transpose") return ws = xTx.I * (xMat.T*yMat) return ws def ridgeTest(xArr, yArr): """ Function description:Ridge regression test Parameters: xMat - x data set yMat - y data set Returns: wMat - Regression coefficient matrix """ xMat = np.mat(xArr); yMat = np.mat(yArr).T #Data standardization yMean = np.mean(yMat, axis = 0) #Row to row operation, find the mean yMat = yMat - yMean #Data minus mean xMeans = np.mean(xMat, axis = 0) #Row to row operation, find the mean xVar = np.var(xMat, axis = 0) #Row to row operation, variance xMat = (xMat - xMeans) / xVar #Data is normalized by subtracting mean divided by variance numTestPts = 30 #30 different lambda tests wMat = np.zeros((numTestPts, np.shape(xMat)[1])) #Initial regression coefficient matrix for i in range(numTestPts): #Changing lambda to calculate regression coefficient ws = ridgeRegres(xMat, yMat, np.exp(i - 10)) #lambda changes exponentially with e, initially a very small number, wMat[i, :] = ws.T #Calculate regression coefficient matrix return wMat if __name__ == '__main__': lgX = [] lgY = [] setDataCollect(lgX, lgY) print(ridgeTest(lgX, lgY))
Looking at the first line of the run result, you can see that the largest is item 4 and the second largest is item 2.
Therefore, if only one feature is selected for prediction, we should select the fourth feature. If 2 features can be selected, the 4th and 2nd features should be selected.
This analysis method enables us to mine the internal laws of a large number of data. When there are only four features, the effect of this method may not be obvious; But if there are more than 100 features, the method will become very effective: it can point out which features are key and which features are unimportant.
6, Summary
- Like classification, regression is also the process of predicting the target value. The difference between regression and classification is that the former predicts continuous variables, while the latter predicts discrete variables.
- Ridge regression is a kind of reduction method, which is equivalent to limiting the size of regression coefficient. Another good reduction method is lasso. Lasso is difficult to solve, but the approximate solution can be obtained by the simple stepwise linear regression method.
- The reduction method can also be regarded as a method to reduce the deviation of a model at the same time.