Practical course of machine learning: linear regression

What is regression?

The purpose of regression is to predict the numerical target value. The most direct way is to write out a calculation formula of the target value according to the input. If you want to predict the power of your little sister's boyfriend's car, you may calculate as follows:

HorsePower = 0.0015 * annualSalary - 0.99 * hoursListeningToPublicRadio

Written in Chinese is:

Car power of little sister's boyfriend = 0.0015 * annual salary of little sister's boyfriend - 0.99 * time to listen to public radio

This is the so-called regression equation, in which 0.0015 and -0.99 are called regression weights. The process of finding these regression coefficients is regression. Once we have these regression coefficients, given the input, it is very easy to make prediction. The specific method is to multiply the regression coefficient by the input value, and then add all the results together to get the predicted value.

When it comes to regression, it generally refers to linear regression, so regression and linear regression in this paper represent the same meaning. Linear regression means that you can multiply the input items by some constants, and then add up the results to get the output. It should be noted that there is another regression model called nonlinear regression, which does not agree with the above practice. For example, it thinks that the output may be the product of the input. In this way, the above power calculation formula can also be written as follows:

HorsePower = 0.0015 * annualSalary / hoursListeningToPublicRadio

This is an example of nonlinear regression, which is not discussed in depth in this paper.

1, Find the best fitting line by linear regression

How do we get the regression equation from a lot of data? Suppose that the input data is stored in matrix X and the result is stored in vector y:

The regression coefficients are stored in the vector w:

Then, for the given data x1, that is, the first column data of matrix X, the prediction result u1 will be given by the following formula:

Now the question is, with the data matrix X and the corresponding label vector Y in hand, how can we find w? A common method is to find the w that minimizes the error. The error here refers to the difference between the predicted u value and the real y value. The simple accumulation of the error will make the positive difference and negative difference offset each other, so we use the square error.

The sum of squared errors can be written as:

The matrix representation can also be written as:

Why can it change like this? Remember a premise: if x is a vector, X is the column vector and x^T is the row vector by default. Bring in the data matrix X and label vector y mentioned above to know why it changes so much.

Before we continue the derivation, we should first clarify one purpose: find w and minimize the sum of square errors. Because we believe that the smaller the sum of square errors, the better the fitting effect of linear regression.

Now, we take the derivative of w by the sum of the square errors expressed by the matrix:

If you are not familiar with matrix calculation, you can move here: https://blog.csdn.net/nomadlx53/article/details/50849941

Let the above formula be equal to 0 to obtain:

The small mark above w indicates that this is the optimal solution of w that can be estimated at present. The w estimated from the existing data may not be the real W value in the data, so a "hat" symbol is used here to indicate that it is only a best estimate of W.

It is worth noting that the above formula contains an inverse matrix, that is, the equation is only used when the inverse matrix exists, that is, the matrix is a square matrix and its determinant is not 0.

The best w solution is a common problem in statistics. There are many other methods besides matrix method. By calling the matrix method in NumPy library, we can complete the required function with only a few lines of code. This method is also called OLS, which means "ordinary least squares".

Let's look at the dataset first:

The first column is 1.0, i.e. x0. The second column is x1, i.e. X-axis data. The third column is x2, i.e. y-axis data. First, draw the data and look at the data distribution. The code is as follows:

import matplotlib.pyplot as plt
import numpy as np
 
def loadDataSet(fileName):
    """
    Function description:Load data
    Parameters:
        fileName - file name
    Returns:
        xArr - x data set
        yArr - y data set
    """
 
    numFeat = len(open(fileName).readline().split('\t')) - 1
    xArr = []; yArr = []
    fr = open(fileName)
    for line in fr.readlines():
        lineArr =[]
        curLine = line.strip().split('\t')
        for i in range(numFeat):
            lineArr.append(float(curLine[i]))
        xArr.append(lineArr)
        yArr.append(float(curLine[-1]))
    return xArr, yArr
 
def plotDataSet():
    """
    Function description:Draw dataset
    Parameters:
        nothing
    Returns:
        nothing
    """
    xArr, yArr = loadDataSet('ex0.txt')                                    #Load dataset
    n = len(xArr)                                                        #Number of data
    xcord = []; ycord = []                                                #Sample point
    for i in range(n):                                                   
        xcord.append(xArr[i][1]); ycord.append(yArr[i])                    #Sample point
    fig = plt.figure()
    ax = fig.add_subplot(111)                                            #Add subplot
    ax.scatter(xcord, ycord, s = 20, c = 'blue',alpha = .5)                #Draw sample points
    plt.title('DataSet')                                                #Draw title
    plt.xlabel('X')
    plt.show()
 
if __name__ == '__main__':
    plotDataSet()


By visualizing the data, we can see the distribution of the data. Next, let's calculate the regression coefficient vector according to the regression coefficient calculation method derived above, and draw the regression curve according to the regression coefficient vector. The code is as follows:

import matplotlib.pyplot as plt
import numpy as np
 
def loadDataSet(fileName):
    """
    Function description:Load data
    Parameters:
        fileName - file name
    Returns:
        xArr - x data set
        yArr - y data set
    """
    numFeat = len(open(fileName).readline().split('\t')) - 1
    xArr = []; yArr = []
    fr = open(fileName)
    for line in fr.readlines():
        lineArr =[]
        curLine = line.strip().split('\t')
        for i in range(numFeat):
            lineArr.append(float(curLine[i]))
        xArr.append(lineArr)
        yArr.append(float(curLine[-1]))
    return xArr, yArr
 
def standRegres(xArr,yArr):
    """
    Function description:Calculate regression coefficient w
    Parameters:
        xArr - x data set
        yArr - y data set
    Returns:
        ws - regression coefficient
    """
    xMat = np.mat(xArr); yMat = np.mat(yArr).T
    xTx = xMat.T * xMat                            #The regression coefficient is calculated according to the publicity derived in this paper
    if np.linalg.det(xTx) == 0.0:
        print("The matrix is singular,No inversion")
        return
    ws = xTx.I * (xMat.T*yMat)
    return ws
 
def plotRegression():
    """
    Function description:Draw regression curves and data points
    Parameters:
        nothing
    Returns:
        nothing
    """
    xArr, yArr = loadDataSet('ex0.txt')                                    #Load dataset
    ws = standRegres(xArr, yArr)                                        #Calculate regression coefficient
    xMat = np.mat(xArr)                                                    #Create xMat matrix
    yMat = np.mat(yArr)                                                    #Create yMat matrix
    xCopy = xMat.copy()                                                    #Deep copy xMat matrix
    xCopy.sort(0)                                                        #sort
    yHat = xCopy * ws                                                     #Calculate the corresponding y value
    fig = plt.figure()
    ax = fig.add_subplot(111)                                            #Add subplot
    ax.plot(xCopy[:, 1], yHat, c = 'red')                                #Draw regression curve
    ax.scatter(xMat[:,1].flatten().A[0], yMat.flatten().A[0], s = 20, c = 'blue',alpha = .5)                #Draw sample points
    plt.title('DataSet')                                                #Draw title
    plt.xlabel('X')
    plt.show()
 
if __name__ == '__main__':
    plotRegression()


How to judge the fitting effect of the fitting curve? Of course, we can observe according to our own experience. In addition, we can also use the corrcoef method to compare the correlation between the predicted value and the real value. The code is as follows:

import numpy as np


def loadDataSet(fileName):
    """
    Function description:Load data
    Parameters:
        fileName - file name
    Returns:
        xArr - x data set
        yArr - y data set
    """
    numFeat = len(open(fileName).readline().split('\t')) - 1
    xArr = []; yArr = []
    fr = open(fileName)
    for line in fr.readlines():
        lineArr =[]
        curLine = line.strip().split('\t')
        for i in range(numFeat):
            lineArr.append(float(curLine[i]))
        xArr.append(lineArr)
        yArr.append(float(curLine[-1]))
    return xArr, yArr

 
 def standRegres(xArr,yArr):
    """
    Function description:Calculate regression coefficient w
    Parameters:
        xArr - x data set
        yArr - y data set
    Returns:
        ws - regression coefficient
    """
    xMat = np.mat(xArr); yMat = np.mat(yArr).T
    xTx = xMat.T * xMat                            #The regression coefficient is calculated according to the publicity derived in this paper
    if np.linalg.det(xTx) == 0.0:
        print("The matrix is singular,No inversion")
        return
    ws = xTx.I * (xMat.T*yMat)
    return ws


if __name__ == '__main__':
    xArr, yArr = loadDataSet('ex0.txt')                                    #Load dataset
    ws = standRegres(xArr, yArr)                                        #Calculate regression coefficient
    xMat = np.mat(xArr)                                                    #Create xMat matrix
    yMat = np.mat(yArr)                                                    #Create yMat matrix
    yHat = xMat * ws
    print(np.corrcoef(yHat.T, yMat))


It can be seen that the data on the diagonal is 1.0, because the match between yMat and itself is perfect, while the correlation coefficient between yhhat and yMat is 0.98.

The best fitting method regards the data as a straight line for modeling, which has a very good performance. There seem to be other potential patterns in the data. So how can we use these patterns? We can locally adjust the forecast according to the data. This method will be introduced below.

2, Locally weighted linear regression

One problem of linear regression is that under fitting may occur, because it seeks unbiased estimation with small mean square error. Obviously, if the model is under fitted, it will not achieve good prediction results. Therefore, some methods allow some deviations to be introduced into the estimation, so as to reduce the mean square error of prediction.

One method is Locally Weighted Linear Regression (LWLR). In this method, we give a certain weight to each point near the point to be predicted. Like kNN, this algorithm needs to select the corresponding data subset in advance for each prediction. The form of the regression coefficient W released by the algorithm is as follows:

Where W is a matrix. The difference between this formula and the formula derived above is that w is used to give weight to each store.

LWLR uses a "kernel" (similar to the kernel in support vector machine) to give higher weights to nearby points. The type of kernel can be selected freely. The most commonly used kernel is Gaussian kernel. The weight of Gaussian kernel is as follows:

In this way, we can write locally weighted linear regression according to the above formula. We can adjust the regression effect by changing the value of k. The code is as follows:

from matplotlib.font_manager import FontProperties
import matplotlib.pyplot as plt
import numpy as np


def loadDataSet(fileName):
    """
    Function description:Load data
    Parameters:
        fileName - file name
    Returns:
        xArr - x data set
        yArr - y data set
    """
    numFeat = len(open(fileName).readline().split('\t')) - 1
    xArr = []; yArr = []
    fr = open(fileName)
    for line in fr.readlines():
        lineArr =[]
        curLine = line.strip().split('\t')
        for i in range(numFeat):
            lineArr.append(float(curLine[i]))
        xArr.append(lineArr)
        yArr.append(float(curLine[-1]))
    return xArr, yArr
 
 
def plotlwlrRegression():
    """
    Function description:Draw multiple locally weighted regression curves
    Parameters:
        nothing
    Returns:
        nothing
    """
    font = FontProperties(fname=r"c:\windows\fonts\simsun.ttc", size=14)
    xArr, yArr = loadDataSet('ex0.txt')                                    #Load dataset
    yHat_1 = lwlrTest(xArr, xArr, yArr, 1.0)                            #Yhhat was calculated from locally weighted linear regression
    yHat_2 = lwlrTest(xArr, xArr, yArr, 0.01)                            #Yhhat was calculated from locally weighted linear regression
    yHat_3 = lwlrTest(xArr, xArr, yArr, 0.003)                            #Yhhat was calculated from locally weighted linear regression
    xMat = np.mat(xArr)                                                    #Create xMat matrix
    yMat = np.mat(yArr)                                                    #Create yMat matrix
    srtInd = xMat[:, 1].argsort(0)                                        #Sort, return index value
    xSort = xMat[srtInd][:,0,:]
    fig, axs = plt.subplots(nrows=3, ncols=1,sharex=False, sharey=False, figsize=(10,8))                                        
    axs[0].plot(xSort[:, 1], yHat_1[srtInd], c = 'red')                        #Draw regression curve
    axs[1].plot(xSort[:, 1], yHat_2[srtInd], c = 'red')                        #Draw regression curve
    axs[2].plot(xSort[:, 1], yHat_3[srtInd], c = 'red')                        #Draw regression curve
    axs[0].scatter(xMat[:,1].flatten().A[0], yMat.flatten().A[0], s = 20, c = 'blue', alpha = .5)                #Draw sample points
    axs[1].scatter(xMat[:,1].flatten().A[0], yMat.flatten().A[0], s = 20, c = 'blue', alpha = .5)                #Draw sample points
    axs[2].scatter(xMat[:,1].flatten().A[0], yMat.flatten().A[0], s = 20, c = 'blue', alpha = .5)                #Draw sample points
    #Set title, x-axis label,y-axis label
    axs0_title_text = axs[0].set_title(u'Locally weighted regression curve,k=1.0',FontProperties=font)
    axs1_title_text = axs[1].set_title(u'Locally weighted regression curve,k=0.01',FontProperties=font)
    axs2_title_text = axs[2].set_title(u'Locally weighted regression curve,k=0.003',FontProperties=font)
    plt.setp(axs0_title_text, size=8, weight='bold', color='red')  
    plt.setp(axs1_title_text, size=8, weight='bold', color='red')  
    plt.setp(axs2_title_text, size=8, weight='bold', color='red')  
    plt.xlabel('X')
    plt.show()


def lwlr(testPoint, xArr, yArr, k = 1.0):
    """
    Function description:Regression coefficients were calculated using locally weighted linear regression w
    Parameters:
        testPoint - Test sample point
        xArr - x data set
        yArr - y data set
        k - Gaussian kernel k,Custom parameters
    Returns:
        ws - regression coefficient
    """
    xMat = np.mat(xArr); yMat = np.mat(yArr).T
    m = np.shape(xMat)[0]
    weights = np.mat(np.eye((m)))                                        #Create weight diagonal matrix
    for j in range(m):                                                  #Traverse the data set to calculate the weight of each sample
        diffMat = testPoint - xMat[j, :]                                 
        weights[j, j] = np.exp(diffMat * diffMat.T/(-2.0 * k**2))
    xTx = xMat.T * (weights * xMat)                                        
    if np.linalg.det(xTx) == 0.0:
        print("The matrix is singular,No inversion")
        return
    ws = xTx.I * (xMat.T * (weights * yMat))                            #Calculate regression coefficient
    return testPoint * ws


def lwlrTest(testArr, xArr, yArr, k=1.0):  
    """
    Function description:Locally weighted linear regression test
    Parameters:
        testArr - Test data set
        xArr - x data set
        yArr - y data set
        k - Gaussian kernel k,Custom parameters
    Returns:
        ws - regression coefficient
    """
    m = np.shape(testArr)[0]                                            #Calculate test dataset size
    yHat = np.zeros(m)    
    for i in range(m):                                                    #Predict each sample point
        yHat[i] = lwlr(testArr[i],xArr,yArr,k)
    return yHat


if __name__ == '__main__':
    plotlwlrRegression()


It can be seen that the smaller the K, the better the fitting effect. However, when k is too small, over fitting occurs, for example, when k is equal to 0.003.

3, Example: predicting the age of abalone

Next, we apply regression to real data. abalone.txt file records the age of Abalone (an aquatic organism → _→), which is from the data of UCI data set. Abalone age can be calculated from the number of layers of abalone shell.

The data of the dataset is as follows:

As you can see, the data set is multidimensional, so it is difficult for us to draw its distribution. The meaning of the representation of each dimension data is not given, but it doesn't matter. We just need to know that the data in the last column is the y value. The last column represents the real age of abalone. The data in the previous columns are the characteristics of abalone, such as the number of layers of abalone shell. Without data cleaning, we directly use all features to test our locally weighted regression.

Create a new. py file and add rssError function to evaluate the final regression results. The code is as follows:

from matplotlib.font_manager import FontProperties
import matplotlib.pyplot as plt
import numpy as np


def loadDataSet(fileName):
    """
    Function description:Load data
    Parameters:
        fileName - file name
    Returns:
        xArr - x data set
        yArr - y data set
    """
    numFeat = len(open(fileName).readline().split('\t')) - 1
    xArr = []; yArr = []
    fr = open(fileName)
    for line in fr.readlines():
        lineArr =[]
        curLine = line.strip().split('\t')
        for i in range(numFeat):
            lineArr.append(float(curLine[i]))
        xArr.append(lineArr)
        yArr.append(float(curLine[-1]))
    return xArr, yArr


def lwlr(testPoint, xArr, yArr, k = 1.0):
    """
    Function description:Regression coefficients were calculated using locally weighted linear regression w
    Parameters:
        testPoint - Test sample point
        xArr - x data set
        yArr - y data set
        k - Gaussian kernel k,Custom parameters
    Returns:
        ws - regression coefficient
    """
    xMat = np.mat(xArr); yMat = np.mat(yArr).T
    m = np.shape(xMat)[0]
    weights = np.mat(np.eye((m)))                                        #Create weight diagonal matrix
    for j in range(m):                                                  #Traverse the data set to calculate the weight of each sample
        diffMat = testPoint - xMat[j, :]                                 
        weights[j, j] = np.exp(diffMat * diffMat.T/(-2.0 * k**2))
    xTx = xMat.T * (weights * xMat)                                        
    if np.linalg.det(xTx) == 0.0:
        print("The matrix is singular,No inversion")
        return
    ws = xTx.I * (xMat.T * (weights * yMat))                            #Calculate regression coefficient
    return testPoint * ws


def lwlrTest(testArr, xArr, yArr, k=1.0):  
    """
    Function description:Locally weighted linear regression test
    Parameters:
        testArr - Test data set,Test set
        xArr - x data set,Training set
        yArr - y data set,Training set
        k - Gaussian kernel k,Custom parameters
    Returns:
        ws - regression coefficient
    """
    m = np.shape(testArr)[0]                                            #Calculate test dataset size
    yHat = np.zeros(m)    
    for i in range(m):                                                    #Predict each sample point
        yHat[i] = lwlr(testArr[i],xArr,yArr,k)
    return yHat


def standRegres(xArr,yArr):
    """
    Function description:Calculate regression coefficient w
    Parameters:
        xArr - x data set
        yArr - y data set
    Returns:
        ws - regression coefficient
    """
    xMat = np.mat(xArr); yMat = np.mat(yArr).T
    xTx = xMat.T * xMat                            #The regression coefficient is calculated according to the publicity derived in this paper
    if np.linalg.det(xTx) == 0.0:
        print("The matrix is singular,No inversion")
        return
    ws = xTx.I * (xMat.T*yMat)
    return ws


def rssError(yArr, yHatArr):
    """
    Error size evaluation function
    Parameters:
        yArr - Real data
        yHatArr - Forecast data
    Returns:
        Error size
    """
    return ((yArr - yHatArr) **2).sum()


if __name__ == '__main__':
    abX, abY = loadDataSet('abalone.txt')
    print('The training set is the same as the test set:Locally weighted linear regression,nucleus k Effect of the size of on the prediction:')
    yHat01 = lwlrTest(abX[0:99], abX[0:99], abY[0:99], 0.1)
    yHat1 = lwlrTest(abX[0:99], abX[0:99], abY[0:99], 1)
    yHat10 = lwlrTest(abX[0:99], abX[0:99], abY[0:99], 10)
    print('k=0.1 Time,The error size is:',rssError(abY[0:99], yHat01.T))
    print('k=1  Time,The error size is:',rssError(abY[0:99], yHat1.T))
    print('k=10 Time,The error size is:',rssError(abY[0:99], yHat10.T))
    print('')
    print('The training set is different from the test set:Locally weighted linear regression,nucleus k Is the smaller the better? Replace dataset,The test results are as follows:')
    yHat01 = lwlrTest(abX[100:199], abX[0:99], abY[0:99], 0.1)
    yHat1 = lwlrTest(abX[100:199], abX[0:99], abY[0:99], 1)
    yHat10 = lwlrTest(abX[100:199], abX[0:99], abY[0:99], 10)
    print('k=0.1 Time,The error size is:',rssError(abY[100:199], yHat01.T))
    print('k=1  Time,The error size is:',rssError(abY[100:199], yHat1.T))
    print('k=10 Time,The error size is:',rssError(abY[100:199], yHat10.T))
    print('')
    print('The training set is different from the test set:Simple linear regression and k=1 Comparison of locally weighted linear regression:')
    print('k=1 Time,The error size is:', rssError(abY[100:199], yHat1.T))
    ws = standRegres(abX[0:99], abY[0:99])
    yHat = np.mat(abX[100:199]) * ws
    print('Simple linear regression error size:', rssError(abY[100:199], yHat.T.A))


It can be seen that when k=0.1, the error of the training set is small, but when applied to the new data set, the error becomes larger. This is often called over fitting phenomenon. For the model we train, we should ensure the high accuracy of the test set, so that the trained model can be applied to new data, that is, we should strengthen the universality of the model. It can be seen that when k=1, the effect of locally weighted linear regression is similar to that of simple linear regression. This also shows that the best model can only be selected by comparing the effects on unknown data. So the best nuclear size is 10? Maybe, but if you want to get better results, you should do 10 tests with 10 different sample sets to compare the results.

This example shows how to use locally weighted linear regression to build a model, which can get better results than ordinary linear regression. The problem with locally weighted linear regression is that it must be run on the whole data set each time. In other words, in order to make a prediction, all training data must be saved.

4, Reduction factor to "understand" data

1. Ridge regression

from matplotlib.font_manager import FontProperties
import matplotlib.pyplot as plt
import numpy as np


def loadDataSet(fileName):
    """
    Function description:Load data
    Parameters:
        fileName - file name
    Returns:
        xArr - x data set
        yArr - y data set
    """
    numFeat = len(open(fileName).readline().split('\t')) - 1
    xArr = []; yArr = []
    fr = open(fileName)
    for line in fr.readlines():
        lineArr =[]
        curLine = line.strip().split('\t')
        for i in range(numFeat):
            lineArr.append(float(curLine[i]))
        xArr.append(lineArr)
        yArr.append(float(curLine[-1]))
    return xArr, yArr


def ridgeRegres(xMat, yMat, lam = 0.2):
    """
    Function description:Ridge regression
    Parameters:
        xMat - x data set
        yMat - y data set
        lam - Reduction factor
    Returns:
        ws - regression coefficient
    """
    xTx = xMat.T * xMat
    denom = xTx + np.eye(np.shape(xMat)[1]) * lam
    if np.linalg.det(denom) == 0.0:
        print("The matrix is singular,Cannot transpose")
        return
    ws = denom.I * (xMat.T * yMat)
    return ws


def ridgeTest(xArr, yArr):
    """
    Function description:Ridge regression test
    Parameters:
        xMat - x data set
        yMat - y data set
    Returns:
        wMat - Regression coefficient matrix
    """
    xMat = np.mat(xArr); yMat = np.mat(yArr).T
    #Data standardization
    yMean = np.mean(yMat, axis = 0)                        #Row to row operation, find the mean
    yMat = yMat - yMean                                    #Data minus mean
    xMeans = np.mean(xMat, axis = 0)                    #Row to row operation, find the mean
    xVar = np.var(xMat, axis = 0)                        #Row to row operation, variance
    xMat = (xMat - xMeans) / xVar                        #Data is normalized by subtracting mean divided by variance
    numTestPts = 30                                        #30 different lambda tests
    wMat = np.zeros((numTestPts, np.shape(xMat)[1]))    #Initial regression coefficient matrix
    for i in range(numTestPts):                            #Changing lambda to calculate regression coefficient
        ws = ridgeRegres(xMat, yMat, np.exp(i - 10))    #lambda changes exponentially with e, initially a very small number,
        wMat[i, :] = ws.T                                 #Calculate regression coefficient matrix
    return wMat


def plotwMat():
    """
    Function description:Plot ridge regression coefficient matrix
    """
    font = FontProperties(fname=r"c:\windows\fonts\simsun.ttc", size=14)
    abX, abY = loadDataSet('abalone.txt')
    redgeWeights = ridgeTest(abX, abY)
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.plot(redgeWeights)    
    ax_title_text = ax.set_title(u'log(lambada)Relationship with regression coefficient', FontProperties = font)
    ax_xlabel_text = ax.set_xlabel(u'log(lambada)', FontProperties = font)
    ax_ylabel_text = ax.set_ylabel(u'regression coefficient', FontProperties = font)
    plt.setp(ax_title_text, size = 20, weight = 'bold', color = 'red')
    plt.setp(ax_xlabel_text, size = 10, weight = 'bold', color = 'black')
    plt.setp(ax_ylabel_text, size = 10, weight = 'bold', color = 'black')
    plt.show()


if __name__ == '__main__':
    plotwMat()


The regression coefficients and log are plotted( λ) On the far left, i.e λ At the minimum, the original values of all coefficients can be obtained (consistent with linear regression); on the right, the coefficients are reduced to 0; at a certain position in the middle, the best prediction result will be obtained. Want to get the best prediction result λ Parameters can be obtained by cross validation, which will be explained later in the article.

2,lasso

It is not difficult to prove that when the following constraints are added, the ordinary least squares regression will obtain the same formula as the ridge regression:

The above formula defines that the sum of squares of all regression coefficients cannot be greater than λ. When two or more features are correlated, a large positive coefficient and a large negative number may be obtained by ordinary least squares regression. Because of the above constraints, ridge regression can avoid this problem.
Similar to ridge regression, another reduction method lasso also limits the regression coefficient, and the corresponding constraints are as follows:

The only difference is that this constraint replaces the sum of squares with absolute values. Although the form of constraints changes only slightly, the results are quite different: in λ When it is small enough, some coefficients will be forced to reduce to 0. This feature can help us better understand the data. The two constraints look similar in terms of formula, but slight changes greatly increase the computational complexity (in order to solve the regression coefficient under this new constraint, the quadratic programming algorithm needs to be used). The following will introduce a simpler method to obtain the results, which is called forward stepwise regression.

3. Forward stepwise regression

The forward stepwise linear regression algorithm belongs to a greedy algorithm, that is, the error is reduced as much as possible at each step. We calculate the regression coefficient no longer by formula, but by fine tuning each regression coefficient every time, and then calculating the prediction error. The set of regression coefficients that minimize the error is the best regression coefficient we need.

The implementation of forward stepwise linear regression is also very simple. Of course, the data should be standardized first. The code is as follows:

from matplotlib.font_manager import FontProperties
import matplotlib.pyplot as plt
import numpy as np


def loadDataSet(fileName):
    """
    Function description:Load data
    Parameters:
        fileName - file name
    Returns:
        xArr - x data set
        yArr - y data set
    """
    numFeat = len(open(fileName).readline().split('\t')) - 1
    xArr = []; yArr = []
    fr = open(fileName)
    for line in fr.readlines():
        lineArr =[]
        curLine = line.strip().split('\t')
        for i in range(numFeat):
            lineArr.append(float(curLine[i]))
        xArr.append(lineArr)
        yArr.append(float(curLine[-1]))
    return xArr, yArr
 
 
def regularize(xMat, yMat):
    """
    Function description:Data standardization
    Parameters:
        xMat - x data set
        yMat - y data set
    Returns:
        inxMat - Standardized x data set
        inyMat - Standardized y data set
    """    
    inxMat = xMat.copy()                                                        #Data copy
    inyMat = yMat.copy()
    yMean = np.mean(yMat, 0)                                                    #Row to row operation, find the mean
    inyMat = yMat - yMean                                                        #Data minus mean
    inMeans = np.mean(inxMat, 0)                                                   #Row to row operation, find the mean
    inVar = np.var(inxMat, 0)                                                     #Row to row operation, variance
    inxMat = (inxMat - inMeans) / inVar                                            #Data is normalized by subtracting mean divided by variance
    return inxMat, inyMat
 
 
def rssError(yArr,yHatArr):
    """
    Function description:Calculate square error
    Parameters:
        yArr - Estimate
        yHatArr - True value
    Returns:
    """
    return ((yArr-yHatArr)**2).sum()


def stageWise(xArr, yArr, eps = 0.01, numIt = 100):
    """
    Function description:Forward stepwise linear regression
    Parameters:
        xArr - x input data
        yArr - y Forecast data
        eps - Step size to be adjusted for each iteration
        numIt - Number of iterations
    Returns:
        returnMat - numIt Regression coefficient matrix of second iteration
    """
    xMat = np.mat(xArr); yMat = np.mat(yArr).T                                         #data set
    xMat, yMat = regularize(xMat, yMat)                                                #Data standardization
    m, n = np.shape(xMat)
    returnMat = np.zeros((numIt, n))                                                #Initialize the regression coefficient matrix of numIt iterations
    ws = np.zeros((n, 1))                                                            #Initialize regression coefficient matrix
    wsTest = ws.copy()
    wsMax = ws.copy()
    for i in range(numIt):                                                            #Iteration numIt times
        # print(ws.T)                                                                    #Print current regression coefficient matrix
        lowestError = float('inf');                                                 #Positive infinity
        for j in range(n):                                                            #Traverse the regression coefficients of each feature
            for sign in [-1, 1]:
                wsTest = ws.copy()
                wsTest[j] += eps * sign                                                #Fine tuning regression coefficient
                yTest = xMat * wsTest                                                #Calculate predicted value
                rssE = rssError(yMat.A, yTest.A)                                    #Calculate square error
                if rssE < lowestError:                                                #If the error is smaller, the current optimal regression coefficient is updated
                    lowestError = rssE
                    wsMax = wsTest
        ws = wsMax.copy()
        returnMat[i,:] = ws.T                                                         #Record the regression coefficient matrix of numIt iterations
    return returnMat


def plotstageWiseMat():
    """
    Function description:Plot ridge regression coefficient matrix
    """
    font = FontProperties(fname=r"c:\windows\fonts\simsun.ttc", size=14)
    xArr, yArr = loadDataSet('abalone.txt')
    returnMat = stageWise(xArr, yArr, 0.005, 1000)
    fig = plt.figure()
    ax = fig.add_subplot(111)
    ax.plot(returnMat)    
    ax_title_text = ax.set_title(u'Forward stepwise regression:Relationship between iteration number and regression coefficient', FontProperties = font)
    ax_xlabel_text = ax.set_xlabel(u'Number of iterations', FontProperties = font)
    ax_ylabel_text = ax.set_ylabel(u'regression coefficient', FontProperties = font)
    plt.setp(ax_title_text, size = 15, weight = 'bold', color = 'red')
    plt.setp(ax_xlabel_text, size = 10, weight = 'bold', color = 'black')
    plt.setp(ax_ylabel_text, size = 10, weight = 'bold', color = 'black')
    plt.show()


if __name__ == '__main__':
    plotstageWiseMat()


Still, we printed the relationship curve between the number of iterations and the regression coefficient. It can be seen that some coefficients are about 0 from beginning to end, which shows that they do not have any impact on the target, that is, these characteristics are probably unnecessary. The advantage of the stepwise linear regression algorithm is that it can help people understand some models and make improvements. After building a model , the algorithm can be run to find out the important features, so it is possible to stop the collection of those unimportant features in time.

To sum up:

The reduction method (stepwise linear regression or ridge regression) is to reduce some coefficients to very small values or directly to 0. This will increase the deviation of the model (reduce the weight of some features) , by reducing the regression coefficient of some features to 0, the complexity of the model is reduced. After eliminating the redundant features, the model is easier to understand and the prediction error is reduced. However, when the reduction is too strict, there will be over fitting, that is, the prediction result with the training set is very good, and the prediction with the test set is much worse.

5, Example: forecasting the price of LEGO toy sets

LEGO produces assembled toys, which are composed of many plastic inserts of different sizes. Its appearance is shown in the figure below:

Generally speaking, these inserts are sold in sets. They can be assembled into many different things, such as ships, castles, some famous buildings, etc. the number of parts contained in each set of LEGO company ranges from 10 to 5000.

A Lego kit will basically be discontinued in a few years, but LEGO collectors will still trade with each other after shutdown. This example is to use the regression method to predict the transaction price between collectors.

1. Get data

We get the information we need by parsing the html file:

from bs4 import BeautifulSoup

 
def scrapePage(retX, retY, inFile, yr, numPce, origPrc):
    """
    Function description:Read data from the page and generate retX and retY list
    Parameters:
        retX - data X
        retY - data Y
        inFile - HTML file
        yr - particular year
        numPce - Number of LEGO parts
        origPrc - original price
    Returns:
        nothing
    """
    # Open and read HTML file
    with open(inFile, encoding='utf-8') as f:
        html = f.read()
    soup = BeautifulSoup(html)
    i = 1
    # Parse according to HTML page structure
    currentRow = soup.find_all('table', r = "%d" % i)
    while(len(currentRow) != 0):
        currentRow = soup.find_all('table', r = "%d" % i)
        title = currentRow[0].find_all('a')[1].text
        lwrTitle = title.lower()
        # Find out if there are brand new labels
        if (lwrTitle.find('new') > -1) or (lwrTitle.find('nisb') > -1):
            newFlag = 1.0
        else:
            newFlag = 0.0
        # To find out whether the logo has been sold, we only collect the data that has been sold
        soldUnicde = currentRow[0].find_all('td')[3].find_all('span')
        if len(soldUnicde) == 0:
            print("commodity #%d not sold "% i)
        else:
            # Analyze the page to get the current price
            soldPrice = currentRow[0].find_all('td')[4]
            priceStr = soldPrice.text
            priceStr = priceStr.replace('$','')
            priceStr = priceStr.replace(',','')
            if len(soldPrice) > 1:
                priceStr = priceStr.replace('Free shipping', '')
            sellingPrice = float(priceStr)
            # Remove incomplete package price
            if  sellingPrice > origPrc * 0.5:
                print("%d\t%d\t%d\t%f\t%f" % (yr, numPce, newFlag, origPrc, sellingPrice))
                retX.append([yr, numPce, newFlag, origPrc])
                retY.append(sellingPrice)
        i += 1
        currentRow = soup.find_all('table', r = "%d" % i)
 

def setDataCollect(retX, retY):
    """
    Function description:Read the data of six Lego sets in turn and generate the data matrix
    Parameters:
        nothing
    Returns:
        nothing
    """
    scrapePage(retX, retY, './lego/lego8288.html', 2006, 800, 49.99)                #LEGO 8288 in 2006, number of parts 800, original price 49.99
    scrapePage(retX, retY, './lego/lego10030.html', 2002, 3096, 269.99)                #LEGO 10030 in 2002, number of parts 3096, original price 269.99
    scrapePage(retX, retY, './lego/lego10179.html', 2007, 5195, 499.99)                #LEGO 10179 in 2007, number of parts 5195, original price 499.99
    scrapePage(retX, retY, './lego/lego10181.html', 2007, 3428, 199.99)                #LEGO 10181 in 2007, number of parts 3428, original price 199.99
    scrapePage(retX, retY, './lego/lego10189.html', 2008, 5922, 299.99)                #LEGO 10189 in 2008, number of parts 5922, original price 299.99
    scrapePage(retX, retY, './lego/lego10196.html', 2009, 3263, 249.99)                #LEGO 10196 in 2009, 3263 parts, original price 249.99
 
 
if __name__ == '__main__':
    lgX = []
    lgY = []
    setDataCollect(lgX, lgY)


We deal with goods that are not available. These characteristics are: year of production, number of parts, whether it is brand new, original price and selling price (second-hand transaction).

2. Model building

After we have processed the data set, the next step is to train the model. First, we need to add the feature X0 column with all 0. Because the first column of linear regression requires 1.0. Then, use the simplest ordinary linear regression i, and write the code as follows:

import numpy as np
from bs4 import BeautifulSoup


def scrapePage(retX, retY, inFile, yr, numPce, origPrc):
    """
    Function description:Read data from the page and generate retX and retY list
    Parameters:
        retX - data X
        retY - data Y
        inFile - HTML file
        yr - particular year
        numPce - Number of LEGO parts
        origPrc - original price
    Returns:
        nothing
    """
    # Open and read HTML file
    with open(inFile, encoding='utf-8') as f:
        html = f.read()
    soup = BeautifulSoup(html)
    i = 1
    # Parse according to HTML page structure
    currentRow = soup.find_all('table', r = "%d" % i)
    while(len(currentRow) != 0):
        currentRow = soup.find_all('table', r = "%d" % i)
        title = currentRow[0].find_all('a')[1].text
        lwrTitle = title.lower()
        # Find out if there are brand new labels
        if (lwrTitle.find('new') > -1) or (lwrTitle.find('nisb') > -1):
            newFlag = 1.0
        else:
            newFlag = 0.0
        # To find out whether the logo has been sold, we only collect the data that has been sold
        soldUnicde = currentRow[0].find_all('td')[3].find_all('span')
        if len(soldUnicde) == 0:
            print("commodity #%d not sold "% i)
        else:
            # Analyze the page to get the current price
            soldPrice = currentRow[0].find_all('td')[4]
            priceStr = soldPrice.text
            priceStr = priceStr.replace('$','')
            priceStr = priceStr.replace(',','')
            if len(soldPrice) > 1:
                priceStr = priceStr.replace('Free shipping', '')
            sellingPrice = float(priceStr)
            # Remove incomplete package price
            if  sellingPrice > origPrc * 0.5:
                print("%d\t%d\t%d\t%f\t%f" % (yr, numPce, newFlag, origPrc, sellingPrice))
                retX.append([yr, numPce, newFlag, origPrc])
                retY.append(sellingPrice)
        i += 1
        currentRow = soup.find_all('table', r = "%d" % i)        


def setDataCollect(retX, retY):
    """
    Function description:Read the data of six Lego sets in turn and generate the data matrix
    Parameters:
        nothing
    Returns:
        nothing
    """
    scrapePage(retX, retY, './lego/lego8288.html', 2006, 800, 49.99)                #LEGO 8288 in 2006, number of parts 800, original price 49.99
    scrapePage(retX, retY, './lego/lego10030.html', 2002, 3096, 269.99)                #LEGO 10030 in 2002, number of parts 3096, original price 269.99
    scrapePage(retX, retY, './lego/lego10179.html', 2007, 5195, 499.99)                #LEGO 10179 in 2007, number of parts 5195, original price 499.99
    scrapePage(retX, retY, './lego/lego10181.html', 2007, 3428, 199.99)                #LEGO 10181 in 2007, number of parts 3428, original price 199.99
    scrapePage(retX, retY, './lego/lego10189.html', 2008, 5922, 299.99)                #LEGO 10189 in 2008, number of parts 5922, original price 299.99
    scrapePage(retX, retY, './lego/lego10196.html', 2009, 3263, 249.99)                #LEGO 10196 in 2009, 3263 parts, original price 249.99


def regularize(xMat, yMat):
    """
    Function description:Data standardization
    Parameters:
        xMat - x data set
        yMat - y data set
    Returns:
        inxMat - Standardized x data set
        inyMat - Standardized y data set
    """    
    inxMat = xMat.copy()                                                        #Data copy
    inyMat = yMat.copy()
    yMean = np.mean(yMat, 0)                                                    #Row to row operation, find the mean
    inyMat = yMat - yMean                                                        #Data minus mean
    inMeans = np.mean(inxMat, 0)                                                   #Row to row operation, find the mean
    inVar = np.var(inxMat, 0)                                                     #Row to row operation, variance
    # print(inxMat)
    print(inMeans)
    # print(inVar)
    inxMat = (inxMat - inMeans) / inVar                                            #Data is normalized by subtracting mean divided by variance
    return inxMat, inyMat


def rssError(yArr,yHatArr):
    """
    Function description:Calculate square error
    Parameters:
        yArr - Estimate
        yHatArr - True value
    Returns:
    """
    return ((yArr-yHatArr)**2).sum()


def standRegres(xArr,yArr):
    """
    Function description:Calculate regression coefficient w
    Parameters:
        xArr - x data set
        yArr - y data set
    Returns:
        ws - regression coefficient
    """
    xMat = np.mat(xArr); yMat = np.mat(yArr).T
    xTx = xMat.T * xMat                            #The regression coefficient is calculated according to the publicity derived in this paper
    if np.linalg.det(xTx) == 0.0:
        print("The matrix is singular,Cannot transpose")
        return
    ws = xTx.I * (xMat.T*yMat)
    return ws
 
 
def useStandRegres():
    """
    Function description:Using simple linear regression
    Parameters:
        nothing
    Returns:
        nothing
    """
    lgX = []
    lgY = []
    setDataCollect(lgX, lgY)
    data_num, features_num = np.shape(lgX)
    lgX1 = np.mat(np.ones((data_num, features_num + 1)))
    lgX1[:, 1:5] = np.mat(lgX)
    ws = standRegres(lgX1, lgY)
    print('%f%+f*particular year%+f*Number of parts%+f*Is it brand new%+f*original price' % (ws[0],ws[1],ws[2],ws[3],ws[4]))     
 
 
if __name__ == '__main__':
    useStandRegres()


It can be seen that the formula used in the model is shown in the figure above. Although this model fits the data well, it doesn't seem to make any sense. The more parts in the kit, the lower the price, which is unreasonable.

We use ridge regression to find the method to minimize the error through cross validation λ Corresponding regression coefficient. The code is as follows:

import numpy as np
from bs4 import BeautifulSoup
import random
 
 
def scrapePage(retX, retY, inFile, yr, numPce, origPrc):
    """
    Function description:Read data from the page and generate retX and retY list
    Parameters:
        retX - data X
        retY - data Y
        inFile - HTML file
        yr - particular year
        numPce - Number of LEGO parts
        origPrc - original price
    Returns:
        nothing
    """
    # Open and read HTML file
    with open(inFile, encoding='utf-8') as f:
        html = f.read()
    soup = BeautifulSoup(html)
    i = 1
    # Parse according to HTML page structure
    currentRow = soup.find_all('table', r = "%d" % i)
    while(len(currentRow) != 0):
        currentRow = soup.find_all('table', r = "%d" % i)
        title = currentRow[0].find_all('a')[1].text
        lwrTitle = title.lower()
        # Find out if there are brand new labels
        if (lwrTitle.find('new') > -1) or (lwrTitle.find('nisb') > -1):
            newFlag = 1.0
        else:
            newFlag = 0.0
        # To find out whether the logo has been sold, we only collect the data that has been sold
        soldUnicde = currentRow[0].find_all('td')[3].find_all('span')
        if len(soldUnicde) == 0:
            print("commodity #%d not sold "% i)
        else:
            # Analyze the page to get the current price
            soldPrice = currentRow[0].find_all('td')[4]
            priceStr = soldPrice.text
            priceStr = priceStr.replace('$','')
            priceStr = priceStr.replace(',','')
            if len(soldPrice) > 1:
                priceStr = priceStr.replace('Free shipping', '')
            sellingPrice = float(priceStr)
            # Remove incomplete package price
            if  sellingPrice > origPrc * 0.5:
                print("%d\t%d\t%d\t%f\t%f" % (yr, numPce, newFlag, origPrc, sellingPrice))
                retX.append([yr, numPce, newFlag, origPrc])
                retY.append(sellingPrice)
        i += 1
        currentRow = soup.find_all('table', r = "%d" % i)
 
 
def ridgeRegres(xMat, yMat, lam = 0.2):
    """
    Function description:Ridge regression
    Parameters:
        xMat - x data set
        yMat - y data set
        lam - Reduction factor
    Returns:
        ws - regression coefficient
    """
    xTx = xMat.T * xMat
    denom = xTx + np.eye(np.shape(xMat)[1]) * lam
    if np.linalg.det(denom) == 0.0:
        print("The matrix is singular,Cannot transpose")
        return
    ws = denom.I * (xMat.T * yMat)
    return ws
 
 
def setDataCollect(retX, retY):
    """
    Function description:Read the data of six Lego sets in turn and generate the data matrix
    Parameters:
        nothing
    Returns:
        nothing
    """
    scrapePage(retX, retY, './lego/lego8288.html', 2006, 800, 49.99)                #LEGO 8288 in 2006, number of parts 800, original price 49.99
    scrapePage(retX, retY, './lego/lego10030.html', 2002, 3096, 269.99)                #LEGO 10030 in 2002, number of parts 3096, original price 269.99
    scrapePage(retX, retY, './lego/lego10179.html', 2007, 5195, 499.99)                #LEGO 10179 in 2007, number of parts 5195, original price 499.99
    scrapePage(retX, retY, './lego/lego10181.html', 2007, 3428, 199.99)                #LEGO 10181 in 2007, number of parts 3428, original price 199.99
    scrapePage(retX, retY, './lego/lego10189.html', 2008, 5922, 299.99)                #LEGO 10189 in 2008, number of parts 5922, original price 299.99
    scrapePage(retX, retY, './lego/lego10196.html', 2009, 3263, 249.99)                #LEGO 10196 in 2009, 3263 parts, original price 249.99
 
 
def regularize(xMat, yMat):
    """
    Function description:Data standardization
    Parameters:
        xMat - x data set
        yMat - y data set
    Returns:
        inxMat - Standardized x data set
        inyMat - Standardized y data set
    """    
    inxMat = xMat.copy()                                                        #Data copy
    inyMat = yMat.copy()
    yMean = np.mean(yMat, 0)                                                    #Row to row operation, find the mean
    inyMat = yMat - yMean                                                        #Data minus mean
    inMeans = np.mean(inxMat, 0)                                                   #Row to row operation, find the mean
    inVar = np.var(inxMat, 0)                                                     #Row to row operation, variance
    # print(inxMat)
    print(inMeans)
    # print(inVar)
    inxMat = (inxMat - inMeans) / inVar                                            #Data is normalized by subtracting mean divided by variance
    return inxMat, inyMat

 
def rssError(yArr,yHatArr):
    """
    Function description:Calculate square error
    Parameters:
        yArr - Estimate
        yHatArr - True value
    Returns:
    """
    return ((yArr-yHatArr)**2).sum()


def standRegres(xArr,yArr):
    """
    Function description:Calculate regression coefficient w
    Parameters:
        xArr - x data set
        yArr - y data set
    Returns:
        ws - regression coefficient
    """
    xMat = np.mat(xArr); yMat = np.mat(yArr).T
    xTx = xMat.T * xMat                            #The regression coefficient is calculated according to the publicity derived in this paper
    if np.linalg.det(xTx) == 0.0:
        print("The matrix is singular,Cannot transpose")
        return
    ws = xTx.I * (xMat.T*yMat)
    return ws
 
 
def crossValidation(xArr, yArr, numVal = 10):
    """
    Function description:Cross validation ridge regression
    Parameters:
        xArr - x data set
        yArr - y data set
        numVal - Number of cross validation
    Returns:
        wMat - Regression coefficient matrix
    """
    m = len(yArr)                                                                        #Number of statistical samples                       
    indexList = list(range(m))                                                            #Generate index value list
    errorMat = np.zeros((numVal,30))                                                    #create error mat 30columns numVal rows
    for i in range(numVal):                                                                #Cross validation numVal times
        trainX = []; trainY = []                                                        #Training set
        testX = []; testY = []                                                            #Test set
        random.shuffle(indexList)                                                        #Disorganize
        for j in range(m):                                                                #Divided data set: 90% training set and 10% test set
            if j < m * 0.9:
                trainX.append(xArr[indexList[j]])
                trainY.append(yArr[indexList[j]])
            else:
                testX.append(xArr[indexList[j]])
                testY.append(yArr[indexList[j]])
        wMat = ridgeTest(trainX, trainY)                                                #The ridge regression coefficients under 30 different lambda were obtained
        for k in range(30):                                                                #Traverse all ridge regression coefficients
            matTestX = np.mat(testX); matTrainX = np.mat(trainX)                        #Test set
            meanTrain = np.mean(matTrainX,0)                                            #Test set mean
            varTrain = np.var(matTrainX,0)                                                #Test set variance
            matTestX = (matTestX - meanTrain) / varTrain                                 #Test set standardization
            yEst = matTestX * np.mat(wMat[k,:]).T + np.mean(trainY)                        #Predict y value according to ws
            errorMat[i, k] = rssError(yEst.T.A, np.array(testY))                            #Statistical error
    meanErrors = np.mean(errorMat,0)                                                    #Calculate the average error of each cross validation
    minMean = float(min(meanErrors))                                                    #Find the minimum error
    bestWeights = wMat[np.nonzero(meanErrors == minMean)]                                #Find the best regression coefficient
    xMat = np.mat(xArr); yMat = np.mat(yArr).T
    meanX = np.mean(xMat,0); varX = np.var(xMat,0)
    unReg = bestWeights / varX                                                            #Data is standardized and therefore needs to be restored
    print('%f%+f*particular year%+f*Number of parts%+f*Is it brand new%+f*original price' % ((-1 * np.sum(np.multiply(meanX,unReg)) + np.mean(yMat)), unReg[0,0], unReg[0,1], unReg[0,2], unReg[0,3]))    
 
 
def ridgeTest(xArr, yArr):
    """
    Function description:Ridge regression test
    Parameters:
        xMat - x data set
        yMat - y data set
    Returns:
        wMat - Regression coefficient matrix
    """
    xMat = np.mat(xArr); yMat = np.mat(yArr).T
    #Data standardization
    yMean = np.mean(yMat, axis = 0)                        #Row to row operation, find the mean
    yMat = yMat - yMean                                    #Data minus mean
    xMeans = np.mean(xMat, axis = 0)                    #Row to row operation, find the mean
    xVar = np.var(xMat, axis = 0)                        #Row to row operation, variance
    xMat = (xMat - xMeans) / xVar                        #Data is normalized by subtracting mean divided by variance
    numTestPts = 30                                        #30 different lambda tests
    wMat = np.zeros((numTestPts, np.shape(xMat)[1]))    #Initial regression coefficient matrix
    for i in range(numTestPts):                            #Changing lambda to calculate regression coefficient
        ws = ridgeRegres(xMat, yMat, np.exp(i - 10))    #lambda changes exponentially with e, initially a very small number,
        wMat[i, :] = ws.T                                 #Calculate regression coefficient matrix
    return wMat
 
 
if __name__ == '__main__':
    lgX = []
    lgY = []
    setDataCollect(lgX, lgY)
    crossValidation(lgX, lgY)


The samples are randomly selected here. Because of their randomness, the results of each run may be slightly different. However, as shown in the figure above, it is not very different from the conventional least squares method, that is, ordinary linear regression. We expected to find a more understandable model, which obviously did not achieve the expected effect.

Now let's look at how the regression coefficient changes during the reduction process. The code is as follows:

import numpy as np
from bs4 import BeautifulSoup
import random
 
 
def scrapePage(retX, retY, inFile, yr, numPce, origPrc):
    """
    Function description:Read data from the page and generate retX and retY list
    Parameters:
        retX - data X
        retY - data Y
        inFile - HTML file
        yr - particular year
        numPce - Number of LEGO parts
        origPrc - original price
    Returns:
        nothing
    """
    # Open and read HTML file
    with open(inFile, encoding='utf-8') as f:
        html = f.read()
    soup = BeautifulSoup(html)
    i = 1
    # Parse according to HTML page structure
    currentRow = soup.find_all('table', r = "%d" % i)
    while(len(currentRow) != 0):
        currentRow = soup.find_all('table', r = "%d" % i)
        title = currentRow[0].find_all('a')[1].text
        lwrTitle = title.lower()
        # Find out if there are brand new labels
        if (lwrTitle.find('new') > -1) or (lwrTitle.find('nisb') > -1):
            newFlag = 1.0
        else:
            newFlag = 0.0
        # To find out whether the logo has been sold, we only collect the data that has been sold
        soldUnicde = currentRow[0].find_all('td')[3].find_all('span')
        if len(soldUnicde) == 0:
            print("commodity #%d not sold "% i)
        else:
            # Analyze the page to get the current price
            soldPrice = currentRow[0].find_all('td')[4]
            priceStr = soldPrice.text
            priceStr = priceStr.replace('$','')
            priceStr = priceStr.replace(',','')
            if len(soldPrice) > 1:
                priceStr = priceStr.replace('Free shipping', '')
            sellingPrice = float(priceStr)
            # Remove incomplete package price
            if  sellingPrice > origPrc * 0.5:
                print("%d\t%d\t%d\t%f\t%f" % (yr, numPce, newFlag, origPrc, sellingPrice))
                retX.append([yr, numPce, newFlag, origPrc])
                retY.append(sellingPrice)
        i += 1
        currentRow = soup.find_all('table', r = "%d" % i)
 
 
def ridgeRegres(xMat, yMat, lam = 0.2):
    """
    Function description:Ridge regression
    Parameters:
        xMat - x data set
        yMat - y data set
        lam - Reduction factor
    Returns:
        ws - regression coefficient
    """
    xTx = xMat.T * xMat
    denom = xTx + np.eye(np.shape(xMat)[1]) * lam
    if np.linalg.det(denom) == 0.0:
        print("The matrix is singular,Cannot transpose")
        return
    ws = denom.I * (xMat.T * yMat)
    return ws


def setDataCollect(retX, retY):
    """
    Function description:Read the data of six Lego sets in turn and generate the data matrix
    Parameters:
        nothing
    Returns:
        nothing
    """
    scrapePage(retX, retY, './lego/lego8288.html', 2006, 800, 49.99)                #LEGO 8288 in 2006, number of parts 800, original price 49.99
    scrapePage(retX, retY, './lego/lego10030.html', 2002, 3096, 269.99)                #LEGO 10030 in 2002, number of parts 3096, original price 269.99
    scrapePage(retX, retY, './lego/lego10179.html', 2007, 5195, 499.99)                #LEGO 10179 in 2007, number of parts 5195, original price 499.99
    scrapePage(retX, retY, './lego/lego10181.html', 2007, 3428, 199.99)                #LEGO 10181 in 2007, number of parts 3428, original price 199.99
    scrapePage(retX, retY, './lego/lego10189.html', 2008, 5922, 299.99)                #LEGO 10189 in 2008, number of parts 5922, original price 299.99
    scrapePage(retX, retY, './lego/lego10196.html', 2009, 3263, 249.99)                #LEGO 10196 in 2009, 3263 parts, original price 249.99
 
 
def regularize(xMat, yMat):
    """
    Function description:Data standardization
    Parameters:
        xMat - x data set
        yMat - y data set
    Returns:
        inxMat - Standardized x data set
        inyMat - Standardized y data set
    """    
    inxMat = xMat.copy()                                                        #Data copy
    inyMat = yMat.copy()
    yMean = np.mean(yMat, 0)                                                    #Row to row operation, find the mean
    inyMat = yMat - yMean                                                        #Data minus mean
    inMeans = np.mean(inxMat, 0)                                                   #Row to row operation, find the mean
    inVar = np.var(inxMat, 0)                                                     #Row to row operation, variance
    # print(inxMat)
    print(inMeans)
    # print(inVar)
    inxMat = (inxMat - inMeans) / inVar                                            #Data is normalized by subtracting mean divided by variance
    return inxMat, inyMat
 
 
def rssError(yArr,yHatArr):
    """
    Function description:Calculate square error
    Parameters:
        yArr - Estimate
        yHatArr - True value
    Returns:
    """
    return ((yArr-yHatArr)**2).sum()


def standRegres(xArr,yArr):
    """
    Function description:Calculate regression coefficient w
    Parameters:
        xArr - x data set
        yArr - y data set
    Returns:
        ws - regression coefficient
    """
    xMat = np.mat(xArr); yMat = np.mat(yArr).T
    xTx = xMat.T * xMat                            #The regression coefficient is calculated according to the publicity derived in this paper
    if np.linalg.det(xTx) == 0.0:
        print("The matrix is singular,Cannot transpose")
        return
    ws = xTx.I * (xMat.T*yMat)
    return ws
 
 
def ridgeTest(xArr, yArr):
    """
    Function description:Ridge regression test
    Parameters:
        xMat - x data set
        yMat - y data set
    Returns:
        wMat - Regression coefficient matrix
    """
    xMat = np.mat(xArr); yMat = np.mat(yArr).T
    #Data standardization
    yMean = np.mean(yMat, axis = 0)                        #Row to row operation, find the mean
    yMat = yMat - yMean                                    #Data minus mean
    xMeans = np.mean(xMat, axis = 0)                    #Row to row operation, find the mean
    xVar = np.var(xMat, axis = 0)                        #Row to row operation, variance
    xMat = (xMat - xMeans) / xVar                        #Data is normalized by subtracting mean divided by variance
    numTestPts = 30                                        #30 different lambda tests
    wMat = np.zeros((numTestPts, np.shape(xMat)[1]))    #Initial regression coefficient matrix
    for i in range(numTestPts):                            #Changing lambda to calculate regression coefficient
        ws = ridgeRegres(xMat, yMat, np.exp(i - 10))    #lambda changes exponentially with e, initially a very small number,
        wMat[i, :] = ws.T                                 #Calculate regression coefficient matrix
    return wMat
 
 
if __name__ == '__main__':
    lgX = []
    lgY = []
    setDataCollect(lgX, lgY)
    print(ridgeTest(lgX, lgY))


Looking at the first line of the run result, you can see that the largest is item 4 and the second largest is item 2.

Therefore, if only one feature is selected for prediction, we should select the fourth feature. If 2 features can be selected, the 4th and 2nd features should be selected.

This analysis method enables us to mine the internal laws of a large number of data. When there are only four features, the effect of this method may not be obvious; But if there are more than 100 features, the method will become very effective: it can point out which features are key and which features are unimportant.

6, Summary

  • Like classification, regression is also the process of predicting the target value. The difference between regression and classification is that the former predicts continuous variables, while the latter predicts discrete variables.
  • Ridge regression is a kind of reduction method, which is equivalent to limiting the size of regression coefficient. Another good reduction method is lasso. Lasso is difficult to solve, but the approximate solution can be obtained by the simple stepwise linear regression method.
  • The reduction method can also be regarded as a method to reduce the deviation of a model at the same time.

Tags: Machine Learning

Posted on Fri, 05 Nov 2021 21:30:26 -0400 by kael.shipman