Human words explain linear regression and gradient descent

from sklearn.datasets import load_boston
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error
from sklearn.linear_model import SGDRegressor
import pandas as pd


def linear_model():
    # get data
    boston = load_boston()
    # Basic data processing
    df = pd.DataFrame(boston.data)
    df2 = pd.DataFrame(boston.target)
    print(df.head())
    print(df.columns.values)
    import matplotlib.pyplot as plt
    import seaborn as sns
    df[13] = df2.values
    # Characteristic correlation
    # sns.heatmap(df.corr(), linewidth=0.1, square=True, annot=True)
    # plt.savefig('.. / feature correlation. png')
    # plt.plot(df[5], df[13])
    # plt.savefig('./0.7 broken line graph between highest feature correlation and target value. png')
    # plt.scatter(df[5], df[13])
    # plt.savefig('./0.7 scatter plot between highest feature correlation and target value. png')

    # Before characteristic Engineering
    # for i in range(0, 13):
    #     plt.scatter(df[i], df[13].values)
    # plt.savefig('.. / no feature processing - scatter diagram of all features and target values. png')

    # After characteristic Engineering
    # transfer = StandardScaler()
    # for i in range(0, 13):
    #     list1 = []
    #     for j in df[i].values:
    #         list2 = []
    #         list2.append(j)
    #         list1.append(list2)
    #     plt.scatter(transfer.fit_transform(list1), df[13].values)
    # plt.savefig('.. / scatter diagram of all features and target values after feature engineering. png')
    # plt.show()
    # transfer is transferred by row, while df is transferred by column

    # division
    x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target
                                                        , test_size=0.2)
    # x_train, x_test, y_train, y_test = train_test_split(boston.data, boston.target, random_state=22)

    # Characteristic Engineering
    transfer = StandardScaler()
    x_train = transfer.fit_transform(x_train)
    print(x_train)
    x_test = transfer.fit_transform(x_test)
    print(len(x_test), len(y_test))
    # plt.scatter(x_test, y_test)
    # plt.show()
    # Linear regression
    # estimator = LinearRegression()  # Normal equation
    estimator = SGDRegressor(max_iter=1000)  # max_iter Max iterations
    estimator.fit(x_train, y_train)
    print('Of this model b yes\n', estimator.intercept_)
    print('Of this model w yes\n', estimator.coef_)
    # Model evaluation
    y_pre = estimator.predict(x_test)
    print("The predicted value is", y_pre)
    # Full fit curve
    list_all = []
    for i in range(13):
        x_result = x_test[:, i]
        plt.scatter(x_result, y_test)
        # str = f '. / scatter plot of eigenvalue and target value - fitting curve of eigenvalue and predicted value model in {i} dimension. png'
        plt.plot(x_result, y_pre, label=i)
        # plt.title(f 'scatter plot of eigenvalue and target value - fitting curve of eigenvalue and predicted value model in {i} dimension')
        # plt.savefig(f '. / scatter plot of eigenvalue and target value - fitting curve of eigenvalue and predicted value model in {i} dimension. png')
        # plt.show()
        plt.title('Full data fitting curve')
        if i == 12:
            plt.legend(loc='best')
            plt.show()
            plt.savefig('./Full data fitting curve.png')

    # Mean square error
    ret = mean_squared_error(y_test, y_pre)
    print("Mean square error", ret)


if __name__ == '__main__':
    linear_model()


The principle of linear regression will not be repeated. The demonstration code is above. Today, let's see what the linear regression sgd random gradient descent method looks like after visualization
First of all, there are many dimensions in the data set of Boston house prices. It is impossible for us to directly look at all dimensions, so we can select one dimension with the highest correlation with the target value for visualization
1. Part I data relevance

2. Data distribution: discrete data or continuous data?
We can see that the sixth column dimension with dimension subscript 5 seems to be the highest 0.7, so we temporarily use it for visualization
The x axis is the 5 dimension and the y axis is the target value

It's not obvious to see the broken line with plot. Let's look at the scatter chart

We can see that most of the data show linear correlation. It is conceivable to fit the data emmm with a very thick one first
So now that you see the data itself
Let's look at the distribution of all the data?

We can see that the data is very discrete on the x-axis, and in the case of two dimensions, it is impossible to fit with a straight line. After all, it is regression, not classification
Then we must do feature processing. Here we can intuitively see the significance of feature processing

After processing, we can see that the data distribution is relatively centralized, and there will be no dispersion in the above figure. Almost all dimensions are combined into a continuous data
There is a pit here. The data required for the transfer of feature processing is processed according to rows, while our df is stored according to columns, that is, the determinant CSV format. Then you need a little skill to put the data into a two-dimensional array and look directly at the code implementation

Then let's think about how to deal with these data

3. Model training
After the model training, what are you doing in the training process and how do you fit the data? Let's directly look at the target value and predicted value of each dimension, that is, the fitting curve of the model in this dimension









We can see that the data of the fifth dimension is indeed the best fitted, or in other words, the data that is most easily on the same line. It seems that the previous data correlation has helped us a lot





We can see that regardless of the distribution of data in each dimension, the model tries to fit it. For example, fitting in two dimensions is a straight line, and fitting in three dimensions is to adjust the position of the straight line
Then, what random gradient descent does is to synthesize all dimensions, find the best coefficient before each dimension, minimize the comprehensive error of the model, and fit high-dimensional data in high latitude space
In fact, it's not a big problem to think like this. It's still very abstract to think directly about how to fit high latitude space. We can see from each dimension that the algorithm is doing wool
If it is still very abstract, you can only look at the following figure, all dimensional curve fitting images

It can be seen abstractly. First, determine a dimension, and the image in this dimension is a line on a plane
Then determine the next dimension, which is also a line on the plane. At this time, because the two input eigenvalue dimensions + target value dimensions are determined to be three-dimensional space, the line determined in the first part needs to fit its line on the remaining third dimension, that is, the projection of the line on the xy plane should coincide with the projection on the zy plane, This line needs to adjust its position in space. When the error is the smallest, the location of the line in the three-dimensional space is also determined.
And so on, increase the dimension and fit the projection
When the projections of all dimensions are fitted, it is inevitable that the left side is a little higher and the right side is a little lower. A more vivid example is that the wedding photos on the wall are skewed, but the public say that the public and the woman say that the woman are reasonable. Finally, take a step back and put an angle that both people think is not very skewed. Everyone can accept it
In this example, two people represent two dimensions, and the skew of the photo represents the distance between the actual position and the 'positive' position in everyone's heart, that is, the error
Then, the random gradient descent is to find the global optimal gradient, that is, there are 100 dimensions, that is, the wedding photo is not very crooked in the eyes of 100 people. At this time, it means that the optimal solution has been found. Then, when placing the photos next time, you will know what angle to place the photos, regardless of the size, black and white of the photos next time, Characters can be placed in a positive position, which is linear regression

Tags: Python Machine Learning sklearn ML

Posted on Tue, 21 Sep 2021 16:58:37 -0400 by xwishmasterx