Learning objectives:
I. Understanding univariate linear regression
II. Learn to use "Gradient descent method" And "correlation coefficient method" to solve the linear model
III. Learn to implement the process in code
1, Univariate linear regression
(1) How to understand "regression analysis"?
Regression analysis is a statistical analysis method to determine the interdependent quantitative relationship between two or more variables. It is a mathematical statistical analysis of the influencing factors (independent variables) and prediction objects (dependent variables) with causality. The regression equation is meaningful only when there is a certain relationship between the variable and the dependent variable. Therefore, whether the factors as independent variables are related to the prediction object as dependent variables, the degree of correlation, and the grasp of judging this degree of correlation have become the problems that must be solved in regression analysis. For correlation analysis, the correlation relationship is generally required to judge the correlation degree of independent variables and dependent variables by the size of correlation coefficient.
(2) Classification
Standard 1: according to the type of relationship between independent variables and dependent variables
Linear regression analysis and nonlinear regression analysis
Standard 2: according to the number of independent variables
Univariate regression analysis and multiple regression analysis
The form of univariate linear regression is y = a + b x
After the values of a and b are estimated according to the sample observation data, the sample regression equation can be used as a prediction model, that is, univariate linear regression Measurement model
(3) Method for solving parameters of regression prediction model
Method 1: solve according to the correlation coefficient and standard deviation
The straight line can be expressed by the formula: y=bx+a.
The formula of regression line slope m is: b = r* (SD of y / SD of x).
Conversion: the correlation coefficient (r) between X and Y values, multiplied by the standard deviation of y value (SD of y) divided by the standard deviation of x value (SD of x).
take Mean point of sample Substitution into regression line a
Solution formula of correlation coefficient:
2. Gradient descent method
Gradient descent principle:
Starting from a random line, for example, line a, we calculate the squared error of the line, then adjust the slope and y intercept, and recalculate the sum of squares of errors of the new line. Continue to adjust until the local minimum is reached, in which the sum of square errors is the smallest.
Gradient descent method is an algorithm that approximates the least square regression line by minimizing the sum of squares of errors through multiple iterations
Cost:
"Cost" is the sum of squares of errors (predicted value actual value)
In order to make the prediction model more accurate (i.e. the lowest cost), we can find the best fitting line by changing the slope and intercept
How to change the parameters?
Find the partial derivative , You can get the fastest descent direction
We introduce the gradient descent formula to change the parameter value
The key is to choose an appropriate learning rate( α), If the learning rate is too small, it will lead to slow convergence; if the learning rate is too large, it will hinder convergence, that is, it will oscillate near the extreme point.
Learning rate schedules (also known as learning rate schedules) change the learning rate during each update, such as annealing. Generally, a preset strategy is used or a smaller threshold is attenuated in each iteration. No matter which adjustment method needs to be fixed in advance, it cannot adapt to the characteristics of each learning dataset.
(4) Solving steps
- 1. Scatter diagram to judge the relationship of variables (simple linear);
- 2. Calculate the correlation coefficient and linear verification;
- 3. Calculate the regression coefficient and establish the regression equation;
- 4. Regression equation test;
- 5. Interval estimation of parameters;
- 6. Forecast;
Examples are as follows:
import numpy as np import matplotlib.pyplot as plt class SimpleRegress(object): def __init__(self, x_data, y_data): self.x_data = x_data self.y_data = y_data self.b0 = 0 self.b1 = 1 return def calculate_work(self): # Solution of B 0 and B 1 in regression equation x_mean = np.mean(self.x_data) # x_mean= 14.0 y_mean = np.mean(self.y_data) # y_mean= 130.0 x1 = self.x_data - x_mean # x1= [-12. -8. -6. -6. -2. 2. 6. 6. 8. 12.] y1 = self.y_data - y_mean # y1= [-72. -25. -42. -12. -13. 7. 27. 39. 19. 72.] s = x1 * y1 # s= [864. 200. 252. 72. 26. 14. 162. 234. 152. 864.] u = x1 * x1 # u= [144. 64. 36. 36. 4. 4. 36. 36. 64. 144.] self.b1 = np.sum(s) / np.sum(u) # b1= 5.0 self.b0 = y_mean - self.b1 * x_mean # b0= 60.0 return def test_data_work(self, text_data): # Establishment of regression equation and numerical prediction result = list([]) for one_test in text_data: y = self.b0 + self.b1 * one_test result.append(y) return result def root_data_view(self): # Draw source data visualization diagram plt.scatter(x_data, y_data, label='simple regress', color='k', s=5) # s point size plt.xlabel('x') plt.ylabel('y') plt.legend() plt.show() return def test_data_view(self): # Draw regression line # Draw the data of two points of the regression line x_min = np.min(self.x_data) x_max = np.max(self.x_data) y_min = np.min(self.y_data) y_max = np.max(self.y_data) x_plot = list([x_min, x_max]) y_plot = list([y_min, y_max]) # draw plt.scatter(x_data, y_data, label='root data', color='k', s=5) # s point size plt.plot(x_plot, y_plot, label='regression line') plt.xlabel('x') plt.ylabel('y') plt.legend() plt.title('simple linear regression') plt.show() return x_data = list([2, 6, 8, 8, 12, 16, 20, 20, 22, 26]) y_data = list([58, 105, 88, 118, 117, 137, 157, 169, 149, 202]) test_data = list([16]) sr = SimpleRegress(x_data, y_data) sr.calculate_work() result = sr.test_data_work(test_data) # result= [140.0] #sr.root_data_view() sr.test_data_view()