# Learning objectives:

## 1, Univariate linear regression

(1) How to understand "regression analysis"?

Regression analysis is a statistical analysis method to determine the interdependent quantitative relationship between two or more variables. It is a mathematical statistical analysis of the influencing factors (independent variables) and prediction objects (dependent variables) with causality. The regression equation is meaningful only when there is a certain relationship between the variable and the dependent variable. Therefore, whether the factors as independent variables are related to the prediction object as dependent variables, the degree of correlation, and the grasp of judging this degree of correlation have become the problems that must be solved in regression analysis. For correlation analysis, the correlation relationship is generally required to judge the correlation degree of independent variables and dependent variables by the size of correlation coefficient.

(2) Classification

Standard 1: according to the type of relationship between independent variables and dependent variables

Linear regression analysis and nonlinear regression analysis

Standard 2: according to the number of independent variables

Univariate regression analysis and multiple regression analysis

The form of univariate linear regression is         y = a + b x

After the values of a and b are estimated according to the sample observation data, the sample regression equation can be used as a prediction model, that is, univariate linear regression     Measurement model

(3) Method for solving parameters of regression prediction model

Method 1: solve according to the correlation coefficient and standard deviation

The straight line can be expressed by the formula: y=bx+a.

The formula of regression line slope m is: b = r*   (SD of y / SD of x).

Conversion: the correlation coefficient (r) between X and Y values, multiplied by the standard deviation of y value (SD of y) divided by the standard deviation of x value (SD of x).

take     Mean point of sample     Substitution into regression line   a

Solution formula of correlation coefficient:

Starting from a random line, for example, line a, we calculate the squared error of the line, then adjust the slope and y intercept, and recalculate the sum of squares of errors of the new line. Continue to adjust until the local minimum is reached, in which the sum of square errors is the smallest.

Gradient descent method is an algorithm that approximates the least square regression line by minimizing the sum of squares of errors through multiple iterations

Cost:

"Cost" is the sum of squares of errors (predicted value actual value)

In order to make the prediction model more accurate (i.e. the lowest cost), we can find the best fitting line by changing the slope and intercept

How to change the parameters?

Find the partial derivative    ，  You can get the fastest descent direction

We introduce the gradient descent formula to change the parameter value

The key is to choose an appropriate learning rate( α)， If the learning rate is too small, it will lead to slow convergence; if the learning rate is too large, it will hinder convergence, that is, it will oscillate near the extreme point.

Learning rate schedules (also known as learning rate schedules) change the learning rate during each update, such as annealing. Generally, a preset strategy is used or a smaller threshold is attenuated in each iteration. No matter which adjustment method needs to be fixed in advance, it cannot adapt to the characteristics of each learning dataset.

(4) Solving steps

• 1. Scatter diagram to judge the relationship of variables (simple linear);
• 2. Calculate the correlation coefficient and linear verification;
• 3. Calculate the regression coefficient and establish the regression equation;
• 4. Regression equation test;
• 5. Interval estimation of parameters;
• 6. Forecast;

Examples are as follows:

```import numpy as np
import matplotlib.pyplot as plt

class SimpleRegress(object):
def __init__(self, x_data, y_data):

self.x_data = x_data
self.y_data = y_data
self.b0 = 0
self.b1 = 1

return

def calculate_work(self):       # Solution of B 0 and B 1 in regression equation

x_mean = np.mean(self.x_data)   # x_mean= 14.0
y_mean = np.mean(self.y_data)   # y_mean= 130.0
x1 = self.x_data - x_mean   # x1= [-12.  -8.  -6.  -6.  -2.   2.   6.   6.   8.  12.]
y1 = self.y_data - y_mean   # y1= [-72. -25. -42. -12. -13.   7.  27.  39.  19.  72.]
s = x1 * y1     # s= [864. 200. 252.  72.  26.  14. 162. 234. 152. 864.]
u = x1 * x1     # u= [144.  64.  36.  36.   4.   4.  36.  36.  64. 144.]
self.b1 = np.sum(s) / np.sum(u)      # b1= 5.0
self.b0 = y_mean - self.b1 * x_mean       # b0= 60.0

return

def test_data_work(self, text_data):    # Establishment of regression equation and numerical prediction

result = list([])
for one_test in text_data:
y = self.b0 + self.b1 * one_test
result.append(y)
return result

def root_data_view(self):    # Draw source data visualization diagram
plt.scatter(x_data, y_data, label='simple regress', color='k', s=5)  # s point size
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.show()
return

def test_data_view(self):    # Draw regression line
# Draw the data of two points of the regression line
x_min = np.min(self.x_data)
x_max = np.max(self.x_data)
y_min = np.min(self.y_data)
y_max = np.max(self.y_data)
x_plot = list([x_min, x_max])
y_plot = list([y_min, y_max])
# draw
plt.scatter(x_data, y_data, label='root data', color='k', s=5)  # s point size
plt.plot(x_plot, y_plot, label='regression line')
plt.xlabel('x')
plt.ylabel('y')
plt.legend()
plt.title('simple linear regression')
plt.show()
return

x_data = list([2, 6, 8, 8, 12, 16, 20, 20, 22, 26])
y_data = list([58, 105, 88, 118, 117, 137, 157, 169, 149, 202])
test_data = list([16])

sr = SimpleRegress(x_data, y_data)
sr.calculate_work()
result = sr.test_data_work(test_data)       # result= [140.0]
#sr.root_data_view()
sr.test_data_view()```

Tags: Algorithm

Posted on Mon, 22 Nov 2021 08:34:03 -0500 by dbair