1, Mathematical principle analysis
linear regression
When there is an accurate and strict linear relationship between two variables, Y=a+bX can be used to represent the functional relationship between them.
Where X is an independent variable; Y is a dependent variable.
However, in real life, due to the interference of other factors, the relationship between many bivariates is not a strict functional relationship, which can not be accurately reflected by the functional equation. In order to distinguish from the functional equation between two variables, we call this relationship regression relationship, which is expressed by the linear program, and this relationship is called regression line or linear regression.
least square method
Calculation principle: least square method, that is to ensure that the sum of squares of the longitudinal distance from each measured point to the regression line is the smallest, and the calculated regression equation can best represent the straight-line trend reflected by the measured data.
Relevant formula:
Total variation decomposition of Y:
The square value of R is between 0 and 1, which reflects the relative degree of regression contribution.
2, EXCEL simple processing
Processed File address , you can download it yourself.
20 sets of data
Pictures drawn using excel:
200 sets of data
Pictures drawn using excel:
2000 sets of data
Pictures drawn using excel:
20000 sets of data
Pictures drawn using excel:
3, python language design least squares calculation (using anaconda's jupyterlab)
Introduction to using tools
A tool jupyterab in Anaconda is used, which can be found in Anaconda official website Download it yourself.
Open jupyterlab in anaconda and it will be opened as a web page:
Click python to create a file:
Do not call the python calculation of the package
#Univariate linear regression without packet switching import pandas as pd def read_file(raw):#Read the file according to the number of lines df = pd.read_excel('..\\source\\weights_heights(height-Weight data set).xls',sheet_name ='weights_heights') height=df.iloc[0:raw,1:2].values weight=df.iloc[0:raw,2:3].values return height,weight def array_to_list(array):#Convert array to list array=array.tolist() for i in range(0,len(array)): array[i]=array[i][0] return array def unary_linear_regression(x,y):#Univariate linear regression, x,y are all list types xi_multiply_yi=0 xi_square=0; x_average=0; y_average=0; f=x for i in range(0,len(x)): xi_multiply_yi+=x[i]*y[i] x_average+=x[i] y_average+=y[i] xi_square+=x[i]*x[i] x_average=x_average/len(x) y_average=y_average/len(x) b=(xi_multiply_yi-len(x)*x_average*y_average)/(xi_square-len(x)*x_average*x_average) a=y_average-b*x_average for i in range(0,len(x)): f[i]=b*x[i]+a R_square=get_coefficient_of_determination(f,y,y_average) print('R_square='+str(R_square)+'\n'+'a='+str(a)+' b='+str(b)) def get_coefficient_of_determination(f,y,y_average):#Transmit the calculated values f and X, the true value of Y, and the average value y_average to obtain the determination coefficient, that is, R ² res=0 tot=0 for i in range(0,len(y)): res+=(y[i]-f[i])*(y[i]-f[i]) tot+=(y[i]-y_average)*(y[i]-y_average) R_square=1-res/tot return R_square raw=[20,200,2000,20000] for i in raw: print('The number of data groups is'+str(i)+":") height,weight=read_file(i) height=array_to_list(height) weight=array_to_list(weight) unary_linear_regression(height,weight)
Click Run:
Results obtained (can be compared with excel):
Call python calculation of package
It is the same as the above process, except that the code has changed. Calling pandas's sklearn method will make the code simpler without typing the algorithm:
#Implementation of univariate linear regression by packet switching from sklearn import linear_model from sklearn.metrics import r2_score import numpy as np import pandas as pd def read_file(raw):#Read the file according to the number of lines df = pd.read_excel('D:\weights_heights(height-Weight data set).xls',sheet_name ='weights_heights') height=df.iloc[0:raw,1:2].values weight=df.iloc[0:raw,2:3].values return height,weight raw=[20,200,2000,20000]#Number of rows to read for i in raw: print('The number of data groups is'+str(i)+":") height,weight=read_file(i) weight_predict=weight lm = linear_model.LinearRegression() lm.fit(height,weight) b=lm.coef_ a=lm.intercept_ weight_predict=lm.predict(height)#Calculate the value inferred from the equation R_square=r2_score(weight,weight_predict)#Calculate variance print('b='+str(b[0][0])+' a='+str(a[0])) print('R_square='+str(R_square))
Results obtained:
4, Reference materials
weights_ Heights (height weight dataset). xls
Simple linear regression (statistical point of view). ppt