Height and weight data sets were analyzed using linear regression

1, Mathematical principle analysis

linear regression

When there is an accurate and strict linear relationship between two variables, Y=a+bX can be used to represent the functional relationship between them.
Where X is an independent variable; Y is a dependent variable.
However, in real life, due to the interference of other factors, the relationship between many bivariates is not a strict functional relationship, which can not be accurately reflected by the functional equation. In order to distinguish from the functional equation between two variables, we call this relationship regression relationship, which is expressed by the linear program, and this relationship is called regression line or linear regression.

least square method

Calculation principle: least square method, that is to ensure that the sum of squares of the longitudinal distance from each measured point to the regression line is the smallest, and the calculated regression equation can best represent the straight-line trend reflected by the measured data.
Relevant formula:




Total variation decomposition of Y:

The square value of R is between 0 and 1, which reflects the relative degree of regression contribution.

2, EXCEL simple processing

Processed File address , you can download it yourself.

20 sets of data

Pictures drawn using excel:

200 sets of data

Pictures drawn using excel:

2000 sets of data

Pictures drawn using excel:

20000 sets of data

Pictures drawn using excel:

3, python language design least squares calculation (using anaconda's jupyterlab)

Introduction to using tools

A tool jupyterab in Anaconda is used, which can be found in Anaconda official website Download it yourself.
Open jupyterlab in anaconda and it will be opened as a web page:

Click python to create a file:

Do not call the python calculation of the package

#Univariate linear regression without packet switching
import pandas as pd
def read_file(raw):#Read the file according to the number of lines
    df = pd.read_excel('..\\source\\weights_heights(height-Weight data set).xls',sheet_name ='weights_heights')
    height=df.iloc[0:raw,1:2].values
    weight=df.iloc[0:raw,2:3].values
    return height,weight

def array_to_list(array):#Convert array to list
    array=array.tolist()
    for i in range(0,len(array)):
        array[i]=array[i][0]
    return array

def unary_linear_regression(x,y):#Univariate linear regression, x,y are all list types
    xi_multiply_yi=0
    xi_square=0;
    x_average=0;
    y_average=0;
    f=x
    for i in range(0,len(x)):
        xi_multiply_yi+=x[i]*y[i]
        x_average+=x[i]
        y_average+=y[i]
        xi_square+=x[i]*x[i]
    x_average=x_average/len(x)
    y_average=y_average/len(x)
    b=(xi_multiply_yi-len(x)*x_average*y_average)/(xi_square-len(x)*x_average*x_average)
    a=y_average-b*x_average
    for i in range(0,len(x)):
        f[i]=b*x[i]+a
    R_square=get_coefficient_of_determination(f,y,y_average)
    print('R_square='+str(R_square)+'\n'+'a='+str(a)+'  b='+str(b))
    
def get_coefficient_of_determination(f,y,y_average):#Transmit the calculated values f and X, the true value of Y, and the average value y_average to obtain the determination coefficient, that is, R ²
    res=0
    tot=0
    for i in range(0,len(y)):
        res+=(y[i]-f[i])*(y[i]-f[i])
        tot+=(y[i]-y_average)*(y[i]-y_average)
    R_square=1-res/tot 
    return R_square

raw=[20,200,2000,20000]
for i in raw:
    print('The number of data groups is'+str(i)+":")
    height,weight=read_file(i)
    height=array_to_list(height)
    weight=array_to_list(weight)
    unary_linear_regression(height,weight)

Click Run:

Results obtained (can be compared with excel):

Call python calculation of package

It is the same as the above process, except that the code has changed. Calling pandas's sklearn method will make the code simpler without typing the algorithm:

#Implementation of univariate linear regression by packet switching
from sklearn import linear_model
from sklearn.metrics import r2_score
import numpy as np
import pandas as pd
def read_file(raw):#Read the file according to the number of lines
    df = pd.read_excel('D:\weights_heights(height-Weight data set).xls',sheet_name ='weights_heights')
    height=df.iloc[0:raw,1:2].values
    weight=df.iloc[0:raw,2:3].values
    return height,weight

raw=[20,200,2000,20000]#Number of rows to read
for i in raw:
    print('The number of data groups is'+str(i)+":")
    height,weight=read_file(i)
    weight_predict=weight
    lm = linear_model.LinearRegression()
    lm.fit(height,weight)
    b=lm.coef_
    a=lm.intercept_
    weight_predict=lm.predict(height)#Calculate the value inferred from the equation
    R_square=r2_score(weight,weight_predict)#Calculate variance
    print('b='+str(b[0][0])+' a='+str(a[0]))
    print('R_square='+str(R_square))

Results obtained:

4, Reference materials

weights_ Heights (height weight dataset). xls
Simple linear regression (statistical point of view). ppt

Tags: Python Machine Learning linear algebra

Posted on Fri, 01 Oct 2021 15:59:47 -0400 by cryp7