Multiple linear regression algorithm

1, Concept

In regression analysis, if there are two or more independent variables, it is called multiple regression. In fact, a phenomenon is often associated with multiple factors. It is more effective and practical to predict or estimate the dependent variable by the optimal combination of multiple independent variables than to predict or estimate only one independent variable. Therefore, multivariate linear regression is more practical than univariate linear regression.

2, Multiple linear regression of EXCEL

① Delete unnecessary data columns neighborhood and style

② Data analysis - > regression

③ Select an area for input and output data

④ Results

3, Code to achieve multiple linear regression

For specific operations in this section, please refer to the following links
excel linear regression and jupyter programming

1.sklearn package implementation

① No data processing
Import package

import pandas as pd
import numpy as np
import seaborn as sns
from sklearn import datasets
from sklearn.linear_model import LinearRegression

read file

df = pd.read_csv('..\\source\\house_prices.csv')
df.info()#Displays column names and data types
df.head(6)#The first n lines are displayed, and N defaults to 5

Take out the independent variable and dependent variable

#Fetch argument
data_x=df[['area','bedrooms','bathrooms']]
data_y=df['price']

Multiple linear regression was carried out and the results were obtained

# Multiple linear regression
model=LinearRegression()
l_model=model.fit(data_x,data_y)
print('Parameter weight')
print(model.coef_)
print('Model intercept')
print(model.intercept_)

Operation results

② Data processing
Outlier detection is required

# Outlier handling
# ================Outlier test function: two methods of IQR & Z score=========================
def outlier_test(data, column, method=None, z=2):
    """ Based on a column, the upper and lower truncation point method is used to detect outliers(Indexes) """
    """ 
    full_data: Complete data
    column: full_data Specified line in, format 'x' Quoted
    return Optional; outlier: Outlier data frame 
    upper: Upper truncation point;  lower: Lower truncation point
    method: Method of checking outliers (optional), default None Is the upper and lower cut-off point method),
            choose Z Method, Z The default is 2
    """
    # ==================Upper and lower cut-off point method to test outliers==============================
    if method == None:
        print(f'with {column} Based on the column, the upper and lower cut-off point method is used(iqr) Detect outliers...')
        print('=' * 70)
        # Quartile; There will be exceptions when calling the function here
        column_iqr = np.quantile(data[column], 0.75) - np.quantile(data[column], 0.25)
        # 1, 3 quantiles
        (q1, q3) = np.quantile(data[column], 0.25), np.quantile(data[column], 0.75)
        # Calculate upper and lower cutoff points
        upper, lower = (q3 + 1.5 * column_iqr), (q1 - 1.5 * column_iqr)
        # Detect outliers
        outlier = data[(data[column] <= lower) | (data[column] >= upper)]
        print(f'First quantile: {q1}, Third quantile:{q3}, Interquartile range:{column_iqr}')
        print(f"Upper cutoff point:{upper}, Lower cutoff point:{lower}")
        return outlier, upper, lower
    # =====================Z-score test outliers==========================
    if method == 'z':
        """ Based on a column, the incoming data is the same as the data you want to segment z Score point, return the outlier index and the data frame """
        """ 
        params
        data: Complete data
        column: Specified detection column
        z: Z Quantile, The default is 2, according to z fraction-According to the normal curve table, take 2 at the left and right ends%,
           According to you z Positive and negative setting of scores. It can also be changed arbitrarily to know the data set of any top percentage
        """
        print(f'with {column} List as basis, use Z Fractional method, z Quantile extraction {z} To detect outliers...')
        print('=' * 70)
        # Calculate the numerical points of the two Z fractions
        mean, std = np.mean(data[column]), np.std(data[column])
        upper, lower = (mean + z * std), (mean - z * std)
        print(f"take {z} individual Z Score: greater than {upper} Or less than {lower} Is considered an outlier.")
        print('=' * 70)
        # Detect outliers
        outlier = data[(data[column] <= lower) | (data[column] >= upper)]
        return outlier, upper, lower

Get the exception set and discard it

outlier, upper, lower = outlier_test(data=df, column='price', method='z')#Get exception data
outlier.info(); outlier.sample(5)
df.drop(index=outlier.index, inplace=True)#Discard exception data

Take out the independent variable and dependent variable

#Fetch argument
data_x=df[['area','bedrooms','bathrooms']]
data_y=df['price']

Multiple linear regression was carried out and the results were obtained

# Multiple linear regression
model=LinearRegression()
l_model=model.fit(data_x,data_y)
print('Parameter weight')
print(model.coef_)
print('Model intercept')
print(model.intercept_)

Get results

2. Implementation of statistical database of linear regression model

Different import functions and outputs

from statsmodels.formula.api import ols
#Do not use dummy variables
lm = ols('price ~ area + bedrooms + bathrooms', data=df).fit()
lm.summary()

Processing results

Output thermal diagram

It can be seen from the figure that variables such as area, bedrooms and bathrooms have strong correlation with house price.

4, Summary

The key and difficulty of multiple linear regression lies in the relationship between multiple variables and results. The thermodynamic diagram learned this time is a good tool, which can clearly see the impact of various factors on the results.

Reference link

multiple linear regression
House price forecast based on multiple linear regression

Posted on Tue, 26 Oct 2021 11:08:18 -0400 by Xyox