# 1, Concept

In regression analysis, if there are two or more independent variables, it is called multiple regression. In fact, a phenomenon is often associated with multiple factors. It is more effective and practical to predict or estimate the dependent variable by the optimal combination of multiple independent variables than to predict or estimate only one independent variable. Therefore, multivariate linear regression is more practical than univariate linear regression.

# 2, Multiple linear regression of EXCEL

① Delete unnecessary data columns neighborhood and style ② Data analysis - > regression ③ Select an area for input and output data ④ Results # 3, Code to achieve multiple linear regression

For specific operations in this section, please refer to the following links
excel linear regression and jupyter programming

## 1.sklearn package implementation

① No data processing
Import package

```import pandas as pd
import numpy as np
import seaborn as sns
from sklearn import datasets
from sklearn.linear_model import LinearRegression
```

```df = pd.read_csv('..\\source\\house_prices.csv')
df.info()#Displays column names and data types
df.head(6)#The first n lines are displayed, and N defaults to 5
```

Take out the independent variable and dependent variable

```#Fetch argument
data_x=df[['area','bedrooms','bathrooms']]
data_y=df['price']
```

Multiple linear regression was carried out and the results were obtained

```# Multiple linear regression
model=LinearRegression()
l_model=model.fit(data_x,data_y)
print('Parameter weight')
print(model.coef_)
print('Model intercept')
print(model.intercept_)
```

Operation results ② Data processing
Outlier detection is required

```# Outlier handling
# ================Outlier test function: two methods of IQR & Z score=========================
def outlier_test(data, column, method=None, z=2):
""" Based on a column, the upper and lower truncation point method is used to detect outliers(Indexes) """
"""
full_data: Complete data
column: full_data Specified line in, format 'x' Quoted
return Optional; outlier: Outlier data frame
upper: Upper truncation point;  lower: Lower truncation point
method: Method of checking outliers (optional), default None Is the upper and lower cut-off point method),
choose Z Method, Z The default is 2
"""
# ==================Upper and lower cut-off point method to test outliers==============================
if method == None:
print(f'with {column} Based on the column, the upper and lower cut-off point method is used(iqr) Detect outliers...')
print('=' * 70)
# Quartile; There will be exceptions when calling the function here
column_iqr = np.quantile(data[column], 0.75) - np.quantile(data[column], 0.25)
# 1, 3 quantiles
(q1, q3) = np.quantile(data[column], 0.25), np.quantile(data[column], 0.75)
# Calculate upper and lower cutoff points
upper, lower = (q3 + 1.5 * column_iqr), (q1 - 1.5 * column_iqr)
# Detect outliers
outlier = data[(data[column] <= lower) | (data[column] >= upper)]
print(f'First quantile: {q1}, Third quantile:{q3}, Interquartile range:{column_iqr}')
print(f"Upper cutoff point:{upper}, Lower cutoff point:{lower}")
return outlier, upper, lower
# =====================Z-score test outliers==========================
if method == 'z':
""" Based on a column, the incoming data is the same as the data you want to segment z Score point, return the outlier index and the data frame """
"""
params
data: Complete data
column: Specified detection column
z: Z Quantile, The default is 2, according to z fraction-According to the normal curve table, take 2 at the left and right ends%，
According to you z Positive and negative setting of scores. It can also be changed arbitrarily to know the data set of any top percentage
"""
print(f'with {column} List as basis, use Z Fractional method, z Quantile extraction {z} To detect outliers...')
print('=' * 70)
# Calculate the numerical points of the two Z fractions
mean, std = np.mean(data[column]), np.std(data[column])
upper, lower = (mean + z * std), (mean - z * std)
print(f"take {z} individual Z Score: greater than {upper} Or less than {lower} Is considered an outlier.")
print('=' * 70)
# Detect outliers
outlier = data[(data[column] <= lower) | (data[column] >= upper)]
return outlier, upper, lower
```

Get the exception set and discard it

```outlier, upper, lower = outlier_test(data=df, column='price', method='z')#Get exception data
outlier.info(); outlier.sample(5)
```

Take out the independent variable and dependent variable

```#Fetch argument
data_x=df[['area','bedrooms','bathrooms']]
data_y=df['price']

```

Multiple linear regression was carried out and the results were obtained

```# Multiple linear regression
model=LinearRegression()
l_model=model.fit(data_x,data_y)
print('Parameter weight')
print(model.coef_)
print('Model intercept')
print(model.intercept_)

```

Get results ## 2. Implementation of statistical database of linear regression model

Different import functions and outputs

```from statsmodels.formula.api import ols
#Do not use dummy variables
lm = ols('price ~ area + bedrooms + bathrooms', data=df).fit()
lm.summary()
```

Processing results Output thermal diagram It can be seen from the figure that variables such as area, bedrooms and bathrooms have strong correlation with house price.

# 4, Summary

The key and difficulty of multiple linear regression lies in the relationship between multiple variables and results. The thermodynamic diagram learned this time is a good tool, which can clearly see the impact of various factors on the results.