# multiple linear regression

Multiple linear regression is a generalization from general linear model to multiple independent variables, and a special case of general linear model, which is limited to one dependent variable.

General linear model (multivariable region model) is a common linear model in statistics.
The formula is generally written as:

Where Y is a matrix containing reaction variables. X is a design matrix containing independent variables. B is a matrix containing multiple estimation parameters. U is a matrix containing errors and residual terms. It is generally assumed that the error is uncorrelated between measurements and follows a multivariate normal distribution. If the error does not follow the multivariate normal distribution, the generalized linear model can be used to relax the assumptions about Y and U.
The general linear model contains many different statistical models: ANOVA, ANCOVA, MANOVA, MANCOVA, general linear regression, t-test and F-test. The general linear model is a generalization of multiple linear regression in the case of more than one dependent variable. If Y, B and U are column vectors, the matrix equation above will represent multiple linear regression.
Hypothesis testing using general linear models can be carried out in two ways: Multivariable or multiple independent univariate tests. In the multivariable test, the columns of Y are tested together, while in the univariate test, the columns of Y are tested independently, that is, as multiple univariate tests with the same design matrix.

Significance: the connection of things is also multifaceted, and the factors affecting the development of things are diverse. It is more effective and practical to estimate the dependent variable by the optimal combination of multiple independent variables than the prediction of a single independent variable.

# 1, Problem analysis

Through the linear regression analysis of the sold house price data in a certain area for a certain period of time, this paper explores the main factors affecting the house price, analyzes the influence degree of these factors, and uses the data obtained from the analysis to predict the trend and trend of house price in the future.
This paper explores the relationship between neighborhood, house area, bedrooms, bathrooms, house style and house price.

# 2, Data preprocessing (excel)

## 1. Data cleaning

In the original data, there are suspected wrong data such as no bedroom, no bathroom or unreasonable house area.

screen

Remove the bedroom to 0

bathroom is the same

Clean the area value less than 1000

## 2. Full numerical data

neighborhood and style are non numerical data and need to be converted into numerical data for regression analysis.

-A,B,C -1,2,3

-ranch,victorian,lodge -10,20,30

Replace A with 1

B:

C:

ranch:

victorian:

lodge:

# 3, Using EXcel regression

## Regression realization

method

House price as dependent variable and other variables as independent variables

## regression analysis

result

Multiple R: correlation coefficient r, which is used to measure the correlation degree between independent variables x and y.
R Square: determination coefficient r square, which reflects the proportion that all variations of dependent variables can be explained by independent variables through regression relationship. It can be commonly understood as using the mean value as the error benchmark to see whether the prediction error is greater than or less than the mean value benchmark error.

The R =0.778 obtained by regression analysis of this data set shows that the relationship between x and y is highly correlated.

The R-square obtained by regression analysis of this data set = 0.605, indicating that the independent variable can explain 60.5% of the dependent variable

Independent variable meaning coefficients
X variable neighborhood 9768.8665605825
X Variable 2 house area 345.152705630739
X Variable 3 bedrooms -1733.14723959822
X Variable 4 bathrooms 8112.15494579683
X Variable 5 house style -455.45090128214

The regression equation is: y=9768.8x1+345.1x2-1733.1x3+8112.1x4-455.4x5-6497.0

It can be seen from the above that the p value of house area x2 is far less than the significance level. 0.05 house area is related to house price. The P values of bedrooms and bathrooms are much greater than the significance level of 0.05, indicating that the correlation between bedrooms and bathrooms and house price is weak.

# 4, Regression using code

## ⅠStatsmodels

### 1. Data processing

Import data

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt

Outlier handling

# Outlier handling
# ================Outlier test function: two methods of IQR & Z score=========================
def outlier_test(data, column, method=None, z=2):
""" Based on a column, the upper and lower truncation point method is used to detect outliers(Indexes) """
"""
full_data: Complete data
column: full_data Specified line in, format 'x' Quoted
return Optional; outlier: Outlier data frame
upper: Upper truncation point;  lower: Lower truncation point
method: Method of checking outliers (optional), default None Is the upper and lower cut-off point method),
choose Z Method, Z The default is 2
"""
# ==================Upper and lower cut-off point method to test outliers==============================
if method == None:
print(f'with {column} Based on the column, the upper and lower cut-off point method is used(iqr) Detect outliers...')
print('=' * 70)
# Quartile; There will be exceptions when calling the function here
column_iqr = np.quantile(data[column], 0.75) - np.quantile(data[column], 0.25)
# 1, 3 quantiles
(q1, q3) = np.quantile(data[column], 0.25), np.quantile(data[column], 0.75)
# Calculate upper and lower cutoff points
upper, lower = (q3 + 1.5 * column_iqr), (q1 - 1.5 * column_iqr)
# Detect outliers
outlier = data[(data[column] <= lower) | (data[column] >= upper)]
print(f'First quantile: {q1}, Third quantile:{q3}, Interquartile range:{column_iqr}')
print(f"Upper cutoff point:{upper}, Lower cutoff point:{lower}")
return outlier, upper, lower
# =====================Z-score test outliers==========================
if method == 'z':
""" Based on a column, the incoming data is the same as the data you want to segment z Score point, return the outlier index and the data frame """
"""
params
data: Complete data
column: Specified detection column
z: Z Quantile, The default is 2, according to z fraction-According to the normal curve table, take 2 at the left and right ends%，
According to you z Positive and negative setting of scores. It can also be changed arbitrarily to know the data set of any top percentage
"""
print(f'with {column} List as basis, use Z Fractional method, z Quantile extraction {z} To detect outliers...')
print('=' * 70)
# Calculate the numerical points of the two Z fractions
mean, std = np.mean(data[column]), np.std(data[column])
upper, lower = (mean + z * std), (mean - z * std)
print(f"take {z} individual Z Score: greater than {upper} Or less than {lower} Is considered an outlier.")
print('=' * 70)
# Detect outliers
outlier = data[(data[column] <= lower) | (data[column] >= upper)]
return outlier, upper, lower
outlier, upper, lower = outlier_test(data=df, column='price', method='z')
outlier.info(); outlier.sample(5)

df.drop(index=outlier.index, inplace=True)

### 2. Draw a heat map to explore the relationship between price and others

# Thermodynamic diagram
def heatmap(data, method='pearson', camp='RdYlGn', figsize=(10 ,8)):
"""
data: Whole data
method: Default to pearson coefficient
camp: The default is: RdYlGn-Red, yellow and blue; YlGnBu-Yellow green blue; Blues/Greens It's also a good choice
figsize: The default is 10, 8
"""
## Eliminate color blocks with diagonal color duplication
#     mask = np.zeros_like(df2.corr())
plt.figure(figsize=figsize, dpi= 80)
sns.heatmap(data.corr(method=method), \
xticklabels=data.corr(method=method).columns, \
yticklabels=data.corr(method=method).columns, cmap=camp, \
center=0, annot=True)
# To achieve the effect of leaving only half of the diagonal, the parameters in brackets can be added with mask=mask

heatmap(data=df, figsize=(6,5))

It can be seen that area, bedrooms, bathrooms and other variables are not small related to house price, which is put into the model

### 3. Establish regression equation by using stats models

import statsmodels.api as sm
from statsmodels.formula.api import ols # ols is a statistical database for establishing linear regression model
from statsmodels.stats.anova import anova_lm
# Number of data set samples: 6028, 600 are randomly selected here
df = df.copy().sample(600)
# Indicates to tell python that this is a classified variable, otherwise Python will use it as a continuous variable
## Here, analysis of variance is directly used to test all classified variables
## The following lines of code are the standard gestures for analysis of variance using the statistical library
lm = ols('price ~ C(neighborhood) + C(style)', data=df).fit()
anova_lm(lm)
from statsmodels.formula.api import ols

lm = ols('price ~ area + bedrooms + bathrooms', data=df).fit()
lm.summary()

Results the R-square was 0.641, and the fitting degree was generally multicollinearity
Try the Sklearn library again

## ⅡSklearn

### Data processing and calculation R

The code is as follows

#Import related libraries
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split #Cross validation is referenced here
from sklearn.linear_model import LinearRegression  #linear regression
from sklearn.linear_model import Lasso, Ridge, LinearRegression as LR
from sklearn.metrics import r2_score, explained_variance_score as EVS, mean_squared_error as MSE
from sklearn.model_selection import train_test_split, cross_val_score
from pandas.core.accessor import register_dataframe_accessor
x = data[['neighborhood','area','bedrooms','bathrooms','style']]# Characteristic data, independent variable
y= data['price']# Tag value, dependent variable
#It is divided into training set and test set in the proportion of 8:2
x_train, x_test, y_train, y_test = train_test_split(
x, y, test_size=0.2, random_state=1)
reg = LR().fit(x_train, y_train)  # Training model
yhat = reg.predict(x_test)  # Label prediction based on test set x
print("r2 = ",r2_score(y_test,yhat))#Decision coefficient R^2

result

# Summary

In this experiment, the accuracy of the linear regression model based on Sklearn is higher than that based on Statsmodels. It should be noted that Excel and linear are used_ Model, non numerical data must be converted to numerical data.