In regression analysis, if there are two or more independent variables, it is called multiple regression. In fact, a phenomenon is often associated with multiple factors. It is more effective and practical to predict or estimate the dependent variable by the optimal combination of multiple independent variables than to predict or estimate only one independent variable. Therefore, multivariate linear regression is more practical than univariate linear regression.
2, Multiple linear regression of EXCEL① Delete unnecessary data columns neighborhood and style
② Data analysis - > regression
③ Select an area for input and output data
④ Results
For specific operations in this section, please refer to the following links
excel linear regression and jupyter programming
1.sklearn package implementation
① No data processing
Import package
import pandas as pd import numpy as np import seaborn as sns from sklearn import datasets from sklearn.linear_model import LinearRegression
read file
df = pd.read_csv('..\\source\\house_prices.csv') df.info()#Displays column names and data types df.head(6)#The first n lines are displayed, and N defaults to 5
Take out the independent variable and dependent variable
#Fetch argument data_x=df[['area','bedrooms','bathrooms']] data_y=df['price']
Multiple linear regression was carried out and the results were obtained
# Multiple linear regression model=LinearRegression() l_model=model.fit(data_x,data_y) print('Parameter weight') print(model.coef_) print('Model intercept') print(model.intercept_)
Operation results
② Data processing
Outlier detection is required
# Outlier handling # ================Outlier test function: two methods of IQR & Z score========================= def outlier_test(data, column, method=None, z=2): """ Based on a column, the upper and lower truncation point method is used to detect outliers(Indexes) """ """ full_data: Complete data column: full_data Specified line in, format 'x' Quoted return Optional; outlier: Outlier data frame upper: Upper truncation point; lower: Lower truncation point method: Method of checking outliers (optional), default None Is the upper and lower cut-off point method), choose Z Method, Z The default is 2 """ # ==================Upper and lower cut-off point method to test outliers============================== if method == None: print(f'with Based on the column, the upper and lower cut-off point method is used(iqr) Detect outliers...') print('=' * 70) # Quartile; There will be exceptions when calling the function here column_iqr = np.quantile(data[column], 0.75) - np.quantile(data[column], 0.25) # 1, 3 quantiles (q1, q3) = np.quantile(data[column], 0.25), np.quantile(data[column], 0.75) # Calculate upper and lower cutoff points upper, lower = (q3 + 1.5 * column_iqr), (q1 - 1.5 * column_iqr) # Detect outliers outlier = data[(data[column] <= lower) | (data[column] >= upper)] print(f'First quantile: , Third quantile:, Interquartile range:') print(f"Upper cutoff point:, Lower cutoff point:") return outlier, upper, lower # =====================Z-score test outliers========================== if method == 'z': """ Based on a column, the incoming data is the same as the data you want to segment z Score point, return the outlier index and the data frame """ """ params data: Complete data column: Specified detection column z: Z Quantile, The default is 2, according to z fraction-According to the normal curve table, take 2 at the left and right ends%, According to you z Positive and negative setting of scores. It can also be changed arbitrarily to know the data set of any top percentage """ print(f'with List as basis, use Z Fractional method, z Quantile extraction To detect outliers...') print('=' * 70) # Calculate the numerical points of the two Z fractions mean, std = np.mean(data[column]), np.std(data[column]) upper, lower = (mean + z * std), (mean - z * std) print(f"take individual Z Score: greater than Or less than Is considered an outlier.") print('=' * 70) # Detect outliers outlier = data[(data[column] <= lower) | (data[column] >= upper)] return outlier, upper, lower
Get the exception set and discard it
outlier, upper, lower = outlier_test(data=df, column='price', method='z')#Get exception data outlier.info(); outlier.sample(5) df.drop(index=outlier.index, inplace=True)#Discard exception data
Take out the independent variable and dependent variable
#Fetch argument data_x=df[['area','bedrooms','bathrooms']] data_y=df['price']
Multiple linear regression was carried out and the results were obtained
# Multiple linear regression model=LinearRegression() l_model=model.fit(data_x,data_y) print('Parameter weight') print(model.coef_) print('Model intercept') print(model.intercept_)
Get results
2. Implementation of statistical database of linear regression model
Different import functions and outputs
from statsmodels.formula.api import ols #Do not use dummy variables lm = ols('price ~ area + bedrooms + bathrooms', data=df).fit() lm.summary()
Processing results
Output thermal diagram
It can be seen from the figure that variables such as area, bedrooms and bathrooms have strong correlation with house price.
The key and difficulty of multiple linear regression lies in the relationship between multiple variables and results. The thermodynamic diagram learned this time is a good tool, which can clearly see the impact of various factors on the results.
Reference linkmultiple linear regression
House price forecast based on multiple linear regression