multiple linear regression
Multiple linear regression is a generalization from general linear model to multiple independent variables, and a special case of general linear model, which is limited to one dependent variable.
General linear model (multivariable region model) is a common linear model in statistics.
The formula is generally written as:
Where Y is a matrix containing reaction variables. X is a design matrix containing independent variables. B is a matrix containing multiple estimation parameters. U is a matrix containing errors and residual terms. It is generally assumed that the error is uncorrelated between measurements and follows a multivariate normal distribution. If the error does not follow the multivariate normal distribution, the generalized linear model can be used to relax the assumptions about Y and U.
The general linear model contains many different statistical models: ANOVA, ANCOVA, MANOVA, MANCOVA, general linear regression, t-test and F-test. The general linear model is a generalization of multiple linear regression in the case of more than one dependent variable. If Y, B and U are column vectors, the matrix equation above will represent multiple linear regression.
Hypothesis testing using general linear models can be carried out in two ways: Multivariable or multiple independent univariate tests. In the multivariable test, the columns of Y are tested together, while in the univariate test, the columns of Y are tested independently, that is, as multiple univariate tests with the same design matrix.
Significance: the connection of things is also multifaceted, and the factors affecting the development of things are diverse. It is more effective and practical to estimate the dependent variable by the optimal combination of multiple independent variables than the prediction of a single independent variable.
1, Problem analysis
Through the linear regression analysis of the sold house price data in a certain area for a certain period of time, this paper explores the main factors affecting the house price, analyzes the influence degree of these factors, and uses the data obtained from the analysis to predict the trend and trend of house price in the future.
This paper explores the relationship between neighborhood, house area, bedrooms, bathrooms, house style and house price.
2, Data preprocessing (excel)
1. Data cleaning
In the original data, there are suspected wrong data such as no bedroom, no bathroom or unreasonable house area.
screen
Remove the bedroom to 0
bathroom is the same
Clean the area value less than 1000
2. Full numerical data
neighborhood and style are non numerical data and need to be converted into numerical data for regression analysis.
-A,B,C -1,2,3
-ranch,victorian,lodge -10,20,30
Replace A with 1
B:
C:
ranch:
victorian:
lodge:
3, Using EXcel regression
Regression realization
method
House price as dependent variable and other variables as independent variables
regression analysis
result
Multiple R: correlation coefficient r, which is used to measure the correlation degree between independent variables x and y.
R Square: determination coefficient r square, which reflects the proportion that all variations of dependent variables can be explained by independent variables through regression relationship. It can be commonly understood as using the mean value as the error benchmark to see whether the prediction error is greater than or less than the mean value benchmark error.
The R =0.778 obtained by regression analysis of this data set shows that the relationship between x and y is highly correlated.
The R-square obtained by regression analysis of this data set = 0.605, indicating that the independent variable can explain 60.5% of the dependent variable
Independent variable meaning coefficients
X variable neighborhood 9768.8665605825
X Variable 2 house area 345.152705630739
X Variable 3 bedrooms -1733.14723959822
X Variable 4 bathrooms 8112.15494579683
X Variable 5 house style -455.45090128214
The regression equation is: y=9768.8x1+345.1x2-1733.1x3+8112.1x4-455.4x5-6497.0
It can be seen from the above that the p value of house area x2 is far less than the significance level. 0.05 house area is related to house price. The P values of bedrooms and bathrooms are much greater than the significance level of 0.05, indicating that the correlation between bedrooms and bathrooms and house price is weak.
4, Regression using code
ⅠStatsmodels
1. Data processing
Import data
import pandas as pd import numpy as np import seaborn as sns import matplotlib.pyplot as plt df = pd.read_csv('house_prices.csv') df.info(); df.head()
Outlier handling
# Outlier handling # ================Outlier test function: two methods of IQR & Z score========================= def outlier_test(data, column, method=None, z=2): """ Based on a column, the upper and lower truncation point method is used to detect outliers(Indexes) """ """ full_data: Complete data column: full_data Specified line in, format 'x' Quoted return Optional; outlier: Outlier data frame upper: Upper truncation point; lower: Lower truncation point method: Method of checking outliers (optional), default None Is the upper and lower cut-off point method), choose Z Method, Z The default is 2 """ # ==================Upper and lower cut-off point method to test outliers============================== if method == None: print(f'with {column} Based on the column, the upper and lower cut-off point method is used(iqr) Detect outliers...') print('=' * 70) # Quartile; There will be exceptions when calling the function here column_iqr = np.quantile(data[column], 0.75) - np.quantile(data[column], 0.25) # 1, 3 quantiles (q1, q3) = np.quantile(data[column], 0.25), np.quantile(data[column], 0.75) # Calculate upper and lower cutoff points upper, lower = (q3 + 1.5 * column_iqr), (q1 - 1.5 * column_iqr) # Detect outliers outlier = data[(data[column] <= lower) | (data[column] >= upper)] print(f'First quantile: {q1}, Third quantile:{q3}, Interquartile range:{column_iqr}') print(f"Upper cutoff point:{upper}, Lower cutoff point:{lower}") return outlier, upper, lower # =====================Z-score test outliers========================== if method == 'z': """ Based on a column, the incoming data is the same as the data you want to segment z Score point, return the outlier index and the data frame """ """ params data: Complete data column: Specified detection column z: Z Quantile, The default is 2, according to z fraction-According to the normal curve table, take 2 at the left and right ends%, According to you z Positive and negative setting of scores. It can also be changed arbitrarily to know the data set of any top percentage """ print(f'with {column} List as basis, use Z Fractional method, z Quantile extraction {z} To detect outliers...') print('=' * 70) # Calculate the numerical points of the two Z fractions mean, std = np.mean(data[column]), np.std(data[column]) upper, lower = (mean + z * std), (mean - z * std) print(f"take {z} individual Z Score: greater than {upper} Or less than {lower} Is considered an outlier.") print('=' * 70) # Detect outliers outlier = data[(data[column] <= lower) | (data[column] >= upper)] return outlier, upper, lower
outlier, upper, lower = outlier_test(data=df, column='price', method='z') outlier.info(); outlier.sample(5)
Discard exception data
df.drop(index=outlier.index, inplace=True)
2. Draw a heat map to explore the relationship between price and others
# Thermodynamic diagram def heatmap(data, method='pearson', camp='RdYlGn', figsize=(10 ,8)): """ data: Whole data method: Default to pearson coefficient camp: The default is: RdYlGn-Red, yellow and blue; YlGnBu-Yellow green blue; Blues/Greens It's also a good choice figsize: The default is 10, 8 """ ## Eliminate color blocks with diagonal color duplication # mask = np.zeros_like(df2.corr()) # mask[np.tril_indices_from(mask)] = True plt.figure(figsize=figsize, dpi= 80) sns.heatmap(data.corr(method=method), \ xticklabels=data.corr(method=method).columns, \ yticklabels=data.corr(method=method).columns, cmap=camp, \ center=0, annot=True) # To achieve the effect of leaving only half of the diagonal, the parameters in brackets can be added with mask=mask heatmap(data=df, figsize=(6,5))
It can be seen that area, bedrooms, bathrooms and other variables are not small related to house price, which is put into the model
3. Establish regression equation by using stats models
import statsmodels.api as sm from statsmodels.formula.api import ols # ols is a statistical database for establishing linear regression model from statsmodels.stats.anova import anova_lm # Number of data set samples: 6028, 600 are randomly selected here df = df.copy().sample(600) # Indicates to tell python that this is a classified variable, otherwise Python will use it as a continuous variable ## Here, analysis of variance is directly used to test all classified variables ## The following lines of code are the standard gestures for analysis of variance using the statistical library lm = ols('price ~ C(neighborhood) + C(style)', data=df).fit() anova_lm(lm) from statsmodels.formula.api import ols lm = ols('price ~ area + bedrooms + bathrooms', data=df).fit() lm.summary()
Results the R-square was 0.641, and the fitting degree was generally multicollinearity
Try the Sklearn library again
ⅡSklearn
Data processing and calculation R
The code is as follows
#Import related libraries import pandas as pd import numpy as np from sklearn.model_selection import train_test_split #Cross validation is referenced here from sklearn.linear_model import LinearRegression #linear regression from sklearn.linear_model import Lasso, Ridge, LinearRegression as LR from sklearn.metrics import r2_score, explained_variance_score as EVS, mean_squared_error as MSE from sklearn.model_selection import train_test_split, cross_val_score from pandas.core.accessor import register_dataframe_accessor
#read in data data=pd.read_csv('house_prices.csv') x = data[['neighborhood','area','bedrooms','bathrooms','style']]# Characteristic data, independent variable y= data['price']# Tag value, dependent variable
#It is divided into training set and test set in the proportion of 8:2 x_train, x_test, y_train, y_test = train_test_split( x, y, test_size=0.2, random_state=1)
reg = LR().fit(x_train, y_train) # Training model yhat = reg.predict(x_test) # Label prediction based on test set x print("r2 = ",r2_score(y_test,yhat))#Decision coefficient R^2
result
Summary
In this experiment, the accuracy of the linear regression model based on Sklearn is higher than that based on Statsmodels. It should be noted that Excel and linear are used_ Model, non numerical data must be converted to numerical data.
Related links
House price forecast based on multiple linear regression
Detailed explanation of multiple linear regression analysis theory