regression analysis
Regression analysis is a widely used quantitative analysis method. It is used to analyze the statistical relationship between things, focus on the quantitative change law between variables, and describe and reflect this relationship in the form of regression equation, so as to help people accurately grasp the degree of variables affected by one or more other variables, and then provide scientific basis for prediction. In big data analysis, regression analysis is a predictive modeling technology, which studies the relationship between dependent variables (targets) and independent variables (predictors). This technique is usually used for predictive analysis, time series models, and discovering causal relationships between variables.
Overview of regression analysis
Basic concepts
Regression analysis is a mathematical method to deal with the correlation between variables. The correlation relationship is different from the functional relationship. The latter reflects the strict dependence between variables, while the former shows a certain degree of volatility or randomness. For each value of the independent variable, the dependent variable can have multiple values corresponding to it. In statistics, regression analysis and correlation analysis can be used to study the correlation.
When the independent variable is non random and the dependent variable is random, their relationship analysis becomes regression analysis; When both are random variables, their relationship analysis becomes correlation analysis. Regression analysis and correlation analysis are often indistinguishable. Broadly speaking, correlation analysis includes regression analysis, but strictly speaking, the two are different. Two variables with correlation
ξ
\xi
ξ and
η
\eta
η, Although there is a close relationship between them, the value of another variable cannot be accurately calculated from the value of one variable. Usually selected
ξ
=
η
\xi = \eta
ξ=η Time
η
\eta
η Mathematical expectation as correspondence
ξ
=
η
\xi = \eta
ξ=η Time
η
\eta
η Because it reflects
ξ
=
x
\xi = x
ξ= x condition
η
\eta
η Average of values. Such correspondence is called regression. According to regression analysis, the mathematical expression between variables can be established, which is called regression equation. The regression equation reflects the average state change of independent variables and dependent variables under fixed conditions. Correlation analysis uses an index to measure the closeness of the relationship between the variables described by the regression equation. Correlation analysis is often supplemented by regression analysis, which complement each other. If the correlation analysis shows that the relationship between variables is very close, the fairly accurate value can be obtained through the established regression equation.
Solvable problems
- The mathematical expression between variables is usually called empirical formula.
- The basic knowledge of probability and statistics is used to analyze, so as to judge the effectiveness of the established empirical formula.
- Conduct factor analysis to determine which is the primary and which is the secondary among several variables affecting a variable, as well as the relationship between them.
Although there is some uncertainty between the variables with specific correlation, the statistical law between them can be explored through the continuous observation of the phenomenon. This kind of statistical law is called regression relationship. The theory, calculation and analysis of regression relationship is called regression analysis.
Steps of regression analysis
Firstly, determine the dependent variables to be predicted, and then focus on the explanatory variables. The multiple regression analysis of multiple regression analysis will give the relationship between the dependent variables and the explanatory variables. Finally, this relationship is in the form of formula to predict the future value of dependent variable.
Regression analysis can be divided into linear regression analysis and logistic regression analysis.
linear regression
In short, linear regression is to multiply the input items by some constants, and then add up the results to get the output. Linear regression includes univariate linear regression and multivariate linear regression.
Simple linear regression analysis
In linear regression analysis, if there is only one independent variable in one dependent variable and the relationship can be roughly expressed by a straight line, it is called simple linear regression analysis.
If it is found that there is a high positive correlation between dependent variable Y and independent variable X, a straight line equation can be determined so that all data points are as close to the fitted straight line as possible. The model of simple linear regression analysis can be expressed by the following equation:
Y
=
a
+
b
x
Y = a + bx
Y=a+bx
Where Y is the dependent variable, a is the intercept, b is the correlation coefficient, and x is the independent variable.
Multiple linear regression analysis
Multiple linear regression analysis is a generalization of simple linear regression analysis, which refers to the regression analysis of multiple dependent variables to multiple independent variables. Among them, the most commonly used is the case limited to one dependent variable but with multiple independent variables, which is also called multiple regression analysis. The general form of multiple regression analysis is as follows
Y
=
a
+
b
1
X
1
+
b
2
X
2
+
b
3
X
3
+
⋯
+
b
k
X
k
Y = a + b_{1}X_{1} + b_{2}X_{2} + b_{3}X_{3} +\cdots+ b_{k}X_{k}
Y=a+b1X1+b2X2+b3X3+⋯+bkXk
Where a stands for intercept,
b
1
,
b
2
,
b
3
⋯
b
k
b_{1},b_{2},b_{3}\cdots b_{k}
b1, b2, b3 * bk are regression coefficients.
Nonlinear regression data analysis
For linear regression problems, the sample points fall on or near a straight line in space, so a linear function can be used to represent the corresponding relationship between independent variables and dependent variables. However, in some applications, the relationship between variables is in the form of curve, so the corresponding relationship between independent variables and dependent variables cannot be expressed by linear function, but needs to be expressed by nonlinear function.
Some nonlinear regression models commonly used in data mining
- Progressive regression model: Y = a + b e − r X Y = a + be^{-rX} Y=a+be−rX
- Conic model: Y = a + b 1 X + b 2 X 2 Y = a + b_{1}X + b_{2}X^{2} Y=a+b1X+b2X2
- Hyperbolic model: Y = a + b X Y = a + \frac{b}{X} Y=a+Xb
Because many nonlinear models are equivalent, the parameterization of the model is not unique, which makes the fitting and interpretation of nonlinear models much more complex than linear models. In nonlinear regression analysis, the most common method to estimate regression parameters is still the least square method.
Least square derivation
Known
n
n
n data points:
(
x
1
,
y
1
)
,
(
x
2
,
y
2
)
⋯
(
x
n
,
y
n
)
(x_{1},y_{1}),(x_{2},y_{2})\cdots(x_{n},y_{n})
(x1,y1),(x2,y2)⋯(xn,yn)
You need to be careful about this
n
n
n data points are used for curve fitting. Through observation, it is found that it is similar to parabola. It is assumed that the form of curve equation is
y
=
a
2
x
2
+
a
1
x
+
a
0
y=a_{2}x^{2}+a_{1}x+a_{0}
y=a2 x2+a1 x+a0 where
a
0
,
a
1
,
a
2
a_{0},a_{1},a_{2}
a0, a1, a2 are unknown if
(
x
1
,
x
2
)
(x_1 , x_2)
(x1, x2) is brought into the equation to obtain:
y
1
=
a
2
x
1
2
+
a
1
x
1
+
a
0
y_1 = a_2x_{1}^{2}+a_{1}x_{1}+a_{0}
y1 = a2 ﹐ x12 + a1 ﹐ x1 + a0 ﹐ then deformation:
(
x
1
2
x
1
1
)
(
a
2
a
1
a
0
)
=
y
1
\begin{gathered} \begin{pmatrix} x_{1}^{2} & x_{1} & 1\end{pmatrix} \begin{pmatrix} a_{2} \\ a_{1} \\ a_{0}\end{pmatrix} =y_{1} \end{gathered}
(x12x11)⎝⎛a2a1a0⎠⎞=y1
Similarly
(
x
i
,
y
i
)
,
i
=
1
,
2
⋯
n
(x_i,y_i),i=1,2\cdots n
(xi, yi), i=1,2 * n, we can get:
(
x
i
2
x
i
1
)
(
a
2
a
1
a
0
)
=
y
i
\begin{gathered} \begin{pmatrix} x_{i}^{2} & x_{i} & 1\end{pmatrix} \begin{pmatrix} a_{2} \\ a_{1} \\ a_{0}\end{pmatrix} =y_{i} \end{gathered}
(xi2xi1)⎝⎛a2a1a0⎠⎞=yi
Therefore, it can be combined into the form of matrix:
(
x
1
2
x
1
1
⋮
⋮
⋮
x
n
2
x
n
1
)
(
a
2
a
1
a
0
)
=
(
y
1
⋮
y
n
)
\begin{pmatrix} x_{1}^{2} & x_{1} & 1 \\ \vdots & \vdots & \vdots \\ x_{n}^{2} & x_{n} & 1 \end{pmatrix} \begin{pmatrix} a_{2} \\ a_{1} \\ a_{0}\end{pmatrix} =\begin{pmatrix} y_{1} \\ \vdots \\ y_{n}\end{pmatrix}
⎝⎜⎛x12⋮xn2x1⋮xn1⋮1⎠⎟⎞⎝⎛a2a1a0⎠⎞=⎝⎜⎛y1⋮yn⎠⎟⎞
Assumptions:
(
x
1
2
x
1
1
⋮
⋮
⋮
x
n
2
x
n
1
)
\begin{pmatrix} x_{1}^{2} & x_{1} & 1 \\ \vdots & \vdots & \vdots \\ x_{n}^{2} & x_{n} & 1 \end{pmatrix}
⎝⎜⎛ x12 ⋮ xn2 ⋮ x1 ⋮ xn ⋮ 1 ⎠⎟ is A,
(
y
1
⋮
y
n
)
\begin{pmatrix} y_{1} \\ \vdots \\ y_{n}\end{pmatrix}
⎝⎜⎛ y1 ⋮ yn ⎠⎟ is T,
(
a
2
a
1
a
0
)
\begin{pmatrix} a_{2} \\ a_{1} \\ a_{0}\end{pmatrix}
⎝ a2 ⎛ a1 ⎠ a0 ⎞ is x
A
x
=
T
Ax=T
Ax=T
⇒
A
T
A
x
=
A
T
T
\Rightarrow A^{T}Ax=A^{T}T
⇒ATAx=ATT
⇒
A
T
A
−
1
A
T
A
x
=
A
T
A
−
1
A
T
T
\Rightarrow { A^{T}A}^{-1}A^{T}Ax={ A^{T}A}^{-1}A^{T}T
⇒ATA−1ATAx=ATA−1ATT
⇒
x
=
A
T
A
−
1
A
T
T
\Rightarrow x={ A^{T}A}^{-1}A^{T}T
⇒x=ATA−1ATT
Solving linear regression by using principle
Linear regression of quadratic equation
import numpy as np import matplotlib.pyplot as plt import numpy.linalg as lg t = np.arange(1, 17, 1) y = np.array([4, 6.4, 8, 8.8, 9.22, 9.5, 9.7, 9.86, 10, 10.20, 10.32, 10.42, 10.5, 10.55, 10.58, 10.6]) plt.figure() plt.plot(t, y, 'k*') # y=at^2+bt+c A = np.c_[t ** 2, t, np.ones(t.shape)] w = lg.inv(A.T.dot(A)).dot(A.T).dot(y) plt.plot(t, w[0] * t ** 2 + w[1] * t + w[2]) plt.show()
import numpy as np import matplotlib.pyplot as plt import numpy.linalg as lg t = np.arange(1, 17, 1) y = np.array([4, 6.4, 8, 8.8, 9.22, 9.5, 9.7, 9.86, 10, 10.20, 10.32, 10.42, 10.5, 10.55, 10.58, 10.6]) plt.figure() plt.plot(t, y, 'k*') #y=ax^(1/2)+bx^(1/4)+c A = np.c_[t ** (1 / 2), t ** (1 / 4), np.ones(t.shape)] w = lg.inv(A.T.dot(A)).dot(A.T).dot(y) plt.plot(t, w[0] * t ** (1 / 2) + w[1] * t ** (1 / 4) + w[2]) plt.show()
Hyperbolic linear regression
import numpy as np import matplotlib.pyplot as plt import numpy.linalg as lg t = np.arange(1, 17, 1) y = np.array([4, 6.4, 8, 8.8, 9.22, 9.5, 9.7, 9.86, 10, 10.20, 10.32, 10.42, 10.5, 10.55, 10.58, 10.6]) plt.figure() plt.plot(t, y, 'k*') # y=a/t+b A = np.c_[1 / t, np.ones(t.shape)] w = lg.inv(A.T.dot(A)).dot(A.T).dot(y) plt.plot(t, w[0] / t + w[1]) plt.show()
Implementation of univariate linear regression with Python
A simple example of linear regression is the problem of house value prediction. Generally speaking, the larger the house, the higher the value of the house. Therefore, it can be inferred that the value of the house is related to the area of the house.
number | square foot | Price (yuan / square foot) |
---|---|---|
1 | 150 | 6450 |
2 | 200 | 7450 |
3 | 250 | 8450 |
4 | 300 | 9450 |
5 | 350 | 11450 |
6 | 400 | 15400 |
7 | 600 | 18450 |
1) In linear regression, we must find a linear relationship in the data so that we can get a and b. The assumed equation is as follows:
y
(
X
)
=
a
+
b
X
y(X)=a+bX
y(X)=a+bX
Of which:
y
(
x
)
y(x)
y(x) is the price value (the value to be predicted) for a particular square foot, meaning that the price is a linear function of square feet. A is a constant; b is the regression coefficient. Now start programming:
Store the data into a CSV file named input_data.csv
id,square_feet,price 1,150,6450 2,200,7450 3,250,8450 4,300,8450 5,350,11450 6,400,15400 7,600,18450
New python file predict_house_price.py
import numpy as np import pandas as pd from sklearn import datasets, linear_model import matplotlib.pyplot as plt # Read data function def get_data(file_name): data = pd.read_csv(file_name) # Read cvs file x_parameter = [] y_parameter = [] for single_square_feet, single_price_value in zip(data['square_feet'], data['price']): # Traversal data x_parameter.append([float(single_square_feet)]) # Stored in the corresponding list y_parameter.append(float(single_price_value)) return x_parameter, y_parameter # Fitting data to a linear model def linear_model_main(x_parameters, y_parameters, predict_value): regr = linear_model.LinearRegression() regr.fit(x_parameters, y_parameters) # Training model predict_outcome = regr.predict(predict_value) predictions = {'intercept': regr.intercept_, 'coefficient': regr.coef_, 'predicted_value': predict_outcome} return predictions # Displays the results of the linear fitting model def show_linear_line(x_parameters, y_parameters): regr = linear_model.LinearRegression() regr.fit(x_parameters, y_parameters) plt.scatter(x_parameters, y_parameters, color='blue') plt.plot(x_parameters, regr.predict(x_parameters), color='red', linewidth=4) plt.xticks(()) plt.yticks(()) plt.show() x, y = get_data('input_data.csv') predictValue = [[700]] result = linear_model_main(x, y, predictValue) print("Intercept value", result['intercept']) print("coefficient", result['coefficient']) print("Predicted value", result['predicted_value']) show_linear_line(x, y)
The scikit learn machine learning algorithm package is used here. The algorithm package is one of the best machine algorithm packages implemented in Python.
regr.fit(x_parameters, y_parameters) is the input point pair to train the model.
regr.predict(predict_value) uses a model to predict the y value of an x.
regr.intercept_ The intercept of linear regression is the value of a
regr.coef_ The coefficient is the value of b
Realizing multiple linear regression with Python
When there are multiple factors affecting the result value, the multiple linear regression model can be used. For example, the sales of goods may be related to TV advertising investment, radio advertising investment and newspaper advertising investment. Can have S a l e s = β 0 + β 1 T V + β 2 R a d i o + β 3 N e w s p a p e r Sales=\beta_{0}+\beta_{1}TV+\beta_{2}Radio+\beta_{3}Newspaper Sales=β0+β1TV+β2Radio+β3Newspaper
Reading data using pandas
pandas is a Python library for data exploration, data analysis, and data processing.
import pandas as pd data=pd.read_csv("D:/Data/Advertising.csv") print(data.head())
The result set shown above is similar to a spreadsheet. This structure is called pandas data frame. The full name of the type is pandas.core.frame.DataFrame.
The two main data structures of pandas are series and DataFrame; Series is similar to one-dimensional array, which is composed of a set of data and a set of related data labels; DataFrame is a tabular data structure, which contains a set of ordered columns, each of which can be of different value types. DataFrame has both row and column indexes. It can be regarded as a dictionary composed of series.
import pandas as pd data=pd.read_csv("D:/Data/Advertising.csv") print(data.tail())
import pandas as pd data=pd.read_csv("D:/Data/Advertising.csv") print(data.shape)
Analysis data
TV: advertising expenses invested in TV
Radio: advertising expenses invested in broadcast media
Newspaper: advertising expenses for newspaper media
Corresponding: continuous value
Sales: Sales of corresponding products
In this case, the product sales are predicted through different advertising investment. Because the response variable is a continuous value, this problem is a regression problem. There are 200 observations in the data set, and each group of observations corresponds to the situation of a market.
Note: seaborn package is recommended here. The data visualization effect of this package is better. In fact, seaborn also belongs to the internal package of Matplotlib, but it needs to be installed separately.
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt data = pd.read_csv("D:/Data/Advertising.csv") # Use scatter diagrams to visualize the relationship between features and corresponding sns.pairplot(data, x_vars=['TV', 'radio', 'newspaper'], y_vars='sales', size=7, aspect=0.8,kind='reg') plt.show() # Here, select TV.Radio.Newspaper as the feature and Sales as the observation value
seaborn's pairplot function plots the scatter plot of each dimension of X and corresponding Y. Adjust the size and scale of the display by setting the size and aspect parameters. By adding the parameter kind = 'reg', seaborn can add a best fitting line and 95% confidence band.
It can be seen that there is a strong linear relationship between TV features and Sales, while the linear relationship between Radio and Sales is weaker, and the linear relationship between Newspaper and Sales is weaker.
linear regression model
Advantages: fast; No adjustment parameters; Easy to explain; Understandable
Disadvantages: compared with other complex models, its prediction accuracy is not high, because it assumes that there is a certain linear relationship between characteristics and response. For the nonlinear relationship, the linear regression model obviously can not model the data well.
Use pandas to build x (eigenvector) and Y (label column)
Scikit learn requires that x is an eigenmatrix and Y is a NumPy vector. Pandas is built on NumPy. Therefore, X can be pandas' DataFrame and y can be pandas' Series. Scikit learn can understand this structure.
import pandas as pd data = pd.read_csv("D:/Data/Advertising.csv") featrue_cols=['TV','radio','newspaper'] x=data[featrue_cols] print(x.head())
import pandas as pd data = pd.read_csv("D:/Data/Advertising.csv") featrue_cols=['TV','radio','newspaper'] x=data[featrue_cols] y=data['sales'] print(y.head())
Build training set and test set
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn import metrics import numpy as np data = pd.read_csv("D:/Data/Advertising.csv") featrue_cols = ['TV', 'radio', 'newspaper'] x = data[featrue_cols] y = data['sales'] x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1) print(x_train.shape) print(x_test.shape) print(y_train.shape) print(y_test.shape)
Linear regression of sklearn
Use sklearn to do linear regression, first import the relevant linear regression model, and then do linear regression simulation.
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn import metrics import numpy as np data = pd.read_csv("D:/Data/Advertising.csv") featrue_cols = ['TV', 'radio', 'newspaper'] x = data[featrue_cols] y = data['sales'] x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1) linreg = LinearRegression() model = linreg.fit(x_train, y_train) print(model) print(linreg.intercept_) print(linreg.coef_)
Results of linear regression
y
=
2.867
+
0.0465
×
T
V
+
0.179
×
R
a
d
i
o
+
0.00345
×
N
e
w
s
p
a
p
e
r
y=2.867+0.0465\times TV+0.179\times Radio+0.00345\times Newspaper
y=2.867+0.0465×TV+0.179×Radio+0.00345×Newspaper
forecast
After the regression model is obtained through linear simulation, the data can be predicted through the model, and the prediction results can be obtained through the predict function.
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn import metrics import numpy as np data = pd.read_csv("D:/Data/Advertising.csv") featrue_cols = ['TV', 'radio', 'newspaper'] x = data[featrue_cols] y = data['sales'] x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1) linreg = LinearRegression() model = linreg.fit(x_train, y_train) y_pred=linreg.predict(x_test) print(y_pred) print(type(y_pred))
Evaluation measure
For classification problems, the evaluation measure is accuracy, but it is not suitable for regression problems. Therefore, the evaluation measure of continuous values is used.
Three commonly used evaluation methods for linear regression are introduced here.
- Mean Absolute Error (MAE)
- Mean square error (MSE)
- Root mean square error (RMSE)
print(type(y_pred),type(y_test)) print(len(y_pred),len(y_test)) print(y_pred.shape,y_test.shape) sum_mean=0 for i in range(len(y_pred)): sum_mean+=(y_pred[i]-y_test.values[i])**2 sum_erro=np.sqrt(sum_mean/50) print("RMSE by hand:",sum_erro)
import pandas as pd import seaborn as sns import matplotlib.pyplot as plt from sklearn.model_selection import train_test_split from sklearn.linear_model import LinearRegression from sklearn import metrics import numpy as np data = pd.read_csv("D:/Data/Advertising.csv") featrue_cols = ['TV', 'radio', 'newspaper'] x = data[featrue_cols] y = data['sales'] x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1) linreg = LinearRegression() model = linreg.fit(x_train, y_train) y_pred=linreg.predict(x_test) sum_mean=0 for i in range(len(y_pred)): sum_mean+=(y_pred[i]-y_test.values[i])**2 sum_erro=np.sqrt(sum_mean/50) plt.figure() plt.plot(range(len(y_pred)),y_pred,'b',label="predict") plt.plot(range(len(y_pred)),y_test,'r',label="test") plt.legend(loc="upper right") plt.xlabel("the number of sales") plt.ylabel("value of sales") plt.show()