Data mining regression analysis

regression analysis

Regression analysis is a widely used quantitative analysis method. It is used to analyze the statistical relationship between things, focus on the quantitative change law between variables, and describe and reflect this relationship in the form of regression equation, so as to help people accurately grasp the degree of variables affected by one or more other variables, and then provide scientific basis for prediction. In big data analysis, regression analysis is a predictive modeling technology, which studies the relationship between dependent variables (targets) and independent variables (predictors). This technique is usually used for predictive analysis, time series models, and discovering causal relationships between variables.

Overview of regression analysis

Basic concepts
Regression analysis is a mathematical method to deal with the correlation between variables. The correlation relationship is different from the functional relationship. The latter reflects the strict dependence between variables, while the former shows a certain degree of volatility or randomness. For each value of the independent variable, the dependent variable can have multiple values corresponding to it. In statistics, regression analysis and correlation analysis can be used to study the correlation.
When the independent variable is non random and the dependent variable is random, their relationship analysis becomes regression analysis; When both are random variables, their relationship analysis becomes correlation analysis. Regression analysis and correlation analysis are often indistinguishable. Broadly speaking, correlation analysis includes regression analysis, but strictly speaking, the two are different. Two variables with correlation ξ \xi ξ and η \eta η, Although there is a close relationship between them, the value of another variable cannot be accurately calculated from the value of one variable. Usually selected ξ = η \xi = \eta ξ=η Time η \eta η Mathematical expectation as correspondence ξ = η \xi = \eta ξ=η Time η \eta η Because it reflects ξ = x \xi = x ξ= x condition η \eta η Average of values. Such correspondence is called regression. According to regression analysis, the mathematical expression between variables can be established, which is called regression equation. The regression equation reflects the average state change of independent variables and dependent variables under fixed conditions. Correlation analysis uses an index to measure the closeness of the relationship between the variables described by the regression equation. Correlation analysis is often supplemented by regression analysis, which complement each other. If the correlation analysis shows that the relationship between variables is very close, the fairly accurate value can be obtained through the established regression equation.
Solvable problems

  • The mathematical expression between variables is usually called empirical formula.
  • The basic knowledge of probability and statistics is used to analyze, so as to judge the effectiveness of the established empirical formula.
  • Conduct factor analysis to determine which is the primary and which is the secondary among several variables affecting a variable, as well as the relationship between them.
    Although there is some uncertainty between the variables with specific correlation, the statistical law between them can be explored through the continuous observation of the phenomenon. This kind of statistical law is called regression relationship. The theory, calculation and analysis of regression relationship is called regression analysis.
    Steps of regression analysis
    Firstly, determine the dependent variables to be predicted, and then focus on the explanatory variables. The multiple regression analysis of multiple regression analysis will give the relationship between the dependent variables and the explanatory variables. Finally, this relationship is in the form of formula to predict the future value of dependent variable.
    Regression analysis can be divided into linear regression analysis and logistic regression analysis.

linear regression

In short, linear regression is to multiply the input items by some constants, and then add up the results to get the output. Linear regression includes univariate linear regression and multivariate linear regression.

Simple linear regression analysis

In linear regression analysis, if there is only one independent variable in one dependent variable and the relationship can be roughly expressed by a straight line, it is called simple linear regression analysis.
If it is found that there is a high positive correlation between dependent variable Y and independent variable X, a straight line equation can be determined so that all data points are as close to the fitted straight line as possible. The model of simple linear regression analysis can be expressed by the following equation: Y = a + b x Y = a + bx Y=a+bx
Where Y is the dependent variable, a is the intercept, b is the correlation coefficient, and x is the independent variable.

Multiple linear regression analysis

Multiple linear regression analysis is a generalization of simple linear regression analysis, which refers to the regression analysis of multiple dependent variables to multiple independent variables. Among them, the most commonly used is the case limited to one dependent variable but with multiple independent variables, which is also called multiple regression analysis. The general form of multiple regression analysis is as follows Y = a + b 1 X 1 + b 2 X 2 + b 3 X 3 + ⋯ + b k X k Y = a + b_{1}X_{1} + b_{2}X_{2} + b_{3}X_{3} +\cdots+ b_{k}X_{k} Y=a+b1​X1​+b2​X2​+b3​X3​+⋯+bk​Xk​
Where a stands for intercept, b 1 , b 2 , b 3 ⋯ b k b_{1},b_{2},b_{3}\cdots b_{k} b1, b2, b3 * bk are regression coefficients.

Nonlinear regression data analysis

For linear regression problems, the sample points fall on or near a straight line in space, so a linear function can be used to represent the corresponding relationship between independent variables and dependent variables. However, in some applications, the relationship between variables is in the form of curve, so the corresponding relationship between independent variables and dependent variables cannot be expressed by linear function, but needs to be expressed by nonlinear function.
Some nonlinear regression models commonly used in data mining

  • Progressive regression model: Y = a + b e − r X Y = a + be^{-rX} Y=a+be−rX
  • Conic model: Y = a + b 1 X + b 2 X 2 Y = a + b_{1}X + b_{2}X^{2} Y=a+b1​X+b2​X2
  • Hyperbolic model: Y = a + b X Y = a + \frac{b}{X} Y=a+Xb​

Because many nonlinear models are equivalent, the parameterization of the model is not unique, which makes the fitting and interpretation of nonlinear models much more complex than linear models. In nonlinear regression analysis, the most common method to estimate regression parameters is still the least square method.

Least square derivation

Known n n n data points: ( x 1 , y 1 ) , ( x 2 , y 2 ) ⋯ ( x n , y n ) (x_{1},y_{1}),(x_{2},y_{2})\cdots(x_{n},y_{n}) (x1​,y1​),(x2​,y2​)⋯(xn​,yn​)
You need to be careful about this n n n data points are used for curve fitting. Through observation, it is found that it is similar to parabola. It is assumed that the form of curve equation is y = a 2 x 2 + a 1 x + a 0 y=a_{2}x^{2}+a_{1}x+a_{0} y=a2 x2+a1 x+a0 where a 0 , a 1 , a 2 a_{0},a_{1},a_{2} a0, a1, a2 are unknown if ( x 1 , x 2 ) (x_1 , x_2) (x1, x2) is brought into the equation to obtain: y 1 = a 2 x 1 2 + a 1 x 1 + a 0 y_1 = a_2x_{1}^{2}+a_{1}x_{1}+a_{0} y1 = a2 ﹐ x12 + a1 ﹐ x1 + a0 ﹐ then deformation: ( x 1 2 x 1 1 ) ( a 2 a 1 a 0 ) = y 1 \begin{gathered} \begin{pmatrix} x_{1}^{2} & x_{1} & 1\end{pmatrix} \begin{pmatrix} a_{2} \\ a_{1} \\ a_{0}\end{pmatrix} =y_{1} \end{gathered} (x12​​x1​​1​)⎝⎛​a2​a1​a0​​⎠⎞​=y1​​
Similarly ( x i , y i ) , i = 1 , 2 ⋯ n (x_i,y_i),i=1,2\cdots n (xi, yi), i=1,2 * n, we can get: ( x i 2 x i 1 ) ( a 2 a 1 a 0 ) = y i \begin{gathered} \begin{pmatrix} x_{i}^{2} & x_{i} & 1\end{pmatrix} \begin{pmatrix} a_{2} \\ a_{1} \\ a_{0}\end{pmatrix} =y_{i} \end{gathered} (xi2​​xi​​1​)⎝⎛​a2​a1​a0​​⎠⎞​=yi​​
Therefore, it can be combined into the form of matrix: ( x 1 2 x 1 1 ⋮ ⋮ ⋮ x n 2 x n 1 ) ( a 2 a 1 a 0 ) = ( y 1 ⋮ y n ) \begin{pmatrix} x_{1}^{2} & x_{1} & 1 \\ \vdots & \vdots & \vdots \\ x_{n}^{2} & x_{n} & 1 \end{pmatrix} \begin{pmatrix} a_{2} \\ a_{1} \\ a_{0}\end{pmatrix} =\begin{pmatrix} y_{1} \\ \vdots \\ y_{n}\end{pmatrix} ⎝⎜⎛​x12​⋮xn2​​x1​⋮xn​​1⋮1​⎠⎟⎞​⎝⎛​a2​a1​a0​​⎠⎞​=⎝⎜⎛​y1​⋮yn​​⎠⎟⎞​
Assumptions: ( x 1 2 x 1 1 ⋮ ⋮ ⋮ x n 2 x n 1 ) \begin{pmatrix} x_{1}^{2} & x_{1} & 1 \\ \vdots & \vdots & \vdots \\ x_{n}^{2} & x_{n} & 1 \end{pmatrix} ⎝⎜⎛ x12 ⋮ xn2 ⋮ x1 ⋮ xn ⋮ 1 ⎠⎟ is A, ( y 1 ⋮ y n ) \begin{pmatrix} y_{1} \\ \vdots \\ y_{n}\end{pmatrix} ⎝⎜⎛ y1 ⋮ yn ⎠⎟ is T, ( a 2 a 1 a 0 ) \begin{pmatrix} a_{2} \\ a_{1} \\ a_{0}\end{pmatrix} ⎝ a2 ⎛ a1 ⎠ a0 ⎞ is x
A x = T Ax=T Ax=T ⇒ A T A x = A T T \Rightarrow A^{T}Ax=A^{T}T ⇒ATAx=ATT ⇒ A T A − 1 A T A x = A T A − 1 A T T \Rightarrow { A^{T}A}^{-1}A^{T}Ax={ A^{T}A}^{-1}A^{T}T ⇒ATA−1ATAx=ATA−1ATT ⇒ x = A T A − 1 A T T \Rightarrow x={ A^{T}A}^{-1}A^{T}T ⇒x=ATA−1ATT

Solving linear regression by using principle

Linear regression of quadratic equation

import numpy as np
import matplotlib.pyplot as plt
import numpy.linalg as lg

t = np.arange(1, 17, 1)
y = np.array([4, 6.4, 8, 8.8, 9.22, 9.5, 9.7, 9.86, 10, 10.20, 10.32, 10.42, 10.5, 10.55, 10.58, 10.6])
plt.figure()
plt.plot(t, y, 'k*')
# y=at^2+bt+c
A = np.c_[t ** 2, t, np.ones(t.shape)]
w = lg.inv(A.T.dot(A)).dot(A.T).dot(y)
plt.plot(t, w[0] * t ** 2 + w[1] * t + w[2])
plt.show()

import numpy as np
import matplotlib.pyplot as plt
import numpy.linalg as lg

t = np.arange(1, 17, 1)
y = np.array([4, 6.4, 8, 8.8, 9.22, 9.5, 9.7, 9.86, 10, 10.20, 10.32, 10.42, 10.5, 10.55, 10.58, 10.6])
plt.figure()
plt.plot(t, y, 'k*')
#y=ax^(1/2)+bx^(1/4)+c
A = np.c_[t ** (1 / 2), t ** (1 / 4), np.ones(t.shape)]
w = lg.inv(A.T.dot(A)).dot(A.T).dot(y)
plt.plot(t, w[0] * t ** (1 / 2) + w[1] * t ** (1 / 4) + w[2])
plt.show()


Hyperbolic linear regression

import numpy as np
import matplotlib.pyplot as plt
import numpy.linalg as lg

t = np.arange(1, 17, 1)
y = np.array([4, 6.4, 8, 8.8, 9.22, 9.5, 9.7, 9.86, 10, 10.20, 10.32, 10.42, 10.5, 10.55, 10.58, 10.6])
plt.figure()
plt.plot(t, y, 'k*')
# y=a/t+b
A = np.c_[1 / t, np.ones(t.shape)]
w = lg.inv(A.T.dot(A)).dot(A.T).dot(y)
plt.plot(t, w[0] / t + w[1])
plt.show()

Implementation of univariate linear regression with Python

A simple example of linear regression is the problem of house value prediction. Generally speaking, the larger the house, the higher the value of the house. Therefore, it can be inferred that the value of the house is related to the area of the house.

numbersquare footPrice (yuan / square foot)
11506450
22007450
32508450
43009450
535011450
640015400
760018450

1) In linear regression, we must find a linear relationship in the data so that we can get a and b. The assumed equation is as follows: y ( X ) = a + b X y(X)=a+bX y(X)=a+bX
Of which: y ( x ) y(x) y(x) is the price value (the value to be predicted) for a particular square foot, meaning that the price is a linear function of square feet. A is a constant; b is the regression coefficient. Now start programming:
Store the data into a CSV file named input_data.csv

id,square_feet,price
1,150,6450
2,200,7450
3,250,8450
4,300,8450
5,350,11450
6,400,15400
7,600,18450

New python file predict_house_price.py

import numpy as np
import pandas as pd
from sklearn import datasets, linear_model
import matplotlib.pyplot as plt


# Read data function
def get_data(file_name):
    data = pd.read_csv(file_name)  # Read cvs file
    x_parameter = []
    y_parameter = []
    for single_square_feet, single_price_value in zip(data['square_feet'], data['price']):
        # Traversal data
        x_parameter.append([float(single_square_feet)])  # Stored in the corresponding list
        y_parameter.append(float(single_price_value))
    return x_parameter, y_parameter


# Fitting data to a linear model
def linear_model_main(x_parameters, y_parameters, predict_value):
    regr = linear_model.LinearRegression()
    regr.fit(x_parameters, y_parameters)  # Training model
    predict_outcome = regr.predict(predict_value)
    predictions = {'intercept': regr.intercept_, 'coefficient': regr.coef_, 'predicted_value': predict_outcome}
    return predictions


# Displays the results of the linear fitting model
def show_linear_line(x_parameters, y_parameters):
    regr = linear_model.LinearRegression()
    regr.fit(x_parameters, y_parameters)
    plt.scatter(x_parameters, y_parameters, color='blue')
    plt.plot(x_parameters, regr.predict(x_parameters), color='red', linewidth=4)
    plt.xticks(())
    plt.yticks(())
    plt.show()


x, y = get_data('input_data.csv')
predictValue = [[700]]
result = linear_model_main(x, y, predictValue)
print("Intercept value", result['intercept'])
print("coefficient", result['coefficient'])
print("Predicted value", result['predicted_value'])
show_linear_line(x, y)

The scikit learn machine learning algorithm package is used here. The algorithm package is one of the best machine algorithm packages implemented in Python.
regr.fit(x_parameters, y_parameters) is the input point pair to train the model.
regr.predict(predict_value) uses a model to predict the y value of an x.
regr.intercept_ The intercept of linear regression is the value of a
regr.coef_ The coefficient is the value of b

Realizing multiple linear regression with Python

When there are multiple factors affecting the result value, the multiple linear regression model can be used. For example, the sales of goods may be related to TV advertising investment, radio advertising investment and newspaper advertising investment. Can have S a l e s = β 0 + β 1 T V + β 2 R a d i o + β 3 N e w s p a p e r Sales=\beta_{0}+\beta_{1}TV+\beta_{2}Radio+\beta_{3}Newspaper Sales=β0​+β1​TV+β2​Radio+β3​Newspaper

Reading data using pandas

pandas is a Python library for data exploration, data analysis, and data processing.

import pandas as pd
data=pd.read_csv("D:/Data/Advertising.csv")
print(data.head())

The result set shown above is similar to a spreadsheet. This structure is called pandas data frame. The full name of the type is pandas.core.frame.DataFrame.
The two main data structures of pandas are series and DataFrame; Series is similar to one-dimensional array, which is composed of a set of data and a set of related data labels; DataFrame is a tabular data structure, which contains a set of ordered columns, each of which can be of different value types. DataFrame has both row and column indexes. It can be regarded as a dictionary composed of series.

import pandas as pd
data=pd.read_csv("D:/Data/Advertising.csv")
print(data.tail())

import pandas as pd
data=pd.read_csv("D:/Data/Advertising.csv")
print(data.shape)

Analysis data

TV: advertising expenses invested in TV
Radio: advertising expenses invested in broadcast media
Newspaper: advertising expenses for newspaper media
Corresponding: continuous value
Sales: Sales of corresponding products
In this case, the product sales are predicted through different advertising investment. Because the response variable is a continuous value, this problem is a regression problem. There are 200 observations in the data set, and each group of observations corresponds to the situation of a market.
Note: seaborn package is recommended here. The data visualization effect of this package is better. In fact, seaborn also belongs to the internal package of Matplotlib, but it needs to be installed separately.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

data = pd.read_csv("D:/Data/Advertising.csv")
#  Use scatter diagrams to visualize the relationship between features and corresponding
sns.pairplot(data, x_vars=['TV', 'radio', 'newspaper'], y_vars='sales', size=7, aspect=0.8,kind='reg')
plt.show()
#  Here, select TV.Radio.Newspaper as the feature and Sales as the observation value

seaborn's pairplot function plots the scatter plot of each dimension of X and corresponding Y. Adjust the size and scale of the display by setting the size and aspect parameters. By adding the parameter kind = 'reg', seaborn can add a best fitting line and 95% confidence band.

It can be seen that there is a strong linear relationship between TV features and Sales, while the linear relationship between Radio and Sales is weaker, and the linear relationship between Newspaper and Sales is weaker.

linear regression model

Advantages: fast; No adjustment parameters; Easy to explain; Understandable
Disadvantages: compared with other complex models, its prediction accuracy is not high, because it assumes that there is a certain linear relationship between characteristics and response. For the nonlinear relationship, the linear regression model obviously can not model the data well.
Use pandas to build x (eigenvector) and Y (label column)
Scikit learn requires that x is an eigenmatrix and Y is a NumPy vector. Pandas is built on NumPy. Therefore, X can be pandas' DataFrame and y can be pandas' Series. Scikit learn can understand this structure.

import pandas as pd


data = pd.read_csv("D:/Data/Advertising.csv")
featrue_cols=['TV','radio','newspaper']
x=data[featrue_cols]
print(x.head())

import pandas as pd


data = pd.read_csv("D:/Data/Advertising.csv")
featrue_cols=['TV','radio','newspaper']
x=data[featrue_cols]
y=data['sales']
print(y.head())


Build training set and test set

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
import numpy as np

data = pd.read_csv("D:/Data/Advertising.csv")
featrue_cols = ['TV', 'radio', 'newspaper']
x = data[featrue_cols]
y = data['sales']
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1)
print(x_train.shape)
print(x_test.shape)
print(y_train.shape)
print(y_test.shape)


Linear regression of sklearn
Use sklearn to do linear regression, first import the relevant linear regression model, and then do linear regression simulation.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
import numpy as np

data = pd.read_csv("D:/Data/Advertising.csv")
featrue_cols = ['TV', 'radio', 'newspaper']
x = data[featrue_cols]
y = data['sales']
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1)
linreg = LinearRegression()
model = linreg.fit(x_train, y_train)
print(model)
print(linreg.intercept_)
print(linreg.coef_)


Results of linear regression y = 2.867 + 0.0465 × T V + 0.179 × R a d i o + 0.00345 × N e w s p a p e r y=2.867+0.0465\times TV+0.179\times Radio+0.00345\times Newspaper y=2.867+0.0465×TV+0.179×Radio+0.00345×Newspaper
forecast
After the regression model is obtained through linear simulation, the data can be predicted through the model, and the prediction results can be obtained through the predict function.

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
import numpy as np

data = pd.read_csv("D:/Data/Advertising.csv")
featrue_cols = ['TV', 'radio', 'newspaper']
x = data[featrue_cols]
y = data['sales']
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1)
linreg = LinearRegression()
model = linreg.fit(x_train, y_train)
y_pred=linreg.predict(x_test)
print(y_pred)
print(type(y_pred))


Evaluation measure
For classification problems, the evaluation measure is accuracy, but it is not suitable for regression problems. Therefore, the evaluation measure of continuous values is used.
Three commonly used evaluation methods for linear regression are introduced here.

  • Mean Absolute Error (MAE)
  • Mean square error (MSE)
  • Root mean square error (RMSE)
print(type(y_pred),type(y_test))
print(len(y_pred),len(y_test))
print(y_pred.shape,y_test.shape)
sum_mean=0
for i in range(len(y_pred)):
    sum_mean+=(y_pred[i]-y_test.values[i])**2
sum_erro=np.sqrt(sum_mean/50)
print("RMSE by hand:",sum_erro)

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics
import numpy as np

data = pd.read_csv("D:/Data/Advertising.csv")
featrue_cols = ['TV', 'radio', 'newspaper']
x = data[featrue_cols]
y = data['sales']
x_train, x_test, y_train, y_test = train_test_split(x, y, random_state=1)
linreg = LinearRegression()
model = linreg.fit(x_train, y_train)
y_pred=linreg.predict(x_test)
sum_mean=0
for i in range(len(y_pred)):
    sum_mean+=(y_pred[i]-y_test.values[i])**2
sum_erro=np.sqrt(sum_mean/50)
plt.figure()
plt.plot(range(len(y_pred)),y_pred,'b',label="predict")
plt.plot(range(len(y_pred)),y_test,'r',label="test")
plt.legend(loc="upper right")
plt.xlabel("the number of sales")
plt.ylabel("value of sales")
plt.show()

Tags: R Language Data Mining

Posted on Sun, 03 Oct 2021 21:13:38 -0400 by JasperBosch