Operation requirements:
- Dependent variable analysis: analysis of house price per unit area
- Independent variable analysis:
1> Distribution of independent variables
2> Analysis of the influence of independent variables on dependent variables - Establish house price prediction model
1> Linear regression model
2> Linear model with logarithm of dependent variable
3> Logarithmic linearity considering interaction terms - Forecast house price
·It covers three chapters: descriptive statistics, statistical inference and linear regression
1: Data display
import pandas as pd import numpy as np import math import matplotlib import matplotlib.pyplot as plt import seaborn as sns import statsmodels.api as sm from numpy import corrcoef,array # from numpy import corrconf,array from statsmodels.formula.api import ols matplotlib.rcParams['axes.unicode_minus']=False #Solving symbol problems plt.rcParams['font.sans-serif']=['SimHei'] #Specifies the default font and displays Chinese when dat11 = pd.read_csv(r'E:\sndHsPr.csv') dat11
Variable meaning:
Column variable | meaning |
---|---|
roomnum | Several rooms |
subway | Is it near the subway |
halls | Several halls |
school | School district room |
AREA | House area |
price | House price per square meter |
floor | Middle level, low level and high level |
2: Basic data
Descriptive statistical analysis: all sample sizes were used
Hypothesis testing and modeling: not all raw data can be used for sampling
In[] : dat0=dat11 - Data transfer In[] : dat0.shape[0] Out[] : 16210 - Sample size over 5000,p Value has no meaning==>Therefore, sampling is required for modeling and inspection dat0.describe(include='all').T - The vertical column is a statistic and the horizontal column is a variable - Continuous variable: the output is percentile, median, etc - Classification variable: the output is frequency information
Note: qq chart and PP chart are used to check whether a distribution conforms to normal distribution and occupy a certain position in statistics. In data analysis, histogram can be described.
3: Dependent variable analysis
Step 1: data sorting
- House price per unit area: 10000 yuan per unit dat0.price=dat0.price/10000 - Change the area name to Chinese dict1 = { u'chaoyang':'Sunrise', u'dongcheng':'Dongcheng', u'fengtai':'Fengtai', u'haidian':'Haidian', u'shijingshan':'Shijingshan', u'xicheng':'Westlife ' } dat0.dist = dat0.dist.apply(lambda x : dict1[x]) dat0.head()
Step 2: histogram
As long as it is data analysis and continuous variable, histogram must be drawn. Prevent errors such as outliers.
dat0.price.hist(bins=20) - Divided into 20 items plt.xlabel('House price per unit area (10000 yuan)/(M2)') plt.ylabel('frequency')
- According to the above histogram, it can be seen that the dependent variable is a right bias function, and the logarithm should be considered.
Step 3: descriptive statistical analysis
- Continuous variables: maximum, minimum, mean, median, standard deviation, quartile
- Categorical variables: frequency analysis
- Mean, median, standard deviation In[]:dat0.price.agg(['mean','median','std']) Out[]: mean 6.115181 median 5.747300 std 2.229336 - Quartile In[] : dat0.price.quantile([0.25,0.5,0.75]) Out[]: 0.25 4.281225 0.50 5.747300 0.75 7.609975
- Pay attention to outliers and have a basic understanding of data
- It is common sense to find out which observation corresponds to the lowest price and the highest price. pd.concat([(dat0[dat0.price==min(dat0.price)]),(dat0[dat0.price==max(dat0.price)])])
4, Independent variable analysis
Step 1: look at the overall data
- The overall data has no outliers
- The data are divided into categorical variables and continuous variables
''' According to the data, five of the six independent variables are classified variables and one is continuous variable; Y Value is price ''' for i in range(7): if i!=3: # AREA is a continuous variable and no frequency analysis is performed print(dat0.columns.values[i],':') print(dat0[dat0.columns.values[i]].agg(['value_counts']).T) print('------------------------------------------') else: continue print('AREA:') print(dat0.AREA.agg(['min','max','median','std']).T) # Sample size / total number of samples > = 0.05, the number of samples is not too small
Step 2: analyze the classified variables (urban area)
1 self distribution of independent variables
- Urban frequency analysis data.dist.value_counts().plot(kind='pie')
2 Analysis of the influence of independent variables on dependent variables
- Look at the average house price in each district
- Method 1: column chart data.price.groupby(dat0.dist).mean().sort_values(ascending=True).plot(kind='barh')
- Method 2: box whisker diagram dat1 = dat0[['dist','price']] - Convert area to classification variable dat1.dist = dat1.dist.astype('category') - Define the order of classification variables,Make it in ascending order dat1.dist.cat.set_categories(['Shijingshan','Fengtai','Sunrise','Haidian','Dongcheng','Westlife '],inplace=True) sns.boxplot(x='dist',y='price',data=dat1) plt.ylabel('House price per unit area (10000 yuan)/(M2)') plt.xlabel('city proper') plt.title('Group box diagram of house price in urban area')
Note:
See whether each district has an impact on Y: it mainly depends on whether the central level is consistent. If it is inconsistent, it means that the two are not independent. Then X can predict y and can be included in the regression model.
Step 3: analyze the continuous variable (area)
1 is the house area related to the price
Scatter diagram: check whether it is right deviation = = "right deviation is considered to take logarithm
Correlation coefficient: judge the correlation
- Scatter diagram shows left dense right evacuation==>Right skew function datA = dat0[['AREA','price']] plt.scatter(datA.AREA,datA.price,marker='.')
- pearson correlation coefficient datA[['AREA','price']].corr(method='pearson')
Note: the correlation coefficient P > 0.8 is a strong correlation, 0.5 < p < 0.8 is a moderate correlation, and P < 0.3 is a weak correlation or no correlation.
- When analyzing between two variables, the correlation coefficient < 0.3 is not considered.
- It should be considered that the correlation coefficient is less than 0.3 during modeling, analysis between multiple variables, logical regression and neural network.
2 take logarithm
2.1 logarithm of Y
datA['price_ln'] = np.log(datA['price']) # Logarithm of Y plt.figure(figsize=(8,8)) plt.scatter(datA.AREA,datA.price_ln,marker='.') plt.xlabel('Area (M2)') plt.ylabel('House price per unit area (after logarithm)')
- correlation coefficient datA[['AREA','price_ln']].corr(method='pearson') - The scatter plot results are similar to the triangular relationship, and the scatter results are still dense
Generally, the correlation coefficient should increase after taking logarithm, but the correlation coefficient is lower than before taking logarithm. Consider taking logarithm for x
2.2 logarithm of X and Y
The graph is dense in the middle and sparse on both sides. Such a graph is normal in both X and Y distribution
datA['AREA_ln'] = np.log(datA['AREA']) datA['price_ln'] = np.log(datA['price']) plt.figure(figsize=(8,8)) plt.scatter(datA.AREA_ln,datA.price_ln,marker='.') plt.xlabel('Area (M2)') plt.ylabel('House price per unit area (after logarithm)')
- Correlation coefficient: the correlation is higher datA[['AREA_ln','price_ln']].corr(method='pearson')
According to descriptive statistics, it is recognized that housing area has a certain impact on housing price.
According to the above results, the logarithm of X and Y should be taken to establish the model.
Descriptive statistical analysis selects all the original data for analysis.
Hypothesis testing and data modeling select some data for analysis
3 hypothesis test
Hypothesis test: verify the understanding obtained from descriptive statistics. Since there are more than 16000 original data, sample and select some data.
3.1 sampling
- Sampling code def get_sample(df, sampling="simple_random", k=1, stratified_col=None): """ For input dataframe Sampling function parameter: - df: Input data frame pandas.dataframe object - sampling:sampling method str The optional values are ["simple_random", "stratified", "systematic"] In order: Simple random sampling, stratified sampling and systematic sampling - k: Number of samples or sampling proportion int or float (int, Must be greater than 0; float, Must be in the interval(0,1)in) If 0 < k < 1 , be k Represents the proportion of the sample to the population If k >= 1 , be k Indicates the number of samples; When stratified sampling, it represents the sample size of each layer - stratified_col: List of column names that need to be layered list Only effective when stratified sampling Return value: pandas.dataframe object, Sampling results """ import random import pandas as pd from functools import reduce import numpy as np import math len_df = len(df) if k <= 0: raise AssertionError("k Cannot be negative") elif k >= 1: assert isinstance(k, int), "When selecting the number of samples, k Must be a positive integer" sample_by_n=True if sampling is "stratified": alln=k*df.groupby(by=stratified_col)[stratified_col[0]].count().count() # Problematic #alln=k*df[stratified_col].value_counts().count() if alln >= len_df: raise AssertionError("Please confirm k Multiplied by the number of layers cannot exceed the total sample size") else: sample_by_n=False if sampling in ("simple_random", "systematic"): k = math.ceil(len_df * k) #print(k) if sampling is "simple_random": print("Using simple random sampling") idx = random.sample(range(len_df), k) res_df = df.iloc[idx,:].copy() return res_df elif sampling is "systematic": print("Using systematic sampling") step = len_df // k+1 #step=len_df//k-1 start = 0 #start=0 idx = range(len_df)[start::step] #idx=range(len_df+1)[start::step] res_df = df.iloc[idx,:].copy() #print("k=%d,step=%d,idx=%d"%(k,step,len(idx))) return res_df elif sampling is "stratified": assert stratified_col is not None, "Please pass in a list containing the column names that need to be layered" assert all(np.in1d(stratified_col, df.columns)), "Please check the column name entered" grouped = df.groupby(by=stratified_col)[stratified_col[0]].count() if sample_by_n==True: group_k = grouped.map(lambda x:k) else: group_k = grouped.map(lambda x: math.ceil(x * k)) res_df = df.head(0) for df_idx in group_k.index: df1=df if len(stratified_col)==1: df1=df1[df1[stratified_col[0]]==df_idx] else: for i in range(len(df_idx)): df1=df1[df1[stratified_col[i]]==df_idx[i]] idx = random.sample(range(len(df1)), group_k[df_idx]) group_df = df1.iloc[idx,:].copy() res_df = res_df.append(group_df) return res_df else: raise AssertionError("sampling is illegal")
- stratified sampling :Sampling by area, 400 samples per area dat01 = get_sample(dat0, sampling="stratified", k=400, stratified_col=['dist']) dat01
3.2 univariate significance analysis – analysis of variance
- Because the independent variables are classified variables and the dependent variables are continuous variables, analysis of variance is carried out import statsmodels.api as sm from statsmodels.formula.api import ols print('dist of P The value is:.4f'%sm.stats.anova_lm(ols('price~C(dist)',data=dat01).fit())._values[0][4]) print('roomnum of P The value is:%.4f'%sm.stats.anova_lm(ols('price~C(roomnum)',data=dat01).fit())._values[0][4]) print("halls of P Value is:%.4f" %sm.stats.anova_lm(ols('price ~ C(halls)',data=dat01).fit())._values[0][4])#Above 0.001 - > marginal significance - > temporary consideration print("floor of P Value is:%.4f" %sm.stats.anova_lm(ols('price ~ C(floor)',data=dat01).fit())._values[0][4])#Above 0.001 - > marginal significance - > temporary consideration print("subway of P Value is:%.4f" %sm.stats.anova_lm(ols('price ~ C(subway)',data=dat01).fit())._values[0][4]) print("school of P Value is:%.4f" %sm.stats.anova_lm(ols('price ~ C(school)',data=dat01).fit())._values[0][4])
- Due to random sampling, the P value is different each time, but the conclusion must be consistent.
3.3 variable coding
Variables are coded to facilitate subsequent modeling.
- Binary variable coding
The number of offices has little effect. It is coded to change it into the presence or absence of offices (0-1 variable) and become a secondary classification variable
dat01['style_new'] = dat01.halls dat01.style_new[dat01.style_new>0]='There is a hall' dat01.style_new[dat01.style_new==0]='No Hall' dat01.head()
2. Multi category variable coding
- For multi category variables: make a dummy variable code when putting it into the model
- Multivariable: urban area, floor datta = pd.get_dummies(dat01[['dist','floor']]) datta.head()
- These two are reference groups. Delete them - retain K-1 Dummy variable datta.drop(['dist_Shijingshan','floor_high'],axis=1,inplace=True) datta.head()
- Finally, the data used for modeling
- The generated dummy variable is combined with other required variables to form a new data frame dat1 = pd.concat([datta,dat01[['school','subway','style_new','roomnum','AREA','price']]],axis=1) dat1.head()
4 modeling – linear regression
Scheme I
- OLS:Regression without intercept term (not commonly used), ols:Is commonly used from statsmodels.formula.api import ols lm1 = ols('price ~ dist_Fengtai+dist_Sunrise+dist_Dongcheng+dist_Haidian+dist_Westlife +school+subway+floor_middle+floor_low+AREA', data=dat1).fit() ''' ols of y Must be a continuous variable, x It can be continuous or classified, if x To classify variables, you can use C()Instead of creating dummy variables ''' lm1_summary= lm1.summary() lm1_summary #Regression results show that
Note: each value in the urban area is compared with Shijingshan, and each value of the floor is compared with the high-rise building.
- Predict the model dat1['pred1'] = lm1.predict(dat01) - Take residual dat1['resid1'] = lm1.resid - pred1(Predicted value) is x,resid1(residual)by y,Make a scatter chart. dat1.plot('pred1','resid1',kind='scatter') ''' Heteroscedasticity: the residual increases with the increase of the predicted value The figure shows heteroscedasticity, which should be dealt with Y Take logarithm '''
- It is proved again by the results that the scatter diagram is heteroscedasticity, so the logarithm of the model is taken.
Scheme 2: logarithm of X and Y
dat1['price_ln'] = np.log(dat1['price']) # Take logarithm of y value dat1['AREA_ln'] = np.log(dat1['AREA']) # Take logarithm of x lm2 = ols("price_ln ~ dist_Fengtai+dist_Sunrise+dist_Dongcheng+dist_Haidian+dist_Westlife +school+subway+floor_middle+floor_low+AREA_ln", data=dat1).fit() lm2_summary = lm2.summary() lm2_summary #Regression results display
Model explanation: after taking logarithm, it becomes percentage
The house price in Fengtai District is 3.72% higher than that in Shijingshan, the house price in school district is 15.9% higher than that in non School District, and the low-rise house is 4.18% higher than that in high-rise house. For each percentage increase in house area, the house price will decrease by 4.14%.
5, Forecast
1 interactive items
- X1 and X2 are put into the model as interactive items (x1*x2)
- The slope of influence is different in different places, and the interaction term is considered
Whether it is distinguished by, whether it is a school district room and whether it is close to the subway are significant, so the interaction items of these three variables should be considered. Other variables are not significant and need not be considered.
1.1 descriptive statistics - interaction between urban and School District Housing
- Structural graphics reveal the price of school district housing in different urban areas df=pd.DataFrame() dist=['Shijingshan','Fengtai','Sunrise','Dongcheng','Haidian','Westlife '] Noschool=[] school=[] for i in dist: Noschool.append(dat0[(dat0['dist']==i)&(dat0['school']==0)]['price'].mean()) school.append(dat0[(dat0['dist']==i)&(dat0['school']==1)]['price'].mean()) df['dist']=pd.Series(dist) df['Noschool']=pd.Series(Noschool) df['school']=pd.Series(school) df
#descriptive statistics df1 = df['Noschool'].T.values df2 = df['school'].T.values plt.figure(figsize=(10,6)) x1 = range(0,len(df)) x2=[i+0.3 for i in x1] plt.bar(x1,df1,color='b',width=0.3,alpha=0.6,label='Non school district room') plt.bar(x2,df2,color='r',width=0.3,alpha=0.6,label='School district room') plt.xlabel('city proper') plt.ylabel('Price per unit area') plt.legend(loc='upper left') plt.xticks(range(0,6),dist) plt.show()
1.2 classification box diagram
- Grouping box diagram of school district housing by urban area school=['Shijingshan','Fengtai','Sunrise','Dongcheng','Haidian','Westlife '] for i in school: dat0[dat0.dist==i][['school','price']].boxplot(by='school',patch_artist=True) plt.xlabel(i+'School district room')
![]() |
![]() |
![]() |
- Box chart: whether there is a school district house in each urban area has an impact on the price of the defense area, and see whether the average value (central level) is consistent.
- Interactive item: whether the school district housing in each urban area has an impact on the defense area price depends on whether the difference of the average value in each urban area is consistent
1.3 modeling
- Put it into the model to see the P value and judge whether it is significant.
- Log linear model with interaction term, interaction between urban area and school district lm3 = ols("price_ln ~ (dist_Fengtai+dist_Sunrise+dist_Dongcheng+dist_Haidian+dist_Westlife )*school+subway+floor_middle+floor_low+AREA_ln", data=dat1).fit() lm3_summary = lm3.summary() lm3_summary #Regression results display
Model interpretation:
The benchmark is the school district house in Shijingshan. The school district house in Shijingshan is cheaper than the non school district house. Therefore, Shijingshan school district housing is 38.54% cheaper than non school district housing. School district housing in Fengtai District increased by 45.84% compared with non School District Housing