4, Case: Analysis on Influencing Factors of second-hand house prices in Beijing


Operation requirements:

  1. Dependent variable analysis: analysis of house price per unit area
  2. Independent variable analysis:
    1> Distribution of independent variables
    2> Analysis of the influence of independent variables on dependent variables
  3. Establish house price prediction model
    1> Linear regression model
    2> Linear model with logarithm of dependent variable
    3> Logarithmic linearity considering interaction terms
  4. Forecast house price
    ·It covers three chapters: descriptive statistics, statistical inference and linear regression

1: Data display

import pandas as pd 
import numpy as np
import math
import matplotlib
import matplotlib.pyplot as plt
import seaborn as sns
import statsmodels.api as sm
from numpy import corrcoef,array
# from numpy import corrconf,array
from statsmodels.formula.api import ols
matplotlib.rcParams['axes.unicode_minus']=False  #Solving symbol problems
plt.rcParams['font.sans-serif']=['SimHei']  #Specifies the default font and displays Chinese when

dat11 = pd.read_csv(r'E:\sndHsPr.csv')
dat11


Variable meaning:

Column variablemeaning
roomnumSeveral rooms
subwayIs it near the subway
hallsSeveral halls
schoolSchool district room
AREAHouse area
priceHouse price per square meter
floorMiddle level, low level and high level

2: Basic data

Descriptive statistical analysis: all sample sizes were used

Hypothesis testing and modeling: not all raw data can be used for sampling

In[] : dat0=dat11   - Data transfer
In[] : dat0.shape[0]
Out[] : 16210
- Sample size over 5000,p Value has no meaning==>Therefore, sampling is required for modeling and inspection


dat0.describe(include='all').T
- The vertical column is a statistic and the horizontal column is a variable
- Continuous variable: the output is percentile, median, etc
- Classification variable: the output is frequency information


Note: qq chart and PP chart are used to check whether a distribution conforms to normal distribution and occupy a certain position in statistics. In data analysis, histogram can be described.

3: Dependent variable analysis

Step 1: data sorting

- House price per unit area: 10000 yuan per unit
dat0.price=dat0.price/10000 

- Change the area name to Chinese
dict1 = {
    u'chaoyang':'Sunrise',
    u'dongcheng':'Dongcheng',
    u'fengtai':'Fengtai',
    u'haidian':'Haidian',
    u'shijingshan':'Shijingshan',
    u'xicheng':'Westlife '
}
dat0.dist = dat0.dist.apply(lambda x : dict1[x])
dat0.head()

Step 2: histogram

As long as it is data analysis and continuous variable, histogram must be drawn. Prevent errors such as outliers.

dat0.price.hist(bins=20)   - Divided into 20 items
plt.xlabel('House price per unit area (10000 yuan)/(M2)')
plt.ylabel('frequency')

  • According to the above histogram, it can be seen that the dependent variable is a right bias function, and the logarithm should be considered.

Step 3: descriptive statistical analysis

  • Continuous variables: maximum, minimum, mean, median, standard deviation, quartile
  • Categorical variables: frequency analysis
- Mean, median, standard deviation
In[]:dat0.price.agg(['mean','median','std'])
Out[]:
	mean      6.115181
	median    5.747300
	std       2.229336

- Quartile
In[] : dat0.price.quantile([0.25,0.5,0.75])
Out[]:
	0.25    4.281225
	0.50    5.747300
	0.75    7.609975
  • Pay attention to outliers and have a basic understanding of data
- It is common sense to find out which observation corresponds to the lowest price and the highest price.
pd.concat([(dat0[dat0.price==min(dat0.price)]),(dat0[dat0.price==max(dat0.price)])])

4, Independent variable analysis

Step 1: look at the overall data

  • The overall data has no outliers
  • The data are divided into categorical variables and continuous variables
'''
According to the data, five of the six independent variables are classified variables and one is continuous variable; Y Value is price
'''
for i in range(7):
    if i!=3:         # AREA is a continuous variable and no frequency analysis is performed
        print(dat0.columns.values[i],':')
        print(dat0[dat0.columns.values[i]].agg(['value_counts']).T)
        print('------------------------------------------')
    else:
        continue
print('AREA:')
print(dat0.AREA.agg(['min','max','median','std']).T)
# Sample size / total number of samples > = 0.05, the number of samples is not too small

Step 2: analyze the classified variables (urban area)

1 self distribution of independent variables

- Urban frequency analysis
data.dist.value_counts().plot(kind='pie')

2 Analysis of the influence of independent variables on dependent variables

  • Look at the average house price in each district
- Method 1: column chart
data.price.groupby(dat0.dist).mean().sort_values(ascending=True).plot(kind='barh')

- Method 2: box whisker diagram

dat1 = dat0[['dist','price']]
- Convert area to classification variable
dat1.dist = dat1.dist.astype('category')
- Define the order of classification variables,Make it in ascending order
dat1.dist.cat.set_categories(['Shijingshan','Fengtai','Sunrise','Haidian','Dongcheng','Westlife '],inplace=True)
sns.boxplot(x='dist',y='price',data=dat1)
plt.ylabel('House price per unit area (10000 yuan)/(M2)')
plt.xlabel('city proper')
plt.title('Group box diagram of house price in urban area')

Note:

See whether each district has an impact on Y: it mainly depends on whether the central level is consistent. If it is inconsistent, it means that the two are not independent. Then X can predict y and can be included in the regression model.

Step 3: analyze the continuous variable (area)

1 is the house area related to the price

Scatter diagram: check whether it is right deviation = = "right deviation is considered to take logarithm

Correlation coefficient: judge the correlation

- Scatter diagram shows left dense right evacuation==>Right skew function
datA = dat0[['AREA','price']]
plt.scatter(datA.AREA,datA.price,marker='.')

- pearson correlation coefficient 
datA[['AREA','price']].corr(method='pearson')


Note: the correlation coefficient P > 0.8 is a strong correlation, 0.5 < p < 0.8 is a moderate correlation, and P < 0.3 is a weak correlation or no correlation.

  • When analyzing between two variables, the correlation coefficient < 0.3 is not considered.
  • It should be considered that the correlation coefficient is less than 0.3 during modeling, analysis between multiple variables, logical regression and neural network.

2 take logarithm

2.1 logarithm of Y
datA['price_ln'] = np.log(datA['price'])  # Logarithm of Y
plt.figure(figsize=(8,8))
plt.scatter(datA.AREA,datA.price_ln,marker='.')
plt.xlabel('Area (M2)')
plt.ylabel('House price per unit area (after logarithm)')

- correlation coefficient
datA[['AREA','price_ln']].corr(method='pearson')
- The scatter plot results are similar to the triangular relationship, and the scatter results are still dense

Generally, the correlation coefficient should increase after taking logarithm, but the correlation coefficient is lower than before taking logarithm. Consider taking logarithm for x

2.2 logarithm of X and Y

The graph is dense in the middle and sparse on both sides. Such a graph is normal in both X and Y distribution

datA['AREA_ln'] = np.log(datA['AREA'])
datA['price_ln'] = np.log(datA['price'])
plt.figure(figsize=(8,8))
plt.scatter(datA.AREA_ln,datA.price_ln,marker='.')
plt.xlabel('Area (M2)')
plt.ylabel('House price per unit area (after logarithm)')

- Correlation coefficient: the correlation is higher
datA[['AREA_ln','price_ln']].corr(method='pearson')


According to descriptive statistics, it is recognized that housing area has a certain impact on housing price.
According to the above results, the logarithm of X and Y should be taken to establish the model.
Descriptive statistical analysis selects all the original data for analysis.

Hypothesis testing and data modeling select some data for analysis

3 hypothesis test

Hypothesis test: verify the understanding obtained from descriptive statistics. Since there are more than 16000 original data, sample and select some data.

3.1 sampling
- Sampling code
def get_sample(df, sampling="simple_random", k=1, stratified_col=None):
    """
    For input dataframe Sampling function

    parameter:
        - df: Input data frame pandas.dataframe object

        - sampling:sampling method str
            The optional values are ["simple_random", "stratified", "systematic"]
            In order: Simple random sampling, stratified sampling and systematic sampling

        - k: Number of samples or sampling proportion int or float
            (int, Must be greater than 0; float, Must be in the interval(0,1)in)
            If 0 < k < 1 , be k Represents the proportion of the sample to the population
            If k >= 1 , be k Indicates the number of samples; When stratified sampling, it represents the sample size of each layer

        - stratified_col: List of column names that need to be layered list
            Only effective when stratified sampling

    Return value:
        pandas.dataframe object, Sampling results
    """
    import random
    import pandas as pd
    from functools import reduce
    import numpy as np
    import math
    
    len_df = len(df)
    if k <= 0:
        raise AssertionError("k Cannot be negative")
    elif k >= 1:
        assert isinstance(k, int), "When selecting the number of samples, k Must be a positive integer"
        sample_by_n=True
        if sampling is "stratified":
            alln=k*df.groupby(by=stratified_col)[stratified_col[0]].count().count() # Problematic
            #alln=k*df[stratified_col].value_counts().count() 
            if alln >= len_df:
                raise AssertionError("Please confirm k Multiplied by the number of layers cannot exceed the total sample size")
    else:
        sample_by_n=False
        if sampling in ("simple_random", "systematic"):
            k = math.ceil(len_df * k)
        
    #print(k)

    if sampling is "simple_random":
        print("Using simple random sampling")
        idx = random.sample(range(len_df), k)
        res_df = df.iloc[idx,:].copy()
        return res_df

    elif sampling is "systematic":
        print("Using systematic sampling")
        step = len_df // k+1          #step=len_df//k-1
        start = 0                  #start=0
        idx = range(len_df)[start::step]  #idx=range(len_df+1)[start::step]
        res_df = df.iloc[idx,:].copy()
        #print("k=%d,step=%d,idx=%d"%(k,step,len(idx)))
        return res_df

    elif sampling is "stratified":
        assert stratified_col is not None, "Please pass in a list containing the column names that need to be layered"
        assert all(np.in1d(stratified_col, df.columns)), "Please check the column name entered"
        
        grouped = df.groupby(by=stratified_col)[stratified_col[0]].count()
        if sample_by_n==True:
            group_k = grouped.map(lambda x:k)
        else:
            group_k = grouped.map(lambda x: math.ceil(x * k))
        
        res_df = df.head(0)
        for df_idx in group_k.index:
            df1=df
            if len(stratified_col)==1:
                df1=df1[df1[stratified_col[0]]==df_idx]
            else:
                for i in range(len(df_idx)):
                    df1=df1[df1[stratified_col[i]]==df_idx[i]]
            idx = random.sample(range(len(df1)), group_k[df_idx])
            group_df = df1.iloc[idx,:].copy()
            res_df = res_df.append(group_df)
        return res_df

    else:
        raise AssertionError("sampling is illegal")

- stratified sampling :Sampling by area, 400 samples per area
dat01 = get_sample(dat0, sampling="stratified", k=400, stratified_col=['dist'])
dat01

3.2 univariate significance analysis – analysis of variance
- Because the independent variables are classified variables and the dependent variables are continuous variables, analysis of variance is carried out
import statsmodels.api as sm
from statsmodels.formula.api import ols
print('dist of P The value is:.4f'%sm.stats.anova_lm(ols('price~C(dist)',data=dat01).fit())._values[0][4])
print('roomnum of P The value is:%.4f'%sm.stats.anova_lm(ols('price~C(roomnum)',data=dat01).fit())._values[0][4])
print("halls of P Value is:%.4f" %sm.stats.anova_lm(ols('price ~ C(halls)',data=dat01).fit())._values[0][4])#Above 0.001 - > marginal significance - > temporary consideration
print("floor of P Value is:%.4f" %sm.stats.anova_lm(ols('price ~ C(floor)',data=dat01).fit())._values[0][4])#Above 0.001 - > marginal significance - > temporary consideration
print("subway of P Value is:%.4f" %sm.stats.anova_lm(ols('price ~ C(subway)',data=dat01).fit())._values[0][4])
print("school of P Value is:%.4f" %sm.stats.anova_lm(ols('price ~ C(school)',data=dat01).fit())._values[0][4])

  • Due to random sampling, the P value is different each time, but the conclusion must be consistent.
3.3 variable coding

Variables are coded to facilitate subsequent modeling.

  1. Binary variable coding
    The number of offices has little effect. It is coded to change it into the presence or absence of offices (0-1 variable) and become a secondary classification variable
dat01['style_new'] = dat01.halls
dat01.style_new[dat01.style_new>0]='There is a hall'
dat01.style_new[dat01.style_new==0]='No Hall'
dat01.head()


2. Multi category variable coding

  • For multi category variables: make a dummy variable code when putting it into the model
- Multivariable: urban area, floor
datta = pd.get_dummies(dat01[['dist','floor']])
datta.head()

- These two are reference groups. Delete them
- retain K-1 Dummy variable
datta.drop(['dist_Shijingshan','floor_high'],axis=1,inplace=True)
datta.head()

  • Finally, the data used for modeling
- The generated dummy variable is combined with other required variables to form a new data frame
dat1 = pd.concat([datta,dat01[['school','subway','style_new','roomnum','AREA','price']]],axis=1)
dat1.head()

4 modeling – linear regression

Scheme I
- OLS:Regression without intercept term (not commonly used), ols:Is commonly used
from statsmodels.formula.api import ols

lm1 = ols('price ~ dist_Fengtai+dist_Sunrise+dist_Dongcheng+dist_Haidian+dist_Westlife +school+subway+floor_middle+floor_low+AREA',
         data=dat1).fit()
'''
ols of y Must be a continuous variable, x It can be continuous or classified,
if x To classify variables, you can use C()Instead of creating dummy variables
'''
lm1_summary= lm1.summary()
lm1_summary  #Regression results show that


Note: each value in the urban area is compared with Shijingshan, and each value of the floor is compared with the high-rise building.

- Predict the model
dat1['pred1'] = lm1.predict(dat01) 
- Take residual
dat1['resid1'] = lm1.resid     
- pred1(Predicted value) is x,resid1(residual)by y,Make a scatter chart.
dat1.plot('pred1','resid1',kind='scatter')

'''
Heteroscedasticity: the residual increases with the increase of the predicted value
 The figure shows heteroscedasticity, which should be dealt with Y Take logarithm
'''

  • It is proved again by the results that the scatter diagram is heteroscedasticity, so the logarithm of the model is taken.
Scheme 2: logarithm of X and Y
dat1['price_ln'] = np.log(dat1['price'])  # Take logarithm of y value
dat1['AREA_ln'] = np.log(dat1['AREA'])    # Take logarithm of x
lm2 = ols("price_ln ~ dist_Fengtai+dist_Sunrise+dist_Dongcheng+dist_Haidian+dist_Westlife +school+subway+floor_middle+floor_low+AREA_ln", data=dat1).fit()
lm2_summary = lm2.summary()
lm2_summary  #Regression results display


Model explanation: after taking logarithm, it becomes percentage
The house price in Fengtai District is 3.72% higher than that in Shijingshan, the house price in school district is 15.9% higher than that in non School District, and the low-rise house is 4.18% higher than that in high-rise house. For each percentage increase in house area, the house price will decrease by 4.14%.

5, Forecast

1 interactive items

  • X1 and X2 are put into the model as interactive items (x1*x2)
  • The slope of influence is different in different places, and the interaction term is considered

Whether it is distinguished by, whether it is a school district room and whether it is close to the subway are significant, so the interaction items of these three variables should be considered. Other variables are not significant and need not be considered.

1.1 descriptive statistics - interaction between urban and School District Housing

- Structural graphics reveal the price of school district housing in different urban areas
df=pd.DataFrame()
dist=['Shijingshan','Fengtai','Sunrise','Dongcheng','Haidian','Westlife ']
Noschool=[]
school=[]
for i in dist:
    Noschool.append(dat0[(dat0['dist']==i)&(dat0['school']==0)]['price'].mean())
    school.append(dat0[(dat0['dist']==i)&(dat0['school']==1)]['price'].mean())

df['dist']=pd.Series(dist)
df['Noschool']=pd.Series(Noschool)
df['school']=pd.Series(school)
df

#descriptive statistics 
df1 = df['Noschool'].T.values
df2 = df['school'].T.values
plt.figure(figsize=(10,6))
x1 = range(0,len(df))
x2=[i+0.3 for i in x1]
plt.bar(x1,df1,color='b',width=0.3,alpha=0.6,label='Non school district room')
plt.bar(x2,df2,color='r',width=0.3,alpha=0.6,label='School district room')
plt.xlabel('city proper')
plt.ylabel('Price per unit area')
plt.legend(loc='upper left')
plt.xticks(range(0,6),dist)
plt.show()

1.2 classification box diagram

- Grouping box diagram of school district housing by urban area
school=['Shijingshan','Fengtai','Sunrise','Dongcheng','Haidian','Westlife ']
for i in school:
    dat0[dat0.dist==i][['school','price']].boxplot(by='school',patch_artist=True)
    plt.xlabel(i+'School district room')
Figure 1 Figure 2 Figure 3
  • Box chart: whether there is a school district house in each urban area has an impact on the price of the defense area, and see whether the average value (central level) is consistent.
  • Interactive item: whether the school district housing in each urban area has an impact on the defense area price depends on whether the difference of the average value in each urban area is consistent

1.3 modeling

  • Put it into the model to see the P value and judge whether it is significant.
- Log linear model with interaction term, interaction between urban area and school district
lm3 = ols("price_ln ~ (dist_Fengtai+dist_Sunrise+dist_Dongcheng+dist_Haidian+dist_Westlife )*school+subway+floor_middle+floor_low+AREA_ln", data=dat1).fit()
lm3_summary = lm3.summary()
lm3_summary  #Regression results display


Model interpretation:
The benchmark is the school district house in Shijingshan. The school district house in Shijingshan is cheaper than the non school district house. Therefore, Shijingshan school district housing is 38.54% cheaper than non school district housing. School district housing in Fengtai District increased by 45.84% compared with non School District Housing

Tags: Python R Language Data Mining

Posted on Mon, 29 Nov 2021 16:43:10 -0500 by d.shankar