python data analysis

python data analysis (2)

This part is also a practical case. I have read Python foundation and python data analysis for many times, but I still feel a bit empty if I don't really knock code. Recently, I am constantly looking for cases and constantly knock code myself.
Case link: https://segmentfault.com/a/1190000015440560
It is mainly divided into two aspects: first, data exploration; second, data visualization

1, Data analysis

1. Import package
The first step is to import the required packages, numpy,pandas,matplotlib,sklearn,seaborn
And choose the style of the drawing

import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib as mpl
import matplotlib.pyplot as plt
#from Ipython.display import display

plt.style.use("fivethirtyeight")
sns.set_style({'font.sans-serif':['simhei','Arial']})
#Check python version
from sys import version_info
if version_info.major!=3:
    raise exception('Please use python 3 To complete this project')

2. Import data and observe it
Step 2: import data

df=pd.read_csv(r'F:\python study\lianjia.csv')
df.head()
df.info()

The results are as follows: there are 23676 records and 12 fields in the dataset, and the Elevator field contains a large number of missing values.


3. Descriptive statistics of data

df.describe()

The results were as follows:

4. Simple data arrangement
Here the function is defined by lambda

#Add new field average house price
df = df.copy()
df['PerPrice']=df.apply(lambda x:x.Price/x.Size,axis=1)
#Reposition columns
x = ['Region', 'District', 'Garden', 'Layout', 'Floor', 'Year', 'Size', 'Elevator', 'Direction', 'Renovation', 'PerPrice', 'Price']
df= pd.DataFrame(df,columns=x)
print(df.head(n=2))

The results were as follows:

2, Data visualization

1. Region feature analysis
For regional characteristics, we can analyze the comparison of housing prices and quantity in different regions.
The code is as follows:

df_house_count=df.groupby('Region')['Price'].count().sort_values(ascending=False).to_frame().reset_index()
df_house_mean=df.groupby('Region')['PerPrice'].mean().sort_values(ascending=False).to_frame().reset_index()
f,[ax1,ax2,ax3]=plt.subplots(3,1,figsize=(20,15)) #Three lines and one column, ax1,ax2,ax3 represent the subgraph, f represents the image, and figsize the size of the image
f.subplots_adjust(hspace=40)#It is used to adjust the spacing of subgraphs, hspace adjusts the vertical spacing, and wspace adjusts the horizontal spacing
#histogram
sns.barplot(x='Region',y='PerPrice',palette="Blues_d",data=df_house_mean,ax=ax1)#sns. Graph name (x='X-axis column name ', y = Y-axis column name', data = original data df object)
ax1.set_title('Comparison of unit price per square meter of second-hand housing in Beijing',fontsize=15)
ax1.set_xlabel('region')
ax1.set_ylabel('Unit price per square meter')
#histogram
sns.barplot(x='Region',y='Price',palette="Greens_d",data=df_house_count,ax=ax2)
ax2.set_title('Comparison of the number of second-hand houses in Beijing',fontsize=15)#font size
ax2.set_xlabel('region')
ax2.set_ylabel('number')
#Box diagram
sns.boxplot(x='Region',y='Price',data=df,ax=ax3)
ax3.set_title('Total price of second hand housing in Beijing',fontsize=15)
ax3.set_xlabel('region')
ax3.set_ylabel('Total price of house')
f.tight_layout()
plt.show()

The results were as follows:

analysis:
**Unit price of second-hand house per square meter: * * the unit price of second-hand house per square meter in Xicheng District is the highest, reaching 110000 / m2, followed by Dongcheng about 100000 / m2, Haidian about 85000 / m2, and other districts between 22000 / m2 and 70000 / m2. The difference of unit price per square meter between different districts is still relatively large.
**Number of second-hand houses: * * Haidian District has the largest number of second-hand houses, but it is basically the same as Chaoyang District, about 3000 sets. Then there is Fengtai District, which is similar to Haidian District and Chaoyang District.
**Total housing price: * * the areas with high median house price are Dongcheng District, Xicheng District, Chaoyang District and Haidian District. The discrete value of the total price of houses is more and higher, which indicates that there will be a certain gap in the total price of houses in the same area.

2. Size feature analysis
Figure: distplot, kdeplot, regplot
Reprint an author's introduction to the three figures
Links are as follows:
https://blog.csdn.net/qq_40195360/article/details/86605860?utm_medium=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-1.nonecase&depth_1-utm_source=distribute.pc_relevant.none-task-blog-BlogCommendFromMachineLearnPai2-1.nonecase

The code is as follows:

#size feature analysis
f,[ax1,ax2]=plt.subplots(1,2,figsize=(15,5))
f.subplots_adjust(wspace=1)
#Distribution of housing area
sns.distplot(df['Size'],bins=20,ax=ax1,color='r')
sns.kdeplot(df['Size'],shade=True,ax=ax1)#Kernel density estimation map
#The relationship between housing area and selling price
sns.regplot(x='Size',y='Price',data=df,ax=ax2)
plt.show()

The results were as follows:

Housing area distribution:
From the picture on the left, we can see that the distribution of housing area belongs to the long tail distribution, indicating that there are more second-hand houses with large area and beyond the normal range.
The relationship between housing area and price:
The graph on the right shows the relationship between size and price, which basically shows a linear relationship, which conforms to the basic situation, that is, the larger the area, the higher the corresponding price. However, there are two abnormal situations: first, the area is less than 10 square meters, and the price is more than 10 million yuan; second, there is a second-hand house with an area of more than 1000 square meters, but the price is very low. In view of these two abnormal situations, we need to find out the reasons.

Let's first look at the area of less than 10 square meters of this part of the second-hand housing basic situation.

df.loc[df['Size']<10]

It can be seen that this part of the second-hand houses are villas, because the structural structure of villas is relatively special, and the field definition is a little different from the second-hand commercial houses, resulting in the dislocation of reptile crawling data. Because the second-hand housing is not in the scope of our analysis, this part of the data is removed.
It is observed that this abnormal point is not an ordinary second-hand house for civil use, it is likely to be a commercial house, so there is only one room, hall 0, with such an area of more than 1000 square meters, which is to be removed here.

After removing the abnormal data, draw the graph again, and the results are as follows:

3. Layout feature analysis
As can be seen from the figure below, the number of 2 rooms and 1 hall is the highest, followed by 3 rooms and 1 hall, followed by 3 rooms and 2 halls, and 1 room and 1 hall. There are still many at the bottom of the list, and the number is very low. Most of them are rooms X and halls x, but some of them are rooms X and bathrooms X. there are some cases of nonstandard naming.


4. Feature analysis of innovation
Let's take a brief look at the data of innovation
code:

df['Renovation'].value_counts()

The results were as follows:
There are no other terms. The data are relatively clean

Drawing:

f,[ax1,ax2,ax3]=plt.subplots(3,1,figsize=(5,20))
sns.countplot(df['Renovation'],ax=ax1)#Statistics times
sns.barplot(x='Renovation',y='Price',data=df,ax=ax2)#Default statistical average
sns.boxplot(x='Renovation',y='Price',data=df,ax=ax3)#Box diagram
plt.show()

The results were as follows:
The number of second-hand hardbound housing is the largest, followed by simple decoration.
From the average price point of view, the price of rough housing is the highest, followed by hardbound. The average price of blank house is high because of its large area, which can be seen from the third figure.
The fluctuation of the blank room is bigger than that of the hardcover and the simple decoration.

5. Analysis of Elevator characteristics
When we look at the data overview, we can see that there are a lot of missing values in Elevator. For the treatment of missing values, the commonly used methods are: first, the average value and median filling method; second, the direct removal method; third, modeling and forecasting according to other characteristics.
The filling method is used here. However, whether there is an elevator is not a numerical value and cannot be filled with the median and the average value. This paper provides an idea to fill in according to the universal rules: generally, if the floor is greater than 6, there is an elevator; if the floor is less than 6, there is no elevator. Fill in according to this simple rule.

df.loc[(df['Floor']>6)&(df['Elevator'].isnull()),'Elevator']='There is an elevator'
df.loc[(df['Floor']<6)&(df['Elevator'].isnull()),'Elevator']='No elevator'

f,[ax1,ax2]=plt.subplots(1,2,figsize=(10,5))
sns.countplot(df['Elevator'],ax=ax1)
ax1.set_title('Comparison of the number of elevators',fontsize=15)
ax1.set_xlabel('Is there an elevator')
ax1.set_ylabel('number')
sns.barplot(x='Elevator',y='Price',data=df,ax=ax2)
ax2.set_title('Comparison of prices with and without elevators',fontsize=15)
ax2.set_xlabel('Is there an elevator')
ax2.set_ylabel('Total price')
plt.show()

The results show that the number of second-hand houses with elevators is higher. After all, the land utilization rate of high-rise buildings is relatively high, which is suitable for the needs of the huge population in Beijing, and elevators are needed for high-rise buildings. Accordingly, the price of second-hand houses with elevators is higher, because the early decoration and later maintenance costs of elevators are included (but the price comparison is only an average concept, for example, the price of 6-story luxury residential quarters without elevators is certainly higher).

6. Year feature analysis
For the grid used below, see the introduction of the author
https://blog.csdn.net/weixin_42398658/article/details/82960379
The code is as follows:

grid=sns.FacetGrid(df,row='Elevator',col='Renovation',palette='seismic',size=4)
grid.map(plt.scatter,'Year','Price')
grid.add_legend()
plt.show()

Under the classification conditions of innovation and Elevator, the Year characteristics were analyzed by FaceGrid

The whole second-hand housing price trend is increasing with time;
The price of second-hand houses built after 2000 is obviously higher than that before 2000;
Before 1980, there was almost no data of second-hand elevator houses, indicating that there was no large-scale elevator installation before 1980;
Before 1980, in the second-hand houses without elevators, the majority were simple second-hand houses, while the hardbound houses were few;

7. Floor feature analysis

f,ax1=plt.subplots(figsize=(20,5))
sns.countplot(x='Floor',data=df,ax=ax1)
ax1.set_title('House type',fontsize=15)
ax1.set_xlabel('number')
ax1.set_ylabel('House type')
plt.show()

The results were as follows:
In the figure below, the second-hand houses on the sixth floor are the most, and the gap with other floors is very large. Next is the 18th floor.

summary

Most of this exercise is about charts. After this exercise, I will have a preliminary understanding of Xiaobai on how to use python to do visualization. I will continue to learn the principle of charts and the use of relevant parameters of charts.

Tags: Python less Lambda IPython

Posted on Mon, 29 Jun 2020 05:24:16 -0400 by alant