Descriptive statistical analysis and mapping

1 descriptive statistical analysis

1.1 concept

  1. Data variable measurement type

Nominal: character (original meaning), numerical value (code)
Grade: characters and values are sorted, and the difference of "small, medium and large" is meaningless
Continuous: value 'age'
#Continuous variables are grouped and used as hierarchical variables to make the data more robust
#Nominal variables and hierarchical variables are collectively referred to as classification variables.
Statistics: frequency, percentage

  1. Describe the distribution of nominal variables

Frequency, percentage
#All packages used in python do not support characters and need to be encoded, 0 / 1 (the amount of attention is coded as 1)

  1. Describe the distribution of continuous variables
  • Centralized trend (position) - the measurement of the center: mean, median, mode
    #Select the mean or median according to the skewness, select the median on the right, and select the mean without skewness
  • Degree of dispersion (degree of dispersion): variance, standard deviation, range, quartile difference IQR

Variance:

Standard Deviation:

Interquartile difference IQR: upper quantile - lower quantile

Box whisker diagram: variable distribution, internal quantile IQR, outliers

  • Skewness (shape): normal skewness=0, right skewness = positive
  1. Common continuous distribution

1) Lognormal distribution: widely used, income distribution, right deviation
Descriptive statistical analysis: take the median
Modeling: take logarithm
2) Gamma distribution: amount of loss caused by disaster
3) Poisson distribution: team length
4) Normal distribution: it is distributed in nature, and the mean value represents the central level

1.2 descriptive statistics / cases

Import the required package

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
import os                #os:Operating System

Read data

os.chdir(r'D:\python Business practice\<Python Detailed explanation of data science and technology and business practice PDF+source code+Eight cases\<Python Detailed explanation of data science and technology and business practice PDF+source code+Eight cases\source code\Python_book\4Describe')
snd=pd.read_csv("sndHsPr.csv")
snd.head()


Change zone name

district={'fengtai':'Fengtai District','haidian':'Haidian District','chaoyang':'Chaoyang District','dongcheng':'Dongcheng District','shijingshan':'Shijingshan District','xicheng':'Xicheng District'}
snd['district']=snd.dist.map(district)Insert the code slice here

1.2.1 single factor frequency: a categorical variable

value_counts() frequency of each value

snd.district.value_counts()

  • Plot a histogram. plot(kind='bar ')
snd.district.value_counts().plot(kind='bar')   

  • Draw pie chart kind='pie '
snd.district.value_counts().plot(kind='pie')    

1.2.2 analysis table

Two categorical variables

Pd.crosstab (classification variable 1, classification variable 2)

Generate data frame and frequency table

sub_sch=pd.crosstab(snd.district,snd.school)    
sub_sch

  • Classification column
pd.crosstab(snd.district,snd.school).plot(kind='bar')

  • Stack the column chart to see the resource allocation
t1=pd.crosstab(snd.district,snd.school)
t1.plot(kind='bar',stacked=True)

  • Standardized stacked column chart
    It is used to compare whether the two classification variables are related. It is intuitive and can clearly see the resource allocation
sub_sch=pd.crosstab(snd.district,snd.school)
sub_sch['sum1']=sub_sch.sum(1)                  #1 stands for column, and one column is summarized
sub_sch.head()

sub_sch=sub_sch.div(sub_sch.sum1,axis=0)    #Percentage by line
sub_sch

sub_sch[[0,1]].plot(kind='bar',stacked=True)

  • The width of the column represents the quantity, which is more intuitive.
def stack2dim (raw,i,j,rotation=0,location='upper right'):

raw: DataFrame data frame of pandas
i. j: names of two classification variables, horizontal and vertical axis names
Rotation: horizontal label rotation angle. The default horizontal direction. If the label is too long, you can set a certain angle, such as rotation=40
Location: the location of the classification label. If it is blocked by the theme graphics, it can be changed to 'upper left'

The function needs to be called*

from stack2dim import *
stack2dim(snd, i="district", j="school")      

1.2.3 single continuous variable description

snd.price.agg(['mean','median','sum','std','skew'])    #Get multiple statistics


Draw histogram to check the distribution, similar to normal distribution

snd.price.hist(bins=100)            #bins is the number of packets

snd.price.mean()     #mean value
snd.price.median()   #median
snd.price.std()      #standard deviation
snd.price.skew()     #skewness
snd.price.quantile([0.01,0.5,0.99])      #Take quantile

1.2.4 classification summary

A categorical variable and a continuous variable statistic
Group by () subtotal

snd.price.groupby(snd.district).mean()    #Take the mean value of continuous variables, which can be replaced

  • Column chart
snd.price.groupby(snd.district).mean().plot(kind='bar')

  • Sort to get the bar graph kind='barh 'plus h as a horizontal bar graph
  • Classification box whisker diagram
    Reflect the relationship between classified variables and continuous variables, compare the changes of continuous variables at different classification levels, and compare the median, which is intuitive.
sns.boxplot(x='district',y='price',data=snd)

1.2.5 summary

Two categorical variables (on the x and y axes respectively) and one continuous variable statistic
pivot_table()

snd.pivot_table(values='price',index='district',columns='school',aggfunc=np.mean)    #Mean is the function np.mean of numpy


Column chart

snd.pivot_table(values='price',index='district',columns='school',aggfunc=np.mean).plot(kind='bar')

1.2.5 time series - biaxial graph

Summarize GDP by year and calculate GDP growth rate. GDP is the pillar and GDP growth rate is the line
Import gdp data

gdp=pd.read_csv('gdp_gdpcr.csv',encoding='gbk')

x=list(gdp.year)
GDP=list(gdp.GDP)
GDPCR=list(gdp.GDPCR)
fig=plt.figure()                           #Set drawing area

ax1=fig.add_subplot(111)     
ax1.bar(x,GDP)                             #The principal axis represents GDP
ax1.set_ylabel('GDP')                      #Set spindle title
ax1.set_title("GDP of China(2000-2017)")   #Set diagram title
ax1.set_xlim(2000,2018)                    #Set abscissa value range

ax2=ax1.twinx()                            #Secondary axis of copy
ax2.plot(x,GDPCR,'r')                      #'r' indicates red
ax2.set_ylabel('Increase Ratio')
ax2.set_xlabel('Year')


##All code should be written in one cell, otherwise you won't get the diagram

Attached:
If the abscissa is garbled and the box is displayed, it may be because the default font cannot print Chinese characters. Modify the font and add the following code

from pylab import mpl
mpl.rcParams['font.sans-serif']=['SimHei']    #Specifies the default font SimHei for simplified Chinese
mpl.rcParams['axes.unicode_minus']=False      #Solve the problem of abscissa display box
snd.district.value_counts().plot(kind='bar')  

2 drawing principle

Data - > information - > relative relation - > Image

Icons expressing relevance
One classification, one continuity: box whisker diagram
Multi category: stacked column chart
Two continuous variables: scatter plot

Tags: R Language Machine Learning

Posted on Mon, 20 Sep 2021 15:17:38 -0400 by ankurcse