# 1 descriptive statistical analysis

## 1.1 concept

1. Data variable measurement type

Nominal: character (original meaning), numerical value (code)
Grade: characters and values are sorted, and the difference of "small, medium and large" is meaningless
Continuous: value 'age'
#Continuous variables are grouped and used as hierarchical variables to make the data more robust
#Nominal variables and hierarchical variables are collectively referred to as classification variables.
Statistics: frequency, percentage

1. Describe the distribution of nominal variables

Frequency, percentage
#All packages used in python do not support characters and need to be encoded, 0 / 1 (the amount of attention is coded as 1)

1. Describe the distribution of continuous variables
• Centralized trend (position) - the measurement of the center: mean, median, mode
#Select the mean or median according to the skewness, select the median on the right, and select the mean without skewness • Degree of dispersion (degree of dispersion): variance, standard deviation, range, quartile difference IQR

Variance: Standard Deviation: Interquartile difference IQR: upper quantile - lower quantile

Box whisker diagram: variable distribution, internal quantile IQR, outliers • Skewness (shape): normal skewness=0, right skewness = positive
1. Common continuous distribution

1) Lognormal distribution: widely used, income distribution, right deviation
Descriptive statistical analysis: take the median
Modeling: take logarithm
2) Gamma distribution: amount of loss caused by disaster
3) Poisson distribution: team length
4) Normal distribution: it is distributed in nature, and the mean value represents the central level

## 1.2 descriptive statistics / cases

Import the required package

```import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import matplotlib
import seaborn as sns
import os                #os:Operating System
```

```os.chdir(r'D:\python Business practice\<Python Detailed explanation of data science and technology and business practice PDF+source code+Eight cases\<Python Detailed explanation of data science and technology and business practice PDF+source code+Eight cases\source code\Python_book\4Describe')
``` Change zone name

```district={'fengtai':'Fengtai District','haidian':'Haidian District','chaoyang':'Chaoyang District','dongcheng':'Dongcheng District','shijingshan':'Shijingshan District','xicheng':'Xicheng District'}
snd['district']=snd.dist.map(district)Insert the code slice here
``` ### 1.2.1 single factor frequency: a categorical variable

value_counts() frequency of each value

```snd.district.value_counts()
``` • Plot a histogram. plot(kind='bar ')
```snd.district.value_counts().plot(kind='bar')
``` • Draw pie chart kind='pie '
```snd.district.value_counts().plot(kind='pie')
``` ### 1.2.2 analysis table

Two categorical variables

Pd.crosstab (classification variable 1, classification variable 2)

Generate data frame and frequency table

```sub_sch=pd.crosstab(snd.district,snd.school)
sub_sch
``` • Classification column
```pd.crosstab(snd.district,snd.school).plot(kind='bar')
``` • Stack the column chart to see the resource allocation
```t1=pd.crosstab(snd.district,snd.school)
t1.plot(kind='bar',stacked=True)
``` • Standardized stacked column chart
It is used to compare whether the two classification variables are related. It is intuitive and can clearly see the resource allocation
```sub_sch=pd.crosstab(snd.district,snd.school)
sub_sch['sum1']=sub_sch.sum(1)                  #1 stands for column, and one column is summarized
``` ```sub_sch=sub_sch.div(sub_sch.sum1,axis=0)    #Percentage by line
sub_sch
``` ```sub_sch[[0,1]].plot(kind='bar',stacked=True)
``` • The width of the column represents the quantity, which is more intuitive.
```def stack2dim (raw,i,j,rotation=0,location='upper right'):
```

raw: DataFrame data frame of pandas
i. j: names of two classification variables, horizontal and vertical axis names
Rotation: horizontal label rotation angle. The default horizontal direction. If the label is too long, you can set a certain angle, such as rotation=40
Location: the location of the classification label. If it is blocked by the theme graphics, it can be changed to 'upper left'

The function needs to be called*

```from stack2dim import *
stack2dim(snd, i="district", j="school")
``` ### 1.2.3 single continuous variable description

```snd.price.agg(['mean','median','sum','std','skew'])    #Get multiple statistics
``` Draw histogram to check the distribution, similar to normal distribution

```snd.price.hist(bins=100)            #bins is the number of packets
``` ```snd.price.mean()     #mean value
snd.price.median()   #median
snd.price.std()      #standard deviation
snd.price.skew()     #skewness
```
```snd.price.quantile([0.01,0.5,0.99])      #Take quantile
``` ### 1.2.4 classification summary

A categorical variable and a continuous variable statistic
Group by () subtotal

```snd.price.groupby(snd.district).mean()    #Take the mean value of continuous variables, which can be replaced
``` • Column chart
```snd.price.groupby(snd.district).mean().plot(kind='bar')
``` • Sort to get the bar graph kind='barh 'plus h as a horizontal bar graph • Classification box whisker diagram
Reflect the relationship between classified variables and continuous variables, compare the changes of continuous variables at different classification levels, and compare the median, which is intuitive.
```sns.boxplot(x='district',y='price',data=snd)
``` ### 1.2.5 summary

Two categorical variables (on the x and y axes respectively) and one continuous variable statistic
pivot_table()

```snd.pivot_table(values='price',index='district',columns='school',aggfunc=np.mean)    #Mean is the function np.mean of numpy
``` Column chart

```snd.pivot_table(values='price',index='district',columns='school',aggfunc=np.mean).plot(kind='bar')
``` ### 1.2.5 time series - biaxial graph

Summarize GDP by year and calculate GDP growth rate. GDP is the pillar and GDP growth rate is the line
Import gdp data

```gdp=pd.read_csv('gdp_gdpcr.csv',encoding='gbk')
``` ```x=list(gdp.year)
GDP=list(gdp.GDP)
GDPCR=list(gdp.GDPCR)
fig=plt.figure()                           #Set drawing area

ax1.bar(x,GDP)                             #The principal axis represents GDP
ax1.set_ylabel('GDP')                      #Set spindle title
ax1.set_title("GDP of China(2000-2017)")   #Set diagram title
ax1.set_xlim(2000,2018)                    #Set abscissa value range

ax2=ax1.twinx()                            #Secondary axis of copy
ax2.plot(x,GDPCR,'r')                      #'r' indicates red
ax2.set_ylabel('Increase Ratio')
ax2.set_xlabel('Year')
``` ##All code should be written in one cell, otherwise you won't get the diagram

Attached:
If the abscissa is garbled and the box is displayed, it may be because the default font cannot print Chinese characters. Modify the font and add the following code

```from pylab import mpl
mpl.rcParams['font.sans-serif']=['SimHei']    #Specifies the default font SimHei for simplified Chinese
mpl.rcParams['axes.unicode_minus']=False      #Solve the problem of abscissa display box
snd.district.value_counts().plot(kind='bar')
```

# 2 drawing principle

Data - > information - > relative relation - > Image

Icons expressing relevance
One classification, one continuity: box whisker diagram
Multi category: stacked column chart
Two continuous variables: scatter plot

Posted on Mon, 20 Sep 2021 15:17:38 -0400 by ankurcse