# 1. Basic statistics

■ basic statistical analysis: also known as descriptive statistical analysis, it generally counts the minimum value, the first quartile value, the median value, the third quartile value, and the maximum value of a variable.

• Common statistical indicators:

• Count, sum, mean, variance, standard deviation
• Descriptive statistical analysis function:

• describe()
• Common statistical functions:

Statistical function Notes
size count
sum Summation
mean mean value
var variance
std standard deviation

## 1.1 import data

```from pandas import read_csv
data = read_csv('F:\Data analysis\Data analysis 3\Chapter 8 data analysis\8\8.1\data.csv')
``` ## 1.2 data description

```data.score.describe()
``` ## 1.3 statistics of values

```data.score.min() #96
data.score.max() #140
data.score.sum() #1574
data.score.mean() #121.07692307692308
data.score.var() #154.91025641025644
data.score.std() #112.44629488684309
data.score.size #13. Notice, this one has no brackets
```

# 2. Group analysis

■ group analysis: it refers to an analysis method that divides the analysis object into different parts according to the group fields to compare and analyze the differences between groups.

• Common statistical index parameters

• Count, sum, average
• Group statistics function:

• Group by (by = [group column 1, group column 2
[statistical column 1, statistical column 2 ]
• . agg({statistical column alias 1: statistical function 1, statistical column alias 2: statistical function 2 }

## 2.1 import data

```import numpy
data = read_csv('F:\Data analysis\Data analysis 3\Chapter 8 data analysis\8\8.2\data.csv')
``` ## 2.2 double the number sequence

```data['score2'] = data['score']*2
``` ## 2.3 basic statistics

```data.groupby(by=['class'])['score'].agg({
'Total score':numpy.sum,
'Number':numpy.size,
'average value':numpy.mean,
'variance':numpy.var,
'standard deviation':numpy.std
})
``` ## 2.4 multi group statistics

```data.groupby(by=['class','name'])[['score','score2']].agg([
numpy.size,
numpy.sum
])
``` ## 2.5 viewing data

```result = data.groupby(by=['class'])['score'].agg({
'Total score':numpy.sum,
'Number':numpy.size,
'average value':numpy.mean,
'variance':numpy.var,
'standard deviation':numpy.std
})
print(result)
``` (1) View index  (3) View average column ## 2.6 multi level index query

### 2.6.1 establish multi-level index

```result2 = data.groupby(by=['class','name'])[['score','score2']].agg([
numpy.size,
numpy.sum
])
``` ### 2.6.2 index query

(1) Query index  (3) Single value query (4) Multivalued query ## 2.7 reset index

```result2.reset_index()
``` # 3. Distribution analysis

■ distribution analysis: it refers to an analysis method that groups the data (quantitative data) equally or unequally according to the analysis purpose to study the distribution law of each group.

## 3.1 import data

```from pandas import read_csv
data = read_csv('F:\Data analysis\Data analysis 3\Chapter 8 data analysis\8\8.3\\data.csv')
``` ## 3.2 data grouping

```bins = [min(data['Age'])-1,20,30,40,max(data['Age'])+1]
labels = ['20 Year old and below','21~30 year','31~40 year','41 Over age']
//Age stratification = pandas.cut(data.Age,bins,labels)
data['Age stratification'] = Age stratification
data
``` ## 3.3 statistical grouping data

```data.groupby(by=['Age stratification'])['Age'].agg({'Number':numpy.size})
``` # 4. Cross analysis

■ cross analysis: it is usually used to analyze the relationship between two or more grouped variables, and carry out comparative analysis of the relationship between variables in the form of cross table;

● quantitative and quantitative group crossing
● quantitative and qualitative group crossing
● qualitative and qualitative group crossing ## 4.1 import and group data

```import pandas,numpy
data = read_csv('F:\Data analysis\Data analysis 3\Chapter 8 data analysis\8\8.4\data.csv')
#Data grouping
bins = [min(data['Age'])-1,20,30,40,max(data['Age'])+1]
labels = ['20 Year old and below','21~30 year','31~40 year','41 Over age']
//Age stratification = pandas.cut(data.Age,bins,labels)
data['Age stratification'] = Age stratification
``` ## 4.2 cross analysis (pivot table)

Number and average age of men and women:

```result1 = data.pivot_table(
values=['Age'],
index=['Age stratification'],
columns=['Gender'],
aggfunc=[numpy.size,numpy.mean]
)

result1
``` Standard deviation of age group of men and women:

```result2 = data.pivot_table(
values=['Age'],
index=['Age stratification'],
columns=['Gender'],
aggfunc=[numpy.std]
)

result2
``` ## 4.2 merge DataFrame # 5. Structural analysis

■ structural analysis: it is an analysis method to calculate the proportion of each component on the basis of grouping, and then analyze the overall internal characteristics.

• axis parameter description
• 0 - > operation by column
• 1 - > operation by line ## 5.1 import data

```import numpy
data = read_csv('F:\Data analysis\Data analysis 3\Chapter 8 data analysis\8\8.5\data.csv')

data
``` ## 5.2 cross analysis (pivot table)

```data_pt = data.pivot_table(
values=['Monthly consumption (yuan)'],
index=['Province'],
columns=['Communication brand'],
aggfunc=[numpy.sum]
)

data_pt
``` ## 5.3 cross analysis operation

### 5.3.1 direct summation ### 5.3.2 sum by column Summary: direct summation defaults to case list summation

### 5.3.3 sum by line ### 5.5.4 proportion of communication brands in each province

```data_pt.div(data_pt.sum(axis=1),axis=0)
``` ### 5.5.5 proportion of communication brands in each province

```data_pt.div(data_pt.sum(axis=0),axis=1)
``` # 6. Correlation

■ correlation analysis: it is a statistical method to study the correlation between random variables, which is to study whether there is a certain dependency between phenomena, and to discuss the correlation direction and degree of the specific dependent phenomena.
■ correlation coefficient: it can be used to describe the relationship between quantitative variables • Correlation analysis function:

• DataFrame.corr()
• Series.corr(other)
• Function Description:

• If the corr method is called by the data frame, the similarity between two columns will be calculated
• If the corr method is called by a sequence, only the correlation between the sequence and the incoming sequence is calculated
• Return value:

• DataFrame call: returns DataFrame
• Series call: returns a numeric type with the size of correlation

## 6.1 import data

```from pandas import read_csv
data = read_csv('F:\Data analysis\Data analysis 3\Chapter 8 data analysis\8\8.6\data.csv')
data
``` ## 6.2 correlation of two sequences

```data['Illiteracy rate'].corr(data['population'])
#Two directions are interchangeable
data['population'].corr(data['Illiteracy rate'])
``` Because 0.1 < 0.3, they are of low correlation

## 6.3 correlation of most columns

```data.loc[:,['Supermarket shopping rate','Online shopping rate','Illiteracy rate','population']].corr()
```   Published 55 original articles, won praise 23, visited 7584

Posted on Sun, 12 Jan 2020 06:24:57 -0500 by TKB