Introduction to data analysis

Article directory

1. Basic statistics

■ basic statistical analysis: also known as descriptive statistical analysis, it generally counts the minimum value, the first quartile value, the median value, the third quartile value, and the maximum value of a variable.

  • Common statistical indicators:

    • Count, sum, mean, variance, standard deviation
  • Descriptive statistical analysis function:

    • describe()
  • Common statistical functions:

Statistical function Notes
size count
sum Summation
mean mean value
var variance
std standard deviation

1.1 import data

from pandas import read_csv
data = read_csv('F:\Data analysis\Data analysis 3\Chapter 8 data analysis\8\8.1\data.csv')

1.2 data description

data.score.describe()

1.3 statistics of values

data.score.min() #96
data.score.max() #140
data.score.sum() #1574
data.score.mean() #121.07692307692308
data.score.var() #154.91025641025644
data.score.std() #112.44629488684309
data.score.size #13. Notice, this one has no brackets



2. Group analysis

■ group analysis: it refers to an analysis method that divides the analysis object into different parts according to the group fields to compare and analyze the differences between groups.

  • Common statistical index parameters

    • Count, sum, average
  • Group statistics function:

    • Group by (by = [group column 1, group column 2
      [statistical column 1, statistical column 2 ]
    • . agg({statistical column alias 1: statistical function 1, statistical column alias 2: statistical function 2 }

2.1 import data

import numpy
from pandas import read_csv
data = read_csv('F:\Data analysis\Data analysis 3\Chapter 8 data analysis\8\8.2\data.csv')

2.2 double the number sequence

data['score2'] = data['score']*2

2.3 basic statistics

data.groupby(by=['class'])['score'].agg({
    'Total score':numpy.sum,
    'Number':numpy.size,
    'average value':numpy.mean,
    'variance':numpy.var,
    'standard deviation':numpy.std
})

2.4 multi group statistics

data.groupby(by=['class','name'])[['score','score2']].agg([
    numpy.size,
    numpy.sum
])

2.5 viewing data

result = data.groupby(by=['class'])['score'].agg({
    'Total score':numpy.sum,
    'Number':numpy.size,
    'average value':numpy.mean,
    'variance':numpy.var,
    'standard deviation':numpy.std
})
print(result)


(1) View index

(2) View header

(3) View average column

2.6 multi level index query

2.6.1 establish multi-level index

result2 = data.groupby(by=['class','name'])[['score','score2']].agg([
    numpy.size,
    numpy.sum
])

2.6.2 index query

(1) Query index

(2) Query header

(3) Single value query

(4) Multivalued query

2.7 reset index

result2.reset_index()





3. Distribution analysis

■ distribution analysis: it refers to an analysis method that groups the data (quantitative data) equally or unequally according to the analysis purpose to study the distribution law of each group.

3.1 import data

from pandas import read_csv
data = read_csv('F:\Data analysis\Data analysis 3\Chapter 8 data analysis\8\8.3\\data.csv')

3.2 data grouping

bins = [min(data['Age'])-1,20,30,40,max(data['Age'])+1]
labels = ['20 Year old and below','21~30 year','31~40 year','41 Over age']
//Age stratification = pandas.cut(data.Age,bins,labels)
data['Age stratification'] = Age stratification
data

3.3 statistical grouping data

data.groupby(by=['Age stratification'])['Age'].agg({'Number':numpy.size})

4. Cross analysis

■ cross analysis: it is usually used to analyze the relationship between two or more grouped variables, and carry out comparative analysis of the relationship between variables in the form of cross table;

● quantitative and quantitative group crossing
● quantitative and qualitative group crossing
● qualitative and qualitative group crossing

4.1 import and group data

import pandas,numpy
from pandas import read_csv
data = read_csv('F:\Data analysis\Data analysis 3\Chapter 8 data analysis\8\8.4\data.csv')
#Data grouping
bins = [min(data['Age'])-1,20,30,40,max(data['Age'])+1]
labels = ['20 Year old and below','21~30 year','31~40 year','41 Over age']
//Age stratification = pandas.cut(data.Age,bins,labels)
data['Age stratification'] = Age stratification

4.2 cross analysis (pivot table)

Number and average age of men and women:

result1 = data.pivot_table(
    values=['Age'],
    index=['Age stratification'],
    columns=['Gender'],
    aggfunc=[numpy.size,numpy.mean]
)

result1


Standard deviation of age group of men and women:

result2 = data.pivot_table(
    values=['Age'],
    index=['Age stratification'],
    columns=['Gender'],
    aggfunc=[numpy.std]
)

result2

4.2 merge DataFrame





5. Structural analysis

■ structural analysis: it is an analysis method to calculate the proportion of each component on the basis of grouping, and then analyze the overall internal characteristics.

  • axis parameter description
    • 0 - > operation by column
    • 1 - > operation by line

5.1 import data

import numpy
from pandas import read_csv
data = read_csv('F:\Data analysis\Data analysis 3\Chapter 8 data analysis\8\8.5\data.csv')

data

5.2 cross analysis (pivot table)

data_pt = data.pivot_table(
    values=['Monthly consumption (yuan)'],
    index=['Province'],
    columns=['Communication brand'],
    aggfunc=[numpy.sum]
)

data_pt

5.3 cross analysis operation

5.3.1 direct summation

5.3.2 sum by column


Summary: direct summation defaults to case list summation

5.3.3 sum by line

5.5.4 proportion of communication brands in each province

data_pt.div(data_pt.sum(axis=1),axis=0)

5.5.5 proportion of communication brands in each province

data_pt.div(data_pt.sum(axis=0),axis=1)

6. Correlation

■ correlation analysis: it is a statistical method to study the correlation between random variables, which is to study whether there is a certain dependency between phenomena, and to discuss the correlation direction and degree of the specific dependent phenomena.
■ correlation coefficient: it can be used to describe the relationship between quantitative variables

  • Correlation analysis function:

    • DataFrame.corr()
    • Series.corr(other)
  • Function Description:

    • If the corr method is called by the data frame, the similarity between two columns will be calculated
    • If the corr method is called by a sequence, only the correlation between the sequence and the incoming sequence is calculated
  • Return value:

    • DataFrame call: returns DataFrame
    • Series call: returns a numeric type with the size of correlation

6.1 import data

from pandas import read_csv
data = read_csv('F:\Data analysis\Data analysis 3\Chapter 8 data analysis\8\8.6\data.csv')
data

6.2 correlation of two sequences

data['Illiteracy rate'].corr(data['population'])
#Two directions are interchangeable
data['population'].corr(data['Illiteracy rate'])


Because 0.1 < 0.3, they are of low correlation

6.3 correlation of most columns

data.loc[:,['Supermarket shopping rate','Online shopping rate','Illiteracy rate','population']].corr()

Published 55 original articles, won praise 23, visited 7584
Private letter follow

Posted on Sun, 12 Jan 2020 06:24:57 -0500 by TKB