The data were analyzed by pandas

background

Understanding programming language first belongs to the world of programming apes. Now, with the improvement of people's education and the development of Internet technology, business personnel also begin to need to understand programming language slowly. From the recruitment demand in recent years, Python has become a new demand.

Most of the business personnel used Excel before. Now, with the increase of the amount of data, Excel can no longer meet the needs of data processing. If there are more than 100000 rows of data in Excel, excel will run very catchy.

Here are some functions commonly used in Excel to see how they are implemented in Python. The pandas library is mainly used by Python to process data, which is also the object introduced in the whole book using Python for data analysis.

The following is the list of programming languages in February 2021:

From the ranking, python is becoming more and more popular

Programming language ranking in February 2021

case

This is just a presentation method. Only 15 rows of data are used

Case data

Import simulation data

import pandas as pd
import numpy as np
data = pd.read_excel('Analog data.xlsx')
data.head()

Import analog number

View data rows and columns

len(data)    #Number of data rows
len(data.columns)    #Number of data columns
data.info()     #Data column details
data.describe()   #Default value, statistical column
data.describe(include='all')   #All columns
data.describe(include='object')   #Character only for columns

View data rows and columns

View data types

data.dtypes

View data types

Data filtering

data[data['Gender']=='male']
data[data['Age']>=30]
data[(data['Age']>=30) & (data['Gender']=='male')]   #Two conditions and
data[(data['Age']>=30) | (data['Gender']=='male')]    #Two conditions or

Data filtering

Modify the data based on filtering

data.loc[data['full name']=='Zhang San','Gender']='female'   #Change Zhang San's gender to female
data

Modify data

Data missing value replacement

data   #There are two missing values in gender and age
int(data['Age'].mean(skipna=True))
#The missing value of age is replaced by the average value
data['Age'].fillna(int(data['Age'].mean(skipna=True)),inplace=True)   
data
data['Gender'].fillna('other',inplace=True)
data

Missing value replacement

add rows

#Method 1
data.loc[15]=[16,'new',55,'female',350,4,50]
data
#Method 2
data_new = pd.DataFrame([[16,'new',55,'female',350,4,50]],columns=data.columns)
pd.concat([data,data_new],ignore_index=True)

add rows

Add column

Adding a column is relatively simple. You can assign a value directly

data['new_column_1']=0
data['new_column_2']='new'

Add column

Delete row

data.loc[15]=[16,'new',55,'female',350,4,50,0,'new']    #Add a test line first
data
data.drop(index=15,inplace=True)    #Delete row
data

Delete row

Delete column

data.drop(columns='new_column_1')    #Returns the new data after deletion, and the original data remains unchanged
data.drop(columns=['new_column_1','new_column_2'])   #Returns the new data after deletion, and the original data remains unchanged
data.drop(columns=['new_column_1','new_column_2'],inplace=True)   #Processing on raw data
data

Delete column

Data De duplication

data 
data[['Gender','Consumption frequency']]
data[['Gender','Consumption frequency']].drop_duplicates(keep='first')  #The first one is reserved, which is generally used in combination with sorting
data[['Gender','Consumption frequency']].drop_duplicates(keep='last') #The last one is reserved, which is generally used in combination with sorting
#Weight removal shall be carried out according to gender and consumption frequency
data.drop_duplicates(subset=['Gender','Consumption frequency'],keep='first')

Data De duplication

Data sorting

It is much more convenient than Excel

data
data.sort_values(by='Consumption amount',ascending=True)
data.sort_values(by='Consumption amount',ascending=False)
data.sort_values(by=['Consumption frequency','Consumption amount'],ascending=[False,True])

Data sorting

data statistics

data
data['Gender'].value_counts()
data['Gender'].value_counts(normalize=True)  #percentage
data.value_counts(subset='Gender')
data.value_counts(subset=['Consumption frequency'],sort=True,ascending=True)

data statistics

PivotTable report

data
pd.pivot_table(data,index=['Gender'],aggfunc='count')
pd.pivot_table(data,index=['Gender'],values=['full name'],aggfunc='count')
pd.pivot_table(data,index=['Gender'],
               columns=['Consumption frequency'],
               values=['full name'],
               aggfunc='count',
               fill_value=0)
pd.pivot_table(data,index=['Gender'],
               columns=['Consumption frequency'],
               values=['full name'],
               aggfunc='sum',
               fill_value=0)
pd.pivot_table(data,index=['Gender'],
               columns=['Consumption frequency'],
               values=['Consumption amount'],
               aggfunc='sum',
               fill_value=0)
pd.pivot_table(data,index=['Gender'],
               columns=['Consumption frequency'],
               values=['Days between last consumption'],
               aggfunc='mean',
               fill_value=0)

PivotTable report

sum function

data
data['Consumption amount'].sum()

sum function

count function

data
data.count()
data['full name'].count()

count function

if function

data
#Method 1
data['Gender_handle']=data['Gender'].map(lambda x:1 if x=='male' else 0)
#Method 2
def gender(x):
    if x=='male':
        return 1
    else:
        return 0
data['Gender_Process 2']=data['Gender'].map(gender)
#Method 3
dict_gender={'male':1,'female':0 ,'other':0}
data['Gender_Process 3']=data['Gender'].map(dict_gender)

Posted on Fri, 26 Nov 2021 02:46:46 -0500 by apacheguy