The five pandas built-in functions are very convenient

Hello, before modeling, what we must do is to traverse and construct the data, that is, data preprocessing. Today, I will introduce several methods commonly used by Pandas to call its built-in functions to traverse data.

catalog:

  • 0. Data preview

  • 1. apply

  • 2. applymap

  • 3. map

  • 4. agg

  • 5. pipe

0. Data preview

The data here are fictional extra linguistic achievements. Just copy them during the demonstration.

import pandas as pd

df = pd.read_clipboard()
df

|

full namelanguagemathematicsEnglishGenderTotal score
0Brother CAI91ninety-five921
1Xiao Ming8293911
2Xiaohua8287ninety-four1
3Grassninety-six55880
4Xiao Hong5141700
5floret5859400
6Bruce Lee7055591
7jack5344421
8Mei Mei Han4551670

1. apply

apply can perform function processing on DataFrame type data according to columns or rows. By default, it is based on columns (or Series alone).

In the case data, for example, we want to replace 1 with male and 0 with female in the gender column.

First, customize a function, which has a parameter s (Series type data).

def getSex(s):
    if s==1:
        return 'male'
    elif s==0:
        return 'female'

There are more concise ways to write the above functions. It is convenient to understand and use the most intuitive way here.

Then, we can directly use apply to call this function.

df['Gender'].apply(getSex)

You can see the output results as follows:

0    male
1    male
2    male
3    female
4    female
5    female
6    male
7    male
8    female
Name: Gender, dtype: object

Of course, we can also call the anonymous function lambda directly:

df['Gender'].apply( lambda s: 'male' if s==1 else 'female' )

You can see that the results are the same:

0    male
1    male
2    male
3    female
4    female
5    female
6    male
7    male
8    female
Name: Gender, dtype: object

The above data processing is based solely on the value condition of one column. We can also process according to the combination condition of multiple columns (it can be understood as by row). Note that in this case, the parameter axis=1 needs to be specified. See the following case for details.

In the case, we believe that a total score of more than 200 and a math score of more than 90 are high scores

# Multi column conditional combination
df['level'] = df.apply(lambda df: 'High score' if df['Total score']>=200 and df['mathematics']>=90 else 'other', axis=1)
df

Similarly, the above functions called with apply are user-defined. In fact, we can also call built-in or pandas/numpy functions.

For example, find the highest score in addition to the number of words and the total score:

# python built-in functions
df[['language','mathematics','English','Total score']].apply(max)
language     96
 mathematics     95
 English     94
 Total score    two hundred and seventy-eight
dtype: int64

The average score of the number of words and the total score:

# numpy's own function
import numpy as np

df[['language','mathematics','English','Total score']].apply(np.mean)
language     sixty-nine.777778
 mathematics     64.444444
 English     seventy-one.444444
 Total score    two hundred and five.666667
dtype: float64

2. applymap

applymap is the function processing of each element, and the variable is the value of each element.

For example, a score of more than 90 points in three subjects other than the number of languages is considered to be a high score in the subject

df[['language','mathematics','English']].applymap(lambda x:'High score' if x>=90 else 'other')

|

languagemathematicsEnglish
0High scoreHigh score
1otherHigh score
2otherother
3High scoreother
4otherother
5otherother
6otherother
7otherother
8otherother

3. map

map returns the final data according to the input correspondence mapping value and acts on a column. The value passed in can be a dictionary. The key value is the original value and the value is the value to be replaced. You can also pass in a function or character formatting expression, etc.

Taking the replacement of 1 with male and 0 with female in the above gender column as an example, it can also be realized through map

df['Gender'].map({1:'male', 0:'female'})

The output results are also consistent:

0    male
1    male
2    male
3    female
4    female
5    female
6    male
7    male
8    female
Name: Gender, dtype: object

For example, if you want to change the total column into a formatted character:

df['Total score'].map('Total score:{}branch'.format)
0    Total score: 278 points
1    Total score: 266 points
2    Total score: 263 points
3    Total score: 239 points
4    Total score: 162 points
5    Total score: 157 points
6    Total score: 184 points
7    Total score: 139 points
8    Total score: 163 points
Name: Total score, dtype: object

4. agg

agg is generally used for aggregation. It is often seen in grouping or perspective operations. Its usage is close to that of apply.

For example, find the highest score, lowest score and average score of the number of words and the total score

df[['language','mathematics','English','Total score']].agg(['max','min','mean'])

We can also perform different operations on different columns (specified in dictionary form)

# The highest score in Chinese, the lowest score in mathematics and the highest and lowest score in English
df.agg({'language':['max'],'mathematics':'min','English':['max','min']})

Of course, calls to custom functions are also supported

5. pipe

For the above four methods calling functions, we found that the parameters of the called function are DataFrame or Serise data. If the called function needs other parameters, what should we do?

So, pipe appeared.

pipe, also known as pipeline method, can standardize and flow our processing and analysis process. It can bring other parameters of the called function when calling the function, which is convenient for the function expansion of the user-defined function.

For example, we need to obtain the data of students whose total score is greater than n and whose gender is sex. Where n and sex are variable parameters, it is not easy to use apply and so on. At this time, you can use the pipe method to make trouble!

Let's define a function first

# Define a function, the total score is greater than or equal to n, and the gender is sex (sex = 2 means no gender)
def total(df, n, sex):
    dfT = df.copy()
    if sex == 2:
        return dfT[(dfT['Total score']>=n)]
    else:
        return dfT[(dfT['Total score']>=n) & (dfT['Gender']==sex)]

If we want to find students with a total score greater than 200, regardless of gender, we can do this:

df.pipe(total,200,2)

Then find the students whose total score is greater than 150 and whose gender is male (1), which can be as follows:

df.pipe(total,150,1)

Then find the students whose total score is greater than 200 and whose gender is female (0), which can be as follows:

df.pipe(total,200,0)

The above is our introduction of five methods to call functions to traverse data. These operation skills can make us more flexible in processing data.

Tags: Python Data Analysis pandas

Posted on Fri, 03 Dec 2021 00:33:23 -0500 by w.geoghegan