Hello, before modeling, what we must do is to traverse and construct the data, that is, data preprocessing. Today, I will introduce several methods commonly used by Pandas to call its built-in functions to traverse data.
catalog:
-
0. Data preview
-
1. apply
-
2. applymap
-
3. map
-
4. agg
-
5. pipe
0. Data preview
The data here are fictional extra linguistic achievements. Just copy them during the demonstration.
import pandas as pd df = pd.read_clipboard() df
|
full name | language | mathematics | English | Gender | Total score |
---|---|---|---|---|---|
0 | Brother CAI | 91 | ninety-five | 92 | 1 |
1 | Xiao Ming | 82 | 93 | 91 | 1 |
2 | Xiaohua | 82 | 87 | ninety-four | 1 |
3 | Grass | ninety-six | 55 | 88 | 0 |
4 | Xiao Hong | 51 | 41 | 70 | 0 |
5 | floret | 58 | 59 | 40 | 0 |
6 | Bruce Lee | 70 | 55 | 59 | 1 |
7 | jack | 53 | 44 | 42 | 1 |
8 | Mei Mei Han | 45 | 51 | 67 | 0 |
1. apply
apply can perform function processing on DataFrame type data according to columns or rows. By default, it is based on columns (or Series alone).
In the case data, for example, we want to replace 1 with male and 0 with female in the gender column.
First, customize a function, which has a parameter s (Series type data).
def getSex(s): if s==1: return 'male' elif s==0: return 'female'
There are more concise ways to write the above functions. It is convenient to understand and use the most intuitive way here.
Then, we can directly use apply to call this function.
df['Gender'].apply(getSex)
You can see the output results as follows:
0 male 1 male 2 male 3 female 4 female 5 female 6 male 7 male 8 female Name: Gender, dtype: object
Of course, we can also call the anonymous function lambda directly:
df['Gender'].apply( lambda s: 'male' if s==1 else 'female' )
You can see that the results are the same:
0 male 1 male 2 male 3 female 4 female 5 female 6 male 7 male 8 female Name: Gender, dtype: object
The above data processing is based solely on the value condition of one column. We can also process according to the combination condition of multiple columns (it can be understood as by row). Note that in this case, the parameter axis=1 needs to be specified. See the following case for details.
In the case, we believe that a total score of more than 200 and a math score of more than 90 are high scores
# Multi column conditional combination df['level'] = df.apply(lambda df: 'High score' if df['Total score']>=200 and df['mathematics']>=90 else 'other', axis=1) df
Similarly, the above functions called with apply are user-defined. In fact, we can also call built-in or pandas/numpy functions.
For example, find the highest score in addition to the number of words and the total score:
# python built-in functions df[['language','mathematics','English','Total score']].apply(max)
language 96 mathematics 95 English 94 Total score two hundred and seventy-eight dtype: int64
The average score of the number of words and the total score:
# numpy's own function import numpy as np df[['language','mathematics','English','Total score']].apply(np.mean)
language sixty-nine.777778 mathematics 64.444444 English seventy-one.444444 Total score two hundred and five.666667 dtype: float64
2. applymap
applymap is the function processing of each element, and the variable is the value of each element.
For example, a score of more than 90 points in three subjects other than the number of languages is considered to be a high score in the subject
df[['language','mathematics','English']].applymap(lambda x:'High score' if x>=90 else 'other')
|
language | mathematics | English |
---|---|---|
0 | High score | High score |
1 | other | High score |
2 | other | other |
3 | High score | other |
4 | other | other |
5 | other | other |
6 | other | other |
7 | other | other |
8 | other | other |
3. map
map returns the final data according to the input correspondence mapping value and acts on a column. The value passed in can be a dictionary. The key value is the original value and the value is the value to be replaced. You can also pass in a function or character formatting expression, etc.
Taking the replacement of 1 with male and 0 with female in the above gender column as an example, it can also be realized through map
df['Gender'].map({1:'male', 0:'female'})
The output results are also consistent:
0 male 1 male 2 male 3 female 4 female 5 female 6 male 7 male 8 female Name: Gender, dtype: object
For example, if you want to change the total column into a formatted character:
df['Total score'].map('Total score:{}branch'.format)
0 Total score: 278 points 1 Total score: 266 points 2 Total score: 263 points 3 Total score: 239 points 4 Total score: 162 points 5 Total score: 157 points 6 Total score: 184 points 7 Total score: 139 points 8 Total score: 163 points Name: Total score, dtype: object
4. agg
agg is generally used for aggregation. It is often seen in grouping or perspective operations. Its usage is close to that of apply.
For example, find the highest score, lowest score and average score of the number of words and the total score
df[['language','mathematics','English','Total score']].agg(['max','min','mean'])
We can also perform different operations on different columns (specified in dictionary form)
# The highest score in Chinese, the lowest score in mathematics and the highest and lowest score in English df.agg({'language':['max'],'mathematics':'min','English':['max','min']})
Of course, calls to custom functions are also supported
5. pipe
For the above four methods calling functions, we found that the parameters of the called function are DataFrame or Serise data. If the called function needs other parameters, what should we do?
So, pipe appeared.
pipe, also known as pipeline method, can standardize and flow our processing and analysis process. It can bring other parameters of the called function when calling the function, which is convenient for the function expansion of the user-defined function.
For example, we need to obtain the data of students whose total score is greater than n and whose gender is sex. Where n and sex are variable parameters, it is not easy to use apply and so on. At this time, you can use the pipe method to make trouble!
Let's define a function first
# Define a function, the total score is greater than or equal to n, and the gender is sex (sex = 2 means no gender) def total(df, n, sex): dfT = df.copy() if sex == 2: return dfT[(dfT['Total score']>=n)] else: return dfT[(dfT['Total score']>=n) & (dfT['Gender']==sex)]
If we want to find students with a total score greater than 200, regardless of gender, we can do this:
df.pipe(total,200,2)
Then find the students whose total score is greater than 150 and whose gender is male (1), which can be as follows:
df.pipe(total,150,1)
Then find the students whose total score is greater than 200 and whose gender is female (0), which can be as follows:
df.pipe(total,200,0)
The above is our introduction of five methods to call functions to traverse data. These operation skills can make us more flexible in processing data.