Datawhale September team learning - hands on data analysis task2_ Learning records

Data cleaning and feature processing

Usually, the original data is not clean, and there may be outliers, missing values and other problems. Therefore, it is generally necessary to clean the data before data analysis.

Read a file first

#Load the required libraries
import numpy as np
import pandas as pd

#Load data train.csv
df = pd.read_csv('train.csv')

Missing value observation and treatment

Missing values may be caused by human error or machine error. If we do not operate on these vacant places, it may greatly affect the results of our subsequent analysis or modeling.

The first is to observe the missing value. The method of observing the missing value can be viewed directly through info(). info() can observe the number of rows, column names of each column, the number of non missing values per column, data type and total amount of data

#Operation results

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 891 entries, 0 to 890
Data columns (total 12 columns):
 #   Column       Non-Null Count  Dtype  
---  ------       --------------  -----  
 0   PassengerId  891 non-null    int64  
 1   Survived     891 non-null    int64  
 2   Pclass       891 non-null    int64  
 3   Name         891 non-null    object 
 4   Sex          891 non-null    object 
 5   Age          714 non-null    float64
 6   SibSp        891 non-null    int64  
 7   Parch        891 non-null    int64  
 8   Ticket       891 non-null    object 
 9   Fare         891 non-null    float64
 10  Cabin        204 non-null    object 
 11  Embarked     889 non-null    object 
dtypes: float64(2), int64(5), object(5)
memory usage: 83.7+ KB

You can see a total of 891 rows, but the number of non missing values of Age, Cabin and embanked does not reach 891, indicating that they all contain missing values. However, this is not intuitive enough, and you can't quickly see the total number of missing values in each column. You can view the number of missing values in each column through the combination of isnull() and sum().



#Operation results

PassengerId      0
Survived         0
Pclass           0
Name             0
Sex              0
Age            177
SibSp            0
Parch            0
Ticket           0
Fare             0
Cabin          687
Embarked         2
dtype: int64

If you combine another sum(), you can directly get the sum of the missing values


#Operation results


isnull() is an element level judgment. It lists the positions of all corresponding elements. If the element is empty or Nan, it will display True, otherwise it will be False. Both Nan and None are treated as np.nan in pandas.


None is python's own type. It is of type none. It represents an empty type and cannot participate in operations


#The result is empty and no empty value is detected


isnan in numpy corresponds to the Nan value, which represents "not a number". The value type is float. It is displayed as Nan in the array. It can participate in the operation, but the result is displayed as Nan.


#The result is still empty. It is not true that there is no null value here!

Note: I've been stuck here for a long time. Later, I found the reason on the forum. There is no absolutely equal data in the computer. The so-called equality is only equality when the accuracy allows! Np.nan originally means not a number. Therefore, to judge whether a value is Nan, you can only use np.isnan(i), and never use i == np.nan, because Nan has attributes that are not equal to itself!

reference material:


Incidentally, numpy uses isnan() to check whether NaN exists, and pandas uses isna() or isnull() to check whether NaN exists. Pandas is built on numpy. In numpy, there is neither na nor null, but only NaN("Not A Number"). Therefore, pandas also follows the NaN value, resulting in different names but the same functions of ISNA and isnull.

In different application scenarios, the processing of missing values is different, which can be roughly divided into:

  • Delete tuple
  • Data supplement
  • Do not handle

Deleting tuples can delete objects (tuples, records) with missing information attribute values, so as to obtain a complete information table, so as to reduce historical data in exchange for complete information, but it may cause a certain waste of data resources. If the data set is small and there are many missing values, this method may not be applicable.

Data complementation is the most commonly used missing value processing method, and there are many complementation methods, such as manual filling, special value filling, average value filling, K-nearest neighbor method, etc. using these methods requires a certain understanding of the data to find the most appropriate filling method.

Do not deal with missing values. I think in some artificial neural networks, it is uncertain how the missing values affect the model training results. We can consider not dealing with missing values for observation.

reference material:

Delete tuple

The operation of deleting tuples usually adopts dropna(), which will delete the row where the missing value is located

#Original data

#Delete missing values

By observing the passenger ID, it can be found that some passenger data have disappeared.

Data supplement

In the above, we can correctly judge the missing value tuples. It is very convenient to simply fill in the data

Directly assign a value to the position of the vacancy, and I fill it with 0 here

df[df['Age'].isnull()] = 0

It can be observed here that if you operate like the above code, not only the missing value will become 0, but also all the values of the line where the missing value is located will become 0. To avoid this, I tried to write an ugly code myself:

The first is observation

# Operation results

OK, this is equivalent to a two-dimensional array. I have judged the missing value in the previous interview, so I try to judge the missing value in the Age column

# Operation results
0      False
1      False
2      False
3      False
4      False
886    False
887    False
888     True
889    False
890    False
Name: Age, Length: 891, dtype: bool

Pick out missing values

# Operation results
5     NaN
17    NaN
19    NaN
26    NaN
28    NaN
859   NaN
863   NaN
868   NaN
878   NaN
888   NaN
Name: Age, Length: 177, dtype: float64

Then assign a value

df['Age'][df['Age'].isnull()] = 0

You can see that the data in the sixth row was originally NaN's' Age ', which has now become 0, and other values have not changed.

Of course, there seems to be a problem with my writing, because it pops up a SettingWithCopyWarning: A value is trying to be set on a copy of a slice from a DataFrame

It may mean that df['Age'][df['Age'].isnull()] is a slice of df, and the slice is a view in pandas, which can only be read and written. If you just view df['Age'][df['Age'].isnull()], there is no problem, but if you want to modify it, there will be a problem (but I don't know why my operation seems to be successful. In theory, it should not produce results.).


Although I accidentally ran successfully, I still have to remember the solution. I found a lot of relevant shares on the Internet. The most common ones are adding copy() in front and using loc to extract columns, or creating a new column into a list, and then assigning the whole value back.

fillna( )

In addition to my error code, of course, there are formal methods

fillna(self, value=None, method=None, axis=None, inplace=False, limit=None, downcast=None, **kwargs)
df.fillna(0).head(6)   # Fill missing data with 0

It can be observed that all missing values are assigned, not just the values of a specific column

Mean filling
df.fillna(df.mean()).head(6)     # Fill the missing data with the mean of each column of features

This mean filling method is not applicable in all cases. For example, Age and Cabin in this table are integers, but after operation, they contain decimals, which changes the format of the original data. However, Cabin values are strings. Naturally, strings cannot be used for mean calculation, so the missing values are not handled correctly, Still missing value.

Median fill
df.fillna(df.median()).head(6)   # Fill the missing data with the median of each column of features

The median filling method can well ensure the format of numeric data, but it is still invalid for string type.

Adjacent feature fill
df.fillna(method='bfill').head(7)   # Fill the front null value with the adjacent back feature

Here, the following eigenvalues are used to fill in the previous missing values, which can solve the problem that the previous methods can not fill in the missing values of string types, but the impact of this practice on the later model training is still unclear, and learners need to constantly try different filling methods.

Observation and treatment of repeated value

df[df.duplicated()]  #View duplicate values
df = df.drop_duplicates()  # Delete duplicate values

Feature observation and processing

Sub box (discretization) processing

Through the above filling operations, it can be found that many times there are problems in individual operations due to different types of features. After my (reference answer) careful observation, the characteristics can be roughly divided into two categories:

  • Numerical features: Survived, Pclass, Age, SibSp, Parch, Fare. Among them, Survived and Pclass are discrete numerical features, and Age, SibSp, Parch, Fare are continuous numerical features
  • Text type features: Name, Sex, Cabin, embossed, Ticket, among which Sex, Cabin, embossed, Ticket are category type text features

Numerical features can generally be directly used for model training, but sometimes continuous variables are discretized for the sake of model stability and robustness. Text features often need to be converted into numerical features before they can be used for modeling and analysis. For example, during model training of iris dataset, the names of various flower types will be changed to specified values, such as 0, 1, 2, etc.

In modeling, continuous variables need to be discretized. After feature discretization, the model will be more stable and reduce the risk of model over fitting. The box division operations include equidistant box division and equal frequency box division

Equidistant distribution box

cut will select the uniform spacing of boxes according to the value itself, that is, the spacing of each box is the same

#The continuous variable Age is divided into five Age groups (0,5] (5,15] (15,30] (30,50] (50,80) and represented by category variable 12345 respectively
df['AgeBand'] = pd.cut(df['Age'],[0,5,15,30,50,80],labels = [1,2,3,4,5])

Only the boundary is considered here, and the number of instances in each equal part may vary.

#The continuous variable Age was divided into five Age groups and represented by category variable 12345
df['AgeBand'] = pd.cut(df['Age'], 5,labels = [1,2,3,4,5])

Equal frequency division box

qcut selects the uniform spacing of boxes according to the frequency of these values, that is, the number of numbers contained in each box is the same

#The continuous variable Age was divided into five Age groups of 10% 30% 50 70% 90% and expressed by the categorical variable 12345
df['AgeBand'] = pd.qcut(df['Age'],[0,0.1,0.3,0.5,0.7,0.9],labels = [1,2,3,4,5])

Detailed reference link:

Text variable conversion
View category text variable name and category
  • value_counts( )

value_counts(normalize=False, sort=True, ascending=False, bins=None, dropna=True)


  1. Normalize: Boolean, default false. The default is false. If true, it will be displayed as a percentage

  2. Sort: Boolean, default true. The default is true, and the results will be sorted

  3. Ascending: Boolean, default false

  4. Bins: integer, format (bins=1). The meaning is not to perform calculations, but to divide them into semi open data sets, which is only applicable to digital data

  5. Dropna: Boolean, default true. Delete na value by default


# Output results
male      453
female    261
0           1
Name: Sex, dtype: int64
  • unique( )

For a one-dimensional array or list, the unique function removes the duplicate elements and returns a new tuple or list without duplicate elements according to the size of the elements.

# Output results
array(['male', 'female', 0], dtype=object)
Category text conversion
  • replace( )
df['Sex_num'] = df['Sex'].replace(['male','female'],[1,2])

  • map( )

I don't know exactly what the principle of this method is. It's different from the information I consulted on the Internet. The online description is that map() will map the specified sequence according to the provided function. The usage in the case of the course is different from that shown on the Internet. I don't know if I can map the keys in map() below to the corresponding values.

df['Sex_num'] = df['Sex'].map({'male': 1, 'female': 2})

  • LabelEncoder for sklearn.preprocessing

LabelEncoder is used to encode n categories as integers between 0 and n-1

from sklearn.preprocessing import LabelEncoder
for feat in ['Cabin', 'Ticket']:
    lbl = LabelEncoder()  
    label_dict = dict(zip(df[feat].unique(), range(df[feat].nunique())))
    df[feat + "_labelEncode"] = df[feat].map(label_dict)
    df[feat + "_labelEncode"] = lbl.fit_transform(df[feat].astype(str))


Reference link: Use of sklearn.preprocessing.LabelEncoder - ColdCode - blog Garden (

  • one hot coding

one hot coding is a process of converting class variables into a form that is easy to use by machine learning algorithms. It is also called one bit effective coding. It mainly uses n-bit status registers to encode N states. Each state consists of its own register bits, and only one bit is effective at any time. Our classification results often get the probability of belonging to a class In this way, it becomes very convenient to calculate the loss function (such as cross entropy loss) or accuracy.

# OneHotEncoder
for feat in ["Age", "Embarked"]:
    x = pd.get_dummies(df[feat], prefix=feat)
    df = pd.concat([df, x], axis=1) 

Reference link:

Feature extraction

In the Name feature, in addition to the passenger's Name, there are also titles, such as Miss, Mrs, etc., which can judge the passenger's marital status, and can also be used as a new feature for model training.

df['Title'] = df.Name.str.extract('([A-Za-z]+)\.', expand=False)

Data reconstruction

Read files first

# Load: result.csv in the data file
text = pd.read_csv('result.csv')

Data aggregation and operation

In this part, we must first look at what is the GroupBy mechanism


  • Function: data grouping and intra group operation after grouping

  • Syntax: df [] (refers to the result attribute name of output data). groupby([df [attribute], df [attribute]). mean()

To be honest, I'm not very good at describing GroupBy. I think I know what it does. Here's the example code:

df = text['Fare'].groupby(text['Sex'])
# Operation results
<pandas.core.groupby.generic.SeriesGroupBy object at 0x000001EA18771340>

You can see that the data is no longer DataFrame data, but a DataFrameGroupBy object. As the name suggests, the grouping object converted from DataFrame is DataFrameGroupBy, and the grouping object converted from Series is SeriesGroupBy.

Next, let's look at the average ticket price for men and women on the Titanic

means = df.mean()
# Operation results
female    44.479818
male      25.523893
Name: Fare, dtype: float64

The output result is the average fares of men and women, that is, the data in the column of Fare is classified according to Sex through GroupBy() operation, which is like classifying the fares of men and women into one category. I just add an average operation, and the final result is output. If you want to calculate the sum of fares of men and women, you only need to change mean() to sum() just fine.

sums = df.sum()

# Operation results
female    13966.6628
male      14727.2865
Name: Fare, dtype: float64

Look, look, that's it. Then the next statistics will be much more convenient

For example, count the survival of men and women on the Titanic

survived_sex = text['Survived'].groupby(text['Sex']).sum()

# Operation results
female    233
male      109
Name: Survived, dtype: int64

Through this statistical result, we can further calculate the survival rates of men and women. It is necessary for preliminary data observation, which can reflect some problems.

Another example is the number of people living in different classes of cabin

survived_pclass = text['Survived'].groupby(text['Pclass']).sum()

# Operation results
1    136
2     87
3    119
Name: Survived, dtype: int64

From the number of people living in different classes of cabin, we can guess that the most serious accident may be in the second class cabin, because the number of people living in the second class cabin is relatively small. Of course, we also need to observe the number of people living in different classes of cabin before we can make further judgment. In short, these preliminary observations are not meaningless, although there are modeling later, However, if we can get some important facts from these data surfaces, maybe we can add weights to these important features during model training, so as to improve the accuracy of model prediction.

Data visualization

Import package

import matplotlib.pyplot as plt

Import file

text = pd.read_csv(r'result.csv')

I think visualization is an indispensable part of data analysis, because often good visualization can make us read data information more intuitively. If we want to show it to outsiders, we should use visualization. There should be no need to say too much about the importance of visualization. For example, given the number of men and women living in the Titanic data set, although the size of the number can be seen at a glance, the approximate difference between the numbers can not be known in the human brain for the first time. There is no concept, and I need to calculate it again (yes, it's me. I'm the laziest and don't want to calculate it)

sex = text.groupby('Sex')['Survived'].sum()

Look, look, the intuitive feeling is that the number of women surviving is about half more than that of men

Let's look at the proportion of survival and death among men and women


It can be seen intuitively that the total number of men is much higher than that of women, and the mortality rate is much higher than that of women. My idea may be that most men showed a certain demeanor and moved at that time

unstack( )

Incidentally, the unstack() function is added to the code, and the stack() function is related to it

  • unstack(): converts the row index of the data to the column index
  • stack(): convert the column index of data to row index

This waste has not passed level 6. First of all, it has suffered the loss of no culture. Stack means stacking. unstack means "don't stack". From the diagram, it means that the feature of Survived doesn't start another one on the x-axis, but stays on the Sex feature in the way of stacking (I don't know what I'm talking about). The column index here can be simply understood as column name. It's still a little difficult to understand when we meet for the first time. Let's see what the result is without untack():


We found that the original histogram was split, and the value of x changed from [female, male] to [(female, 0), (female, 1), (male, 0), (male, 1)]. Let's continue with the figure above. After all, this section is visualization. It's natural to use visualization to learn visualization. Ha ha ha. Next, observe the difference between stack and unstack from the data level.

Original data

# Operation results
Sex     Survived
female  0            81
        1           233
male    0           468
        1           109
Name: Survived, dtype: int64

The original data is in the status of stack(). You can see that Sex and Survived are row indexes, and you initially feel that the "level" of Sex is higher than that of Survived. Next, observe the data in the status of unstack().

Original data + unstack()


It is easy to find that the observed feature originally used as a row index has now become a column index and has become a two-dimensional table as a whole. I think the principle of unstack() should be understood by now.

Reference link: (4 messages) comparison of stack(), unstack() and pivot() methods of DataFrame in pandas_ S_o_l_o_n blog - CSDN blog

Random demonstration

Survival of personnel at different bin levels

import seaborn as sns
sns.countplot(x="Pclass", hue="Survived", data=text)

Situation of people of different ages in Xingcun

facet = sns.FacetGrid(text, hue="Survived",aspect=3),'Age',shade= True)
facet.set(xlim=(0, text['Age'].max()))

Personnel division of different class

text.Age[text.Pclass == 1].plot(kind='kde')
text.Age[text.Pclass == 2].plot(kind='kde')
text.Age[text.Pclass == 3].plot(kind='kde')

There are too many functions in the visual library, which require us to take time to explore further. Good visualization not only looks cool, but also looks cool. For example, some potential relationships can be intuitively reflected through visualization. If visualization is not used for operation, it will not be operated by visualization (ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha ha

Well, no kidding, the content will be added after this learning record. The added content depends on what I learned, so sometimes some content may be inserted here and some content may be inserted there, but no one should see it. This is purely what I use to prove my learning process. Although it is poorly written, I don't learn very well, The main reason is that the scope of knowledge is not wide enough. When encountering new problems, we can't compare it with similar things we have learned before. Then, the language level is low. Many things don't know how to express correctly. Many times we just want to write a "self understanding" to perfunctory the past (it's said that in addition to studying programming more at ordinary times, I think I also need to get in touch with literary works more. Now I'm really like a reckless man.).

Next, we should continue to supplement the content of visualization. Although other data cleaning and feature extraction are only trivial, visualization has not reached the same level, so they have been pulled to the same level first. Personally, I think visualization can also help in the process of learning. Come on, YoungCat!

Tags: Python R Language Machine Learning Data Analysis

Posted on Mon, 20 Sep 2021 10:59:36 -0400 by gerbs987