summary
python is the most popular programming language in the world, and one of its biggest application fields is data analysis. Among many data analysis tools in python, Pandas is a very common data analysis library in python. It is often used in the fields of data analysis, machine learning, deep learning and so on. Using Pandas, we can read data such as Excel/CSV/TXT/MySQL, and then conduct various cleaning, filtering, perspective and aggregation analysis. We can also directly draw data analysis charts such as line chart and pie chart. In terms of function, it can realize automatic processing of large files, realize almost all functions of Excel and be more powerful.
This experiment will introduce the basic use of pandas data analysis library in a practical way, so that we can quickly master the use of python data analysis library pandas in a short time and reserve knowledge for subsequent project coding
Experimental environment
- Python 3.7
- Pycharm
Task 2: Pandas data analysis Practice-1
[mission objective]
The main objective of this task is to use pandas for data analysis and actual combat. During the actual combat, we will take you to understand the following functions of pandas module:
- Understanding data
- Analyze data problems
- Cleaning data
- Integration code
[task steps]
1. Preparatory work
After opening the CMD window, execute the following command to open the Jupiter notebook editor
jupyter notebook
After successfully executing the above command, the system will automatically open the default browser, as shown in the following figure:
After successfully opening the browser, create a notebook file according to the following process
Rename the new notebook
2. After creating a new notebook file, write code in the new notebook
- Understanding data
Before processing any data, our first task is to understand the data and what it is used for. We try to understand the columns / rows, records, data formats, semantic errors, missing entries and wrong formats of data, so that we can roughly understand the "cleaning" work to be done before data analysis.
We need a patient this time_ heart_ The data file of rate.csv is very small and can be seen at a glance. This data is in CSV format. The data describes the heartbeat of different individuals at different times. The column information of the data includes people's age, weight, gender and heart rate at different times.
- When you load data, you view the dataset
import pandas as pd df = pd.read_csv('data/patient_heart_rate.csv') df.head()
The operation results are as follows:
Analyze data problems
- No column headers
- A column has more than one parameter
- The units of column data are not uniform
- Missing value
- Duplicate data
- Non ASCII characters
- Some column headers should be data, not column name parameters
3. Cleaning data
3.1. No column head
If the data we get does not have column headers like the above data, Pandas provides parameters for custom column headers when reading csv. Let's read csv by manually setting column header parameters. The code is as follows:
import pandas as pd column_names= ['id', 'name', 'age', 'weight','m0006', 'm0612','m1218','f0006','f0612','f1218'] df = pd.read_csv('data/patient_heart_rate.csv', names = column_names) df.head()
The operation results are as follows:
The above results show our custom column headers. We just passed an additional parameter names = column when reading csv this time_ Names, which tells Pandas to use the column headers we provide.
4. A column has more than one parameter
It is not difficult to find in the data that the name column contains two parameters, Firtname and Lastname. In order to keep the data clean, we decided to split the name column into Firstname and Lastname
Use str.split(expand=True) to split the list into new columns, and then delete the original Name column
df[['first_name','last_name']] = df['name'].str.split(expand=True) df.drop('name', axis=1, inplace=True) df.head()
The operation results are as follows:
5. The units of column data are not uniform
If you look closely at the dataset, you can find that the units of the Weight column are not uniform. Some units are kgs and some units are lbs
lbs_weight_s = df[df.weight.str.contains("lbs").fillna(False)]['weight'] lbs_weight_s = lbs_weight_s.apply(lambda lbs: "%.2fkgs" % (float(lbs[:-3])/2.2) ) df.loc[lbs_weight_s.index,'weight'] = lbs_weight_s
The operation results are as follows:
6. Missing value processing
In the data set, some age, weight and heart rate are missing. We also encounter the most common problem of data cleaning - data loss. Generally, it is because this information is not collected. We can consult industry experts. Typical methods for handling missing data:
- Delete: delete records with missing data
- Fake: replace with legal initial value, and the value type can be used 0, the empty string '' can be used
- Mean: use the mean of the current column
- High frequency: use the data with the highest frequency in the current column
- Source optimization: if you can communicate with the data collection team, you can jointly troubleshoot problems and find solutions.
7. Duplicate data processing
Sometimes there will be some duplicate data in the dataset. Execute the following code to observe the first 10 data in the dataset
df.head(10)
The operation results are as follows:
Observing the above results, we can find that there are duplicate data in our data set, as follows
First, let's check whether there are duplicate records. If there are duplicate records, use the drop provided by Pandas_ Duplicates() to remove duplicate data.
df.drop_duplicates(['first_name','last_name'],inplace=True) df.head(10)
The operation results are as follows:
Delete duplicate data in weight field
df.drop_duplicates(['weight'],inplace=True) df.head(10)
The operation results are as follows
8. Non ASCII characters
Fristname and Lastname have some non ASCII characters in the dataset.
There are many ways to process non ASCII data
- delete
- replace
- Just a hint
We use the method of deletion:
df['first_name'].replace(, regex=True, inplace=True)df['last_name'].replace(, regex=True, inplace=True)df.head()
The operation results are as follows:
9. Some column headers should be data, not column name parameters
Some column headers are composed of gender and time range. These data may be row column conversion in the process of processing and collection, or the fixed naming rules of the collector. These values should be broken down into gender (m,f), time range of hour units (00-06, 06-12, 12-18)
sorted_columns = ['id','age','weight','first_name','last_name']df = pd.melt(df, id_vars=sorted_columns, var_name='sex_hour', value_name='puls_rate')df = df[df.puls_rate != '-'].dropna()df = df.sort_values(['id','first_name','last_name']).reset_index()def split_sex_date(sex_hour): sex = sex_hour[:1] if 'f' == sex: sex = 'female' elif 'm' == sex: sex = 'male' hour = sex_hour[1:] return pd.Series([sex,hour])df[['sex','hour']] = df.sex_hour.apply(split_sex_date)df.drop('sex_hour',axis=1)
The operation results are as follows:
Task 3: Pandas data analysis practice-2
[mission objective]
The main objective of this task is to use pandas for data analysis and actual combat. During the actual combat, we will take you to understand the following functions of pandas module:
- Date processing
- Character encoding problem
[task steps]
1. Refer to step 1 of [task 1], create a new notebook file in Jupiter notebook editing, named pandas-data-processing-3, as shown in the following figure:
2. Preview data
This time we use Artworks.csv, and we select 100 rows of data to complete this content. Specific steps:
- Import Pandas
- Read csv data to DataFrame (make sure the data has been downloaded to the specified path)
DataFrame is the built-in data display structure of Pandas. The display speed is very fast. Through DataFrame, we can quickly preview and analyze data. The code is as follows:
import pandas as pddf = pd.read_csv('./data/Artworks.csv').head(100)df.head(10)
The operation results are as follows:
2. Statistical date data
Let's take a closer look at the data in the Date column. Some data are the range of years (1976-1977), not a single year. When we use year data to draw, we can't draw it as easily as a single year. Let's use Pandas's value now_ Count () to count the number of each kind of data.
First, select the columns to count and call value_counts():
df['Date'].value_counts()
The operation results are as follows:
3. Date data problem
Date column data, in addition to the year range, there are three abnormal formats. Let's list them below:
- Question 1, time frame (1976-77)
- Question 2, estimation (c 1917 (around 1917)
- Problem 3: missing data (Unknown)
- Question 4: meaningless data (n.d.)
Next, we will deal with each of the above problems and use Pandas to convert these irregular data into unified format data.
Questions 1 and 2 have data, but the format is inappropriate. Questions 3 and 4 are not actually valid data. For the first two problems, we can format the data in the code to achieve the purpose of cleaning. However, for the last two problems, the code can only treat them as missing values. For simplicity, we treat the data of questions 3 and 4 as 0.
Dealing with problem 1
The data in question 1 are all in a two-year time range. We choose one of them as the data after cleaning. For simplicity, we will use the start time to replace the data of this problem, because this time is a four digit number. If we want to use the end year, we have to supplement the first two digits.
First, we need to find the data of problem 1 so that we can update it. Ensure that other data is not updated, because other data may have been formatted or processed below.
The data in the time range we want to process contains "-", so that we can filter the data we want to process through this special string, then divide the data with "-" through split(), and take the first part of the result as the final result of processing.
The code is as follows
row_with_dashes = df['Date'].str.contains('-').fillna(False)for i, dash in df[row_with_dashes].iterrows(): df.at[i,'Date'] = dash['Date'][0:4]df['Date'].value_counts()
The operation results are as follows:
Dealing with problem 2
The data in question 2 reflects the inaccuracy of the data itself. It is an estimated year time. If we convert it into a year, we only need to retain the last four digits. The feature of the data is that the data contains "c", so that we can filter out the data to be converted through this feature.
row_with_cs = df['Date'].str.contains('c').fillna(False)for i,row in df[row_with_cs].iterrows(): df.at[i,'Date'] = row['Date'][-4:]df[row_with_cs]
The operation results are as follows:
Dealing with problems three or four
Assign the data of question 3 and 4 to the initial value of 0
df['Date'] = df['Date'].replace('Unknown','0',regex=True) df['Date'] = df['Date'].replace('n.d.','0',regex=True) df['Date']
The operation results are as follows:
4. Attachment: completion code
Note: the data display part has been deleted from the complete code
import pandas as pd df = pd.read_csv('../data/Artworks.csv').head(100) df.head(10) df['Date'].value_counts() row_with_dashes = df['Date'].str.contains('-').fillna(False) for i, dash in df[row_with_dashes].iterrows(): df.at[i,'Date'] = dash['Date'][0:4] df['Date'].value_counts() row_with_cs = df['Date'].str.contains('c').fillna(False) for i,row in df[row_with_cs].iterrows(): df.at[i,'Date'] = row['Date'][-4:] df['Date'].value_counts() df['Date'] = df['Date'].replace('Unknown','0',regex=True) df['Date'] = df['Date'].replace('n.d.','0',regex=True) df['Date'].value_counts()