Preliminary knowledge - common data analysis library of python core usage

2. Preliminary knowledge - common data analysis library for python core usage (Part 2) ...
summary
Experimental environment
Task 2: Pandas data analysis Practice-1
Analyze data problems
Task 3: Pandas data analysis practice-2
Dealing with problem 1
Dealing with problem 2
Dealing with problems three or four
2. Preliminary knowledge - common data analysis library for python core usage (Part 2)

summary

python is the most popular programming language in the world, and one of its biggest application fields is data analysis. Among many data analysis tools in python, Pandas is a very common data analysis library in python. It is often used in the fields of data analysis, machine learning, deep learning and so on. Using Pandas, we can read data such as Excel/CSV/TXT/MySQL, and then conduct various cleaning, filtering, perspective and aggregation analysis. We can also directly draw data analysis charts such as line chart and pie chart. In terms of function, it can realize automatic processing of large files, realize almost all functions of Excel and be more powerful.

This experiment will introduce the basic use of pandas data analysis library in a practical way, so that we can quickly master the use of python data analysis library pandas in a short time and reserve knowledge for subsequent project coding

Experimental environment

  • Python 3.7
  • Pycharm

Task 2: Pandas data analysis Practice-1

[mission objective]

The main objective of this task is to use pandas for data analysis and actual combat. During the actual combat, we will take you to understand the following functions of pandas module:

  • Understanding data
  • Analyze data problems
  • Cleaning data
  • Integration code

[task steps]

1. Preparatory work

After opening the CMD window, execute the following command to open the Jupiter notebook editor

jupyter notebook

After successfully executing the above command, the system will automatically open the default browser, as shown in the following figure:

After successfully opening the browser, create a notebook file according to the following process

Rename the new notebook

2. After creating a new notebook file, write code in the new notebook

  • Understanding data

Before processing any data, our first task is to understand the data and what it is used for. We try to understand the columns / rows, records, data formats, semantic errors, missing entries and wrong formats of data, so that we can roughly understand the "cleaning" work to be done before data analysis.

We need a patient this time_ heart_ The data file of rate.csv is very small and can be seen at a glance. This data is in CSV format. The data describes the heartbeat of different individuals at different times. The column information of the data includes people's age, weight, gender and heart rate at different times.

  • When you load data, you view the dataset
import pandas as pd df = pd.read_csv('data/patient_heart_rate.csv') df.head()

The operation results are as follows:

Analyze data problems

  1. No column headers
  2. A column has more than one parameter
  3. The units of column data are not uniform
  4. Missing value
  5. Duplicate data
  6. Non ASCII characters
  7. Some column headers should be data, not column name parameters

3. Cleaning data

3.1. No column head

If the data we get does not have column headers like the above data, Pandas provides parameters for custom column headers when reading csv. Let's read csv by manually setting column header parameters. The code is as follows:

import pandas as pd column_names= ['id', 'name', 'age', 'weight','m0006', 'm0612','m1218','f0006','f0612','f1218'] df = pd.read_csv('data/patient_heart_rate.csv', names = column_names) df.head()

The operation results are as follows:

The above results show our custom column headers. We just passed an additional parameter names = column when reading csv this time_ Names, which tells Pandas to use the column headers we provide.

4. A column has more than one parameter

It is not difficult to find in the data that the name column contains two parameters, Firtname and Lastname. In order to keep the data clean, we decided to split the name column into Firstname and Lastname

Use str.split(expand=True) to split the list into new columns, and then delete the original Name column

df[['first_name','last_name']] = df['name'].str.split(expand=True) df.drop('name', axis=1, inplace=True) df.head()

The operation results are as follows:

5. The units of column data are not uniform

If you look closely at the dataset, you can find that the units of the Weight column are not uniform. Some units are kgs and some units are lbs

lbs_weight_s = df[df.weight.str.contains("lbs").fillna(False)]['weight'] lbs_weight_s = lbs_weight_s.apply(lambda lbs: "%.2fkgs" % (float(lbs[:-3])/2.2) ) df.loc[lbs_weight_s.index,'weight'] = lbs_weight_s

The operation results are as follows:

6. Missing value processing

In the data set, some age, weight and heart rate are missing. We also encounter the most common problem of data cleaning - data loss. Generally, it is because this information is not collected. We can consult industry experts. Typical methods for handling missing data:

  • Delete: delete records with missing data
  • Fake: replace with legal initial value, and the value type can be used 0, the empty string '' can be used
  • Mean: use the mean of the current column
  • High frequency: use the data with the highest frequency in the current column
  • Source optimization: if you can communicate with the data collection team, you can jointly troubleshoot problems and find solutions.

7. Duplicate data processing

Sometimes there will be some duplicate data in the dataset. Execute the following code to observe the first 10 data in the dataset

df.head(10)

The operation results are as follows:

Observing the above results, we can find that there are duplicate data in our data set, as follows

First, let's check whether there are duplicate records. If there are duplicate records, use the drop provided by Pandas_ Duplicates() to remove duplicate data.

df.drop_duplicates(['first_name','last_name'],inplace=True) df.head(10)

The operation results are as follows:

Delete duplicate data in weight field

df.drop_duplicates(['weight'],inplace=True) df.head(10)

The operation results are as follows

8. Non ASCII characters

Fristname and Lastname have some non ASCII characters in the dataset.

There are many ways to process non ASCII data

  • delete
  • replace
  • Just a hint

We use the method of deletion:

df['first_name'].replace(, regex=True, inplace=True)df['last_name'].replace(, regex=True, inplace=True)df.head()

The operation results are as follows:

9. Some column headers should be data, not column name parameters

Some column headers are composed of gender and time range. These data may be row column conversion in the process of processing and collection, or the fixed naming rules of the collector. These values should be broken down into gender (m,f), time range of hour units (00-06, 06-12, 12-18)

sorted_columns = ['id','age','weight','first_name','last_name']df = pd.melt(df, id_vars=sorted_columns, var_name='sex_hour', value_name='puls_rate')df = df[df.puls_rate != '-'].dropna()df = df.sort_values(['id','first_name','last_name']).reset_index()def split_sex_date(sex_hour): sex = sex_hour[:1] if 'f' == sex: sex = 'female' elif 'm' == sex: sex = 'male' hour = sex_hour[1:] return pd.Series([sex,hour])df[['sex','hour']] = df.sex_hour.apply(split_sex_date)df.drop('sex_hour',axis=1)

The operation results are as follows:

Task 3: Pandas data analysis practice-2

[mission objective]

The main objective of this task is to use pandas for data analysis and actual combat. During the actual combat, we will take you to understand the following functions of pandas module:

  • Date processing
  • Character encoding problem

[task steps]

1. Refer to step 1 of [task 1], create a new notebook file in Jupiter notebook editing, named pandas-data-processing-3, as shown in the following figure:

2. Preview data

This time we use Artworks.csv, and we select 100 rows of data to complete this content. Specific steps:

  • Import Pandas
  • Read csv data to DataFrame (make sure the data has been downloaded to the specified path)

DataFrame is the built-in data display structure of Pandas. The display speed is very fast. Through DataFrame, we can quickly preview and analyze data. The code is as follows:

import pandas as pddf = pd.read_csv('./data/Artworks.csv').head(100)df.head(10)

The operation results are as follows:

2. Statistical date data

Let's take a closer look at the data in the Date column. Some data are the range of years (1976-1977), not a single year. When we use year data to draw, we can't draw it as easily as a single year. Let's use Pandas's value now_ Count () to count the number of each kind of data.

First, select the columns to count and call value_counts():

df['Date'].value_counts()

The operation results are as follows:

3. Date data problem

Date column data, in addition to the year range, there are three abnormal formats. Let's list them below:

  • Question 1, time frame (1976-77)
  • Question 2, estimation (c 1917 (around 1917)
  • Problem 3: missing data (Unknown)
  • Question 4: meaningless data (n.d.)

Next, we will deal with each of the above problems and use Pandas to convert these irregular data into unified format data.

Questions 1 and 2 have data, but the format is inappropriate. Questions 3 and 4 are not actually valid data. For the first two problems, we can format the data in the code to achieve the purpose of cleaning. However, for the last two problems, the code can only treat them as missing values. For simplicity, we treat the data of questions 3 and 4 as 0.

Dealing with problem 1

The data in question 1 are all in a two-year time range. We choose one of them as the data after cleaning. For simplicity, we will use the start time to replace the data of this problem, because this time is a four digit number. If we want to use the end year, we have to supplement the first two digits.

First, we need to find the data of problem 1 so that we can update it. Ensure that other data is not updated, because other data may have been formatted or processed below.

The data in the time range we want to process contains "-", so that we can filter the data we want to process through this special string, then divide the data with "-" through split(), and take the first part of the result as the final result of processing.

The code is as follows

row_with_dashes = df['Date'].str.contains('-').fillna(False)for i, dash in df[row_with_dashes].iterrows(): df.at[i,'Date'] = dash['Date'][0:4]df['Date'].value_counts()

The operation results are as follows:

Dealing with problem 2

The data in question 2 reflects the inaccuracy of the data itself. It is an estimated year time. If we convert it into a year, we only need to retain the last four digits. The feature of the data is that the data contains "c", so that we can filter out the data to be converted through this feature.

row_with_cs = df['Date'].str.contains('c').fillna(False)for i,row in df[row_with_cs].iterrows(): df.at[i,'Date'] = row['Date'][-4:]df[row_with_cs]

The operation results are as follows:

Dealing with problems three or four

Assign the data of question 3 and 4 to the initial value of 0

df['Date'] = df['Date'].replace('Unknown','0',regex=True) df['Date'] = df['Date'].replace('n.d.','0',regex=True) df['Date']

The operation results are as follows:

4. Attachment: completion code

Note: the data display part has been deleted from the complete code

import pandas as pd df = pd.read_csv('../data/Artworks.csv').head(100) df.head(10) df['Date'].value_counts() row_with_dashes = df['Date'].str.contains('-').fillna(False) for i, dash in df[row_with_dashes].iterrows(): df.at[i,'Date'] = dash['Date'][0:4] df['Date'].value_counts() row_with_cs = df['Date'].str.contains('c').fillna(False) for i,row in df[row_with_cs].iterrows(): df.at[i,'Date'] = row['Date'][-4:] df['Date'].value_counts() df['Date'] = df['Date'].replace('Unknown','0',regex=True) df['Date'] = df['Date'].replace('n.d.','0',regex=True) df['Date'].value_counts()

28 October 2021, 10:33 | Views: 7027

Add new comment

For adding a comment, please log in
or create account

0 comments