[Pandas learning notes 02] practical operation of data processing

Author: Huan Hao

Source: Hang Seng LIGHT cloud community

Pandas is a Python software library, which provides a large number of functions and methods that enable us to process data quickly and easily. This paper will mainly introduce the practical data processing operation of pandas.

Series of articles:

[Pandas learning notes 01] powerful tool set for analyzing structured data

summary

Pandas is a library based on NumPy. In terms of data processing, it can be understood as an enhanced version of NumPy. At the same time, pandas is also an open source project. It is based on python, so it reads and processes data very fast, and can easily handle missing data (expressed as NaN) and non floating point data in floating-point data. In this paper, the basic data set operation mainly introduces the reading and writing methods of CSV and Excel, the basic data processing mainly introduces the missing value and feature extraction, and the final DataFrame operation mainly introduces the methods of function and sorting.

Dataset basic operations

  • Read datasets from CSV format files
import pandas as pd
# Mode 1 
df1 = pd.read_csv("file.csv")
# Mode 2
df2 = pd.DataFrame.from_csv("file.csv")
  • Read datasets in Excel format files
import pandas as pd
df = pd.read_excel("file.xlsx")
  • Get basic dataset feature information
df.info()
  • Query basic statistics of dataset
print(df.describe())
  • Query the title names of all columns
print(df.columns)
  • Use the DataFrame object to write data to a CSV file
# Comma as separator without index
df.to_csv("data.csv", sep=",", index=False)

Data set processing

First, define a DataFrame dataset:

import pandas as pd

df = pd.DataFrame(data = [['java',1],['python',2],['golang','3']],index = [1,2,3],columns = ['name','rank'])
print(df)

Print dataset:

     name rank
1    java    1
2  python    2
3  golang    3

Query data operation

  • Use df.loc[index, column] to query the data of specific rows and columns
# Queries data for specified rows and columns
df.loc[0,'name']
#Select the data from row 0 to row 1, column name and column rank
df.loc[[0,1],['name','age']]
#Select the name column as the data of M, name and rank columns
df.loc[df['name']=='java',['name','rank']] 
  • Query the whole column or a certain range of row data through df['column_name '] or df[row_start_index, row_end_index]
# Select single or multiple columns
df['name']
df[['name','rank']] 
#Lines 0 and after
df[0:]   
# Lines 1 to 2 (excluding line 3)
df[1:3]   
# Last line
df[-1:]   

Add data operation

  • Add column data to the dataset:
# In column 0, add the column name as user_num, the value is user_ Value of num
user_num = ['100','89','70']
df.insert(0,'user_num',user_num) 

# By default, data with column name of application and value of application is added to the last column of df
application = ['Web','AI','server']
df['application'] = application
  • Add row data to the dataset:
# If there is no row of data with index "10" in df, it will be added
# If there is already a row of data with index "10" in df, update the data.
df.loc[10] = ['php',10]

# Append new data to df
new_df = pd.DataFrame(index = True,columns = ['name','rank'])
df = df.append(new_df,ignore_index = True) 

Modify data operation

  • Modify column headings
#Just put 'user'_ Change 'num' to 'users', and write all the columns, otherwise an error will be reported.
df.columns = ['name', 'rank', 'users'] 
#Only modify name. If inplace is True, modify df directly. Otherwise, do not modify df, but return a modified data.
df.rename(columns = {'name':'Name'}, inplace = True) 
  • Modify value
# Change the value of index to '0' and column to 'name' to C
df.loc[0, 'name'] = 'C'  
# Modify all values of the row whose index is' 0 '
df.loc[0] = ['java', 1, '1000']  
# Change the value of index to '0' and column to 'name' to Java
df.loc[0,['name','rank']] = ['Java'] 
# Replace missing data
df.replace(to_replace=None, value=None)

Delete data operation

  • Delete row data
# Delete the two lines with index values of 2 and 3
df.drop([2,3],axis = 0,inplace = False)
  • Delete column data
# Delete name column
df.drop(['name'],axis = 1,inplace = False)  
del df['name']  
# Delete the name column. After the operation, the deleted column will be returned to new_df
new_df = df.pop('age')  

summary

This article mainly introduces the practical operation of Pandas toolset, which can help us solve the basic problems of daily data processing. Later, we will continue to share high-level skills. Please look forward to it.

Tags: Data Analysis pandas

Posted on Fri, 26 Nov 2021 10:15:00 -0500 by throx