Pandas foundation data structure and index operation

Pandas Foundation

Introduction

Pandas is a library based on numpy, but it has more powerful functions. Numpy focuses on the operation of numerical data, while pandas has good support for numerical data, string data and other forms of table data.

Content introduction

1. Data structure (Series,DataFrame,Panel)
2. Index operation
3. Data operation
4. Hierarchical index
5. Visualization (ignored temporarily)
6. example 1
7. Reading and storage of external data (csv,txt,json,excel, database, web data)
8. example 2
9. Data cleaning and sorting
10. example 3
11. Data grouping and aggregation
12. example 4
13. Time series

data structure

# Pandas uses a lot of Numpy code style, but pandas is used to deal with tabular or heterogeneous data, and Numpy is more suitable to deal with homogeneous numerical array data.

# Pandas data structure:
# Series and DataFrame

# I. Series
# Series is a one-dimensional array object that contains a sequence of values (similar to the types in Numpy) and data labels called indexes.
# In fact, the values attribute of a purely numeric Series object is a Numpy array.

# The simplest sequence can consist of just one array:
import pandas as pd
import numpy as np
obj = pd.Series([4,7,-5,3])
obj
# print(obj)  # The printing effect is the same as that of obj
0    4
1    7
2   -5
3    3
dtype: int64
# In an interactive environment, the string representation of Series is: index on the left, value on the right.

# 1) The default index starts from 0. You can obtain the values of the Series object through the index attribute and values attribute respectively
# Index and value:
obj = pd.Series([4,7,-5,3])
obj.index # Similar to range(4)
RangeIndex(start=0, stop=4, step=1)
obj.values
array([ 4,  7, -5,  3], dtype=int64)
# 2) You can customize the index to replace the default index:
obj2=pd.Series([4,7,-5,3],index=['d','b','a','c'])
# obj2.values
# obj2.index
obj2['a']
-5
# You can use the series obj.index attribute to change the index by location assignment:
obj2.index=['m','b','f','d']
obj2
m    4
b    7
f   -5
d    3
dtype: int64
# 3) Use Numpy functions or Numpy style operations on Series objects (such as filtering based on Boolean arrays, multiplying with scalars, or applying mathematical functions),
# The connection between index and value is preserved.
obj2[obj2>0] #Index as a Boolean array
d    4
b    7
c    3
dtype: int64
obj2*2
d     8
b    14
a   -10
c     6
dtype: int64
np.exp(obj2)  # Calculate the e-based index of each element
d      54.598150
b    1096.633158
a       0.006738
c      20.085537
dtype: float64
# 4) You can also think of Series as a fixed and ordered dictionary, so it's also a good idea to use Series instead of a dictionary in a context where a dictionary might be used.
print('Yes') if 'b' in obj2 else print('No')
print('Yes') if 'e' in obj2 else print('No') 
Yes
No
# 5) If you have stored the data in a Python dictionary, you can also use the dictionary to generate a Series object:
sdata={'Ohio':35000,'texas':71000,'Oregon':16000,'Utah':1000}
obj3=pd.Series(sdata)  # Generate a dictionary in [key] order.
obj3
Ohio      35000
texas     71000
Oregon    16000
Utah       1000
dtype: int64
# Of course, you can also redefine the key. When the newly defined key does not exist in the original dictionary, Series processes the missing value as NaN.
# When the key in the original dictionary does not exist in the newly defined index list, the key value pair will be excluded.
obj4=pd.Series(sdata,index=["Tom",'Ohio','Ivan','Utah'])
obj4
Tom         NaN
Ohio    35000.0
Ivan        NaN
Utah     1000.0
dtype: float64
# 6) Test missing data:
# Pd.isnull (Series obj) pd.notnull (Series obj) is equivalent to series obj. Isnull() series obj. Notnull() returns a boolean type series object:
# It can be used as a top-level function of pandas or a method called by a Series instance object.
# We will discuss how to deal with missing data in the following chapters.

# pd.isnull(obj4)
obj4.isnull()
Tom      True
Ohio    False
Ivan     True
Utah    False
dtype: bool
# pd.notnull(obj4)
pd.notnull(obj4)
Tom     False
Ohio     True
Ivan    False
Utah     True
dtype: bool
# 7) Index auto alignment for Series:
# Because some indexes of obj3 and obj4 are the same, we operate on the two Series objects:
# The index auto alignment function of Series is similar to the join operation of database.
obj3+obj4
Ivan          NaN
Ohio      70000.0
Oregon        NaN
Tom           NaN
Utah       2000.0
texas         NaN
dtype: float64
# 8) Properties series obj.name and series obj.index.name:
# You can modify these two attributes (none by default), which are often integrated with other important functions of pandas.
print(obj4.name)
print(obj4.index.name)
obj4.name="population"   # population
obj4.index.name="state"  # State
obj4
None
None





state
Tom         NaN
Ohio    35000.0
Ivan        NaN
Utah     1000.0
Name: population, dtype: float64
# 9) To change the element values of a Series object by index:
# obj = pd.Series([4,7,-5,3],index=['Bob','Steve','Jeff','Ryan'])
obj = pd.Series([4,7,-5,3])
obj.index=['Bob1','Steve1','Jeff1','Ryan1']
obj['Steve1']=10
obj
Bob1       4
Steve1    10
Jeff1     -5
Ryan1      3
dtype: int64

# 2, DataFrame
# Represents a matrix data table that contains a sorted set of columns, each of which can be of different value types (numeric, string, Boolean, etc.).
# There are both row and column indexes.
# Can be treated as a dictionary for a Series that shares the same index.
# Data is stored as more than one two-dimensional block, rather than a list, dictionary, or other collection of one-dimensional arrays.

# Note: Although DataFrame is two-dimensional, hierarchical index can be used to present data of higher dimensions in DataFrame. Hierarchical index is a more advanced data processing feature in pandas.

# DataFrame can be created or transformed from other data types.

# 1) Create DataFrame:
# Pass pd.DataFrame() a dictionary of equal length lists, arrays or tuples to create a DataFrame object
# The column index of the key DataFrame of the dictionary.


# Valid input for DataFrame constructor:

# 2-D array 2-D ndarray data matrix, the index label of the row and column is optional parameter 1-10)
# The dictionary composed of array, list, tuple and sequence. Each sequence becomes a column of DataFrame. All sequences must be 1-1) to 1-5) in length
# Numpy structured / recorded array is consistent with the dictionary composed of arrays
# The dictionary of Series consists of one column for each Series, the indexes of Series are merged into row indexes, and the indexes 1-1) to 1-4 can also be explicitly passed)
# Dictionary (nested Dictionary) each internal dictionary forms a column, and the key is combined to form the row index of the result 1-6)
# An element in the list formed by dictionary or Series forms a row of DataFrame, and the key of dictionary or Series index forms the column label of DataFrame 1-8) 1-9)
# The list composed of list or tuple is consistent with 2d ndarray 1-7)

# Other dataframes will use the index of the original DataFrame if the index is not explicitly passed
# Numpy MaskedArray is similar to 2d ndarray, but hidden values become NA / missing values in the resulting DataFrame



# 1-1) pd.DataFrame(dict_ndarrayvalu)
from pandas import Series,DataFrame
import pandas as pd
import numpy as np

data={
    'name':np.array(['Zhang San','Li Si','Wang Wu','Xiao Ming']),
    'sex':np.array(['female','female','male','male']),
    'year':np.array([2001,2001,2003,2002]),
    'city':np.array(['Beijing','Shanghai','Guangzhou','Shenzhen'])
}
df=DataFrame(data)
df
name sex year city
0 Zhang San female 2001 Beijing
1 Li Si female 2001 Shanghai
2 Wang Wu male 2003 Guangzhou
3 Xiao Ming male 2002 Shenzhen
# 1-2)pd.DataFrame(dict_listvalue)
data1={
    'name':['Zhang San','Li Si','Wang Wu','Xiao Ming'],
    'sex':['female','female','male','male'],
    'year':[2001,2001,2003,2002],
    'city':['Beijing','Shanghai','Guangzhou','Shenzhen']
}
df1=DataFrame(data1)
df1
name sex year city
0 Zhang San female 2001 Beijing
1 Li Si female 2001 Shanghai
2 Wang Wu male 2003 Guangzhou
3 Xiao Ming male 2002 Shenzhen
# 1-3)pd.DataFrame(dict_tuplevalue)
data1={
    'name':('Zhang San','Li Si','Wang Wu','Xiao Ming'),
    'sex':('female','female','male','male'),
    'year':(2001,2001,2003,2002),
    'city':('Beijing','Shanghai','Guangzhou','Shenzhen')
}
df1=DataFrame(data1)
df1
name sex year city
0 Zhang San female 2001 Beijing
1 Li Si female 2001 Shanghai
2 Wang Wu male 2003 Guangzhou
3 Xiao Ming male 2002 Shenzhen
# 1-4)pd.DataFrame(dict_Seriesvalue)
data2={
    'name':pd.Series(['Zhang San','Li Si','Wang Wu','Xiao Ming']),
    'sex':pd.Series(['female','female','male','male']),
    'year':pd.Series([2001,2001,2003,2002]),
    'city':pd.Series(['Beijing','Shanghai','Guangzhou','Shenzhen'])
}
df02=DataFrame(data2)
df02
name sex year city
0 Zhang San female 2001 Beijing
1 Li Si female 2001 Shanghai
2 Wang Wu male 2003 Guangzhou
3 Xiao Ming male 2002 Shenzhen
# 1-5)pd.DataFrame(dict_mixvalu)
data3={
    'name':np.array(['Zhang San','Li Si','Wang Wu','Xiao Ming']),
    'sex':('female','female','male','male'),
    'year':[2001,2001,2003,2002],
    'city':pd.Series(['Beijing','Shanghai','Guangzhou','Shenzhen'])
}
df03=DataFrame(data3)
df03
name sex year city
0 Zhang San female 2001 Beijing
1 Li Si female 2001 Shanghai
2 Wang Wu male 2003 Guangzhou
3 Xiao Ming male 2002 Shenzhen
# 1-6) you can also create DataFrame data with nested dictionary data: the key of the outer dictionary is column index, and the key of the inner dictionary is row index
data2={
    'sex':{'Zhang San':'female','Li Si':'female','Wang Wu':'male'},
    'city':{'Zhang San':'Beijing','Li Si':'Shanghai','Wang Wu':'Guangzhou'}
}
# df02=DataFrame(data2,index=['Boss','Li Si','Wang Wu'])   # You can also explicitly specify the order of row / column indexes or change row / column data in this way.
df02=DataFrame(data2,index=['Boss','Li Si','Wang Wu'],columns=['sex','city','newcolumn'])
df02
sex city newcolumn
Boss NaN NaN NaN
Li Si female Shanghai NaN
Wang Wu male Guangzhou NaN
# 1-7) create DataFrame data using nested list composed of list or tuple. Each inner list / tuple is a row of data row, and the column index is provided by DataFrame() 
# data = [['Alex',10],['Bob',12],['Clarke',13]]
data = [('Alex',10),('Bob',12),('Clarke',13)]
df = pd.DataFrame(data,columns=['Name','Age'],dtype=float)
df
Name Age
0 NaN NaN
1 NaN NaN
2 NaN NaN
# 1-8) use dictionary list to create DataFrame data. Each dictionary contains a row of data (value) of column label (key)
data = [{'name':'Alex','age':10},{'name':'Bob','age':12},{'name':'Clarke','age':13}]
df = pd.DataFrame(data,dtype=float)
df
name age
0 Alex 10.0
1 Bob 12.0
2 Clarke 13.0
# 1-9) use Series objects to create DataFrame data. Each Series contains a row of data (values) of column labels (index es)
data = [pd.Series(['Alex',10],index=['Name','Age']),pd.Series(['Bob',12],index=['Name','Age']),pd.Series(['Clarke',13],index=['Name','Age'])]
df = pd.DataFrame(data,dtype=float)
df
Name Age
0 Alex 10.0
1 Bob 12.0
2 Clarke 13.0
# 1-10) use 2dndarray to create DataFrame data:
arr2d=np.array([['Alex',10],['Bob',12],['Clarke',13]])
df = pd.DataFrame(arr2d,columns=['name','age'])  # Use 2dndarray to create DataFrame data. dtype=float cannot be specified. An error will be reported
df
name age
0 Alex 10
1 Bob 12
2 Clarke 13
# You can view columns and rows through the columns and index properties:
df02.columns  
Index(['sex', 'city'], dtype='object')
df02.index
Index(['Boss', 'Li Si', 'Wang Wu'], dtype='object')
# 2) List of specified column indexes of DataFrame, and sort the column indexes:
# Although df above arranges the columns according to the code order at the time of creation, because the dictionary is unordered, for the sake of security, we manually specify the order of \ modifying the columns when creating the DataFrame:
df02=DataFrame(data,columns=['name','year','sex','city'])  # grammar
df02
name year sex city
0 Zhang San 2001 female Beijing
1 Li Si 2001 female Shanghai
2 Wang Wu 2003 male Guangzhou
3 Xiao Ming 2002 male Shenzhen
# 3) Specify row index list (if no row index is specified, the default row index is from 0 to N-1 (N is the data length))
df03=DataFrame(data,columns=['name','year','sex','city'],index=['a','b','c','d'])  # grammar
df03
name year sex city
a Zhang San 2001 female Beijing
b Li Si 2001 female Shanghai
c Wang Wu 2003 male Guangzhou
d Xiao Ming 2002 male Shenzhen
# When the specified [column] index does not exist in the incoming data dictionary, it will be processed with the missing value NaN:
# If a column is specified less, the DataFrame object generated does not contain the column data.
# Note that if more or less rows are specified, an error will be reported.
df03=DataFrame(data,columns=['name','year','sex','city','job'],index=['a','b','c','d'])  # grammar
df03
name year sex city job
a Zhang San 2001 female Beijing NaN
b Li Si 2001 female Shanghai NaN
c Wang Wu 2003 male Guangzhou NaN
d Xiao Ming 2002 male Shenzhen NaN
# 4) Index the rows and columns of the DataFrame:

# 4-1) columns of index DataFrame return Series objects: 
# The result returned by dataframe [obj [column] or dataframe [obj. Column] is a Series object
# The returned Series object has the same index as the DataFrame object, and the name property of the Series will also be set reasonably:
df04=DataFrame(data,columns=['name','year','sex','city','job'],index=['Zhang','Plum','king','Horse']) 
df04['sex']
Zhang    female
//Li female
//Wang male
//Horse
Name: sex, dtype: object
sex=df04.sex
sex
Zhang    female
//Li female
//Wang male
//Horse
Name: sex, dtype: object
# 4-2) the row of index DataFrame returns the Series object:
# Note syntax: dataframe_obj.loc[line_index]  # Special property loc of DataFrame object
df04.loc['Plum']
name        Li Si
year      2001
sex     female
city        Shanghai
job        NaN
Name: Plum, dtype: object
# 4-3) modify / assign / create this column element through the column index:
df04['job']='Data Analysis' # Assign a value to a column
df04
name year sex city job
Zhang Zhang San 2001 female Beijing Data Analysis
Plum Li Si 2001 female Shanghai Data Analysis
king Wang Wu 2003 male Guangzhou Data Analysis
Horse Xiao Ming 2002 male Shenzhen Data Analysis
df04['job']=np.arange(4.)  # Assign an array of ndarray equal in length to a column
df04
name year sex city job
Zhang Zhang San 2001 female Beijing 0.0
Plum Li Si 2001 female Shanghai 1.0
king Wang Wu 2003 male Guangzhou 2.0
Horse Xiao Ming 2002 male Shenzhen 3.0
df04['job']=['Data Analysis']*4  # Assign a list of equal length to a column
df04
name year sex city job
Zhang Zhang San 2001 female Beijing Data Analysis
Plum Li Si 2001 female Shanghai Data Analysis
king Wang Wu 2003 male Guangzhou Data Analysis
Horse Xiao Ming 2002 male Shenzhen Data Analysis
val=pd.Series(['Data Analysis dev','Data Analysis im','Data Analysis ba'],index=['Horse','Plum','Zhang'])
df04['job']=val  # Assign a Series object with length < = column length to a column, handle the missing value as NAN, and arrange it according to the index of DataFrame
df04
name year sex city job
Zhang Zhang San 2001 female Beijing Data Analysis ba
Plum Li Si 2001 female Shanghai Data Analysis im
king Wang Wu 2003 male Guangzhou NaN
Horse Xiao Ming 2002 male Shenzhen Data Analysis dev
val=pd.Series(['band 6A','band 8B','band 7B'],index=['Horse','Plum','king'])
df04['band']=val  # If the column index does not exist, create a new column. (whether the assigned content is a list, Series or value)
df04
# It shows that the columns selected from DataFrame are views of data, not copies. Changes to the Series are mapped to the DataFrame.
name year sex city job band
Zhang Zhang San 2001 female Beijing Data Analysis ba NaN
Plum Li Si 2001 female Shanghai Data Analysis im band 8B
king Wang Wu 2003 male Guangzhou NaN band 7B
Horse Xiao Ming 2002 male Shenzhen Data Analysis dev band 6A
# 5) del keyword delete column of DataFrame  
# Like a delete key in a dictionary

data={
    'name':np.array(['Zhang San','Li Si','Wang Wu','Xiao Ming']),
    'sex':np.array(['female','female','male','male']),
    'year':np.array([2001,2001,2003,2002]),
    'city':np.array(['Beijing','Shanghai','Guangzhou','Shenzhen'])
}
df04=DataFrame(data)
# step1 first add a column, then delete:
df04['new']=df04.sex=='female'    # The syntax for Note:df04.new cannot be used to create a new column or to delete the specified column
df04
name sex year city new
0 Zhang San female 2001 Beijing True
1 Li Si female 2001 Shanghai True
2 Wang Wu male 2003 Guangzhou False
3 Xiao Ming male 2002 Shenzhen False
del df04['new']
df04.columns
Index(['name', 'sex', 'year', 'city'], dtype='object')
# 6) Replication of DataFrame data:
# To explicitly use the Series copy method:
series_copy=df04['sex'].copy()
series_copy
0    female
1    female
2      male
3      male
Name: sex, dtype: object
series_copy[2]='female'  # Note that series [2] rather than series [2 '], if used, is equivalent to creating a new element.
series_copy
0    female
1    female
2    female
3      male
Name: sex, dtype: object
df04  # Changes to the series  copy do not affect the original dataframe
name sex year city
0 Zhang San female 2001 Beijing
1 Li Si female 2001 Shanghai
2 Wang Wu male 2003 Guangzhou
3 Xiao Ming male 2002 Shenzhen
# 7) Transpose of DataFrame   
# Syntax: similar to Numpy, dataframe_obj.T
df04.T
0 1 2 3
name Zhang San Li Si Wang Wu Xiao Ming
sex female female male male
year 2001 2001 2003 2002
city Beijing Shanghai Guangzhou Shenzhen
df04
name sex year city
0 Zhang San female 2001 Beijing
1 Li Si female 2001 Shanghai
2 Wang Wu male 2003 Guangzhou
3 Xiao Ming male 2002 Shenzhen
# 8) Assign values to the name attribute of the column index and row index:
df04.index.name='Name'
df04.columns.name='Infor'
df04
Infor name sex year city
Name
0 Zhang San female 2001 Beijing
1 Li Si female 2001 Shanghai
2 Wang Wu male 2003 Guangzhou
3 Xiao Ming male 2002 Shenzhen
# 9) Index, column and values of DataFrame
df04.index  # One-dimensional array
RangeIndex(start=0, stop=4, step=1, name='Name')
df04.columns  # One-dimensional array
Index(['name', 'sex', 'year', 'city'], dtype='object', name='Infor')
df04.values  # Two-dimensional array
array([['Zhang San', 'female', 2001, 'Beijing'],
       ['Li Si', 'female', 2001, 'Shanghai'],
       ['Wang Wu', 'male', 2003, 'Guangzhou'],
       ['Xiao Ming', 'male', 2002, 'Shenzhen']], dtype=object)
# df04['new']=['12','abc',12,'NaN']
df04['new']=pd.Series(['Data Analysis dev','Data Analysis im','Data Analysis ba'],index=[0,1,2])
df04.values
array([['Zhang San', 'female', 2001, 'Beijing', 'Data Analysis dev'],
       ['Li Si', 'female', 2001, 'Shanghai', 'Data Analysis im'],
       ['Wang Wu', 'male', 2003, 'Guangzhou', 'Data Analysis ba'],
       ['Xiao Ming', 'male', 2002, 'Shenzhen', nan]], dtype=object)

Index operation

#Follow the "zero basic Python data analysis" to learn.
# 2, Index object
# The index of DataFrame is also an object, which has its own methods and properties
# Index objects in pandas are used to store axis labels and other metadata, such as axis names or labels.
# Series or DataFrame, any array or label sequence you use can be internally converted to an index object.

data={
    'name':np.array(['Zhang San','Li Si','Wang Wu','Xiao Ming']),
    'sex':('female','female','male','male'),
    'year':[2001,2001,2003,2002],
    'city':('Beijing','Shanghai','Guangzhou','Shenzhen')
}
df=DataFrame(data)

# 1) Index object
df.index
RangeIndex(start=0, stop=4, step=1)
df.columns
Index(['name', 'sex', 'year', 'city'], dtype='object')
# 2) The index object cannot be changed. An error will be reported if the object is forced to change:
# index=df.index
col = df.columns
# index[1]=0
col[1]='a'
---------------------------------------------------------------------------

TypeError                                 Traceback (most recent call last)

<ipython-input-144-daed13b54382> in <module>
      3 col = df.columns
      4 # index[1]=0
----> 5 col[1]='a'


~\anaconda3\envs\data_analysis\lib\site-packages\pandas\core\indexes\base.py in __setitem__(self, key, value)
   3908 
   3909     def __setitem__(self, key, value):
-> 3910         raise TypeError("Index does not support mutable operations")
   3911 
   3912     def __getitem__(self, key):


TypeError: Index does not support mutable operations
# 3) The index object is similar to an array, and its function is similar to a fixed size collection
df.index.name='infor item'
df.columns.name='details'
df
details name sex year city
infor item
0 Zhang San female 2001 Beijing
1 Li Si female 2001 Shanghai
2 Wang Wu male 2003 Guangzhou
3 Xiao Ming male 2002 Shenzhen
'sex' in df.columns
True
2 in df.index
True
# Three, pandas index operation
# (1) Explain the index operation method of Series and DataFrame
# (2) Compare with Excel data, explain the selection and operation of DataFrame data

# 1) Reindex:
# 1-1) Series reindex: s obj.reindex returns a new Series object without changing the original Series
# The method parameter can be defaulted. By default, the element value of the newly introduced index in the new index list is NaN,
# Method ='fill 'or' pad 'means that the element value is the same as the previous line (fill forward), and method='bfill' or 'pad' means that the element value is the same as the previous line (fill backward).

s_obj=Series([1,2,3,4],index=['a','c','d','f'])
s_obj
a    1
c    2
d    3
f    4
dtype: int64
s_obj2=s_obj.reindex(['a','c','m','d','e'])   # Instead of using the original index c, index e is introduced
s_obj2
a    1.0
c    2.0
m    NaN
d    3.0
e    NaN
dtype: float64
# After reindex is reindexed, when importing indexes that are not in the original Series, you can use method ='fill ', method ='bfill' to fill in.
# Fill rule: the newly introduced index and the original index are sorted by characters, and filled forward or backward according to the order of characters.
s_obj3=s_obj.reindex(['a','b','m','d','e'],method='ffill')  # Fill backward, fill backward b with the value corresponding to a, and fill backward m with the value corresponding to f
s_obj3
a    1
b    1
m    4
d    3
e    3
dtype: int64
s_obj4=s_obj.reindex(['a','b','m','d','e'],method='bfill')  # Fill forward. Fill B forward with the value corresponding to c, and m forward with the value after m (NaN)
s_obj4
f    4.0
b    2.0
m    NaN
d    3.0
e    4.0
dtype: float64
s_obj  # Do not change the original Series object
a    1
b    2
c    3
d    4
dtype: int64
# 1-2) reindex of DataFrame:
# For DataFrame data, both row and column indexes can be re indexed:
import numpy as np
import pandas as pd
from pandas import DataFrame,Series
df=DataFrame(np.arange(9).reshape((3,3)),index=['a','c','d'],columns=['name','id','sex'])
df
import pandas as pd
import numpy as np
print("223")

Waiting to change

Published 2 original articles, praised 0 and visited 8

Tags: Attribute Excel Database Python

Posted on Mon, 16 Mar 2020 00:42:13 -0400 by bobthebullet990