Introduction to pandas

NumPy is known as an extension of the Python language, supporting a large number of dimension array and matrix operations, and also providing a large number of mathematical function libraries for array operations.Numpy is based on the c language, so it makes numpy run fast and efficient is one of the advantages of numpy.But numpy's specialty is not data processing, but scientific calculation, so there are not many cases of numpy when processing data. Because the data that needs to be processed is usually labeled with columns and index index, which numpy does not support, then pandas is needed. Pandas'main work is to do data analysis, pandas followed byWe inherited numpy. We want to use pandas instead of numpy.Pandas is a database built on Numpy and can be interpreted as an enhanced version of numpy for data processing.

In data analysis: Samples are (rows) and features are (columns).

Advantages and disadvantages of pandas.Advantages: Pandas are more flexible than Excel, matlab, tableau and so on, and have more advantages in dealing with large data problems. Pandas are faster and faster when reading excel files.Disadvantages: relatively rigid control, clear use of functions in pandas, statsmodels statistical library (bug s overwhelming, documents provided by the official network are mostly unfriendly), high number of scipy (stats, too cumbersome to use).

The processing speed of pandas.Why not use mysql?Slow: slow file operation; pandas: fast data loading into memory.

Data analysis uses databases: numpy as a dependency database; pandas data analysis database; matplotlib intuitive database visualization; seaborn auxiliary database (database contains palette), graphics more rich; pyecharts: simple database visualization, more common in e-commerce;,,.

There are two data types in pandas.Series series (index is ordered, one-dimensional); DataFrame structured data (two-dimensional table).

 

1. Series series

Series is a data structure similar to a one-dimensional array consisting of a set of data and its associated indexes, i.e. values: a set of data (ndarray type) and key: related data index tags.At first glance, this structure seems similar to Dict dictionary. We know that dictionary is an out-of-order data structure. Series in pandas has a different data structure. It is equivalent to a fixed-length, ordered dictionary, and its index and value are independent of each other. The index of Series is variable, while the key value of dict dictionary is not.Series combines sequences with hash.Sequence: index is ordered, index is enumeration type; hash: key is out of order, key is association type.Both data types in pandas can use objects and attributes to get values and assignments, and string s are also objects in pandas.

 

Creation of Series:

Can be created from lists or numpy arrays: the default index is an integer index from 0 to N-1.(list, tuple, dict, ndarry) Casts to Series type.

 1 # Created from a list, default index is an integer index of 0 to 4
 2 s0 = Series([1,2,3,4,5])
 3 s0[1]   # 2
4 # from numpy Array Creation 5 s1 = Series(np.array(list('ABCD'))) 6 7 # By Dictionary ( hash)Create, Dictionary's key Will be Series As Yes index 8 s2 = Series({'A':1,"B":2,"C":3}) 9 10 # By setting index Parameter Specified Index --> {a:nail,b:B,c:} 11 s3 = Series(data=list('The Meeting Point'),index=list('abcd'))

 

Index and Slice of Series:

1). General index: You can use brackets to take a single index (return the element type at this time), or brackets to take multiple indexes from a list (return a Series type at this time).

2). Explicit Index:

- Use the association type in index as the index value; - Use.loc[] (recommended).pandas is an upgrade to ndarray, but Series is also an upgrade to dict

3). Implicit Index:
- Use integers as index values; - Use.iloc[] (recommended)

1 # General Index
2 s3[0]           # 'nail'
3 s3['a']        # 'nail'​
4 
5 # Explicit Index
6 s3.loc['a']   # 'nail'
7 8 # Implicit Index
9 s3.iloc[0]    # 'nail'

4). Slices:

 1 # Routine Section, Left Closed Right Open
 2 s3[0:-1]
 3     # a    nail
 4     # b    B
 5     # c    C
 6     # Name: username, dtype: object
 7     
 8 # Explicit slicing, fully closed interval
 9 s3.loc['a':'d']
10     # a    nail
11     # b    B
12     # c    C
13     # d    D
14     # Name: username, dtype: object
15     
16 # Implicit Slice, Left Closed Right Open
17 s3.iloc[0:-1]
18     # a    nail
19     # b    B
20     # c    C
21     # Name: username, dtype: object

 

Properties of Series:

Series can be thought of as a fixed-length, ordered dictionary.

 1 '''
 2 ndim: Dimension
 3 shape:Shape
 4 size: Get the length of the element
 5 dtype: data type
 6 index: Get all indexes
 7 values:Get all values
 8 name: get the name   
 9 head(): Quickly view the style of the Series object to get the first five pieces of data
 10 tail(): Quickly view the style of the Series object to get the last five pieces of data
11 '''

Code demonstration example:

 1 s3.shape      # (4,)
 2 s3.size       # 4
 3 s3.ndim       # 1
 4 s3.name       # 'username'
 5 s3.dtype      # dtype('O') Representation string type
 6 s3.index      # Index(['a', 'b', 'c', 'd'], dtype='object')
 7 s3.keys()     # Index(['a', 'b', 'c', 'd'], dtype='object')
 8 s3.valuse     # array(['nail', 'B', 'C', 'D'], dtype=object)
 9 s3.head(n=5)
10 s3.tail(n=5)

 

Detect missing data:

When the index does not have a corresponding value, the missing data display NaN (not a number) may occur.Note: np.NaN!== np.NaN; you can use pd.isnull(), pd.notnull(), or the built-in isnull(),notnull() function to detect missing data.

 1 # Make one containing NaN Valued Series data
 2 s5 = Series(data=range(4),index=list('abcd')) # NaN yes float
 3 s5['c'] = np.nan        # Will be indexed c The value of becomes nan  
 4 s5
 5   #  a    0.0
 6   #  b    1.0
 7   #  c    NaN
 8   #  d    3.0
 9   #  dtype: float64
10 
11 # Detect missing data
12 cond = pd.isnull(s5)        # Equivalent to s5.isnull()
13 cond 
14   # a     False
15   # b     False
16   # c     True
17   # d     False
18   # dtype: bool
19 # Check NaN After value, the nan Value data becomes 0
20 s5[cond]= 0
21    # a    0.0
22    # b    1.0
23    # c    0.0
24    # d    3.0
25    # dtype: float64
26 
27 s5['c'] = np.nan    # Will be indexed c The value of becomes nan  
28 
29 # Detect missing data 
30 cond_fa = s5.notnull()     # Equivalent to pd.notnull(s5)
31 cond_fa
32    # a     True
33    # b     True
34    # c     False
35    # d     True
36    # dtype: bool
37 # Check NaN After value, the nan Value Filtering
38 s5[cond_fa]
39    # a    0.0
40    # b    1.0
41    # d    3.0
42    # dtype: float64 

 

Operations between Series:

NaN+Any value is NaN.Automatically align data with different indexes in the operation, and if the indexes do not correspond, complete the NaN.

 1 s5 * 3 
 2    # a    0.0
 3    # b    3.0
 4    # c    0.0
 5    # d    9.0
 6    # dtype: float64
 7 
 8 s6 = Series(range(5),list('bcdef'))
 9 
10 s5.add(s6)   # Amount to s5+s6
11    # a    NaN
12    # b    1.0
13    # c    1.0
14    # d    5.0
15    # e    NaN
16    # f     NaN
17    # dtype: float64

 

2. DataFrame

The DataFrame is a tabular data structure that can be seen as a Dictionary of Series (sharing the same index).A DataFrame consists of multiple columns of data arranged in a certain order.The intent of the design is to extend the use of Series from one dimension to multiple dimensions.The DataFrame has both row and column indexes.Row index: index; Column index: columns; Value: values (numpy two-dimensional array).Our training set (some two-dimensional data) is two-dimensional, so Series can't meet this condition, xy axis, a point on axis (0, 0).Each column of a DataFrame can be a collection of different types of values, so you can also think of it as a collection of eries with the same index for different data types.

Creation of the DataFrame:

The most common method is to pass a dictionary to create it.The DataFrame uses the key of the dictionary as the name of each column and the value of the dictionary (an array) as the value of each column.In addition, the DataFrame automatically adds an index to each row (as with Seres).As with Series, if the incoming column does not match the key of the dictionary, the corresponding value is NaN.

 1 df1 = DataFrame(data={'Mathematics':[10,20,30,40,50],
 2                       'Chinese':[1,2,3,4,5],
 3                       'English':[10,11,12,13,14]},
 4                 index=['Tom','Jhon','Jack','Marry','Jurray'],
 5                 columns=['English','Mathematics','Chinese'])
 6 
# English Mathematical Language 7 # Tom 10 10 1 8 # Jhon 11 20 2 9 # Jack 12 30 3 10 # Marry 13 40 4 11 # Jurray 14 50 5

DataFrame properties:

values,columns,index,shape,ndim,dtypes.

 1 # Row index
 2 df1.index     # Index(['Lei Jun', 'Wifeless', 'Don't know Dad Fu', 'Jack Ma', 'Rotaijun'], dtype='object')
 3 
 4 # Row Index
 5 df1.columns   # Index(['English', 'Mathematics', 'Chinese'], dtype='object')
 6 
 7 df1.dtypes    
 8    # English    int64
 9    # Mathematics    int64
10    # Chinese    int64
11    # dtype: object
12 
13 df1.size    # 15
14
df1.ndim # 2

Index of DataFrame:

1). Index columns (get a column): [] By default, only column indexes can be taken.- in a dictionary-like manner; by attributes.You can get a DataFrame column as a Series.The returned Series has the same index as the original DataFrame, and the name property has been set, which is the corresponding column name.

df1['Language'] --->Get the'Language'column

2). Index rows (get a row): - use.loc[] plus index to index rows; - use.iloc[] plus integer to index rows.Also return a Series with index as the original columns.

Df1.loc['Tom'] --->Get the'Tom'line
Df1.iloc[0] --->Get line 0, equal to'Tom'

3). Index elements (to get a certain number/value): - Use column indexes; - Use row indexes (iloc[3,1] equals two parameters; the [3,3] inside iloc[[3,3]] is considered a parameter); and - Use the values attribute (two-dimensional numpy array)

df1.loc['Tom','English'] --->Get the value of this element in the'Tom'row, the'English' column
Df1.iloc[0,0] --->Get the value of this element in row 0, column 0, as in df1.loc['Tom','English']
 1 # Indexing columns, [ ] Column index only by default
 2 df1['Chinese']             # Get As One Series  ,Equivalent to [ df1.Language]
 3 # Tom       1
 4 # Jhon      2
 5 # Jack      3
 6 # Marry     4
 7 # Jurray    5
 8 # Name: Chinese, dtype: int64
 9 
10 df1.Chinese             # It is not recommended to get columns like this
11 df1['Tom']             # Errors will occur
12 
13 
14 # Indexing Rows
15 df1.loc['Tom']         # Explicit loc
16 # English    10
17 # Mathematics    10
18 # Chinese     1
19 # Name: Tom, dtype: int64
20 
21 df1.iloc[0]            # Implicit iloc
22 # English    10 
23 # Mathematics    10
24 # Chinese     1
25 # Name: Tom, dtype: int64
26 
27 # Index elements
28 df1.loc['Tom','English']   # 10
29 
30 df1.iloc[0,0]         # 10

4). Slicing operation:

Gets the values of some rows and columns, which can be multiple values

 1 # Use row index explicit loc Slice, fully closed interval
 2 df1.loc['Tom':'Jack']                # Obtain Tom reach Jack Data for rows, for rows
 3 #       English Mathematical Language
 4 # Tom   10  10   1
 5 # Jhon  11  20   2
 6 # Jack  12  30   3
 7 
 8 df1.loc['Tom':'Jack','English':'Mathematics']   # Obtain Tom reach Jack Rows, English to Math Columns
 9 #       English Mathematics
10 # Tom   10  10
11 # Jhon  11  20
12 # Jack  12  30
13 
14 
15 # Implicit use of row index iloc Slice, left open right closed
16 df1.iloc[1: 2]       # Get data for rows 1 through 2(Does not contain line 2),For rows
17 #       English Mathematical Language
18 # Jhon  11  20   2
19 
20 df1.iloc[0:2 , 0:2]  # Get data from rows 0 to 2, columns 0 to 2(Does not contain row 2 and column 2)
21 #       English Mathematics
22 # Tom   10  10
23 # Jhon  11  20

 

Operation of the DataFrame:

Dataframe works like Series.

The following is a table of Python operators and pandas operation functions:

 

 

Operations between Series and DataFrame:

Use the pandas operation function: axis=0: to operate as a unit (the parameter must be a column), valid for all columns; axis=1: to operate as a unit of behavior (the parameter must be a row), valid for all rows.

Column Direction: df1.add(s) #Default column addition; Row Direction: (df1.T + s).T.

Processing missing data:
df1.loc['Tom','English'] = np.NaN
df1.loc['Jack','Math'] = np.NaN

Operations on empty in pandas:

isnull(): is a null value, notnull(): is not a null value, dropna(): filters the null value, fillna(): fills the null value.

Data analysis: Deleting rows is appropriate, rows represent one piece of data and columns affect all data.Machine learning: If there are more null values in the rows, delete rows, and delete columns if there are more null values in the columns.

 1 # Determine whether the value is null
 2 df1.loc[:,'English'].isnull()
 3 # Tom       True
 4 # Jhon      True
 5 # Jack      True
 6 # Marry     True
 7 # Jurray    True
 8 # Name: English, dtype: bool
 9 
10 # Delete when all elements in the row are empty
11 df1.dropna(how='all')
12 13 # Delete when all elements in the row are empty
14 df1.dropna(how='any',axis=1,inplace=True)
15 
16 # Fill in empty values
17 df1.fillna(value=None, method=None, axis=None, inplace=False, limit=None, #downcast=None, **kwargs)

 

Aggregation operation:
So-called aggregation operations: mean, standard deviation, maximum, minimum...

1 df1.sum()        # The default is to operate on columns
2 df1.mean()       # The default is to operate on columns  
3 df1.max()        # Maximum value of column                                                          
4 df1.var()        # Sample variance, representing data fluctuations 
5 df1.std()        # Sample standard deviation  

 

pandas splicing operation:

The splicing of pandas is divided into two types: cascade: pd.concat, pd.append; merge: pd.merge, pd.join.

Review the cascade of numpy:

1 np.concatenate([np.random.randint(0,100,(5,4)),np.random.rand(5,4),np.random.randn(5,4)],axis=1)

1). Simple cascade:

Row merge: pd.concat([df1,df2],axis=0).As with np.concatenate, increase the number of rows first (default axis=0).Note that index es can be repeated in cascading.

 

Column merge: pd.concat([df1,df2],axis=1).Not recommended, concat is not a join table query, it is good at Union (vertical axis=0), horizontal merge must not be used.

 

 1 # Vertical inner connection, join='inner' ,Will delete the containing NaN Rows or columns
 2 pd.concat([df1,df2],axis=0,join='inner')   # index Repeatable in cascading
 3 
 4 # Horizontal in-line connection
 5 pd.concat([df1,df2],axis=1,join='inner')
 6 
 7 # Outer Connection, Mismatched Item Complement NaN
 8 pd.concat([df1,df2],axis=0,join='outer')
 9 
10 # ignore_index=True Ignore duplicate indexes
11 pd.concat([df1,df2],axis=0,join='outer',ignore_index=True)
12 
13 # Use multilevel indexes keys ,Solve duplicate problems
14 pd.concat([df1,df2],axis=0,join='outer',keys=['df1','df2'])
15          #   a   b   c
16    # df1 0  a0  b0  c0
17    #     1  a1  b1  c1
18    #     2  a2  b2  c2
19    # df2 2  a2  b2  c2
20    #     3  a3  b3  c3
21    #     4  a4  b4  c4

 

2). Mismatched cascades:
Mismatch refers to inconsistent indexes of cascaded dimensions.For example, column indexes are inconsistent when cascading vertically and row indexes are inconsistent when cascading horizontally; there are three ways to join:
- External connection: complement NaN (default mode)
- Internal connection: connect only matching items
-Connect the specified axis join_axes to display columns in a single dataframe

3. Use the append() function to add:

Since the use of cascades in the latter is very common, there is a function append specifically for adding later, append is similar to concat and can only be vertical directly: df1.append(df2)

 

 

4). Merge using pd.merge():

The difference between merge and concat is that merge needs to be merged by a common row or column, and when merged by pd.merge(), it automatically merges by the same column with the same column name as the key.Note that the order of each column element is not required to be consistent

 

pd.merge(DataFrame1,DataFrame2, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=True, suffixes=('_x', '_y'))

pd.merge(DataFrame1,DataFrame2) == DataFrame1.merge(DataFrame2)

Parameters:

how: default to inner, can be set to inner/outer/left/right
On: Joins based on a field must exist in both DateFrame s (if not, left_on and right_on, respectively)
left_on: Left join to the column used as the join key in DataFrame1
right_on: Right join to the column used as the connection key in DataFrame2
left_index: Use DataFrame1 row index as join key
right_index: Use DataFrame2 row index as join key
sort: Arranges the merged data according to the join key, defaulting to True
suffixes: Distinguishes between duplicate columns that appear in two datasets by adding the suffix _x,_y to the new dataset

Code demo differences:

 1 pd.DataFrame({'lkey':['foo','bar','baz','foo'], 'value':[1,2,3,4]})
 2 #   lkey  value
 3 # 0  foo      1
 4 # 1  bar      2
 5 # 2  baz      3
 6 # 3  foo      4
 7 
 8 pd.DataFrame({'rkey':['foo','bar','qux','bar'], 'value':[5,6,7,8]})
 9 #   rkey  value
10 # 0  foo      5
11 # 1  bar      6
12 # 2  qux      7
13 # 3  bar      8
14 
15 # inner link
16 dataDf1.merge(dataDf2, left_on='lkey',right_on='rkey')
17 #   lkey  value_x rkey  value_y
18 # 0  foo        1  foo        5
19 # 1  foo        4  foo        5
20 # 2  bar        2  bar        6
21 # 3  bar        2  bar        8
22 
23 # Outer link
24 dataDf1.merge(dataDf2, left_on='lkey', right_on='rkey', how='outer')
25 #   lkey  value_x rkey  value_y
26 # 0  foo      1.0  foo      5.0
27 # 1  foo      4.0  foo      5.0
28 # 2  bar      2.0  bar      6.0
29 # 3  bar      2.0  bar      8.0
30 # 4  baz      3.0  NaN      NaN
31 # 5  NaN      NaN  qux      7.0
32 
33 # left link
34 dataDf1.merge(dataDf2, left_on='lkey', right_on='rkey',how='left')
35 #   lkey  value_x rkey  value_y
36 # 0  foo        1  foo      5.0
37 # 1  bar        2  bar      6.0
38 # 2  bar        2  bar      8.0
39 # 3  baz        3  NaN      NaN
40 # 4  foo        4  foo      5.0
41 
41 # right Link 42 dataDf1.merge(dataDf2, left_on='lkey', right_on='rkey',how='right') 43 # lkey value_x rkey value_y 44 # 0 foo 1.0 foo 5 45 # 1 foo 4.0 foo 5 46 # 2 bar 2.0 bar 6 47 # 3 bar 2.0 bar 8 48 # 4 NaN NaN qux 7

[Note] 1. When multiple keys are the same, use on=to explicitly specify which column is the key; when the keys on the left and right sides are not equal, use left_on and right_on to specify the columns on the left and right sides as the key.

2). Inner Merge: Keep only the key that has both (default mode); Outer Merge how='outer': Complement NaN; Left Merge, Right Merge how='left', how='right'.

 

 

 

Resolution of column conflicts:

 

 

When columns conflict, that is, when more than one column has the same name, you need to use on=to specify which column is the key, with suffixes to specify the conflicting column name, and suffixes=to specify the suffix yourself.

1 dataDf1.merge(dataDf2, left_on='lkey', right_on='rkey', how='right', suffixes=('_df1', '_df2'))
2 #   lkey     value_df1  rkey  value_df2
3 # 0  foo        1.0      foo      5
4 # 1  foo        4.0      foo      5
5 # 2  bar        2.0      bar      6
6 # 3  bar        2.0      bar      8
7 # 4  NaN        NaN      qux      7

 

 

pandas delete operation:

Dataframe.drop(labels=0,axis=0,inplace=True):labels=0 indicates line 0, inplace=True indicates an impact on the original data.

1 df1.drop(labels=0,axis=0,inplace=True)    # Delete Line 0
2 df1.drop(0)                               # Delete Line 0
3 df1.drop([0,1])                           # Delete line 0, 1

Row Reset:

Dataframe.set_index('id',inplace=True)

 1 dataDf1
 2 #   lkey  value
 3 # 0  foo      1
 4 # 1  bar      2
 5 # 2  baz      3
 6 # 3  foo      4
 7 
 8 dataDf1.set_index(['value'],inplace=True)  # take'value'Set to index
 9 dataDf1
10 #        lkey
11 # value
12 #   1     foo
13 #   2     bar
14 #   3     baz
15 #   4     foo
16 
17 dataDf1.reset_index(inplace=True)          # take index Return to dataframe in
18 dataDf1
19 #     value   lkey
20 # 0      1     foo
21 # 1      2     bar
22 # 2      3     baz
23 # 3      4     foo

Statistics of occurrences:

1 data[ column_1 ].value_counts()

Simple saving and reading of data:

1 pd.read_csv('test.csv')  # read csv file
2 
3 data.to_csv('test.csv')   # Preservation csv file

Supplement: Basic statistics.data. describe(); data.info().

Tags: Python Database Excel MATLAB

Posted on Sat, 07 Mar 2020 11:43:09 -0500 by Yawa