[script] Swiss Army knife pandas for data processing

1, Pandas introduction

In Python, pandas includes advanced data structures Series and DataFrame, which makes it very convenient, fast and simple to process data in Python.

There are some incompatibilities between different versions of pandas. Therefore, we need to know which version of pandas we are using.

import pandas as pd
print(pd.__version__)

1.3.1

The two main data structures of pandas are Series and DataFrame. Let's import them and related modules first:

import numpy as np
import pandas as pd
from pandas import Series, DataFrame

2, Pandas data structure: Series

Series can also be called a sequence. In a general sense, series can be simply regarded as a one-dimensional array. The main difference between series and one-dimensional array is that the series type has an index, which can be associated with another common data structure Hash in programming. Secondly, the elements of series are usually of uniform types, while one-dimensional arrays can be composed of elements of different types.

2.1 creating Series

The basic format of creating a Series is s = Series(data, index=index, name=name). The following are some examples of creating a Series:

a = np.random.randn(5)
print("a is an array:")
print(a)
s = Series(a)
print("s is a Series:")
print(s)

a is an array:
[-0.300 seven 983   0.2893125  -0.1010 six 809 -1.060 seven 6531  0.29202818]
s is a Series:
0   -0.300798
1    0.289313
2   -0.101068
3   -1.060765
4    0.292028
dtype: float64

You can add an index when creating a Series, and you can use Series.index to view the specific index. It should be noted that when creating a Series from an array, if you specify index, the index length should be consistent with the length of data:

s = Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'])
print(s)
print(s.index)

a    1.192283
b    1.477963
c   -0.386441
d    1.622310
e    0.845787
dtype: float64
Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

Another option to create a Series is name. You can specify the name of the Series, which can be accessed by Series.name. In the subsequent DataFrame, the column name of each column becomes the name of the Series when the column is taken out separately:

s = Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'],
name='my_series')
print(s)
print(s.name)

a   -2.240 fifteen 5
b    0.258177
c    0.343206
d    1.220887
e   -0.fifteen 3971
Name: my_series, dtype: float64
my_series

Series can also be created from a dictionary (dict):

d = {'a': 0., 'b': 1, 'c': 2}
print("d is a dict:")
print(d)
s = Series(d)
print("s is a Series:")
print(s)

d is a dict:
{'a': 0.0, 'b': 1, 'c': 2}
s is a Series:
a    0.0
b    1.0
c    2.0
dtype: float64

Let's take a look at the case where the index is specified when creating a Series using a dictionary (the index length does not have to be the same as the dictionary):

s = Series(d, index=['b', 'c', 'd', 'a'])
print(s)

b    1.0
c    2.0
d    NaN
a    0.0
dtype: float64

We can observe two points: one is the Series created by the dictionary, and the data will be rearranged in the order of index; Second, the index length can be inconsistent with the dictionary length. If there are more indexes, pandas will automatically allocate NaN (not a number, the standard mark of missing data in pandas) to the redundant indexes. Of course, if there are fewer indexes, some dictionary contents will be intercepted.

If the data is a single variable, such as the number 4, Series will repeat this variable:

s = Series(4., index=['a', 'b', 'c', 'd', 'e'])
print(s)

a    4.0
b    4.0
c    4.0
d    4.0
e    4.0
dtype: float64

2.2 access to series data

When accessing Series data, you can use subscripts like arrays, indexes like dictionaries, and filter with some conditions:

s = Series(np.random.randn(10),index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'])print(s[0])0.20339387093082803
print(s[[0]])a    0.906648dtype: float64

Note: the difference between s[0] and s[[0]] is that s[0] only takes values, while s[[0]] takes rows, and the original sequence type will be retained

print(s[:2])a   -2.028119b    0.061965dtype: float64
print(s[[2,0,4]])c   -0.526092a    0.484422e    1.571355dtype: float64
print(s[['e', 'i']])e    0.435927i    1.045612dtype: float64
print(s[s > 0.5])c    0.995218e    0.858984h    0.942102i    0.675896dtype: float64
print('e' in s)True

3, Pandas data structure: DataFrame

DataFrame is also called data structure. Before using DataFrame, let's explain the characteristics of DataFrame. DataFrame is a two-dimensional data structure that combines several Series by column. Each column is taken out separately as a Series, which is very similar to the data taken out from SQL database. Therefore, it is more convenient to process a DataFrame by column. When programming, users should pay attention to cultivating the thinking of building data by column. The advantage of DataFrame is that it can easily handle different types of columns. Therefore, it is not necessary to consider how to inverse a DataFrame full of floating-point numbers. It is more convenient to save the data as a NumPy matrix type.

3.1 create DataFrame

First, let's look at how to create a DataFrame from a dictionary. DataFrame is a two-dimensional data structure and an aggregate of multiple Series. Let's first create a dictionary whose value is Series and convert it to DataFrame:

d = {'one': Series([1., 2., 3.], index=['a', 'b', 'c']), 'two':Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}df = DataFrame(d)print(df)   one  twoa  1.0  1.0b  2.0  2.0c  3.0  3.0d  NaN  4.0

You can specify the required rows and columns. If the dictionary does not contain corresponding elements, it will be set to NaN:

df = DataFrame(d, index=['r', 'd', 'a'], columns=['two', 'three'])print(df)   two threer  NaN   NaNd  4.0   NaNa  1.0   NaN

You can use dataframe.index and dataframe.columns to view the rows and columns of DataFrame. dataframe.values returns the elements of DataFrame in the form of array:

print("DataFrame index:")print(df.index)print("DataFrame columns:")print(df.columns)print("DataFrame values:")print(df.values)DataFrame index:Index(['a', 'b', 'c', 'd'], dtype='object')DataFrame columns:Index(['one', 'two'], dtype='object')DataFrame values:[[ 1.  1.] [ 2.  2.] [ 3.  3.] [nan  4.]]

DataFrame can also be created from a dictionary whose value is an array, but the length of each array needs to be the same:

d = {'one': [1., 2., 3., 4.], 'two': [4., 3., 2., 1.]}df = DataFrame(d, index=['a', 'b', 'c', 'd'])print(df)   one  twoa  1.0  4.0b  2.0  3.0c  3.0  2.0d  4.0  1.0

When the value is not an array, there is no such restriction, and the missing value is supplemented by NaN:

d= [{'a': 1.6, 'b': 2}, {'a': 3, 'b': 6, 'c': 9}]df = DataFrame(d)print(df)     a  b    c0  1.6  2  NaN1  3.0  6  9.0

When actually processing data, you sometimes need to create an empty DataFrame. You can do this:

df = DataFrame()print(df)Empty DataFrameColumns: []Index: []

Another useful way to create a DataFrame is to use the concat function to create a DataFrame based on Series or DataFrame

a = Series(range(5))b = Series(np.linspace(4, 20, 5))df = pd.concat([a, b], axis=1)print(df)   0     10  0   4.01  1   8.02  2  12.03  3  16.04  4  20.0

Where axis=1 means merging by row, axis=0 means merging by column, and Series is processed into one column. Therefore, if axis=0 is selected here, a 10 will be obtained × DataFrame for 1. The following example shows how to merge a DataFrame into a large DataFrame by row:

df = DataFrame()index = ['alpha', 'beta', 'gamma', 'delta', 'eta']for i in range(5):    a = DataFrame([np.linspace(i, 5*i, 5)], index=[index[i]])    df = pd.concat([df, a], axis=0)print(df)         0    1     2     3     4alpha  0.0  0.0   0.0   0.0   0.0beta   1.0  2.0   3.0   4.0   5.0gamma  2.0  4.0   6.0   8.0  10.0delta  3.0  6.0   9.0  12.0  fifteen.0eta    4.0  8.0  12.0  16.0  20.0

3.2 DataFrame data access

First of all, emphasize again that the dataframe is based on columns. All operations can be imagined as taking a column from the dataframe and then taking elements from the Series. You can use datafrae.column_name can select columns or use dataframe [] operation to select columns. We can immediately find that the former method can only select one column, and the latter method can select multiple columns. If the dataframe does not have a column name, [] can use a non negative integer, that is, "subscript" to select the column; If there is a column name, it must be selected using the column name, and datafrae.column_name is invalid without column name:

print(df[1])print(type(df[1]))alpha    0.0beta     2.0gamma    4.0delta    6.0eta      8.0Name: 1, dtype: float64<class 'pandas.core.series.Series'>
print(df[[1]])print(type(df[[1]]))         1alpha  0.0beta   2.0gamma  4.0delta  6.0eta    8.0<class 'pandas.core.frame.DataFrame'>

Note: similar to Series in Section 2.2, df[1] gets a column and the type changes to Series; df[[1]] get a column and maintain the DataFrame structure

df.columns = ['a', 'b', 'c', 'd', 'e']print(df['b'])print(type(df['b']))alpha    0.0beta     2.0gamma    4.0delta    6.0eta      8.0Name: b, dtype: float64<class 'pandas.core.series.Series'>
print(df[['a', 'd']])print(type(df[['a', 'd']]))         a     dalpha  0.0   0.0beta   1.0   4.0gamma  2.0   8.0delta  3.0  12.0eta    4.0  16.0<class 'pandas.core.frame.DataFrame'>

The above code uses dataframe.columns to assign a column name to the DataFrame, and we see that a column is taken separately, and its data structure displays Series. The result of taking two or more columns is still DataFrame. To access specific elements, you can use subscripts or indexes like Series:

print(df['b'][2])print(df['b']['gamma'])4.04.0

To select rows, you can use dataframe.iloc to select by subscript or dataframe.loc to select by label:

print(df.iloc[1])print(df.loc['beta'])a    1.0b    2.0c    3.0d    4.0e    5.0Name: beta, dtype: float64a    1.0b    2.0c    3.0d    4.0e    5.0Name: beta, dtype: float64

Note: loc is based on tag and iloc is based on subscript. If iloc uses a tag, an error will be reported; otherwise, if loc uses a subscript, an error will be reported.

Example: the correct usage is df.iloc[0][0] or df.loc ['a'] ['beta']

You can also select rows by slicing or Boolean vectors:

print("Selecting by slices:")print(df[1:3])bool_vec = [True, False, True, True, False]print("Selecting by boolean vector:")print(df[bool_vec])Selecting by slices:         a    b    c    d     ebeta   1.0  2.0  3.0  4.0   5.0gamma  2.0  4.0  6.0  8.0  10.0Selecting by boolean vector:         a    b    c     d     ealpha  0.0  0.0  0.0   0.0   0.0gamma  2.0  4.0  6.0   8.0  10.0delta  3.0  6.0  9.0  12.0  fifteen.0

Combine rows and columns to select data:

# Column before row print(df[['b', 'd']].iloc[[1, 3]])# First and last print(df.iloc[[1, 3]][['b', 'd']])# ditto df.iloc[1, 3]['b', 'd']# Column before row print(df[['b', 'd']].loc[['beta', 'delta']])# First and last print(df.loc[['beta', 'delta']][['b', 'd']])# Ditto df.loc ['beta ',' Delta '] ['b','d '] B dbeta 2.0 4.0 delta 6.0 12.0 B dbeta 2.0 4.0 delta 6.0 12.0 B dbeta 2.0 4.0 delta 6.0 12.0 B dbeta 2.0 4.0 delta 6.0 12.0

dataframe.at and dataframe.iat are the fastest ways to access elements in a special location instead of a specific row and column. They are used for index and subscript access respectively:

print(df.iat[2, 3])print(df.at['gamma', 'd'])8.08.0

4, Advanced: Pandas data operation

After mastering the operations in this chapter, you can basically process most of the data. To make it easier to see the data, let's set the width of the output screen

pd.set_option('display.width', 200)

4.1 other methods of data creation

The creation of data structure is not just the standard form described earlier. For example, we can create a Series with date as element:

dates = pd.date_range('20150101', periods=5)print(dates)DatetimeIndex(['2015-01-01', '2015-01-02', '2015-01-03', '2015-01-04',               '2015-01-05'],              dtype='datetime64[ns]', freq='D')

Assign this date to a DataFrame as an index:

df = pd.DataFrame(np.random.randn(5, 4),index=dates,columns=list('ABCD'))print(df)                   A         B         C         D2015-01-01  0.008608 -0.686443 -0.021788  0.4344862015-01-02  0.711034  0.746027  1.528270  0.5572102015-01-03 -0.334801  0.532736  1.006003  0.0303722015-01-04  0.507740  0.668962 -0.166262  0.5183842015-01-05  0.887693 -0.839035  0.998530  1.066598

Any object that can be converted into Series can be used to create a DataFrame:

df2 = pd.DataFrame({ 'A' : 1., 'B': pd.Timestamp('201502 fourteen'), 'C': pd.Series(1.6,index=list(range(4)),dtype='float64'), 'D' : np.array([4] * 4, dtype='int64'), 'E' : 'hello pandas!' })print(df2)     A          B    C  D              E0  1.0 2015-02-14  1.6  4  hello pandas!1  1.0 2015-02-14  1.6  4  hello pandas!2  1.0 2015-02-14  1.6  4  hello pandas!3  1.0 2015-02-14  1.6  4  hello pandas!

4.2 data viewing and sorting

Create a DataFrame from the following data:

raw_data = [['000001.XSHE', '2015-01-05', 'Ping An Bank', 15.99, 16.28, 15.60, 16.02, 286043643], ['601998.XSHG', '2015-01-28', 'China CITIC Bank', 7.04, 7.32, 6.95, 7.15, 163146128], ['000001.XSHE', '2015-01-07', 'Ping An Bank', 15.56, 15.83, 15.30, 15.48, 170012067], ['000001.XSHE', '2015-01-08', 'Ping An Bank', 15.50, 15.57, 14.90, 14.96, 140771421], ['000001.XSHE', '2015-01-09', 'Ping An Bank', 14.90, 15.87, 14.71, 15.08, 250850023], ['601998.XSHG', '2015-01-29', 'China CITIC Bank', 6.97, 7.05, 6.90, 7.01, 93003445], ['000001.XSHE', '2015-01-06', 'Ping An Bank', 15.85, 16.39, 15.55, 15.78, 216642140], ['601998.XSHG', '2015-01-30', 'China CITIC Bank', 7.10, 7.14, 6.92, 6.95, 68146718]]columns = ['secID', 'tradeDate', 'secShortName', 'openPrice', 'highestPrice', 'lowestPrice', 'closePrice', 'turnoverVol']df = DataFrame(raw_data, columns=columns)print(df)         secID   tradeDate secShortName  openPrice  highestPrice  lowestPrice  closePrice  turnoverVol0  000001.XSHE  2015-01-05         Ping An Bank      15.99         16.28        15.60       16.02    2860436431  601998.XSHG  2015-01-28         China CITIC Bank       7.04          7.32         6.95        7.15    1631461282  000001.XSHE  2015-01-07         Ping An Bank      15.56         15.83        15.30       15.48    1700120673  000001.XSHE  2015-01-08         Ping An Bank      15.50         15.57        14.90       14.96    1407714214  000001.XSHE  2015-01-09         Ping An Bank      14.90         15.87        14.71       15.08    2508500235  601998.XSHG  2015-01-29         China CITIC Bank       6.97          7.05         6.90        7.01     930034456  000001.XSHE  2015-01-06         Ping An Bank      15.85         16.39        15.55       15.78    2166421407  601998.XSHG  2015-01-30         China CITIC Bank       7.10          7.14         6.92        6.95     68146718[8 rows x 8 columns]

The above codes are the daily market information of two stocks in January 2015. First, let's take a look at the size of the data:

print(df.shape)(8, 8)

We can see that there are 8 rows, indicating that we have obtained 8 records, and each record has 8 fields. Now preview the data. dataframe.head() and dataframe.tail() can view the first five rows and the last five rows of data. If you need to change the number of rows, you can specify in parentheses:

print("Head of this DataFrame:")print(df.head())print("Tail of this DataFrame:")print(df.tail(3))Head of this DataFrame:         secID   tradeDate secShortName  ...  lowestPrice  closePrice  turnoverVol0  000001.XSHE  2015-01-05         Ping An Bank  ...        15.60       16.02    2860436431  601998.XSHG  2015-01-28         China CITIC Bank  ...         6.95        7.15    1631461282  000001.XSHE  2015-01-07         Ping An Bank  ...        15.30       15.48    1700120673  000001.XSHE  2015-01-08         Ping An Bank  ...        14.90       14.96    1407714214  000001.XSHE  2015-01-09         Ping An Bank  ...        14.71       15.08    250850023[5 rows x 8 columns]Tail of this DataFrame:         secID   tradeDate secShortName  ...  lowestPrice  closePrice  turnoverVol5  601998.XSHG  2015-01-29         China CITIC Bank  ...         6.90        7.01     930034456  000001.XSHE  2015-01-06         Ping An Bank  ...        15.55       15.78    2166421407  601998.XSHG  2015-01-30         China CITIC Bank  ...         6.92        6.95     68146718[3 rows x 8 columns]

dataframe.describe() provides statistics of pure numerical data in DataFrame:

print(df.describe())       openPrice  highestPrice  lowestPrice  closePrice   turnoverVolcount   8.000000      8.000000     8.000000     8.00000  8.000000e+00mean   12.363750     12.681250    12.103750    12.30375  1.735769e+08std     4.422833      4.571541     4.300156     4.37516  7.490931e+07min     6.970000      7.050000     6.900000     6.95000  6.814672e+0725%     7.085000      7.275000     6.942500     7.11500  1.288294e+0850%    15.200000     15.700000    14.805000    15.02000  1.665791e+0875%    15.632500     15.972500    15.362500    15.55500  2.251941e+08max    15.990000     16.390000    15.600000    16.02000  2.860436e+08

Sorting the data will facilitate us to observe the data. DataFrame provides two forms of sorting. One is sorting by row and column, that is, sorting by index (row name) or column name. You can call dataframe.sort_index, specify axis=0 to sort by index (row name), axis=1 to sort by column name, and specify ascending (ascending=True) or descending (ascending=False):

print("Order by column names, descending:")print(df.sort_index(axis=0, ascending=False).head())Order by column names, descending:         secID   tradeDate secShortName  ...  lowestPrice  closePrice  turnoverVol7  601998.XSHG  2015-01-30         China CITIC Bank  ...         6.92        6.95     681467186  000001.XSHE  2015-01-06         Ping An Bank  ...        15.55       15.78    2166421405  601998.XSHG  2015-01-29         China CITIC Bank  ...         6.90        7.01     930034454  000001.XSHE  2015-01-09         Ping An Bank  ...        14.71       15.08    2508500233  000001.XSHE  2015-01-08         Ping An Bank  ...        14.90       14.96    140771421

The second sort is sort by value. You can specify the column name and sorting method. The default is ascending sort:

print("Order by column value, ascending:")print(df.sort_values(by=['tradeDate']))print("Order by multiple columns value:")df = df.sort_values(by=['tradeDate', 'secID'], ascending=[False, True])print(df.head())Order by column value, ascending:         secID   tradeDate secShortName  ...  lowestPrice  closePrice  turnoverVol0  000001.XSHE  2015-01-05         Ping An Bank  ...        15.60       16.02    2860436436  000001.XSHE  2015-01-06         Ping An Bank  ...        15.55       15.78    2166421402  000001.XSHE  2015-01-07         Ping An Bank  ...        15.30       15.48    1700120673  000001.XSHE  2015-01-08         Ping An Bank  ...        14.90       14.96    1407714214  000001.XSHE  2015-01-09         Ping An Bank  ...        14.71       15.08    2508500231  601998.XSHG  2015-01-28         China CITIC Bank  ...         6.95        7.15    1631461285  601998.XSHG  2015-01-29         China CITIC Bank  ...         6.90        7.01     930034457  601998.XSHG  2015-01-30         China CITIC Bank  ...         6.92        6.95     68146718[8 rows x 8 columns]Order by multiple columns value:         secID   tradeDate secShortName  ...  lowestPrice  closePrice  turnoverVol7  601998.XSHG  2015-01-30         China CITIC Bank  ...         6.92        6.95     681467185  601998.XSHG  2015-01-29         China CITIC Bank  ...         6.90        7.01     930034451  601998.XSHG  2015-01-28         China CITIC Bank  ...         6.95        7.15    1631461284  000001.XSHE  2015-01-09         Ping An Bank  ...        14.71       15.08    2508500233  000001.XSHE  2015-01-08         Ping An Bank  ...        14.90       14.96    140771421[5 rows x 8 columns]

4.3 data access and operation

4.3.1 data filtering

Several ways of accessing DataFrame data using loc, iloc, at, iat and [] have been introduced. Here is another method to obtain partial rows or all columns using:

print(df.iloc[1:4][:])         secID   tradeDate secShortName  ...  lowestPrice  closePrice  turnoverVol1  601998.XSHG  2015-01-28         China CITIC Bank  ...         6.95        7.15    1631461282  000001.XSHE  2015-01-07         Ping An Bank  ...        15.30       15.48    1700120673  000001.XSHE  2015-01-08         Ping An Bank  ...        14.90       14.96    140771421

We can extend the previously introduced method of obtaining data using Boolean vectors to easily filter the data. For example, we want to select the data whose closing price is above the mean:

print(df[df.closePrice > df.closePrice.mean()].head())         secID   tradeDate secShortName  ...  lowestPrice  closePrice  turnoverVol0  000001.XSHE  2015-01-05         Ping An Bank  ...        15.60       16.02    2860436432  000001.XSHE  2015-01-07         Ping An Bank  ...        15.30       15.48    1700120673  000001.XSHE  2015-01-08         Ping An Bank  ...        14.90       14.96    1407714214  000001.XSHE  2015-01-09         Ping An Bank  ...        14.71       15.08    2508500236  000001.XSHE  2015-01-06         Ping An Bank  ...        15.55       15.78    216642140[5 rows x 8 columns]

The isin() function can easily filter the data in the DataFrame:

print(df[df['secID'].isin(['601998.XSHG'])].head())         secID   tradeDate secShortName  ...  lowestPrice  closePrice  turnoverVol1  601998.XSHG  2015-01-28         China CITIC Bank  ...         6.95        7.15    1631461285  601998.XSHG  2015-01-29         China CITIC Bank  ...         6.90        7.01     930034457  601998.XSHG  2015-01-30         China CITIC Bank  ...         6.92        6.95     68146718[3 rows x 8 columns]

4.3.2 processing missing data

On the basis of accessing the data, we can change the data, for example, modify some elements as missing values:

df.loc[df['secID'] == '000001.XSHE', 'closePrice'] = np.nanprint(df)         secID   tradeDate secShortName  ...  lowestPrice  closePrice  turnoverVol0  000001.XSHE  2015-01-05         Ping An Bank  ...        15.60         NaN    2860436431  601998.XSHG  2015-01-28         China CITIC Bank  ...         6.95        7.15    1631461282  000001.XSHE  2015-01-07         Ping An Bank  ...        15.30         NaN    1700120673  000001.XSHE  2015-01-08         Ping An Bank  ...        14.90         NaN    1407714214  000001.XSHE  2015-01-09         Ping An Bank  ...        14.71         NaN    2508500235  601998.XSHG  2015-01-29         China CITIC Bank  ...         6.90        7.01     930034456  000001.XSHE  2015-01-06         Ping An Bank  ...        15.55         NaN    2166421407  601998.XSHG  2015-01-30         China CITIC Bank  ...         6.92        6.95     68146718[8 rows x 8 columns]

Find rows with missing elements

print(df[df['closePrice'].isnull()])         secID   tradeDate secShortName  ...  lowestPrice  closePrice  turnoverVol0  000001.XSHE  2015-01-05         Ping An Bank  ...        15.60         NaN    2860436432  000001.XSHE  2015-01-07         Ping An Bank  ...        15.30         NaN    1700120673  000001.XSHE  2015-01-08         Ping An Bank  ...        14.90         NaN    1407714214  000001.XSHE  2015-01-09         Ping An Bank  ...        14.71         NaN    2508500236  000001.XSHE  2015-01-06         Ping An Bank  ...        15.55         NaN    216642140[5 rows x 8 columns]

There may be some missing data in the original data. Just like the sample data processed now, there are many ways to deal with missing data. Generally, dataframe.dropna() is used to discard data with nan by row; If you specify how = 'all' (the default is' any '), the data will be discarded only when the whole row is nan; If thresh is specified, it means that a row of data is reserved only when the number of non missing columns exceeds the specified value; To specify discarding according to a column, it can be done through subset.

# Make the first line closePrice by nandf.loc[1, 'closePrice'] = np.nan# see df Size before filtering print("Data size before filtering:")print(df.shape)# Filter all containing nan Line of print("Drop all rows that have any NaN values:")print("Data size after filtering:")print(df.dropna().shape)print(df.dropna())# Filter integer behavior only nan Line of print("Drop only if all columns are NaN:")print("Data size after filtering:")print(df.dropna(how='all').shape)print(df.dropna(how='all'))# Filter business nan More than 6 hours print("Drop rows who do not have at least six values that are not NaN")print("Data size after filtering:")print(df.dropna(thresh=6).shape)print(df.dropna(thresh=6))# When the value of a column is nan, the row is filtered. Print ("drop only if nan in specific column:") print ("data size after filtering:") print (df.dropna (subset = ['closeprice '). Shape) print (df.dropna (subset = ['closeprice')) data size before filtering: (8, 8) drop all rows that have any nan values: data size after filtering: (7, 8) Secid tradedate secshortname... Lowestprice closeprice turnover VOL0 00000 1.xshe 2015-01-05 Ping An Bank... 15.60 16.02 2860436432 00000 1.xshe 2015-01-07 Ping An Bank... 15.30 15.48 1700120673 00000 1.xshe 2015-01-08 Ping An Bank... 14.90 14.96 140 7714214 00000 1.xshe 2015-01-09 Ping An Bank... 14.71 15.08 2508500235 601998.xshg 2015-01-29 China CITIC Bank... 6.90 7.01 930034456 00000 1.xshe 2015-01-06 Ping An Bank... 15.55 15.78 2166421407 601998.xshg 2015-01-30 China CITIC Bank... 6.92 6.95     68146718[7 rows x 8 columns]Drop only if all columns are nan:Data size after filtering:(8, 8) Secid tradedate secshortname... Lowestprice closeprice turnevervol0 00000 1.xshe 2015-01-05 Ping An Bank... 15.60 16.02 2860436431 601998.xshg 2015-01-28 China CITIC Bank... 6.95 nan 1631461282 00000 1.xshe 2015-01-07 Ping An Bank... 15.30 15.48 170 0120673 00000 1.xshe 2015-01-08 Ping An Bank... 14.90 14.96 1407714214 00000 1.xshe 2015-01-09 Ping An Bank... 14.71 15.08 2508500235 601998.xshg 2015-01-29 China CITIC Bank... 6.90 7.01 930034456 00000 1.xshe 2015-01-06 Ping An Bank... 15.55 1 5.78 2166421407 601998.xshg 2015-01-30 China CITIC Bank... 6.92 6.95 68146718 [8 rows x 8 columns] drop rows who do not have at least six values that are not nandata size after filtering: (8, 8) Secid tradedate secshortname... Lowestprice closeprice turnevervol0 00000 1.xshe 2015-01-05 Ping An Bank... 15.60 16.02 2860436431 601998.xshg 2015-01-28 China CITIC Bank... 6.95 nan 1631461282 00000 1.xshe 2015-01-07 Ping An Bank... 15.30 15.48 170 0120673 00000 1.xshe 2015-01-08 Ping An Bank... 14.90 14.96 1407714214 00000 1.xshe 2015-01-09 Ping An Bank... 14.71 15.08 2508500235 601998.xshg 2015-01-29 China CITIC Bank... 6.90 7.01 930034456 00000 1.xshe 2015-01-06 Ping An Bank... 15.55 1 5.78 2166421407 601998.xshg 2015-01-30 China CITIC Bank... 6.92 6.95 68146718 [8 rows x 8 columns] drop only if nan in specific column: data size after filtering: (7, 8) Secid tradedate secshortname... Lowestprice closeprice turnover VOL0 00000 1.xshe 2015-01-05 Ping An Bank... 15.60 16.02 2860436432 00000 1.xshe 2015-01-07 Ping An Bank... 15.30 15.48 1700120673 00000 1.xshe 2015-01-08 Ping An Bank... 14.90 14.96 140 7714214 00000 1.xshe 2015-01-09 Ping An Bank... 14.71 15.08 2508500235 601998.xshg 2015-01-29 China CITIC Bank... 6.90 7.01 930034456 00000 1.xshe 2015-01-06 Ping An Bank... 15.55 15.78 2166421407 601998.xshg 2015-01-30 China CITIC Bank... 6.92 6.95     68146718[7 rows x 8 columns]

When there is missing data, it may not be all discarded. dataframe.fillna(value=value) can specify the value to fill in the missing value

print(df.fillna(value=20150101).head())         secID   tradeDate secShortName  ...  lowestPrice   closePrice  turnoverVol0  000001.XSHE  2015-01-05         Ping An Bank  ...        15.60        16.02    2860436431  601998.XSHG  2015-01-28         China CITIC Bank  ...         6.95  20150101.00    1631461282  000001.XSHE  2015-01-07         Ping An Bank  ...        15.30        15.48    1700120673  000001.XSHE  2015-01-08         Ping An Bank  ...        14.90        14.96    1407714214  000001.XSHE  2015-01-09         Ping An Bank  ...        14.71        15.08    250850023[5 rows x 8 columns]

4.3.3 data operation

The class functions of Series and DataFrame provide some functions, such as mean(), sum(), etc. specify 0 by column and 1 by row:

print(df.mean(0))openPrice       1.236375e+01highestPrice    1.268125e+01lowestPrice     1.210375e+01closePrice      1.230375e+01turnoverVol     1.735769e+08dtype: float64

value_ The counts function can conveniently count the frequency:

print(df['closePrice'].value_counts().head())16.02    17.15     115.48    114.96    115.08    1Name: closePrice, dtype: int64

In panda, Series can call the map function to apply a function to each element, DataFrame can call the apply function to apply a function to each column (row), and applymap can apply a function to each element. The function can be a user-defined lambda function or other existing functions. The following example shows how to adjust the closing price to the [0,1] range:

print(df[['closePrice']].apply(lambda x: (x - x.min()) / (x.max() - x.min())).head())   closePrice0    1.0000001    0.0220512    0.9404633    0.8831314    0.896362

Using append, you can add elements after Series and add a row at the end of the DataFrame:

dat1 = df[['secID', 'tradeDate', 'closePrice']].head()dat2 = df[['secID', 'tradeDate', 'closePrice']].iloc[2]print("Before appending:")print(dat1)dat = dat1.append(dat2, ignore_index=True)print("After appending:")print(dat)Before appending:         secID   tradeDate  closePrice0  000001.XSHE  2015-01-05       16.021  601998.XSHG  2015-01-28        7.152  000001.XSHE  2015-01-07       15.483  000001.XSHE  2015-01-08       14.964  000001.XSHE  2015-01-09       15.08After appending:         secID   tradeDate  closePrice0  000001.XSHE  2015-01-05       16.021  601998.XSHG  2015-01-28        7.152  000001.XSHE  2015-01-07       15.483  000001.XSHE  2015-01-08       14.964  000001.XSHE  2015-01-09       15.085  000001.XSHE  2015-01-07       15.48

Dataframes can be merged as in SQL. In the previous article, we introduced the use of concat function to create dataframes, which is a way of merging. Another way to use the merge function is to specify which columns to merge according to. The following example shows how to merge data according to security ID and transaction date:

dat1 = df[['secID', 'tradeDate', 'closePrice']]dat2 = df[['secID', 'tradeDate', 'turnoverVol']]dat = dat1.merge(dat2, on=['secID', 'tradeDate'])print("The first DataFrame:")print(dat1.head())print("The second DataFrame:")print(dat2.head())print("Merged DataFrame:")print(dat.head())The first DataFrame:         secID   tradeDate  closePrice0  000001.XSHE  2015-01-05       16.021  601998.XSHG  2015-01-28        7.152  000001.XSHE  2015-01-07       15.483  000001.XSHE  2015-01-08       14.964  000001.XSHE  2015-01-09       15.08The second DataFrame:         secID   tradeDate  turnoverVol0  000001.XSHE  2015-01-05    2860436431  601998.XSHG  2015-01-28    1631461282  000001.XSHE  2015-01-07    1700120673  000001.XSHE  2015-01-08    1407714214  000001.XSHE  2015-01-09    250850023Merged DataFrame:         secID   tradeDate  closePrice  turnoverVol0  000001.XSHE  2015-01-05       16.02    2860436431  601998.XSHG  2015-01-28        7.15    1631461282  000001.XSHE  2015-01-07       15.48    1700120673  000001.XSHE  2015-01-08       14.96    1407714214  000001.XSHE  2015-01-09       15.08    250850023
parametereffect
leftSpliced left DataFrame object
rightSpliced right DataFrame object
onThe name of the column or index level to join. Must be found in the left and right DataFrame objects. If not passed and left_index and right_ If the index is False, the intersection of columns in the DataFrame is inferred as a join key
left_onThe column or index level in the left DataFrame is used as the key. It can be a column name, an index level name, or an array with a length equal to the length of the DataFrame
right_onThe column or index level in the left DataFrame is used as the key. It can be a column name, an index level name, or an array with a length equal to the length of the DataFrame
left_indexIf True, the index (row label) in the DataFrame on the left is used as its connection key. For a DataFrame with MultiIndex (hierarchy), the number of levels must match the number of connection keys in the DataFrame on the right
right_indexAnd left_index functions are similar
howThe values include 'left', 'right', 'outer', 'inner', and the default is' inner '. Inner is the intersection set, outer is the union set, left retains all row and column data on the left and uses the data on the right to complete. Right is opposite to left
sortSort the resulting dataframes by the join key in dictionary order. The default is True, and setting it to False will significantly improve performance in many cases
suffixesString suffix tuple for overlapping columns. The default is ('x ',' y ')
copyAlways copy data from the passed DataFrame object (the default is True), even if there is no need to rebuild the index

Another powerful function of DataFrame is groupby, which is very convenient for grouping data. We average the opening price, highest price, lowest price, closing price and trading volume of ten stocks in January 2015:

df_grp = df.groupby('secID')grp_mean = df_grp.mean()print(grp_mean)             openPrice  highestPrice  lowestPrice  closePrice   turnoverVolsecID                                                                      000001.XSHE  15.560000        15.988    15.212000   15.464000  2.128639e+08601998.XSHG   7.036667         7.170     6.923333    7.036667  1.080988e+08

If you want to get the latest data of each stock, what should you do? drop_duplicates can realize this function. First sort the data by date, and then de duplicate by security ID:

df2 = df.sort_values(by=['secID', 'tradeDate'], ascending=[True, False])print(df2.drop_duplicates(subset='secID'))         secID   tradeDate secShortName  ...  lowestPrice  closePrice  turnoverVol4  000001.XSHE  2015-01-09         Ping An Bank  ...        14.71       15.08    2508500237  601998.XSHG  2015-01-30         China CITIC Bank  ...         6.92        6.95     68146718[2 rows x 8 columns]

If you want to retain the oldest data, you can take the last record after descending arrangement. You can achieve this by specifying keep = 'last' (the first record is taken by default):

print(df2.drop_duplicates(subset='secID', keep='last'))         secID   tradeDate secShortName  ...  lowestPrice  closePrice  turnoverVol0  000001.XSHE  2015-01-05         Ping An Bank  ...        15.60       16.02    2860436431  601998.XSHG  2015-01-28         China CITIC Bank  ...         6.95        7.15    163146128[2 rows x 8 columns]

4.3.4 data visualization

pandas data can be directly plotted and viewed. In the following example, we use the closing price of Sinopec in January to plot, where set_index('tradeDate ') [[closePrice]] means that the column "tradeDate" of DataFrame is taken as an index, and the column "closePrice" is used as Series value to return to a Series object, then the plot function is drawn, and more parameters can be viewed in the matplotlib document.

dat = df[df['secID'] == '600028.XSHG'].set_index('tradeDate')['closePrice']dat.plot(title="Close Price of SINOPEC (600028) during Jan, 2015")

5, Code introduction

typedescribecodenotes
Data creationCreate DataFramedf = DataFrame({'one': [1, 2, 3], 'two': ['a', 'b', 'c']}, index=['a', 'b', 'c'])index: row name
columns: column name
df = DataFrame([[1, 2, 3], ['a', 'b', 'c']], columns=['one', 'two', 'three'])
Creation date Seriespd.date_range('20210101', periods=5)periods: incremental quantity
Splice Seriesdf = pd.concat([Series([1, 2, 3]), Series(['a', 'b', 'c'])], axis=1)asxi: transverse splicing when equal to 1, longitudinal splicing when equal to 0
Data base accessView data sizedf.shape-
View the first n linesdf.head(n)The default is 5
View the last n linesdf.tail(n)
View statistics for pure numeric datadf.describe()Including count, mean, standard deviation, minimum and maximum
Access columndf['one'][]: access column 'one', type Series
df[['one']]
df[['one', 'two']]
[[]]: access column 'one', type DataFrame
Access rowdf.iloc[0]
df.iloc[0:2]
iloc: select rows by subscript
loc: select a row by row name
Note: values are selected together and DataFrame is selected separately
df.loc['a']
Access row + columndf.loc[['a', 'b']][['one']]
df.iloc[0:2][['one']]
Access valuedf.iloc[0, 0]
df.loc['a', 'one']
Data filteringSelect the row that matches the equationdf[df['one'] > 1]Select a row whose value of column 'one' is greater than 1
Select a line that contains a stringdf[df['two'].str.contains('a')]Select the row whose column 'two' contains the string 'a'
Select a row that belongs to several valuesdf[df['one'].isin([1, 2])]Select the row whose value of column 'one' belongs to [1, 2]
Select missing rowdf[df['one'].isnull()]Select the row with missing column 'one' value
Select non missing rowsdf[df['one'].notnull()]Select a row whose column 'one' value is not missing
data statisticsFind the meandf.mean(0)Specify 0 by column and 1 by row
Sumdf.sum(0)
Statistical frequencydf['one'].value_counts()The number of occurrences of each value in column 'one'
Data operationSort by column (row) namedf.sort_index(axis=1, ascending=True)axis: sort by column name when it is equal to 1, and sort by row name when it is equal to 0
Ascending: ascending when True, descending when False
Sort by valuedf.sort_values(by=['one', 'two'], ascending=[True, True])by: column name to sort
Data De duplicationdf.drop_duplicates(subset=['one'])De duplication according to column 'one', only the first row is retained
Data mergingdf.merge(df2, on=['one'])on: column name as merge key value
how: the values are left, right, outer and inner. The default is inner. Inner is the intersection set, outer is the union set, left retains all row and column data on the left and uses the data on the right to complete. Right is opposite to left
suffixes: used to rename column names of overlapping columns. The default is (')_ x’,‘_ y’)
Apply a function to each column (row)df[['one']].apply(lambda x: x+1)lambda: custom functions
Apply a function to each elementdf[['one']].applymap(lambda x: x+1)
Add datadf.append(df2, ignore_index=True)ignore_index: when True, redefine the row name as an ordered number, and when False, splice the original row name
Delete columndf.drop('one', axis=1)-
Data groupingdf.groupby('one')The data is grouped according to the column 'one'. After grouping, there is a DataFrameGroupBy object. If you want to view the specific content, you can use list to convert it to list
Convert all elements to stringsdf.astype(str)-
Convert column elements to stringsdf['one'].astype(str)-
Split column elements according to commasdf.drop('two', axis=1).join(df['two'].str.split(',', expand=True).stack().reset_index(level=1, drop=True).rename('tag'))rename: column name after splitting

6, Interaction between Pandas and Excel

5.1 reading Excel data

Parse the data in the Excel table into a DataFrame

import pandas as pdwith pd.ExcelFile('saw.xlsx') as xlsx:	# List table names	names = xlsx.sheet_names	# Read table Sheet1 Data	df = pd.read_excel(xlsx, 'Sheet1', na_value=['NA'], keep_default_na=False)    # na_value: Interpret the value as nan    # keep_default_na: Parses an empty string into an empty string'',Default to True Resolve to nan    # In addition, index_col=0 sets the first column as an index

5.2 writing Excel data

Write data DataFrame to Excel table

import pandas as pdfrom openpyxl import load_workbookwb = load_workbook('saw.xlsx')with pd.ExcelWriter('saw.xlsx', engine='openpyxl') as xlsx:    # Will data df Write to table Sheet2 in    xlsx.book = wb    # Do not keep indexes and column names df.to_excel(xlsx, sheet_name='Sheet2', index=False, header=None)

Note: if you directly call the ExcelWriter method to write, the original table will be overwritten, so it needs to be written in combination with the openpyxl module

7, Reference articles

Python stock quantitative trading tutorial

Tags: Python Pycharm data structure pandas

Posted on Sun, 05 Sep 2021 17:55:14 -0400 by tony.j.jackson@o2.co.uk