[script] Swiss Army knife pandas for data processing

1, Pandas introduction

In Python, pandas includes advanced data structures Series and DataFrame, which makes it very convenient, fast and simple to process data in Python.

There are some incompatibilities between different versions of pandas. Therefore, we need to know which version of pandas we are using.

import pandas as pd print(pd.__version__) 1.3.1

The two main data structures of pandas are Series and DataFrame. Let's import them and related modules first:

import numpy as np import pandas as pd from pandas import Series, DataFrame

2, Pandas data structure: Series

Series can also be called a sequence. In a general sense, series can be simply regarded as a one-dimensional array. The main difference between series and one-dimensional array is that the series type has an index, which can be associated with another common data structure Hash in programming. Secondly, the elements of series are usually of uniform types, while one-dimensional arrays can be composed of elements of different types.

2.1 creating Series

The basic format of creating a Series is s = Series(data, index=index, name=name). The following are some examples of creating a Series:

a = np.random.randn(5) print("a is an array:") print(a) s = Series(a) print("s is a Series:") print(s) a is an array: [-0.300 seven 983 0.2893125 -0.1010 six 809 -1.060 seven 6531 0.29202818] s is a Series: 0 -0.300798 1 0.289313 2 -0.101068 3 -1.060765 4 0.292028 dtype: float64

You can add an index when creating a Series, and you can use Series.index to view the specific index. It should be noted that when creating a Series from an array, if you specify index, the index length should be consistent with the length of data:

s = Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e']) print(s) print(s.index) a 1.192283 b 1.477963 c -0.386441 d 1.622310 e 0.845787 dtype: float64 Index(['a', 'b', 'c', 'd', 'e'], dtype='object')

Another option to create a Series is name. You can specify the name of the Series, which can be accessed by Series.name. In the subsequent DataFrame, the column name of each column becomes the name of the Series when the column is taken out separately:

s = Series(np.random.randn(5), index=['a', 'b', 'c', 'd', 'e'], name='my_series') print(s) print(s.name) a -2.240 fifteen 5 b 0.258177 c 0.343206 d 1.220887 e -0.fifteen 3971 Name: my_series, dtype: float64 my_series

Series can also be created from a dictionary (dict):

d = {'a': 0., 'b': 1, 'c': 2} print("d is a dict:") print(d) s = Series(d) print("s is a Series:") print(s) d is a dict: {'a': 0.0, 'b': 1, 'c': 2} s is a Series: a 0.0 b 1.0 c 2.0 dtype: float64

Let's take a look at the case where the index is specified when creating a Series using a dictionary (the index length does not have to be the same as the dictionary):

s = Series(d, index=['b', 'c', 'd', 'a']) print(s) b 1.0 c 2.0 d NaN a 0.0 dtype: float64

We can observe two points: one is the Series created by the dictionary, and the data will be rearranged in the order of index; Second, the index length can be inconsistent with the dictionary length. If there are more indexes, pandas will automatically allocate NaN (not a number, the standard mark of missing data in pandas) to the redundant indexes. Of course, if there are fewer indexes, some dictionary contents will be intercepted.

If the data is a single variable, such as the number 4, Series will repeat this variable:

s = Series(4., index=['a', 'b', 'c', 'd', 'e']) print(s) a 4.0 b 4.0 c 4.0 d 4.0 e 4.0 dtype: float64

2.2 access to series data

When accessing Series data, you can use subscripts like arrays, indexes like dictionaries, and filter with some conditions:

s = Series(np.random.randn(10),index=['a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j'])print(s[0])0.20339387093082803

print(s[[0]])a 0.906648dtype: float64

Note: the difference between s[0] and s[[0]] is that s[0] only takes values, while s[[0]] takes rows, and the original sequence type will be retained

print(s[:2])a -2.028119b 0.061965dtype: float64

print(s[[2,0,4]])c -0.526092a 0.484422e 1.571355dtype: float64

print(s[['e', 'i']])e 0.435927i 1.045612dtype: float64

print(s[s > 0.5])c 0.995218e 0.858984h 0.942102i 0.675896dtype: float64

print('e' in s)True

3, Pandas data structure: DataFrame

DataFrame is also called data structure. Before using DataFrame, let's explain the characteristics of DataFrame. DataFrame is a two-dimensional data structure that combines several Series by column. Each column is taken out separately as a Series, which is very similar to the data taken out from SQL database. Therefore, it is more convenient to process a DataFrame by column. When programming, users should pay attention to cultivating the thinking of building data by column. The advantage of DataFrame is that it can easily handle different types of columns. Therefore, it is not necessary to consider how to inverse a DataFrame full of floating-point numbers. It is more convenient to save the data as a NumPy matrix type.

3.1 create DataFrame

First, let's look at how to create a DataFrame from a dictionary. DataFrame is a two-dimensional data structure and an aggregate of multiple Series. Let's first create a dictionary whose value is Series and convert it to DataFrame:

d = {'one': Series([1., 2., 3.], index=['a', 'b', 'c']), 'two':Series([1., 2., 3., 4.], index=['a', 'b', 'c', 'd'])}df = DataFrame(d)print(df) one twoa 1.0 1.0b 2.0 2.0c 3.0 3.0d NaN 4.0

You can specify the required rows and columns. If the dictionary does not contain corresponding elements, it will be set to NaN:

df = DataFrame(d, index=['r', 'd', 'a'], columns=['two', 'three'])print(df) two threer NaN NaNd 4.0 NaNa 1.0 NaN

You can use dataframe.index and dataframe.columns to view the rows and columns of DataFrame. dataframe.values returns the elements of DataFrame in the form of array:

print("DataFrame index:")print(df.index)print("DataFrame columns:")print(df.columns)print("DataFrame values:")print(df.values)DataFrame index:Index(['a', 'b', 'c', 'd'], dtype='object')DataFrame columns:Index(['one', 'two'], dtype='object')DataFrame values:[[ 1. 1.] [ 2. 2.] [ 3. 3.] [nan 4.]]

DataFrame can also be created from a dictionary whose value is an array, but the length of each array needs to be the same:

d = {'one': [1., 2., 3., 4.], 'two': [4., 3., 2., 1.]}df = DataFrame(d, index=['a', 'b', 'c', 'd'])print(df) one twoa 1.0 4.0b 2.0 3.0c 3.0 2.0d 4.0 1.0

When the value is not an array, there is no such restriction, and the missing value is supplemented by NaN:

d= [{'a': 1.6, 'b': 2}, {'a': 3, 'b': 6, 'c': 9}]df = DataFrame(d)print(df) a b c0 1.6 2 NaN1 3.0 6 9.0

When actually processing data, you sometimes need to create an empty DataFrame. You can do this:

df = DataFrame()print(df)Empty DataFrameColumns: []Index: []

Another useful way to create a DataFrame is to use the concat function to create a DataFrame based on Series or DataFrame

a = Series(range(5))b = Series(np.linspace(4, 20, 5))df = pd.concat([a, b], axis=1)print(df) 0 10 0 4.01 1 8.02 2 12.03 3 16.04 4 20.0

Where axis=1 means merging by row, axis=0 means merging by column, and Series is processed into one column. Therefore, if axis=0 is selected here, a 10 will be obtained × DataFrame for 1. The following example shows how to merge a DataFrame into a large DataFrame by row:

df = DataFrame()index = ['alpha', 'beta', 'gamma', 'delta', 'eta']for i in range(5): a = DataFrame([np.linspace(i, 5*i, 5)], index=[index[i]]) df = pd.concat([df, a], axis=0)print(df) 0 1 2 3 4alpha 0.0 0.0 0.0 0.0 0.0beta 1.0 2.0 3.0 4.0 5.0gamma 2.0 4.0 6.0 8.0 10.0delta 3.0 6.0 9.0 12.0 fifteen.0eta 4.0 8.0 12.0 16.0 20.0

3.2 DataFrame data access

First of all, emphasize again that the dataframe is based on columns. All operations can be imagined as taking a column from the dataframe and then taking elements from the Series. You can use datafrae.column_name can select columns or use dataframe [] operation to select columns. We can immediately find that the former method can only select one column, and the latter method can select multiple columns. If the dataframe does not have a column name, [] can use a non negative integer, that is, "subscript" to select the column; If there is a column name, it must be selected using the column name, and datafrae.column_name is invalid without column name:

print(df[1])print(type(df[1]))alpha 0.0beta 2.0gamma 4.0delta 6.0eta 8.0Name: 1, dtype: float64<class 'pandas.core.series.Series'>

print(df[[1]])print(type(df[[1]])) 1alpha 0.0beta 2.0gamma 4.0delta 6.0eta 8.0<class 'pandas.core.frame.DataFrame'>

Note: similar to Series in Section 2.2, df[1] gets a column and the type changes to Series; df[[1]] get a column and maintain the DataFrame structure

df.columns = ['a', 'b', 'c', 'd', 'e']print(df['b'])print(type(df['b']))alpha 0.0beta 2.0gamma 4.0delta 6.0eta 8.0Name: b, dtype: float64<class 'pandas.core.series.Series'>

print(df[['a', 'd']])print(type(df[['a', 'd']])) a dalpha 0.0 0.0beta 1.0 4.0gamma 2.0 8.0delta 3.0 12.0eta 4.0 16.0<class 'pandas.core.frame.DataFrame'>

The above code uses dataframe.columns to assign a column name to the DataFrame, and we see that a column is taken separately, and its data structure displays Series. The result of taking two or more columns is still DataFrame. To access specific elements, you can use subscripts or indexes like Series:

print(df['b'][2])print(df['b']['gamma'])4.04.0

To select rows, you can use dataframe.iloc to select by subscript or dataframe.loc to select by label:

print(df.iloc[1])print(df.loc['beta'])a 1.0b 2.0c 3.0d 4.0e 5.0Name: beta, dtype: float64a 1.0b 2.0c 3.0d 4.0e 5.0Name: beta, dtype: float64

Note: loc is based on tag and iloc is based on subscript. If iloc uses a tag, an error will be reported; otherwise, if loc uses a subscript, an error will be reported.

Example: the correct usage is df.iloc[0][0] or df.loc ['a'] ['beta']

You can also select rows by slicing or Boolean vectors:

print("Selecting by slices:")print(df[1:3])bool_vec = [True, False, True, True, False]print("Selecting by boolean vector:")print(df[bool_vec])Selecting by slices: a b c d ebeta 1.0 2.0 3.0 4.0 5.0gamma 2.0 4.0 6.0 8.0 10.0Selecting by boolean vector: a b c d ealpha 0.0 0.0 0.0 0.0 0.0gamma 2.0 4.0 6.0 8.0 10.0delta 3.0 6.0 9.0 12.0 fifteen.0

Combine rows and columns to select data:

# Column before row print(df[['b', 'd']].iloc[[1, 3]])# First and last print(df.iloc[[1, 3]][['b', 'd']])# ditto df.iloc[1, 3]['b', 'd']# Column before row print(df[['b', 'd']].loc[['beta', 'delta']])# First and last print(df.loc[['beta', 'delta']][['b', 'd']])# Ditto df.loc ['beta ',' Delta '] ['b','d '] B dbeta 2.0 4.0 delta 6.0 12.0 B dbeta 2.0 4.0 delta 6.0 12.0 B dbeta 2.0 4.0 delta 6.0 12.0 B dbeta 2.0 4.0 delta 6.0 12.0

dataframe.at and dataframe.iat are the fastest ways to access elements in a special location instead of a specific row and column. They are used for index and subscript access respectively:

print(df.iat[2, 3])print(df.at['gamma', 'd'])8.08.0

4, Advanced: Pandas data operation

After mastering the operations in this chapter, you can basically process most of the data. To make it easier to see the data, let's set the width of the output screen

pd.set_option('display.width', 200)

4.1 other methods of data creation

The creation of data structure is not just the standard form described earlier. For example, we can create a Series with date as element:

dates = pd.date_range('20150101', periods=5)print(dates)DatetimeIndex(['2015-01-01', '2015-01-02', '2015-01-03', '2015-01-04', '2015-01-05'], dtype='datetime64[ns]', freq='D')

Assign this date to a DataFrame as an index:

df = pd.DataFrame(np.random.randn(5, 4),index=dates,columns=list('ABCD'))print(df) A B C D2015-01-01 0.008608 -0.686443 -0.021788 0.4344862015-01-02 0.711034 0.746027 1.528270 0.5572102015-01-03 -0.334801 0.532736 1.006003 0.0303722015-01-04 0.507740 0.668962 -0.166262 0.5183842015-01-05 0.887693 -0.839035 0.998530 1.066598

Any object that can be converted into Series can be used to create a DataFrame:

df2 = pd.DataFrame({ 'A' : 1., 'B': pd.Timestamp('201502 fourteen'), 'C': pd.Series(1.6,index=list(range(4)),dtype='float64'), 'D' : np.array([4] * 4, dtype='int64'), 'E' : 'hello pandas!' })print(df2) A B C D E0 1.0 2015-02-14 1.6 4 hello pandas!1 1.0 2015-02-14 1.6 4 hello pandas!2 1.0 2015-02-14 1.6 4 hello pandas!3 1.0 2015-02-14 1.6 4 hello pandas!

4.2 data viewing and sorting

Create a DataFrame from the following data:

raw_data = [['000001.XSHE', '2015-01-05', 'Ping An Bank', 15.99, 16.28, 15.60, 16.02, 286043643], ['601998.XSHG', '2015-01-28', 'China CITIC Bank', 7.04, 7.32, 6.95, 7.15, 163146128], ['000001.XSHE', '2015-01-07', 'Ping An Bank', 15.56, 15.83, 15.30, 15.48, 170012067], ['000001.XSHE', '2015-01-08', 'Ping An Bank', 15.50, 15.57, 14.90, 14.96, 140771421], ['000001.XSHE', '2015-01-09', 'Ping An Bank', 14.90, 15.87, 14.71, 15.08, 250850023], ['601998.XSHG', '2015-01-29', 'China CITIC Bank', 6.97, 7.05, 6.90, 7.01, 93003445], ['000001.XSHE', '2015-01-06', 'Ping An Bank', 15.85, 16.39, 15.55, 15.78, 216642140], ['601998.XSHG', '2015-01-30', 'China CITIC Bank', 7.10, 7.14, 6.92, 6.95, 68146718]]columns = ['secID', 'tradeDate', 'secShortName', 'openPrice', 'highestPrice', 'lowestPrice', 'closePrice', 'turnoverVol']df = DataFrame(raw_data, columns=columns)print(df) secID tradeDate secShortName openPrice highestPrice lowestPrice closePrice turnoverVol0 000001.XSHE 2015-01-05 Ping An Bank 15.99 16.28 15.60 16.02 2860436431 601998.XSHG 2015-01-28 China CITIC Bank 7.04 7.32 6.95 7.15 1631461282 000001.XSHE 2015-01-07 Ping An Bank 15.56 15.83 15.30 15.48 1700120673 000001.XSHE 2015-01-08 Ping An Bank 15.50 15.57 14.90 14.96 1407714214 000001.XSHE 2015-01-09 Ping An Bank 14.90 15.87 14.71 15.08 2508500235 601998.XSHG 2015-01-29 China CITIC Bank 6.97 7.05 6.90 7.01 930034456 000001.XSHE 2015-01-06 Ping An Bank 15.85 16.39 15.55 15.78 2166421407 601998.XSHG 2015-01-30 China CITIC Bank 7.10 7.14 6.92 6.95 68146718[8 rows x 8 columns]

The above codes are the daily market information of two stocks in January 2015. First, let's take a look at the size of the data:

print(df.shape)(8, 8)

We can see that there are 8 rows, indicating that we have obtained 8 records, and each record has 8 fields. Now preview the data. dataframe.head() and dataframe.tail() can view the first five rows and the last five rows of data. If you need to change the number of rows, you can specify in parentheses:

print("Head of this DataFrame:")print(df.head())print("Tail of this DataFrame:")print(df.tail(3))Head of this DataFrame: secID tradeDate secShortName ... lowestPrice closePrice turnoverVol0 000001.XSHE 2015-01-05 Ping An Bank ... 15.60 16.02 2860436431 601998.XSHG 2015-01-28 China CITIC Bank ... 6.95 7.15 1631461282 000001.XSHE 2015-01-07 Ping An Bank ... 15.30 15.48 1700120673 000001.XSHE 2015-01-08 Ping An Bank ... 14.90 14.96 1407714214 000001.XSHE 2015-01-09 Ping An Bank ... 14.71 15.08 250850023[5 rows x 8 columns]Tail of this DataFrame: secID tradeDate secShortName ... lowestPrice closePrice turnoverVol5 601998.XSHG 2015-01-29 China CITIC Bank ... 6.90 7.01 930034456 000001.XSHE 2015-01-06 Ping An Bank ... 15.55 15.78 2166421407 601998.XSHG 2015-01-30 China CITIC Bank ... 6.92 6.95 68146718[3 rows x 8 columns]

dataframe.describe() provides statistics of pure numerical data in DataFrame:

print(df.describe()) openPrice highestPrice lowestPrice closePrice turnoverVolcount 8.000000 8.000000 8.000000 8.00000 8.000000e+00mean 12.363750 12.681250 12.103750 12.30375 1.735769e+08std 4.422833 4.571541 4.300156 4.37516 7.490931e+07min 6.970000 7.050000 6.900000 6.95000 6.814672e+0725% 7.085000 7.275000 6.942500 7.11500 1.288294e+0850% 15.200000 15.700000 14.805000 15.02000 1.665791e+0875% 15.632500 15.972500 15.362500 15.55500 2.251941e+08max 15.990000 16.390000 15.600000 16.02000 2.860436e+08

Sorting the data will facilitate us to observe the data. DataFrame provides two forms of sorting. One is sorting by row and column, that is, sorting by index (row name) or column name. You can call dataframe.sort_index, specify axis=0 to sort by index (row name), axis=1 to sort by column name, and specify ascending (ascending=True) or descending (ascending=False):

print("Order by column names, descending:")print(df.sort_index(axis=0, ascending=False).head())Order by column names, descending: secID tradeDate secShortName ... lowestPrice closePrice turnoverVol7 601998.XSHG 2015-01-30 China CITIC Bank ... 6.92 6.95 681467186 000001.XSHE 2015-01-06 Ping An Bank ... 15.55 15.78 2166421405 601998.XSHG 2015-01-29 China CITIC Bank ... 6.90 7.01 930034454 000001.XSHE 2015-01-09 Ping An Bank ... 14.71 15.08 2508500233 000001.XSHE 2015-01-08 Ping An Bank ... 14.90 14.96 140771421

The second sort is sort by value. You can specify the column name and sorting method. The default is ascending sort:

print("Order by column value, ascending:")print(df.sort_values(by=['tradeDate']))print("Order by multiple columns value:")df = df.sort_values(by=['tradeDate', 'secID'], ascending=[False, True])print(df.head())Order by column value, ascending: secID tradeDate secShortName ... lowestPrice closePrice turnoverVol0 000001.XSHE 2015-01-05 Ping An Bank ... 15.60 16.02 2860436436 000001.XSHE 2015-01-06 Ping An Bank ... 15.55 15.78 2166421402 000001.XSHE 2015-01-07 Ping An Bank ... 15.30 15.48 1700120673 000001.XSHE 2015-01-08 Ping An Bank ... 14.90 14.96 1407714214 000001.XSHE 2015-01-09 Ping An Bank ... 14.71 15.08 2508500231 601998.XSHG 2015-01-28 China CITIC Bank ... 6.95 7.15 1631461285 601998.XSHG 2015-01-29 China CITIC Bank ... 6.90 7.01 930034457 601998.XSHG 2015-01-30 China CITIC Bank ... 6.92 6.95 68146718[8 rows x 8 columns]Order by multiple columns value: secID tradeDate secShortName ... lowestPrice closePrice turnoverVol7 601998.XSHG 2015-01-30 China CITIC Bank ... 6.92 6.95 681467185 601998.XSHG 2015-01-29 China CITIC Bank ... 6.90 7.01 930034451 601998.XSHG 2015-01-28 China CITIC Bank ... 6.95 7.15 1631461284 000001.XSHE 2015-01-09 Ping An Bank ... 14.71 15.08 2508500233 000001.XSHE 2015-01-08 Ping An Bank ... 14.90 14.96 140771421[5 rows x 8 columns]

4.3 data access and operation

4.3.1 data filtering

Several ways of accessing DataFrame data using loc, iloc, at, iat and [] have been introduced. Here is another method to obtain partial rows or all columns using:

print(df.iloc[1:4][:]) secID tradeDate secShortName ... lowestPrice closePrice turnoverVol1 601998.XSHG 2015-01-28 China CITIC Bank ... 6.95 7.15 1631461282 000001.XSHE 2015-01-07 Ping An Bank ... 15.30 15.48 1700120673 000001.XSHE 2015-01-08 Ping An Bank ... 14.90 14.96 140771421

We can extend the previously introduced method of obtaining data using Boolean vectors to easily filter the data. For example, we want to select the data whose closing price is above the mean:

print(df[df.closePrice > df.closePrice.mean()].head()) secID tradeDate secShortName ... lowestPrice closePrice turnoverVol0 000001.XSHE 2015-01-05 Ping An Bank ... 15.60 16.02 2860436432 000001.XSHE 2015-01-07 Ping An Bank ... 15.30 15.48 1700120673 000001.XSHE 2015-01-08 Ping An Bank ... 14.90 14.96 1407714214 000001.XSHE 2015-01-09 Ping An Bank ... 14.71 15.08 2508500236 000001.XSHE 2015-01-06 Ping An Bank ... 15.55 15.78 216642140[5 rows x 8 columns]

The isin() function can easily filter the data in the DataFrame:

print(df[df['secID'].isin(['601998.XSHG'])].head()) secID tradeDate secShortName ... lowestPrice closePrice turnoverVol1 601998.XSHG 2015-01-28 China CITIC Bank ... 6.95 7.15 1631461285 601998.XSHG 2015-01-29 China CITIC Bank ... 6.90 7.01 930034457 601998.XSHG 2015-01-30 China CITIC Bank ... 6.92 6.95 68146718[3 rows x 8 columns]

4.3.2 processing missing data

On the basis of accessing the data, we can change the data, for example, modify some elements as missing values:

df.loc[df['secID'] == '000001.XSHE', 'closePrice'] = np.nanprint(df) secID tradeDate secShortName ... lowestPrice closePrice turnoverVol0 000001.XSHE 2015-01-05 Ping An Bank ... 15.60 NaN 2860436431 601998.XSHG 2015-01-28 China CITIC Bank ... 6.95 7.15 1631461282 000001.XSHE 2015-01-07 Ping An Bank ... 15.30 NaN 1700120673 000001.XSHE 2015-01-08 Ping An Bank ... 14.90 NaN 1407714214 000001.XSHE 2015-01-09 Ping An Bank ... 14.71 NaN 2508500235 601998.XSHG 2015-01-29 China CITIC Bank ... 6.90 7.01 930034456 000001.XSHE 2015-01-06 Ping An Bank ... 15.55 NaN 2166421407 601998.XSHG 2015-01-30 China CITIC Bank ... 6.92 6.95 68146718[8 rows x 8 columns]

Find rows with missing elements

print(df[df['closePrice'].isnull()]) secID tradeDate secShortName ... lowestPrice closePrice turnoverVol0 000001.XSHE 2015-01-05 Ping An Bank ... 15.60 NaN 2860436432 000001.XSHE 2015-01-07 Ping An Bank ... 15.30 NaN 1700120673 000001.XSHE 2015-01-08 Ping An Bank ... 14.90 NaN 1407714214 000001.XSHE 2015-01-09 Ping An Bank ... 14.71 NaN 2508500236 000001.XSHE 2015-01-06 Ping An Bank ... 15.55 NaN 216642140[5 rows x 8 columns]

There may be some missing data in the original data. Just like the sample data processed now, there are many ways to deal with missing data. Generally, dataframe.dropna() is used to discard data with nan by row; If you specify how = 'all' (the default is' any '), the data will be discarded only when the whole row is nan; If thresh is specified, it means that a row of data is reserved only when the number of non missing columns exceeds the specified value; To specify discarding according to a column, it can be done through subset.

# Make the first line closePrice by nandf.loc[1, 'closePrice'] = np.nan# see df Size before filtering print("Data size before filtering:")print(df.shape)# Filter all containing nan Line of print("Drop all rows that have any NaN values:")print("Data size after filtering:")print(df.dropna().shape)print(df.dropna())# Filter integer behavior only nan Line of print("Drop only if all columns are NaN:")print("Data size after filtering:")print(df.dropna(how='all').shape)print(df.dropna(how='all'))# Filter business nan More than 6 hours print("Drop rows who do not have at least six values that are not NaN")print("Data size after filtering:")print(df.dropna(thresh=6).shape)print(df.dropna(thresh=6))# When the value of a column is nan, the row is filtered. Print ("drop only if nan in specific column:") print ("data size after filtering:") print (df.dropna (subset = ['closeprice '). Shape) print (df.dropna (subset = ['closeprice')) data size before filtering: (8, 8) drop all rows that have any nan values: data size after filtering: (7, 8) Secid tradedate secshortname... Lowestprice closeprice turnover VOL0 00000 1.xshe 2015-01-05 Ping An Bank... 15.60 16.02 2860436432 00000 1.xshe 2015-01-07 Ping An Bank... 15.30 15.48 1700120673 00000 1.xshe 2015-01-08 Ping An Bank... 14.90 14.96 140 7714214 00000 1.xshe 2015-01-09 Ping An Bank... 14.71 15.08 2508500235 601998.xshg 2015-01-29 China CITIC Bank... 6.90 7.01 930034456 00000 1.xshe 2015-01-06 Ping An Bank... 15.55 15.78 2166421407 601998.xshg 2015-01-30 China CITIC Bank... 6.92 6.95 68146718[7 rows x 8 columns]Drop only if all columns are nan:Data size after filtering:(8, 8) Secid tradedate secshortname... Lowestprice closeprice turnevervol0 00000 1.xshe 2015-01-05 Ping An Bank... 15.60 16.02 2860436431 601998.xshg 2015-01-28 China CITIC Bank... 6.95 nan 1631461282 00000 1.xshe 2015-01-07 Ping An Bank... 15.30 15.48 170 0120673 00000 1.xshe 2015-01-08 Ping An Bank... 14.90 14.96 1407714214 00000 1.xshe 2015-01-09 Ping An Bank... 14.71 15.08 2508500235 601998.xshg 2015-01-29 China CITIC Bank... 6.90 7.01 930034456 00000 1.xshe 2015-01-06 Ping An Bank... 15.55 1 5.78 2166421407 601998.xshg 2015-01-30 China CITIC Bank... 6.92 6.95 68146718 [8 rows x 8 columns] drop rows who do not have at least six values that are not nandata size after filtering: (8, 8) Secid tradedate secshortname... Lowestprice closeprice turnevervol0 00000 1.xshe 2015-01-05 Ping An Bank... 15.60 16.02 2860436431 601998.xshg 2015-01-28 China CITIC Bank... 6.95 nan 1631461282 00000 1.xshe 2015-01-07 Ping An Bank... 15.30 15.48 170 0120673 00000 1.xshe 2015-01-08 Ping An Bank... 14.90 14.96 1407714214 00000 1.xshe 2015-01-09 Ping An Bank... 14.71 15.08 2508500235 601998.xshg 2015-01-29 China CITIC Bank... 6.90 7.01 930034456 00000 1.xshe 2015-01-06 Ping An Bank... 15.55 1 5.78 2166421407 601998.xshg 2015-01-30 China CITIC Bank... 6.92 6.95 68146718 [8 rows x 8 columns] drop only if nan in specific column: data size after filtering: (7, 8) Secid tradedate secshortname... Lowestprice closeprice turnover VOL0 00000 1.xshe 2015-01-05 Ping An Bank... 15.60 16.02 2860436432 00000 1.xshe 2015-01-07 Ping An Bank... 15.30 15.48 1700120673 00000 1.xshe 2015-01-08 Ping An Bank... 14.90 14.96 140 7714214 00000 1.xshe 2015-01-09 Ping An Bank... 14.71 15.08 2508500235 601998.xshg 2015-01-29 China CITIC Bank... 6.90 7.01 930034456 00000 1.xshe 2015-01-06 Ping An Bank... 15.55 15.78 2166421407 601998.xshg 2015-01-30 China CITIC Bank... 6.92 6.95 68146718[7 rows x 8 columns]

When there is missing data, it may not be all discarded. dataframe.fillna(value=value) can specify the value to fill in the missing value

print(df.fillna(value=20150101).head()) secID tradeDate secShortName ... lowestPrice closePrice turnoverVol0 000001.XSHE 2015-01-05 Ping An Bank ... 15.60 16.02 2860436431 601998.XSHG 2015-01-28 China CITIC Bank ... 6.95 20150101.00 1631461282 000001.XSHE 2015-01-07 Ping An Bank ... 15.30 15.48 1700120673 000001.XSHE 2015-01-08 Ping An Bank ... 14.90 14.96 1407714214 000001.XSHE 2015-01-09 Ping An Bank ... 14.71 15.08 250850023[5 rows x 8 columns]

4.3.3 data operation

The class functions of Series and DataFrame provide some functions, such as mean(), sum(), etc. specify 0 by column and 1 by row:

print(df.mean(0))openPrice 1.236375e+01highestPrice 1.268125e+01lowestPrice 1.210375e+01closePrice 1.230375e+01turnoverVol 1.735769e+08dtype: float64

value_ The counts function can conveniently count the frequency:

print(df['closePrice'].value_counts().head())16.02 17.15 115.48 114.96 115.08 1Name: closePrice, dtype: int64

In panda, Series can call the map function to apply a function to each element, DataFrame can call the apply function to apply a function to each column (row), and applymap can apply a function to each element. The function can be a user-defined lambda function or other existing functions. The following example shows how to adjust the closing price to the [0,1] range:

print(df[['closePrice']].apply(lambda x: (x - x.min()) / (x.max() - x.min())).head()) closePrice0 1.0000001 0.0220512 0.9404633 0.8831314 0.896362

Using append, you can add elements after Series and add a row at the end of the DataFrame:

dat1 = df[['secID', 'tradeDate', 'closePrice']].head()dat2 = df[['secID', 'tradeDate', 'closePrice']].iloc[2]print("Before appending:")print(dat1)dat = dat1.append(dat2, ignore_index=True)print("After appending:")print(dat)Before appending: secID tradeDate closePrice0 000001.XSHE 2015-01-05 16.021 601998.XSHG 2015-01-28 7.152 000001.XSHE 2015-01-07 15.483 000001.XSHE 2015-01-08 14.964 000001.XSHE 2015-01-09 15.08After appending: secID tradeDate closePrice0 000001.XSHE 2015-01-05 16.021 601998.XSHG 2015-01-28 7.152 000001.XSHE 2015-01-07 15.483 000001.XSHE 2015-01-08 14.964 000001.XSHE 2015-01-09 15.085 000001.XSHE 2015-01-07 15.48

Dataframes can be merged as in SQL. In the previous article, we introduced the use of concat function to create dataframes, which is a way of merging. Another way to use the merge function is to specify which columns to merge according to. The following example shows how to merge data according to security ID and transaction date:

dat1 = df[['secID', 'tradeDate', 'closePrice']]dat2 = df[['secID', 'tradeDate', 'turnoverVol']]dat = dat1.merge(dat2, on=['secID', 'tradeDate'])print("The first DataFrame:")print(dat1.head())print("The second DataFrame:")print(dat2.head())print("Merged DataFrame:")print(dat.head())The first DataFrame: secID tradeDate closePrice0 000001.XSHE 2015-01-05 16.021 601998.XSHG 2015-01-28 7.152 000001.XSHE 2015-01-07 15.483 000001.XSHE 2015-01-08 14.964 000001.XSHE 2015-01-09 15.08The second DataFrame: secID tradeDate turnoverVol0 000001.XSHE 2015-01-05 2860436431 601998.XSHG 2015-01-28 1631461282 000001.XSHE 2015-01-07 1700120673 000001.XSHE 2015-01-08 1407714214 000001.XSHE 2015-01-09 250850023Merged DataFrame: secID tradeDate closePrice turnoverVol0 000001.XSHE 2015-01-05 16.02 2860436431 601998.XSHG 2015-01-28 7.15 1631461282 000001.XSHE 2015-01-07 15.48 1700120673 000001.XSHE 2015-01-08 14.96 1407714214 000001.XSHE 2015-01-09 15.08 250850023

parametereffectleftSpliced left DataFrame objectrightSpliced right DataFrame objectonThe name of the column or index level to join. Must be found in the left and right DataFrame objects. If not passed and left_index and right_ If the index is False, the intersection of columns in the DataFrame is inferred as a join keyleft_onThe column or index level in the left DataFrame is used as the key. It can be a column name, an index level name, or an array with a length equal to the length of the DataFrameright_onThe column or index level in the left DataFrame is used as the key. It can be a column name, an index level name, or an array with a length equal to the length of the DataFrameleft_indexIf True, the index (row label) in the DataFrame on the left is used as its connection key. For a DataFrame with MultiIndex (hierarchy), the number of levels must match the number of connection keys in the DataFrame on the rightright_indexAnd left_index functions are similarhowThe values include 'left', 'right', 'outer', 'inner', and the default is' inner '. Inner is the intersection set, outer is the union set, left retains all row and column data on the left and uses the data on the right to complete. Right is opposite to leftsortSort the resulting dataframes by the join key in dictionary order. The default is True, and setting it to False will significantly improve performance in many casessuffixesString suffix tuple for overlapping columns. The default is ('x ',' y ')copyAlways copy data from the passed DataFrame object (the default is True), even if there is no need to rebuild the index

Another powerful function of DataFrame is groupby, which is very convenient for grouping data. We average the opening price, highest price, lowest price, closing price and trading volume of ten stocks in January 2015:

df_grp = df.groupby('secID')grp_mean = df_grp.mean()print(grp_mean) openPrice highestPrice lowestPrice closePrice turnoverVolsecID 000001.XSHE 15.560000 15.988 15.212000 15.464000 2.128639e+08601998.XSHG 7.036667 7.170 6.923333 7.036667 1.080988e+08

If you want to get the latest data of each stock, what should you do? drop_duplicates can realize this function. First sort the data by date, and then de duplicate by security ID:

df2 = df.sort_values(by=['secID', 'tradeDate'], ascending=[True, False])print(df2.drop_duplicates(subset='secID')) secID tradeDate secShortName ... lowestPrice closePrice turnoverVol4 000001.XSHE 2015-01-09 Ping An Bank ... 14.71 15.08 2508500237 601998.XSHG 2015-01-30 China CITIC Bank ... 6.92 6.95 68146718[2 rows x 8 columns]

If you want to retain the oldest data, you can take the last record after descending arrangement. You can achieve this by specifying keep = 'last' (the first record is taken by default):

print(df2.drop_duplicates(subset='secID', keep='last')) secID tradeDate secShortName ... lowestPrice closePrice turnoverVol0 000001.XSHE 2015-01-05 Ping An Bank ... 15.60 16.02 2860436431 601998.XSHG 2015-01-28 China CITIC Bank ... 6.95 7.15 163146128[2 rows x 8 columns]

4.3.4 data visualization

pandas data can be directly plotted and viewed. In the following example, we use the closing price of Sinopec in January to plot, where set_index('tradeDate ') [[closePrice]] means that the column "tradeDate" of DataFrame is taken as an index, and the column "closePrice" is used as Series value to return to a Series object, then the plot function is drawn, and more parameters can be viewed in the matplotlib document.

dat = df[df['secID'] == '600028.XSHG'].set_index('tradeDate')['closePrice']dat.plot(title="Close Price of SINOPEC (600028) during Jan, 2015")

5, Code introduction typedescribecodenotesData creationCreate DataFramedf = DataFrame({'one': [1, 2, 3], 'two': ['a', 'b', 'c']}, index=['a', 'b', 'c'])index: row name
columns: column namedf = DataFrame([[1, 2, 3], ['a', 'b', 'c']], columns=['one', 'two', 'three'])Creation date Seriespd.date_range('20210101', periods=5)periods: incremental quantitySplice Seriesdf = pd.concat([Series([1, 2, 3]), Series(['a', 'b', 'c'])], axis=1)asxi: transverse splicing when equal to 1, longitudinal splicing when equal to 0Data base accessView data sizedf.shape-View the first n linesdf.head(n)The default is 5View the last n linesdf.tail(n)View statistics for pure numeric datadf.describe()Including count, mean, standard deviation, minimum and maximumAccess columndf['one'][]: access column 'one', type Seriesdf[['one']]
df[['one', 'two']][[]]: access column 'one', type DataFrameAccess rowdf.iloc[0]
df.iloc[0:2]iloc: select rows by subscript
loc: select a row by row name
Note: values are selected together and DataFrame is selected separatelydf.loc['a']Access row + columndf.loc[['a', 'b']][['one']]
df.iloc[0:2][['one']]Access valuedf.iloc[0, 0]
df.loc['a', 'one']Data filteringSelect the row that matches the equationdf[df['one'] > 1]Select a row whose value of column 'one' is greater than 1Select a line that contains a stringdf[df['two'].str.contains('a')]Select the row whose column 'two' contains the string 'a'Select a row that belongs to several valuesdf[df['one'].isin([1, 2])]Select the row whose value of column 'one' belongs to [1, 2]Select missing rowdf[df['one'].isnull()]Select the row with missing column 'one' valueSelect non missing rowsdf[df['one'].notnull()]Select a row whose column 'one' value is not missingdata statisticsFind the meandf.mean(0)Specify 0 by column and 1 by rowSumdf.sum(0)Statistical frequencydf['one'].value_counts()The number of occurrences of each value in column 'one'Data operationSort by column (row) namedf.sort_index(axis=1, ascending=True)axis: sort by column name when it is equal to 1, and sort by row name when it is equal to 0
Ascending: ascending when True, descending when FalseSort by valuedf.sort_values(by=['one', 'two'], ascending=[True, True])by: column name to sortData De duplicationdf.drop_duplicates(subset=['one'])De duplication according to column 'one', only the first row is retainedData mergingdf.merge(df2, on=['one'])on: column name as merge key value
how: the values are left, right, outer and inner. The default is inner. Inner is the intersection set, outer is the union set, left retains all row and column data on the left and uses the data on the right to complete. Right is opposite to left
suffixes: used to rename column names of overlapping columns. The default is (')_ x’，‘_ y’)Apply a function to each column (row)df[['one']].apply(lambda x: x+1)lambda: custom functionsApply a function to each elementdf[['one']].applymap(lambda x: x+1)Add datadf.append(df2, ignore_index=True)ignore_index: when True, redefine the row name as an ordered number, and when False, splice the original row nameDelete columndf.drop('one', axis=1)-Data groupingdf.groupby('one')The data is grouped according to the column 'one'. After grouping, there is a DataFrameGroupBy object. If you want to view the specific content, you can use list to convert it to listConvert all elements to stringsdf.astype(str)-Convert column elements to stringsdf['one'].astype(str)-Split column elements according to commasdf.drop('two', axis=1).join(df['two'].str.split(',', expand=True).stack().reset_index(level=1, drop=True).rename('tag'))rename: column name after splitting 6, Interaction between Pandas and Excel

5.1 reading Excel data

Parse the data in the Excel table into a DataFrame

import pandas as pdwith pd.ExcelFile('saw.xlsx') as xlsx: # List table names names = xlsx.sheet_names # Read table Sheet1 Data df = pd.read_excel(xlsx, 'Sheet1', na_value=['NA'], keep_default_na=False) # na_value: Interpret the value as nan # keep_default_na: Parses an empty string into an empty string'',Default to True Resolve to nan # In addition, index_col=0 sets the first column as an index

5.2 writing Excel data

Write data DataFrame to Excel table

import pandas as pdfrom openpyxl import load_workbookwb = load_workbook('saw.xlsx')with pd.ExcelWriter('saw.xlsx', engine='openpyxl') as xlsx: # Will data df Write to table Sheet2 in xlsx.book = wb # Do not keep indexes and column names df.to_excel(xlsx, sheet_name='Sheet2', index=False, header=None)

Note: if you directly call the ExcelWriter method to write, the original table will be overwritten, so it needs to be written in combination with the openpyxl module