If the COVID-19 epidemic as a stock factor

This paper uses the WindAlpha single factor research framework of wankuang platform to explore the impact of COVID-19 epidemic on the stock market from the perspective of stock factors. IC analysis found that there was a significant negative correlation between the severity of the epidemic and the return of the stock market, which increased with time and peaked on the 10th day. According to the COVID-19 epidemic factors, the return rate of the first group was significantly higher than that of the other groups, and the long-term combination could outperform the market. ( Code clone point here)

Article directory

1. Factor construction

First, obtain the time series of epidemic data of each province. For the method of obtaining the time series, see n methods of obtaining the historical data of COVID-19 epidemic . The data source here is nCov2019 Library in R language:

x <- load_nCov2019()
data <- summary(x)[1:3]
write.csv(data,"Data/nCovProvince.csv",row.names = FALSE)
## province	time	cum_confirm
## Shanghai 2020 / 2 / 10 302
## Yunnan 2020 / 2 / 10 149
## Inner Mongolia 2020 / 2 / 10 58
## Beijing 2020 / 2 / 10 342
## Taiwan 2020 / 2 / 10 18

Then extract the registered place and office location of all A shares of non ST and non PT shares on the wankuang platform, and find the province corresponding to the location. According to the province information, find the corresponding number of confirmed cases in the epidemic data obtained before, and take the number of confirmed cases as the factor value of the stock day.

from WindPy import * #api
from datetime import datetime
from scipy import stats, optimize
from WindCharts import *
import pandas as pd
import WindAlpha as wa

# Read epidemic data: before reading, you need to upload the csv file to the "data file / nCov" folder
nCov_province = pd.read_csv("data/nCov/nCovProvince.csv",encoding = "gbk",index_col = ["time", "province"])
start_date = nCov_province.index[-1][0]
end_date = nCov_province.index[1][0]

def get_province(address):
    # Return the province corresponding to the address according to the address information
    province_list = ['Shanghai','Yunnan','Inner Mongolia','Beijing','Taiwan','Jilin','Sichuan','Tianjin','Ningxia','Anhui','Shandong',
                 'Hunan','Macao','Gansu','Fujian','Tibet','Guizhou','Liaoning','Chongqing','Shaanxi','Qinghai','Hong Kong','Heilongjiang']
    for province in province_list:
        if province in address:
            return province
def factor_prepare(nCov_province):
    # Get the list of trading days and convert it to string format
    trade_dates = w.tdays(start_date, end_date, period="d").Data[0]
    trade_dates = [dt.strftime("%Y-%m-%d") for dt in trade_dates]
    # Obtain all A share codes of non ST, non PT
    stock_set = w.wset("sectorconstituent", "date="+start_date+";sectorId=a001010f00000000;field=wind_code", usedf=True)
    stock_list = list(stock_set[1]['wind_code'])
    # Obtain the registered place and office address of the company and extract the province
    raw_data=w.wss(stock_list, "address,office", "rptDate= %s" %(start_date),usedf=True)[1]
    raw_data["ADDRESS"] = raw_data["ADDRESS"].map(get_province)
    raw_data["OFFICE"] = raw_data["OFFICE"].map(get_province)
    # Data format required to organize data into factors
    factor = pd.DataFrame(columns=raw_data.columns,
    for trade_date in trade_dates:
        factor.loc[trade_date,"ADDRESS"] = np.asarray(raw_data["ADDRESS"])
        factor.loc[trade_date,"OFFICE"] = np.asarray(raw_data["OFFICE"])
    # Transfer the information of the province where the company is located to the cumulative number of confirmed patients on that day and take it as the final factor value
    for trade_date in trade_dates:
        for security_code in stock_list:
            for columns in factor.columns :
                security_prvince = factor.loc[(trade_date,security_code),columns]
                    # There is a possibility that the index of the assigned value cannot reach the corresponding value, because there is no case in the corresponding province at that time
                    factor.loc[(trade_date,security_code),columns] = nCov_province.loc[(trade_date,security_prvince), "cum_confirm"]
                except (KeyError):
                    factor.loc[(trade_date,security_code),columns] = 0
    # The date index of upload factor needs to be converted to datetime format
    date_index = [datetime.datetime.strptime(dt,"%Y-%m-%d") for dt in index.levels[0]]
    index.set_levels(date_index, level=0, inplace = True)
    factor.set_index(index,inplace = True)
    inds_ret = wa.load_local_factors(factor)
    return inds_ret
raw_ind_ret = factor_prepare(nCov_province)

The structure of the uploaded factor data is as follows. The last two columns are the data automatically added by wankuang platform in the process of uploading the factor. MKT ﹣ cap ﹣ ashare represents the market value of A shares (excluding restricted shares), and next ﹣ RET represents the next earnings of the shares.

2. Factor pretreatment

Next, preprocess the factors:

  • Missing value processing: delete the missing row directly
  • De extremum: the extremum of Hubei Province will be deleted when the extremum is processed, so the de extremum is not processed
  • Standardization: the market value weighted standardization method is adopted, which is equivalent to the neutralization of market value at the same time
  • Industry neutralization: in order to eliminate the potential impact of industry on factor performance, the form of linear regression is established here for neutralization, that is, the original factor value is the dependent variable, and the industry dummy variable is the independent variable for regression. The residual obtained by regression is the part of the original factor that cannot be explained by the industry, so the residual is extracted as the new factor after neutralization.
# The above process directly calls the process? Raw? Data function of WindAlpha
processed_inds_ret = wa.process_raw_data(raw_ind_ret, missing_method = "dele",
                                      extreme_method = False, scale_method = "cap",
                                      neutralize_method = 'sw_1',isinclude_cap = False)

Comparison between the original factor and the processed factor:

import seaborn as se
import matplotlib.pyplot as plt
fig = plt.figure(figsize=(7,5))
plt.suptitle(u"Processed Data VS Raw Data(Date:2020-02-04)")

2. Factor analysis

2.1 IC analysis

Information coefficient (IC) refers to the correlation between the factor value of each period and the stock return of the next period. The size of IC reflects the linear relationship between factor exposure value and stock performance. The larger the IC value is, the higher the predictability of factor is, the better the effect of stock selection is. In this paper, the IC value is calculated by the RankIC method, i.e. the cross-sectional correlation coefficient between the exposure value ranking of factors and the ranking of next period return of stocks.

# Call the IC analysis function of WindAlpha directly
ic_ana = wa.ic_analysis(processed_inds_ret)
ind = "OFFICE"
fig_ic=WLine("IC Sequence:{}".format(ind),"{}-{}".format(start_date, end_date), ic_ana.ic_series.loc[ind])

2.1.1 IC signal attenuation

The previous IC Series calculated the correlation between the factor value of the current period and the yield of the next period, that is, the difference between the factor value and the yield is one cycle. The IC decay describes the correlation between the factor value and the yield in LAG period. The specific calculation method is that if there is a total of N period factor data and yield data, first calculate the IC value of all I period factors and i+1 period yield to find the average, and then calculate the IC value of I period factor and i+2 period yield to find the average (i=1,... , N-LAG), and finally get the average value of each IC of LAG, which reflects the IC attenuation.

LAG0 -0.016098 -0.014985
LAG1 -0.037245 -0.035465
LAG2 -0.055568 -0.054450
LAG3 -0.081646 -0.080446
LAG4 -0.100494 -0.099304
LAG5 -0.114949 -0.112260
LAG6 -0.131797 -0.132035
LAG7 -0.142606 -0.145498
LAG8 -0.143235 -0.149350
LAG9 -0.141615 -0.150159
LAG10 -0.146176 -0.154830
LAG11 -0.144909 -0.152806

By visualizing the results of IC attenuation, we can find a very obvious rule. There is a significant negative correlation between the severity of the epidemic and the return of the stock market, which increases with time and reaches the peak on the 10th day.

# According to the results of IC attenuation, the epidemic data of office places have a greater impact on the stock market
ind = "OFFICE"
fig_decay=WBar('{} IC Decay'.format("COVID-19"), '',ic_ana.ic_decay[ind].to_frame())

2.2 yield analysis

According to the factor value, the stocks are divided into five groups, and a multi empty combination of multi empty and multi empty is constructed. The average return (market value weighted) of each group of stocks in each period is directly used to calculate the cumulative return and other indicators of each group.

# Based on Wind all A index
direction_dict = {ind: 'descending'}
ret_ana = wa.return_analysis(processed_inds_ret,'881001.WI',start_date, end_date,ind_direction=direction_dict)

Visualizing the results of each group, the first group's yield is significantly higher than other groups, and the long and short combination can win the market.

sig_ret_line = WLine("Long and short portfolio yield:{}".format("COVID-19"),"{}-{}".format(start_date, end_date) , round(ret_ana.group_cum_return[['G01','G02','G03','G04','G05','G01-G05','BENCH_RET']].loc[ind],4),auto_yaxis=True)

The above is the whole content of this article, welcome to pay attention to my Know about|Brief book|CSDN|WeChat official account PurePlay will share research and learn dry goods from time to time.

Published 8 original articles, won praise 0, visited 136
Private letter follow

Tags: R Language encoding

Posted on Sat, 15 Feb 2020 03:25:17 -0500 by networkguy