Share doubles life winner, python crawls funds to filter stocks

Preface:

Hear you want to get rich?Then calm down and take your time. Haven't you heard the famous saying: "The poor always don't want to get rich slowly."Think about getting rich overnight, what's the middle caipiao like, but how can I be so lucky?You can't get 780,000 Cai piaos like me and steal flowers yourself.

If you want to become rich slowly, you just need to manage your money well; small money depends on saving, big money depends on earning!Before, I also had to think about how to make myself free of wealth, so I had the idea to learn how to manage money. When it comes to managing money, I have to say some financial products, such as gold, futures, stocks, funds, etc. First, popularize these small knowledge, because it is to crawl stocks and funds, so I will briefly introduce them.

Stocks (English: stock) or capital stock (English: capital stock) are securities whose ownership is allocated by a joint stock company through this certificate.Because a joint stock company needs to raise funds, it will issue shares to investors as proof of partial ownership of the company's capital, become shareholders to obtain dividends (dividends), and share the profits brought by the company's growth or market fluctuations; but also share the risks brought by the company's operational errors.* Come to Wikipedia

Fund, raise a chestnut. You have money in your hand and want to buy stocks, but you don't know about stocks. I have no money, but I have very rich financial knowledge and stock experience. I am a good financial manager.So all together, you give me your money and I'll use it to manage my finances, and when I make money, I'll take a little out of it;'I'refers to funds; ----- self-understanding from the sparkling desire

Overall, stocks are profitable and risky.Funds have low returns and risks because they buy a lot of stocks, and they will hardly encounter all rises or falls, so they are more resistant to risks.Speaking of this, if I don't want to buy a fund, I want to buy stocks and stocks that I want to buy, but I don't know what to do. Well, that's what I think. Beautiful as it is, there are always ways.Under our analysis, the fund is an institution that has bought a lot of stocks, and all of them are financial gods. We can see what stocks they have bought from the fund, and then follow them. After all, they don't want to lose money, they will pick potential stocks.

text

This paper uses python to crawl the funds of a financial website, crawl the stocks held by 5000 + funds, and process them.

In the past few years, there are many cases that the whole company was caught because of crawling data, so here we can say: Refuse to use crawlers for illegal behavior, resolutely patriotic, do good things without leaving a name, and help grandma cross the road. We hope that the police uncle will not take me away because of this crawling article.

Knowledge points covered in this paper:

1. python string: split, stitch, Chinese character judgment;

2. python regular expressions;

3. crawler requests request library, xpath to get data, proxy server;

4. selenium usage: headless browser, element positioning, explicit waiting, data acquisition;

5. python operation mongodb

Web Site Analysis

Code and data are posted later, and the target website is analyzed first, which will help us to make the crawling process clearer.

Target site: http://fund.eastmoney.com/data/fundranking.html#tall;c0;r;szzf;pn50;ddesc;qsd20181126;qed20191126;qdii;zq;gg;gzbd;gzfs;bbzt;sfbb

What we crawled was data from Open Fund:

We can open a fund at any time and go to its details page. I don't know if you find it. The url of the details page of the fund is a combination of the fund code of the fund on the first page and http://fund.eastmoney.com/for example:

040011 --- Waran Core Optimal Mixed url: http://fund.eastmoney.com/040011.html

005660 - url of Jiashi Resource Selected Stock A: http://fund.eastmoney.com/005660.html

ok, well, we can find the stock holding information of the fund by dropping down the details page of the fund, that is, what stocks the fund has bought:

Then click on the details page of the fund's stock holdings and drop down to see the three-quarter stock holdings of the fund:

Yes, this is the target data, the data to crawl;

ok, let's not crawl and then analyze the details page of this fund position. This url is also regular. It is used by http://fundf10.eastmoney.com/ccmx_ Combined with the fund code of the fund, for example:

005660, Warranty Details page url of Jiashi Resource Selected Stock A: http://fundf10.eastmoney.com/ccmx_005660.html

006921, Warehouse Details page url of Southern Intelligence Mixed: http://fundf10.eastmoney.com/ccmx_006921.html

 

Because these data are loaded dynamically by js, crawling with requests is difficult, and selenium is generally used to simulate browser behavior in this case.However, selenium crawls are actually less efficient.Actually, we can still crawl using requests. JS dynamic loading is an operation performed by JS code in the html page, which automatically loads data from the server. So the data is not visible on the page that was crawled at first, unless some particularly difficult data needs selenium, because selenium is called: as long as you can see the data, you can get it.After all, selenium mimics how people manipulate browsers.Here we analyze the JS dynamic loading, then crawl using requests, and then use selenium for the second crawl.

On the first page, press F12 to open the developer tools and refresh them.

You can see the data in the blue box on the right. This is the data returned by js after dynamic loading, and then processed and displayed on the page. In fact, as long as you get this data, you do not need to crawl the first page.

Let's click on Headers again. This Request URL is the url js requested. You can try to return this url directly to your browser and return a bunch of data to you. The composition of the page url for fund holding stocks is analyzed above, so as long as you need the six-digit fund code in the data, this code is extracted with python's rule.And then the url of the fund's stock pages; and then the stock held by the fund is crawled and stored on the fund's stock pages;

Crawl process:

1. Get the six-digit fund code from the url that js requests to dynamically load data from the first page.

Then http://fundf10.eastmoney.com/ccmx_ +Detail page url for fund holding stock consisting of fund code +.html;

2. Crawl the details page url of the stock held by the fund, because it is also loaded dynamically by js (the loading speed is faster) and you need to determine if the fund has stocks held (some funds do not buy stocks, and you do not know what they are doing), so use selenium to crawl, and also use an explicit wait method to wait for data loading to complete;

3. Organize and store the data in mongodb;

Code Explanation - Data Crawling:

This time we'll put the code in sections and explain them.

Library needed:

import requests
import re
from lxml import etree
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pymongo

Some common ways to prepare:

#Determine if the string contains Chinese
def is_contain_chinese(check_str):
    """
    //Determine if the string contains Chinese
    :param check_str: {str} Strings to Detect
    :return: {bool} Include Return True, Does not contain returns False
    """
    for ch in check_str:
        if u'\u4e00' <= ch <= u'\u9fff':
            return True
    return False
#selenium uses the class name to determine if an element exists, which is used to determine whether the fund has a holding stock in the details page of its holding stock.
def is_element(driver,element_class):
    try:
        WebDriverWait(driver,2).until(EC.presence_of_element_located((By.CLASS_NAME,element_class)))
    except:
        return False
    else:
        return True
#Method of requests requesting url and returning text after processing
def get_one_page(url):
    headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36',
    }
    proxies = {
        "http": "http://XXX.XXX.XXX.XXX:XXXX"
    }

    response = requests.get(url,headers=headers,proxies=proxies)
    response.encoding = 'utf-8'
    if response.status_code == 200:
        return response.text
    else:
        print("Request Status Code != 200,url error.")
        return None
#This method directly requests, returns and processes the data from the first page, and then composes the warehouse information url and the stock name and stores them in an array.
def page_url():
    stock_url = []      #Define an array to store the url of the details page of the fund's holding stock
    stock_name = []     #Define an array, the name of the storage fund
    url = "http://fund.eastmoney.com/data/rankhandler.aspx?op=ph&dt=kf&ft=all&rs=&gs=0&sc=zzf&st=desc&sd=2018-11-26&ed=2019-11-26&qdii=&tabSubtype=,,,,,&pi=1&pn=10000&dx=1&v=0.234190661250681"
    result_text = get_one_page(url)
    # print(result_text.replace('\"',','))    #Replace "with,
    # print(result_text.replace('\"',',').split(','))    #Divided by
    # print(re.findall(r"\d{6}",result_text))     #The 6-bit code of the output stock returns the array;
    for i in result_text.replace('\"',',').split(','):  #Replace "with, then with, split, traverse to filter out the characters that contain Chinese (the name of the stock)
        result_chinese = is_contain_chinese(i)
        if result_chinese == True:
            stock_name.append(i)
    for numbers in re.findall(r"\d{6}",result_text):
        stock_url.append("http://Fundf10.eastmoney.com/ccmx_%s.html "%(numbers)#Save the stitched url s in the list;
    return stock_url,stock_name
#selenium requests the method of crawling the name of the holding stock of the fund;
def hold_a_position(url):
    driver.get(url)  # Information Requesting Fund Holdings
    element_result = is_element(driver, "tol")  # Whether or not this element exists to determine whether or not there is holding information;
    if element_result == True:  # Crawl if there is hold information;
        wait = WebDriverWait(driver, 3)  # Set a wait time
        input = wait.until(EC.presence_of_element_located((By.CLASS_NAME, 'tol')))  # Waiting for this class to appear;
        ccmx_page = driver.page_source  # Get the source code for the page
        ccmx_xpath = etree.HTML(ccmx_page)  # Convert to xpath format
        ccmx_result = ccmx_xpath.xpath("//div[@class='txt_cont']//div[@id='cctable']//div[@class='box'][1]//td[3]//text()")
        return ccmx_result
    else:   #If there is no hold information, the null character is returned.
        return "null"

Note the page_url() method, the URL inside is the URL requested by the above analysis when js dynamically loads data. Note that the parameter behind the url, Pi is the page number, PN is how many pieces of data per page. I have pi=1, pn=10000 here, which means that the first page, displaying 10,000 pieces of data (the actual data must be less, the first page only has 5000+), shows all the numbers at onceYes;

Program start:

if __name__ == '__main__':
    # Create a connection to the mongodb database
    client = pymongo.MongoClient(host='XXX.XXX.XXX.XXX', port=XXXXX)  # Connect mongodb, host is ip, port is port
    db = client.db_spider  # Use (create) a database
    db.authenticate("User name", "Password")  # mongodb user name, password connection;
    collection = db.tb_stock  # Use (create) a set (table)

    stock_url, stock_name = page_url()     #Get the first page data, return the array of fund URLs and the array of fund names;

    #Browser Action
    chrome_options = Options()
    chrome_options.add_argument('--headless')
    driver = webdriver.Chrome(options=chrome_options)    #Initialize the browser, no browser interface;

    if len(stock_url) == len(stock_name):       #Determine if the number of fund URLs and names obtained is consistent
        for i in range(len(stock_url)):
            return_result = hold_a_position(stock_url[i])  # Traverse the position information and return the name of the position stock--array
            dic_data = {
                'fund_url':stock_url[i],
                'fund_name':stock_name[i],
                'stock_name':return_result
            }        #dic_data is the composing dictionary data, prepared for storage in mongodb;
            print(dic_data)
            collection.insert_one(dic_data)     #Insert dic_data into the mongodb database
    else:
        print("fund url And funds name Number of arrays is inconsistent, exit.")
        exit()

    driver.close()              #Close Browser

    #Query: Filter out non-null data
    find_stock = collection.find({'stock_name': {'$ne': 'null'}})  # Query for data where stock_name is not equal to null (excluding those funds that do not have positions in stocks);
    for i in find_stock:
        print(i)

Okay, so far, the code for crawling the data has been assigned, run and sit back.

The project runs in a single process, so the crawl speed is slightly slower and is also affected by the network speed, which will continue to improve to multithreaded later.

Code Explanation - Data Processing:

The data has been crawled and stored in the database above, where it is processed and made available;

First explain the idea:

1. We need to know the comprehensive data of all the stocks held by these funds, including duplicate stocks in the positions of the funds.

2. You need to know which stocks are repeated, how many and how many times.

That way, the stock with the most repetitions is definitely the best, because it proves that a lot of funds have bought the stock.

Looking at the code, the comment is clear:

import pymongo

#1. Database: Connect libraries, use collections, and create documents;#
client = pymongo.MongoClient(host='XXX.XXX.XXX.XXX',port=XXXXX)  #Connect to mongodb database

db = client.db_spider       #Use (create) a database
db.authenticate("User name","Password")      #Authenticate user name, password

collection = db.tb_stock    #Use (create) a collection (table) that already stores the data crawled by the program above;
tb_result = db.tb_data      #Use (create) a collection (table) to store the last processed data;

#Query for data where stock_name is not equal to null, i.e., exclude funds that do not have open shares;
find_stock = collection.find({'stock_name':{'$ne':'null'}})

#2. Processing data to add up all stock arrays into one array---list_stock_all #
list_stock_all = []     #Define an array to store all stock names, including duplicates;
for i in find_stock:
    print(i['stock_name'])    #Holding stock of the output fund (type is array)
    list_stock_all = list_stock_all + i['stock_name']   #Combine all stock arrays into one array;
print("Total Stocks:" + str(len(list_stock_all)))

#3. Processing Data and Stock Weighting#
list_stock_repetition = []  #Define an array to store the weighted stock
for n in list_stock_all:
    if n not in list_stock_repetition:        #If it does not exist
        list_stock_repetition.append(n)        #Then add it to the array and weight it off;
print("Number of stocks after weight removal:" + str(len(list_stock_repetition)))

#4. Combining the two arrays from two or three to filter data#
for u in list_stock_repetition:        #An array of stocks after traversal
    if list_stock_all.count(u) > 10:   #Finds the number of repetitions of a stock in an array of unrestricted shares if the number of repetitions is greater than 10
        #Make up a dictionary of data to store in mongodb;
        data_stock = {
            "name":u,
            "numbers":list_stock_all.count(u)
        }
        insert_result = tb_result.insert_one(data_stock)    #Store in mongodb
        print("Stock Name:" + u + " , Repeats:" + str(list_stock_all.count(u)))

In this way, the data is slightly processed and stored in a collection of tb_data.

Only partially processed data are disclosed below:

{'_id': ObjectId('5e0b5ecc7479db5ac2ec62c9'), 'name': 'Crystal Optoelectronics', 'numbers': 61}
{'_id': ObjectId('5e0b5ecc7479db5ac2ec62ca'), 'name': 'Ordinary people', 'numbers': 77}
{'_id': ObjectId('5e0b5ecc7479db5ac2ec62cb'), 'name': 'North Huachuang', 'numbers': 52}
{'_id': ObjectId('5e0b5ecc7479db5ac2ec62cc'), 'name': 'Golden Wind Technology', 'numbers': 84}
{'_id': ObjectId('5e0b5ecc7479db5ac2ec62cd'), 'name': 'Tianshun Wind Energy', 'numbers': 39}
{'_id': ObjectId('5e0b5ecc7479db5ac2ec62ce'), 'name': 'Shi Da Sheng Hua', 'numbers': 13}
{'_id': ObjectId('5e0b5ecc7479db5ac2ec62cf'), 'name': 'State Investment Power', 'numbers': 55}
{'_id': ObjectId('5e0b5ecc7479db5ac2ec62d0'), 'name': 'SINOPEC (China Petrochemical Corporation', 'numbers': 99}
{'_id': ObjectId('5e0b5ecc7479db5ac2ec62d1'), 'name': 'PetroChina', 'numbers': 54}
{'_id': ObjectId('5e0b5ecc7479db5ac2ec62d2'), 'name': 'China Security', 'numbers': 1517}
{'_id': ObjectId('5e0b5ecc7479db5ac2ec62d3'), 'name': 'Maotai, Guizhou', 'numbers': 1573}
{'_id': ObjectId('5e0b5ecc7479db5ac2ec62d4'), 'name': 'China Merchants Bank', 'numbers': 910}

The data has not been sorted and ranked in any order.

Data:

Sinopec's numbers is 54, indicating that 54 out of 5000 + funds have bought Sinopec's shares.

numbers of China Merchants Bank is 910, indicating that 910 out of 5000 + funds have bought shares of China Merchants Bank

......

Well, there's nothing to say at this point;

Finally, caution should be taken when entering the market. Stocks are risky. Articles are for learning only, with no responsibility for profits or losses.

Tags: Python Selenium MongoDB Database

Posted on Sat, 11 Jan 2020 12:39:42 -0500 by synapse0