Python's share price doubled, life winner, python crawling fund, screening stocks

preface:
I hear you want to be rich? Then calm down and take your time. Haven't you heard the famous saying: "the poor always don't want to get rich slowly". They all want to get rich overnight and win the lottery, but how can they be so lucky? You can't win 780000 lottery tickets like me and spend them secretly.
If you want to become rich slowly, just manage your money well; Save small money and earn big money! Before, I had to think about how to make myself realize wealth freedom, so I had the idea of learning financial management. When it comes to financial management, I have to say some financial products, such as gold, futures, stocks, funds, etc. first popularize these little knowledge, because it is to climb stocks and funds, so I'll briefly introduce them;
Stock (English: stock) or capital stock (English: capital stock) is a kind of securities, and the joint-stock company distributes its ownership through this kind of securities. Because the joint-stock company needs to raise funds, it will issue shares to investors as a certificate of partial ownership of the company's capital, become shareholders, so as to obtain dividends (dividends) and share the profits brought by the growth of the company or the fluctuation of the trading market; But we should also jointly bear the risks caused by the company's operation errors------ Here comes Wikipedia
Fund, take chestnuts for example, is that you have money and want to buy stocks, but you don't know the relevant knowledge of stocks; I have no money, but I have very rich financial knowledge and stock experience. I am a good financial planner. So we sum up, you give me your money, and I'll use it to manage money. When I make money, I'll take some share; "I" refers to the fund------ Self understanding from spark burning wish
Generally speaking, stocks have high returns and high risks. The fund has low income and low risk, because the fund has bought a lot of stocks, and the stocks bought will not encounter all rise or fall, so it is relatively risk resistant. Speaking of this, if I don't want to buy a fund, I want to buy stocks and want to buy good stocks, but I don't understand stocks. What should I do? Well, well, that's what I think. Although I think beautifully, there are always ways. According to our analysis, the fund is an institution that has bought a lot of stocks, and there are all kinds of financial gods in it. We can see what stocks they have bought from the fund, and then follow them. After all, they don't want to lose money and will choose potential stocks.

text
This paper uses python to crawl the funds of a financial website, crawl the stocks held by 5000 + funds, and process them.

There are many cases in which the whole company has been arrested because of crawling data. Therefore, it is explained here: refuse to use reptiles to commit illegal acts, resolutely love the country and the people, do good things without leaving a name, and help grandma cross the road more. I hope the police uncle will not take me away because of this reptile article.
 


Knowledge points involved in this paper:
1. python string: segmentation, splicing, Chinese character judgment;
2. python regular expression;
3. Crawler requests request library, xpath data acquisition, proxy {filtering} server;
4. selenium usage: headless browser, element positioning, explicit waiting, data acquisition;
5. python operation mongodb
Website analysis
We'll post the code and data later, and first analyze the target website, which will help us make the crawling process clearer;
Target site:


http://fund.eastmoney.com/data/fundranking.html#tall;c0;r;szzf;pn50;ddesc;qsd20181126;qed20191126;qdii;zq;gg;gzbd;gzfs;bbzt;sfbb
What we climb is the data in [open-end fund]:


We can click a fund to enter its details page. I don't know if you found it. The url of the fund details page is the fund code and password of the fund on the home page

http://fund.eastmoney.com/

A combination of, for example:
040011 --- preferred mixed url of Hua'an core:

http://fund.eastmoney.com/040011.html


005660 - url of harvest resources selected stock A:

http://fund.eastmoney.com/005660.html

ok, ok, we can pull down the fund details page to find the stock position information of the fund, that is, which stocks the fund has bought:


 


Then click more to enter the details page of the fund's shareholding, and pull down to see the stock position information of the fund in three quarters:
 


Yes, this is the target data, the data to be crawled;
ok, let's not crawl first, and then analyze the details page of the fund's position. This url is also regular. It is used

http://fundf10.eastmoney.com/ccmx_

Combined with the fund code of the fund, such as:
005660, url of the position details page of harvest resources selected stock A:

http://fundf10.eastmoney.com/ccmx_005660.html


006921, url of position details page of Nanfang Zhicheng hybrid:

http://fundf10.eastmoney.com/ccmx_006921.html

Because these data are dynamically loaded with js, it is very difficult to crawl using requests. In this case, selenium is generally used to simulate browser behavior. But selenium's crawling efficiency is really low. In fact, we can still use requests for crawling. js dynamic loading is that the js code in the html page performs an operation and automatically loads data from the server, so the data can't be seen on the page crawling at the beginning. Selenium is only needed for some data that is particularly difficult to climb, because selenium claims that you can get all the data you can see. After all, selenium imitates the behavior of people operating browsers. Here, we analyze the dynamic loading of js, then use requests to crawl, and then use selenium for the second crawl.
On the home page, press F12 to open the developer tool, and then refresh it,


You can see the data in the blue box on the right. This is the data returned by js after dynamic loading, and then presented on the page after processing. In fact, you only need to obtain these data instead of crawling to the home page;
Let's click Headers again. The Request URL is the url of the js request. You can try to enter the url directly with the browser and a pile of data will be returned to you; The above analyzes the composition of the url of the fund position stock page, so as long as the six digit fund code in these data is required. In this code, the six digit number is extracted with python regular, and then the url of the fund position stock page is formed; Then crawl and store the stocks held by the fund on the fund position stock page;
Crawling process:
1. First, request the url requested when js dynamically loads data from the home page, and obtain the six digit fund code from it,
then


http://fundf10.eastmoney.com/ccmx_ + Fund code +. html

The url of the details page of the shares held by the fund;
2. Crawl the url of the details page of the fund's position stocks. Because it is also loaded dynamically by js (loading speed is fast), and it is necessary to judge whether the fund has position stocks (some funds don't buy stocks and don't know what they are doing), selenium is used to crawl. At the same time, the explicit waiting method is also used to wait for the data to be loaded;
3. Organize the data and store it in mongodb;
Code explanation - Data crawling:
This time, we will put the code segment up and explain the segment;
Required libraries:

import requests
import re
from lxml import etree
from selenium import webdriver
from selenium.webdriver.chrome.options import Options
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
import pymongo

Some common methods of preparation:

#Judge whether the string contains Chinese
def is_contain_chinese(check_str):
    """
    Judge whether the string contains Chinese
    :param check_str: {str} String to detect
    :return: {bool} Include return True, Return not included False
    """
    for ch in check_str:
        if u'\u4e00' <= ch <= u'\u9fff':
            return True
    return False
#selenium judges whether the element exists through class name, which is used to judge whether the fund has position stocks in the fund position stock details page;
def is_element(driver,element_class):
    try:
        WebDriverWait(driver,2).until(EC.presence_of_element_located((By.CLASS_NAME,element_class)))
    except:
        return False
    else:
        return True
#requests is the method to request the url, and the text is returned after processing
def get_one_page(url):
    headers = {
        'User-Agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.142 Safari/537.36',
    }
    proxies = {
        "http": "http://XXX.XXX.XXX.XXX:XXXX"
    }
  
    response = requests.get(url,headers=headers,proxies=proxies)
    response.encoding = 'utf-8'
    if response.status_code == 200:
        return response.text
    else:
        print("Request status code != 200,url error.")
        return None
#This method directly requests, returns and processes the data of the home page, forms the position information url and stock name, and stores them in the array;
def page_url():
    stock_url = []      #Define an array to store the url of the fund's position stock details page
    stock_name = []     #Define an array to store the name of the fund
    url = "http://fund.eastmoney.com/data/rankhandler.aspx?op=ph&dt=kf&ft=all&rs=&gs=0&sc=zzf&st=desc&sd=2018-11-26&ed=2019-11-26&qdii=&tabSubtype=,,,,,&pi=1&pn=10000&dx=1&v=0.234190661250681"
    result_text = get_one_page(url)
    # print(result_text.replace('\"',','))    #Replace with,
    # print(result_text.replace('\"',',').split(','))    #Divided by
    # print(re.findall(r"\d{6}",result_text))     #Output the 6-bit code of the stock and return the array;
    for i in result_text.replace('\"',',').split(','):  #Replace "with" and then split it with ", traverse and filter out Chinese characters (stock name)
        result_chinese = is_contain_chinese(i)
        if result_chinese == True:
            stock_name.append(i)
    for numbers in re.findall(r"\d{6}",result_text):
        stock_url.append("http://fundf10.eastmoney.com/ccmx_% s. HTML "% (numbers)) # store the spliced url in the list;
    return stock_url,stock_name
#selenium requests the method of [url of the fund's position stock details page] to crawl the name of the fund's position stock;
def hold_a_position(url):
    driver.get(url)  # Request fund position information
    element_result = is_element(driver, "tol")  # Whether this element exists is used to judge whether there is position information;
    if element_result == True:  # Crawl if there is position information;
        wait = WebDriverWait(driver, 3)  # Set a waiting time
        input = wait.until(EC.presence_of_element_located((By.CLASS_NAME, 'tol')))  # Wait for this class to appear;
        ccmx_page = driver.page_source  # Get the source code of the page
        ccmx_xpath = etree.HTML(ccmx_page)  # Convert to xpath format
        ccmx_result = ccmx_xpath.xpath("//div[@class='txt_cont']//div[@id='cctable']//div[@class='box'][1]//td[3]//text()")
        return ccmx_result
    else:   #If there is no position information, null character will be returned;
        return "null"

Note the page_url() method. The URL in the method is the URL requested when analyzing js dynamic loading data above. Note the parameters behind the URL. Pi is the page number and pn is the number of data per page. Here, pi=1 and pn=10000, which means that 10000 data are displayed on the first page (there must not be so much actual data, but only 5000 + on the first page) , all the data will be displayed at one time;
Program start:

if __name__ == '__main__':
    # Create connection to mongodb database
    client = pymongo.MongoClient(host='XXX.XXX.XXX.XXX', port=XXXXX)  # Connect mongodb. host is ip and port is port
    db = client.db_spider  # Use (create) database
    db.authenticate("user name", "password")  # mongodb user name and password connection;
    collection = db.tb_stock  # Use (create) a set (table)
  
    stock_url, stock_name = page_url()     #Get the home page data and return the array of fund url and fund name;
  
    #Browser action
    chrome_options = Options()
    chrome_options.add_argument('--headless')
    driver = webdriver.Chrome(options=chrome_options)    #Initialize the browser without browser interface;
  
    if len(stock_url) == len(stock_name):       #Judge whether the obtained fund url is consistent with the number of fund names
        for i in range(len(stock_url)):
            return_result = hold_a_position(stock_url[i])  # Traverse the position information and return the name of the position stock - array
            dic_data = {
                'fund_url':stock_url[i],
                'fund_name':stock_name[i],
                'stock_name':return_result
            }        #dic_data is dictionary data, which is prepared for storage in mongodb;
            print(dic_data)
            collection.insert_one(dic_data)     #Insert dic_data into mongodb database
    else:
        print("fund url And funds name The number of arrays is inconsistent. Exit.")
        exit()
  
    driver.close()              #Close browser
  
    #Query: filter out non null data
    find_stock = collection.find({'stock_name': {'$ne': 'null'}})  # Query the data whose stock_name is not equal to null (excluding those fund institutions that do not hold stocks);
    for i in find_stock:
        print(i)

OK, so far, the code for crawling data has been explained, and you can wait after running; the project runs in a single process, so the crawling speed is slightly slow. At the same time, it is also affected by the network speed, and it will continue to be improved into multithreading in the later stage. Code explanation - Data Processing: Data crawling has been coexisted and stored in the database, and the data is processed here to make it available. First, explain the idea : 1. We need to know the comprehensive data of all the stocks held by these funds, including the duplicate stocks in the fund positions; 2. We need to know which stocks have been repeated, how many have been repeated and how many times have been repeated; in this way, the stock with the largest number of repetitions must be the best, because it proves that many funds have purchased this stock. See the code for details , the note has made it clear:

import pymongo
  
#1, Database: connect libraries, use collections, and create documents;#
client = pymongo.MongoClient(host='XXX.XXX.XXX.XXX',port=XXXXX)  #Connect to mongodb database
  
db = client.db_spider       #Use (create) database
db.authenticate("user name","password")      #Authentication user name and password
  
collection = db.tb_stock    #Use (create) a set (table) in which the data crawled by the above program has been stored;
tb_result = db.tb_data      #Use (create) a set (table) to store the last processed data;
  
#Query the data whose stock_name is not equal to null, that is, exclude the funds that do not hold stocks;
find_stock = collection.find({'stock_name':{'$ne':'null'}})
  
#2, Process the data and accumulate all stock arrays into an array---list_stock_all #
list_stock_all = []     #Define an array to store all stock names, including duplicate ones;
for i in find_stock:
    print(i['stock_name'])    #Output the position stocks of the Fund (type is array)
    list_stock_all = list_stock_all + i['stock_name']   #Synthesize all stock arrays into one array;
print("Total number of shares:" + str(len(list_stock_all)))
  
#3, Data processing, stock de duplication#
list_stock_repetition = []  #Define an array to store the stocks after de duplication
for n in list_stock_all:
    if n not in list_stock_repetition:        #If it doesn't exist
        list_stock_repetition.append(n)        #Add to the array and remove the duplicate;
print("Number of stocks after weight removal:" + str(len(list_stock_repetition)))
  
#4, Combine the two arrays obtained in 2 and 3 for data filtering#
for u in list_stock_repetition:        #Traverse the array of stocks after de duplication
    if list_stock_all.count(u) > 10:   #Find the duplicate number of stocks in the array of stocks that have not been de duplicated. If the duplicate number is greater than 10
        #The data is formed into a dictionary for storage in mongodb;
        data_stock = {
            "name":u,
            "numbers":list_stock_all.count(u)
        }
        insert_result = tb_result.insert_one(data_stock)    #Store in mongodb
        print("Stock Name:" + u + " , Number of repetitions:" + str(list_stock_all.count(u)))

In this way, the data is processed slightly and stored in the collection of tb_data; only part of the processed data is disclosed below:

{'_id': ObjectId('5e0b5ecc7479db5ac2ec62c9'), 'name': 'Crystal photoelectric', 'numbers': 61}
{'_id': ObjectId('5e0b5ecc7479db5ac2ec62ca'), 'name': 'ordinary people', 'numbers': 77}
{'_id': ObjectId('5e0b5ecc7479db5ac2ec62cb'), 'name': 'North Huachuang', 'numbers': 52}
{'_id': ObjectId('5e0b5ecc7479db5ac2ec62cc'), 'name': 'Goldwind Technology', 'numbers': 84}
{'_id': ObjectId('5e0b5ecc7479db5ac2ec62cd'), 'name': 'Tianshun wind energy', 'numbers': 39}
{'_id': ObjectId('5e0b5ecc7479db5ac2ec62ce'), 'name': 'Shi dashenghua', 'numbers': 13}
{'_id': ObjectId('5e0b5ecc7479db5ac2ec62cf'), 'name': 'SDIC power', 'numbers': 55}
{'_id': ObjectId('5e0b5ecc7479db5ac2ec62d0'), 'name': 'SINOPEC (China Petrochemical Corporation', 'numbers': 99}
{'_id': ObjectId('5e0b5ecc7479db5ac2ec62d1'), 'name': 'PetroChina', 'numbers': 54}
{'_id': ObjectId('5e0b5ecc7479db5ac2ec62d2'), 'name': 'China Ping An', 'numbers': 1517}
{'_id': ObjectId('5e0b5ecc7479db5ac2ec62d3'), 'name': 'Moutai, Guizhou', 'numbers': 1573}
{'_id': ObjectId('5e0b5ecc7479db5ac2ec62d4'), 'name': 'China Merchants Bank', 'numbers': 910}

The data has not been sorted, and the ranking is not in order; in the data: Sinopec's numbers is 54, indicating that 54 of the 5000 + funds have bought Sinopec's shares; China Merchants Bank's numbers is 910, indicating that 910 of the 5000 + funds have bought China Merchants Bank's shares... Well, there's nothing to say at this point; finally, be careful when entering the market, stock There are risks; the article is only for learning, responsible for its own profits and losses, and is not responsible;

Tags: Python crawler

Posted on Fri, 08 Oct 2021 22:24:14 -0400 by deepermethod