Data acquisition and fusion technology_ Experiment 4

Code cloud link: Ouxiang Experiment 4

1. Operation ①:

  • Requirements: master the serialization output method of Item and Pipeline data in the scene; Scrapy+Xpath+MySQL database storage technology route crawling Dangdang website book data

  • Candidate sites: http://search.dangdang.com/?key=python&act=input

  • Key words: students can choose freely

  • Output information:

    The output information of MySQL is as follows

2. Problem solving ideas

2.1 web page analysis and commodity positioning

It can be found that under the li tag, the xpath positioning statement of each product information is "/ / ul[@class='bigimg']/li"

2.2 information extraction

Find the required information from each li

for book in books:
    title = book.xpath("./a/@title").extract_first()
    price = book.xpath("./p[@class='price']/span[@class='search_now_price']/text()").extract_first()
    author = book.xpath("./p[@class='search_book_author']/span/a/@title").extract_first()
    date = book.xpath("./p[@class='search_book_author']/span[2]/text()").extract_first()
    publisher = book.xpath("./p[@class='search_book_author']/span/a[@name='P_cbs']/text()").extract_first()
    detail = book.xpath("./p[@class='detail']/text()").extract_first()
2.3 writing to database

pipelines.py

class DdspiderPipeline:

    def open_spider(self, spider):
        self.count = 1
        self.con = pymysql.connect(host='localhost', user='root', password='123456', charset="utf8")
        self.cursor = self.con.cursor()
        self.cursor.execute("CREATE DATABASE IF NOT EXISTS books")
        self.cursor.execute("USE books")
        self.cursor.execute("CREATE TABLE IF NOT EXISTS books("
                            "id int primary key,"
                            "bTitle varchar(512) ,"
                            "bAuthor varchar(256),"
                            "bPublisher varchar(256),"
                            "bDate varchar(32),"
                            "bPrice varchar(16),"
                            "bDetail text)"
                            "ENGINE=InnoDB DEFAULT CHARSET=utf8")
        print("opened")
        self.opened = True


    def process_item(self, item, spider):
        try:
            if self.opened:
                self.cursor.execute("insert into books (id, bTitle, bAuthor, bPublisher, bDate, bPrice, bDetail) values "
                                    "(%s,%s,%s,%s,%s,%s,%s)", (self.count, item['title'], item['author'], item['publisher'],
                                                                  item['date'], item['price'], item['detail']))
                self.count += 1
            return item
        except Exception as err:
            print(err)

    def close_spider(self, spider):
        if self.opened:
            self.con.commit()
            self.con.close()
            self.opened = False
        print("close")
        print("Total crawling:", self.count, "Book")
2.4 modify setting.py
BOT_NAME = 'ddSpider'

SPIDER_MODULES = ['ddSpider.spiders']
NEWSPIDER_MODULE = 'ddSpider.spiders'
ROBOTSTXT_OBEY = False
ITEM_PIPELINES = {
   'ddSpider.pipelines.DdspiderPipeline': 300,
}
2.5 results

3. Experience

It is not difficult to reproduce. After this experiment, I have mastered the serialization output method of Item and Pipeline data in the scene. At the same time, it is also the first time to use mysql. I have a preliminary understanding of the interaction between pymysql and MySQL database in python. The mastery of Xpath has also been greatly deepened.

1. Operation ②:

  • Requirements: master the serialization output method of Item and Pipeline data in the scene; Crawl the foreign exchange website data using the technology route of "scratch framework + Xpath+MySQL database storage".

  • Candidate website: China Merchants Bank Network: http://fx.cmbchina.com/hq/

  • Output information: MySQL database storage and output format

    Id Currency TSP CSP TBP CBP Time
    1 HKD 86.60 86.60 86.26 85.65 15: 36: 30
    2......

2. Problem solving ideas

2.1 web page analysis and commodity positioning

The web page is relatively simple. You can easily find the required data under the tr tag, but the first tr is the header, which should be omitted. xpath is positioned as "/ / table[@class='data']//tr"

2.2 information extraction

Find the required exchange rate information from each tr

for f in forex_data[1:]:
    item = forexSpiderItem()
    currency = f.xpath("./td[1]/text()").extract_first().strip()
    TSP = f.xpath("./td[4]/text()").extract_first().strip()
    CSP = f.xpath("./td[5]/text()").extract_first().strip()
    TBP = f.xpath("./td[6]/text()").extract_first().strip()
    CBP = f.xpath("./td[7]/text()").extract_first().strip()
    time = f.xpath("./td[8]/text()").extract_first().strip()
2.3 writing to database

pipelines.py

class ForexspiderPipeline:
    def open_spider(self, spider):
        try:
            self.con = pymysql.connect(host='localhost', port=3306, user='root', password='123456', charset='utf8')
            self.cursor = self.con.cursor()
            self.cursor.execute("CREATE DATABASE IF NOT EXISTS forex")
            self.cursor.execute("USE forex")
            self.cursor.execute("CREATE TABLE IF NOT EXISTS forexs(Id varchar(8) primary key, Currency varchar(32), "
                                "TSP varchar(8),CSP varchar(8), TBP varchar(8), CBP varchar(8), Time varchar(16))"
                                "ENGINE=InnoDB DEFAULT CHARSET=utf8")
            self.opened = True
            print("opened")

        except Exception as e:
            print(e)
            self.opened = False

    def process_item(self, item, spider):
        try:
            if self.opened:
                self.cursor.execute("INSERT INTO forexs VALUES (%s,%s,%s,%s,%s,%s,%s)",
                                    (item["id"],item["currency"],item["TSP"],item["CSP"],
                                     item["TBP"],item["CBP"],item["time"]))
                print("INSERT INTO forexs VALUES (%s,%s,%s,%s,%s,%s,%s)",
                                    (item["id"],item["currency"],item["TSP"],item["CSP"],
                                     item["TBP"],item["CBP"],item["time"]))
                return item
        except Exception as e:
            print(e)

    def close_spider(self, spider):
        if self.opened:
            self.con.commit()
            self.con.close()
            self.opened = False
            print("closed")
2.4 modify setting.py (similar to the previous question)
2.5 results

The id field is set to varchar, and the table display is not sorted by id

3. Experience

It is roughly the same as the first question, and continues to consolidate the understanding of the use of pymysql and MySQL database.

1. Operation ③:

  • Requirements: be familiar with Selenium's search for HTML elements, crawling Ajax web page data, waiting for HTML elements, etc; Use Selenium framework + MySQL database storage technology route to crawl the stock data information of "Shanghai and Shenzhen A shares", "Shanghai A shares" and "Shenzhen A shares".

  • Candidate website: Dongfang fortune.com: http://quote.eastmoney.com/center/gridlist.html#hs_a_board

  • Output information: the storage and output format of MySQL database is as follows. The header should be named in English, such as serial number id, stock code: bStockNo..., which is defined and designed by students themselves:

    Serial number Stock code Stock name Latest quotation Fluctuation range Rise and fall Turnover Turnover amplitude highest minimum Today open Received yesterday
    1 688093 N Shihua 28.47 62.22% 10.92 261 thousand and 300 760 million 22.34 32.0 28.08 30.2 17.55
    2......

2. Problem solving ideas

2.1 web page analysis and commodity positioning

The required stock information is under the tr tag. The xpath positioning statement is' / / tbody/tr ', and the three sections of "Shanghai and Shenzhen A shares", "Shanghai A shares" and "Shenzhen A shares" are modified correspondingly. The tail of the url is' HS'_ a_ board','sh_ a_ board','sz_ a_ board'

2.2 information extraction
stocks = driver.find_elements(By.XPATH, '//tbody/tr')
for s in stocks:
    stock_data = []
    for i in [1,2,3,5,6,7,8,9,10,11,12,13,14]:
        stock_data.append(s.find_element_by_xpath('./td['+str(i)+']').text)
    stocks_data.append(stock_data)
2.3 page turning processing
# Turn pages
driver.find_element_by_xpath('//div[@id="main-table_paginate"]/a[2]').click()
sleep(3)
# Roll up to the top
driver.execute_script("window.scrollTo(0,0);")
sleep(3)
pages -= 1
2.4 code

spider function

def spider(key,pages):
    stocks_data = []
    chrome_options = Options()
    chrome_options.add_argument('--headless')
    chrome_options.add_argument('--disable-gpu')
    driver = webdriver.Chrome(chrome_options=chrome_options)
    driver.get('http://quote.eastmoney.com/center/gridlist.html#'+key)
    while pages:
        stocks = driver.find_elements(By.XPATH, '//tbody/tr')
        for s in stocks:
            stock_data = []
            for i in [1,2,3,5,6,7,8,9,10,11,12,13,14]:
                stock_data.append(s.find_element_by_xpath('./td['+str(i)+']').text)
            stocks_data.append(stock_data)
        # Turn pages
        driver.find_element_by_xpath('//div[@id="main-table_paginate"]/a[2]').click()
        sleep(3)
        # Roll up to the top
        driver.execute_script("window.scrollTo(0,0);")
        sleep(3)
        pages -= 1
    return stocks_data

Main function

def main():
    con = pymysql.connect(host='localhost', port=3306, user='root', password='123456', charset='utf8')
    cursor = con.cursor()
    cursor.execute("CREATE DATABASE IF NOT EXISTS stock")
    cursor.execute("USE stock")
    print("opened")
    keys = ['hs_a_board','sh_a_board','sz_a_board']
    names = ['Shanghai and Shenzhen A thigh','Shanghai  A thigh','Deep evidence A thigh']
    for i in range(3):
        cursor.execute("CREATE TABLE IF NOT EXISTS " +names[i]+"(sNo varchar(8) , sId varchar(16) primary key, "
                       "sName varchar(32), sPrice varchar(8), sChangePer varchar(8), sChangeAmount varchar(8), "
                       "sTransactionVolume varchar(16), sTradingVolume varchar(16),  sAmplitude varchar(8), sHigh varchar(8), "
                       "sLow varchar(8), sToday varchar(8), sYesterday varchar(8))"
                       "ENGINE=InnoDB DEFAULT CHARSET=utf8")
    	stocks = spider(keys[i],3)#Crawl 3 pages
        print(len(stocks))
        for s in stocks:
            print(s[0], s[1], s[2], s[3], s[4], s[5], s[6], s[7], s[8], s[9], s[10], s[11], s[12])
            cursor.execute("INSERT INTO " +names[i]+" VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)",
                           (s[0], s[1], s[2], s[3], s[4], s[5], s[6], s[7], s[8], s[9], s[10], s[11], s[12]))
    con.commit()
    con.close()
    print("closed")
2.5 results

Shenzhen A shares

3. Experience

Better understand the use of selenium. Although selenium is very slow, it runs directly in the browser. Users can clearly see every step of the browser and have more functions.

Posted on Tue, 23 Nov 2021 15:22:42 -0500 by HaXoRL33T