Code cloud link: Ouxiang Experiment 4
1. Operation ①:
-
Requirements: master the serialization output method of Item and Pipeline data in the scene; Scrapy+Xpath+MySQL database storage technology route crawling Dangdang website book data
-
Candidate sites: http://search.dangdang.com/?key=python&act=input
-
Key words: students can choose freely
-
Output information:
The output information of MySQL is as follows
2. Problem solving ideas
2.1 web page analysis and commodity positioning
It can be found that under the li tag, the xpath positioning statement of each product information is "/ / ul[@class='bigimg']/li"
2.2 information extraction
Find the required information from each li
for book in books: title = book.xpath("./a/@title").extract_first() price = book.xpath("./p[@class='price']/span[@class='search_now_price']/text()").extract_first() author = book.xpath("./p[@class='search_book_author']/span/a/@title").extract_first() date = book.xpath("./p[@class='search_book_author']/span[2]/text()").extract_first() publisher = book.xpath("./p[@class='search_book_author']/span/a[@name='P_cbs']/text()").extract_first() detail = book.xpath("./p[@class='detail']/text()").extract_first()
2.3 writing to database
pipelines.py
class DdspiderPipeline: def open_spider(self, spider): self.count = 1 self.con = pymysql.connect(host='localhost', user='root', password='123456', charset="utf8") self.cursor = self.con.cursor() self.cursor.execute("CREATE DATABASE IF NOT EXISTS books") self.cursor.execute("USE books") self.cursor.execute("CREATE TABLE IF NOT EXISTS books(" "id int primary key," "bTitle varchar(512) ," "bAuthor varchar(256)," "bPublisher varchar(256)," "bDate varchar(32)," "bPrice varchar(16)," "bDetail text)" "ENGINE=InnoDB DEFAULT CHARSET=utf8") print("opened") self.opened = True def process_item(self, item, spider): try: if self.opened: self.cursor.execute("insert into books (id, bTitle, bAuthor, bPublisher, bDate, bPrice, bDetail) values " "(%s,%s,%s,%s,%s,%s,%s)", (self.count, item['title'], item['author'], item['publisher'], item['date'], item['price'], item['detail'])) self.count += 1 return item except Exception as err: print(err) def close_spider(self, spider): if self.opened: self.con.commit() self.con.close() self.opened = False print("close") print("Total crawling:", self.count, "Book")
2.4 modify setting.py
BOT_NAME = 'ddSpider' SPIDER_MODULES = ['ddSpider.spiders'] NEWSPIDER_MODULE = 'ddSpider.spiders' ROBOTSTXT_OBEY = False ITEM_PIPELINES = { 'ddSpider.pipelines.DdspiderPipeline': 300, }
2.5 results
3. Experience
It is not difficult to reproduce. After this experiment, I have mastered the serialization output method of Item and Pipeline data in the scene. At the same time, it is also the first time to use mysql. I have a preliminary understanding of the interaction between pymysql and MySQL database in python. The mastery of Xpath has also been greatly deepened.
1. Operation ②:
-
Requirements: master the serialization output method of Item and Pipeline data in the scene; Crawl the foreign exchange website data using the technology route of "scratch framework + Xpath+MySQL database storage".
-
Candidate website: China Merchants Bank Network: http://fx.cmbchina.com/hq/
-
Output information: MySQL database storage and output format
Id Currency TSP CSP TBP CBP Time 1 HKD 86.60 86.60 86.26 85.65 15: 36: 30 2......
2. Problem solving ideas
2.1 web page analysis and commodity positioning
The web page is relatively simple. You can easily find the required data under the tr tag, but the first tr is the header, which should be omitted. xpath is positioned as "/ / table[@class='data']//tr"
2.2 information extraction
Find the required exchange rate information from each tr
for f in forex_data[1:]: item = forexSpiderItem() currency = f.xpath("./td[1]/text()").extract_first().strip() TSP = f.xpath("./td[4]/text()").extract_first().strip() CSP = f.xpath("./td[5]/text()").extract_first().strip() TBP = f.xpath("./td[6]/text()").extract_first().strip() CBP = f.xpath("./td[7]/text()").extract_first().strip() time = f.xpath("./td[8]/text()").extract_first().strip()
2.3 writing to database
pipelines.py
class ForexspiderPipeline: def open_spider(self, spider): try: self.con = pymysql.connect(host='localhost', port=3306, user='root', password='123456', charset='utf8') self.cursor = self.con.cursor() self.cursor.execute("CREATE DATABASE IF NOT EXISTS forex") self.cursor.execute("USE forex") self.cursor.execute("CREATE TABLE IF NOT EXISTS forexs(Id varchar(8) primary key, Currency varchar(32), " "TSP varchar(8),CSP varchar(8), TBP varchar(8), CBP varchar(8), Time varchar(16))" "ENGINE=InnoDB DEFAULT CHARSET=utf8") self.opened = True print("opened") except Exception as e: print(e) self.opened = False def process_item(self, item, spider): try: if self.opened: self.cursor.execute("INSERT INTO forexs VALUES (%s,%s,%s,%s,%s,%s,%s)", (item["id"],item["currency"],item["TSP"],item["CSP"], item["TBP"],item["CBP"],item["time"])) print("INSERT INTO forexs VALUES (%s,%s,%s,%s,%s,%s,%s)", (item["id"],item["currency"],item["TSP"],item["CSP"], item["TBP"],item["CBP"],item["time"])) return item except Exception as e: print(e) def close_spider(self, spider): if self.opened: self.con.commit() self.con.close() self.opened = False print("closed")
2.4 modify setting.py (similar to the previous question)
2.5 results
The id field is set to varchar, and the table display is not sorted by id
3. Experience
It is roughly the same as the first question, and continues to consolidate the understanding of the use of pymysql and MySQL database.
1. Operation ③:
-
Requirements: be familiar with Selenium's search for HTML elements, crawling Ajax web page data, waiting for HTML elements, etc; Use Selenium framework + MySQL database storage technology route to crawl the stock data information of "Shanghai and Shenzhen A shares", "Shanghai A shares" and "Shenzhen A shares".
-
Candidate website: Dongfang fortune.com: http://quote.eastmoney.com/center/gridlist.html#hs_a_board
-
Output information: the storage and output format of MySQL database is as follows. The header should be named in English, such as serial number id, stock code: bStockNo..., which is defined and designed by students themselves:
Serial number Stock code Stock name Latest quotation Fluctuation range Rise and fall Turnover Turnover amplitude highest minimum Today open Received yesterday 1 688093 N Shihua 28.47 62.22% 10.92 261 thousand and 300 760 million 22.34 32.0 28.08 30.2 17.55 2......
2. Problem solving ideas
2.1 web page analysis and commodity positioning
The required stock information is under the tr tag. The xpath positioning statement is' / / tbody/tr ', and the three sections of "Shanghai and Shenzhen A shares", "Shanghai A shares" and "Shenzhen A shares" are modified correspondingly. The tail of the url is' HS'_ a_ board','sh_ a_ board','sz_ a_ board'
2.2 information extraction
stocks = driver.find_elements(By.XPATH, '//tbody/tr') for s in stocks: stock_data = [] for i in [1,2,3,5,6,7,8,9,10,11,12,13,14]: stock_data.append(s.find_element_by_xpath('./td['+str(i)+']').text) stocks_data.append(stock_data)
2.3 page turning processing
# Turn pages driver.find_element_by_xpath('//div[@id="main-table_paginate"]/a[2]').click() sleep(3) # Roll up to the top driver.execute_script("window.scrollTo(0,0);") sleep(3) pages -= 1
2.4 code
spider function
def spider(key,pages): stocks_data = [] chrome_options = Options() chrome_options.add_argument('--headless') chrome_options.add_argument('--disable-gpu') driver = webdriver.Chrome(chrome_options=chrome_options) driver.get('http://quote.eastmoney.com/center/gridlist.html#'+key) while pages: stocks = driver.find_elements(By.XPATH, '//tbody/tr') for s in stocks: stock_data = [] for i in [1,2,3,5,6,7,8,9,10,11,12,13,14]: stock_data.append(s.find_element_by_xpath('./td['+str(i)+']').text) stocks_data.append(stock_data) # Turn pages driver.find_element_by_xpath('//div[@id="main-table_paginate"]/a[2]').click() sleep(3) # Roll up to the top driver.execute_script("window.scrollTo(0,0);") sleep(3) pages -= 1 return stocks_data
Main function
def main(): con = pymysql.connect(host='localhost', port=3306, user='root', password='123456', charset='utf8') cursor = con.cursor() cursor.execute("CREATE DATABASE IF NOT EXISTS stock") cursor.execute("USE stock") print("opened") keys = ['hs_a_board','sh_a_board','sz_a_board'] names = ['Shanghai and Shenzhen A thigh','Shanghai A thigh','Deep evidence A thigh'] for i in range(3): cursor.execute("CREATE TABLE IF NOT EXISTS " +names[i]+"(sNo varchar(8) , sId varchar(16) primary key, " "sName varchar(32), sPrice varchar(8), sChangePer varchar(8), sChangeAmount varchar(8), " "sTransactionVolume varchar(16), sTradingVolume varchar(16), sAmplitude varchar(8), sHigh varchar(8), " "sLow varchar(8), sToday varchar(8), sYesterday varchar(8))" "ENGINE=InnoDB DEFAULT CHARSET=utf8") stocks = spider(keys[i],3)#Crawl 3 pages print(len(stocks)) for s in stocks: print(s[0], s[1], s[2], s[3], s[4], s[5], s[6], s[7], s[8], s[9], s[10], s[11], s[12]) cursor.execute("INSERT INTO " +names[i]+" VALUES (%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s,%s)", (s[0], s[1], s[2], s[3], s[4], s[5], s[6], s[7], s[8], s[9], s[10], s[11], s[12])) con.commit() con.close() print("closed")
2.5 results
Shenzhen A shares
3. Experience
Better understand the use of selenium. Although selenium is very slow, it runs directly in the browser. Users can clearly see every step of the browser and have more functions.