The fourth experiment of data acquisition and fusion technology practice
Operation ①:
1. Title
-
Requirements: master the serialization output method of Item and Pipeline data in the scene; Scrapy+Xpath+MySQL database storage technology route crawling Dangdang website book data
-
Candidate sites: http://www.dangdang.com/
-
Key words: Students' free choice
-
Output information: MySQL database storage and output formats are as follows:
2. Experimental ideas
class Test41Item(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() title = scrapy.Field() author = scrapy.Field() publisher = scrapy.Field() date = scrapy.Field() price = scrapy.Field() detail = scrapy.Field() pass
Write the information to be crawled for this operation in items.py
def start_requests(self): url = "http://search.dangdang.com/?" for page in range(1, 4): params = {"key": "python", "act": "input", "page_index": str(page)} # Construct get request yield scrapy.FormRequest(url=url, callback=self.parse, method="GET", headers=self.headers,formdata=params)
First, start_ The requests function implements parameter settings to turn pages and crawl the contents of 3 pages of books
def parse(self, response): try: data=response.body.decode("gbk") selector = scrapy.Selector(text=data) books = selector.xpath("//*[@id='search_nature_rg']/ul/li") for book in books: item = Test41Item() item["title"] = book.xpath("./a/@title").extract_first() item["author"] = book.xpath("./p[@class='search_book_author']/span")[0].xpath("./a/@title").extract_first() item["publisher"] = book.xpath("./p[@class='search_book_author']/span")[2].xpath("./a/text()").extract_first() item["date"] = book.xpath("./p[@class='search_book_author']/span")[1].xpath("./text()").extract_first() try: item["date"] = item["date"].split("/")[-1] except: item["date"] = " " item["price"] = book.xpath("./p[@class='price']/span[@class='search_now_price']/text()").extract_first() item["detail"] = book.xpath("./p[@class='detail']/text()").extract_first() if book.xpath("./p[@class='detail']/text()").extract_first() else "" print(item["title"]) print(item["author"]) print(item["publisher"]) print(item["date"]) print(item["price"]) print(item["detail"]) yield item except Exception as err: print(err)
Obtain the required data in the parse function, in which the date data is subjected to an additional processing, removing the spaces and "/" before mm / DD / yy
class Test41Pipeline: db = DB() count = 1 def __init__(self): self.db.openDB() #Create and open a table def process_item(self, item, spider): try: self.db.insert(self.count, item['title'], item['author'], item['publisher'], item['date'], item['price'], item['detail']) self.count += 1 except Exception as err: print(err) return item
In the pipelines script, link the mysql database and create a table to store the previously obtained data into the mysql database
The information in the database is as shown above, crawling 60 books per page, a total of 180 books
3. Experience
Homework ① reviewed the graph framework and xpath, used the graph framework to crawl the content of Dangdang, and used xpath to parse the information
Operation ②:
1. Title
-
Requirements: master the serialization output method of Item and Pipeline data in the scene; Crawl the foreign exchange website data using the technology route of "scratch framework + Xpath+MySQL database storage".
-
Candidate website: China Merchants Bank Network: http://fx.cmbchina.com/hq/
-
Output information: MYSQL database storage and output format
2. Experimental ideas
class Test42Item(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() currency = scrapy.Field() tsp = scrapy.Field() csp = scrapy.Field() tbp = scrapy.Field() cbp = scrapy.Field() time = scrapy.Field() pass
Write the information to be crawled for this operation in items.py
def start_requests(self): url = "http://fx.cmbchina.com/hq/" # Construct get request yield scrapy.FormRequest(url=url, callback=self.parse, method="GET", headers=self.headers)
At start_ The get request is constructed in the requests function
def parse(self, response): try: data=response.body.decode("utf-8") selector = scrapy.Selector(text=data) #print(data) trs = selector.xpath("//*[@id='realRateInfo']/table/tr") for tr in trs[1:]: item = Test42Item() item["currency"] = tr.xpath("./td[position()=1]/text()").extract_first().replace("\n","").replace(" ","") item["tsp"] = tr.xpath("./td[position()=4]/text()").extract_first().replace("\n","").replace(" ","") item["csp"] = tr.xpath("./td[position()=5]/text()").extract_first().replace("\n","").replace(" ","") item["tbp"] = tr.xpath("./td[position()=6]/text()").extract_first().replace("\n","").replace(" ","") item["cbp"] = tr.xpath("./td[position()=7]/text()").extract_first().replace("\n","").replace(" ","") item["time"] = tr.xpath("./td[position()=8]/text()").extract_first().replace("\n","").replace(" ","") print(item["currency"]) print(item["tsp"]) print(item["csp"]) print(item["tbp"]) print(item["cbp"]) print(item["time"]) yield item except Exception as err: print(err)
After observation, use xpath in the parse function to obtain the data you need, and remove line breaks and spaces in the data
class Test42Pipeline: db = DB() count = 1 def __init__(self): self.db.openDB() def process_item(self, item, spider): try: self.db.insert(self.count, item['currency'], item['tsp'], item['csp'], item['tbp'], item['cbp'], item['time']) self.count += 1 except Exception as err: print(err) return item
Finally, save the data into the database in the same operation ①
The database information is shown in the figure
3. Experience
Homework ② is relatively simple, and I am familiar with the script framework and xpath method again
Operation ③:
1. Title
-
Requirements: be familiar with Selenium's search for HTML elements, crawling Ajax web page data, waiting for HTML elements, etc;
Use Selenium framework + MySQL database storage technology route to crawl the stock data information of "Shanghai and Shenzhen A shares", "Shanghai A shares" and "Shenzhen A shares".
-
Candidate website: Dongfang fortune.com: http://quote.eastmoney.com/center/gridlist.html#hs_a_board
-
Output information: MySQL database storage and output formats are as follows
The header should be named in English, such as serial number id, stock code: bStockNo... The header should be defined and designed by the students themselves
Serial number Stock code Stock name Latest quotation Fluctuation range Rise and fall Turnover Turnover amplitude highest minimum Today open Received yesterday 1 688093 N Shihua 28.47 62.22% 10.92 261300 760 million 22.34 32.0 28.08 30.2 17.55 2......
2. Experimental ideas
if __name__ == "__main__": url='http://quote.eastmoney.com/center/gridlist.html#hs_a_board' #type records the xpath path of the two buttons to be clicked type = ['//*[@id="nav_sh_a_board"]/a','//*[@id="nav_sz_a_board"]/a'] driver = OpenDriver(url) db = DB() db.openDB() #Get data section for page in range(1,10): GetInfo(page) #Page turning part if page==3:#After turning the page twice, the third page is changed to a stock type ToNextType(type[0]) elif page==6: ToNextType(type[1]) else: ToNextPage()#Normal page turning db.closeDB() driver.close() print("End of crawling")
In the main function part, the main process is to open the browser, enter the web page, obtain data, continue to obtain data by page turning / stock exchange type, and finally end crawling. Store the data in the mysql database when obtaining the data
def OpenDriver(url): chrome_options = Options() # chrome_options.add_argument('--headless') chrome_options.add_argument('--disable-gpu') driver = webdriver.Chrome(options=chrome_options) driver.get(url) # Browser access home page return driver
The OpenDriver function opens the browser using webdrill
def GetInfo(page): time.sleep(3) #Waiting for web page to load print(driver.find_element(By.XPATH, '//*[@id="table_wrapper-table"]/tbody').text) stocks = driver.find_element(By.XPATH, '//*[@id="table_wrapper-table"]/tbody').text.split("\n") #print(stocks) if (page-1)//3 == 0: type = 'Shanghai and Shenzhen A thigh' elif (page-1)//3 == 1: type = 'Shanghai A thigh' else: type = 'Deep evidence A thigh' for stock in stocks: stock = stock.split(" ") no = int(stock[0]) StockNo = stock[1] StockName = stock[2] LatestQuotation =stock[4] FluctuationRange =stock[5] RiseAndFall =stock[6] Turnover =stock[7] TurnoverMoney =stock[8] Amplitude =stock[9] Highest =stock[10] Minimum =stock[11] TodayOpen =stock[12] ReceivedYesterday =stock[13] db.insert(no,type,StockNo,StockName,LatestQuotation,FluctuationRange,RiseAndFall,Turnover,TurnoverMoney,Amplitude,Highest,Minimum,TodayOpen,ReceivedYesterday)
The GetInfo function judges and records the types of stocks currently obtained according to the page parameter passed in, and then obtains all the required information in turn and stores it in the database
def ToNextPage(): locator = (By.XPATH, '//*[@id="main-table_paginate"]/a[2]') WebDriverWait(driver, 10, 0.5).until(EC.presence_of_element_located(locator))#Wait for the button to load NextPage = driver.find_element(By.XPATH, '//*[@id="main-table_paginate"]/a[2]') NextPage.click()
The ToNextPage function is to click the button to enter the next page, ready to read new data
def ToNextType(type): locator = (By.XPATH, type) WebDriverWait(driver, 10, 0.5).until(EC.presence_of_element_located(locator))#Wait for the button to load NextType = driver.find_element(By.XPATH, type) NextType.click()
The ToNextType button tracks through the Xpath path of the passed in button, and click it to enter another kind of stock page
The information stored in the database is shown in the figure, crawling three pages of three stocks, a total of 180 lines of information
3. Experience
Homework ③ mainly reviewed the use of Selenium framework. It is relatively simple and convenient to use Selenium framework to crawl Ajax web page data
Code on code cloud: 2019 data acquisition and fusion technology: practical work of data acquisition and fusion technology - Gitee.com