Python Selenium practice (Baidu, famous sayings, JD)


The configuration of selenium can be viewed Station B crawler tutorial



1, Open Baidu and search


Open Baidu:

from selenium.webdriver import Chrome

web = Chrome()

web.get('https://www.baidu.com')


Find the input box in the developer tool:


Enter the value to query and press enter:

input_btn = web.find_element_by_id('kw')
input_btn.send_keys('Jackie Chan', Keys.ENTER)

Total code and test conditions:

from selenium.webdriver import Chrome
from selenium.webdriver.common.keys import Keys

web = Chrome()

web.get('https://www.baidu.com')

web.maximize_window()

input_btn = web.find_element_by_id('kw')
input_btn.send_keys('Jackie Chan', Keys.ENTER)


2, Climb to celebrity Quotes

Crawl website as http://quotes.toscrape.com/js/ , you need to crawl the famous quotes and corresponding celebrities on the first five pages and store them in the csv file.


1. Climb a page

First climb to the first page to test.

In the developer tool, it can be found that each group of famous quotes (famous quotes + celebrities) is in a div of class="quote", and there are no other class="quote" Tags:


After that, you can find that the famous saying sentence is in the < span > tag of class="text" under div, and the author is in the small tag of class="author":


Therefore, the code on the first page is as follows:

div_list = web.find_elements_by_class_name('quote')
print(len(div_list))
for div in div_list:
    saying = div.find_element_by_class_name('text').text
    author = div.find_element_by_class_name('author').text
    info = [saying, author]
    print(info)

The results are as follows:


2. Climb to page 5

After crawling a page, you need to turn the page, that is, click the turn page button.

It can be found that the Next button has only href attribute and cannot be located. If the first page has only the Next page button, and the number of subsequent pages has previous page and Next page buttons, it cannot be located through xpath:


The attribute aria hidden of its child element span (i.e. arrow) is unique in the first page. The attribute aria hidden exists in the subsequent pages, but the arrow of Next is always the last.

Therefore, you can click to jump to the next page by looking for the last span tab with aria hidden attribute

web.find_elements_by_css_selector('[aria-hidden]')[-1].click()

Test:

n = 5
for i in range(0, n):
    div_list = web.find_elements_by_class_name('quote')
    print(len(div_list))
    for div in div_list:
        saying = div.find_element_by_class_name('text').text
        author = div.find_element_by_class_name('author').text
        info = [saying, author]
        print(info)
    if i == n-1:
        break
    web.find_elements_by_css_selector('[aria-hidden]')[-1].click()
    time.sleep(2)

Successfully flipped:


3. Data storage

sayingAndAuthor = []
n = 5
for i in range(0, n):
    div_list = web.find_elements_by_class_name('quote')
    for div in div_list:
        saying = div.find_element_by_class_name('text').text
        author = div.find_element_by_class_name('author').text
        info = [saying, author]
        sayingAndAuthor.append(info)
    print('Successfully crawled the second' + (i+1) + 'page')
    if i == n-1:
        break
    web.find_elements_by_css_selector('[aria-hidden]')[-1].click()
    time.sleep(2)

with open('Famous quotes.csv', 'w', encoding='utf-8')as fp:
    fileWrite = csv.writer(fp)
    fileWrite.writerow(['well-known saying', 'celebrity'])   # Write header
    fileWrite.writerows(sayingAndAuthor)

4. Total code

from selenium.webdriver import Chrome
import time
import csv

web = Chrome()

web.get('http://quotes.toscrape.com/js/')


sayingAndAuthor = []
n = 5
for i in range(0, n):
    div_list = web.find_elements_by_class_name('quote')
    for div in div_list:
        saying = div.find_element_by_class_name('text').text
        author = div.find_element_by_class_name('author').text
        info = [saying, author]
        sayingAndAuthor.append(info)
    print('Successfully crawled the second' + str(i + 1) + 'page')
    if i == n-1:
        break
    web.find_elements_by_css_selector('[aria-hidden]')[-1].click()
    time.sleep(2)

with open('Famous quotes.csv', 'w', encoding='utf-8')as fp:
    fileWrite = csv.writer(fp)
    fileWrite.writerow(['well-known saying', 'celebrity'])   # Write header
    fileWrite.writerows(sayingAndAuthor)
web.close()

Crawling results:


3, Crawl Jingdong book information


Crawling the first three pages of a keyword book, this paper takes science fiction as an example


1. Climb to the first page


  • Enter Jingdong and search for Science Fiction:
from selenium.webdriver import Chrome
from selenium.webdriver.common.keys import Keys

web = Chrome()

web.get('https://www.jd.com/')
web.maximize_window()
web.find_element_by_id('key').send_keys('science fiction', Keys.ENTER)  # Find the input box and enter


In the developer tool, you can find that each product information exists in the li whose class contains "GL item" (when the mouse covers a li, its class becomes GL item hover):


When you first enter, you can find 30 li, but when you slide down to the lower page, the number of li becomes 60, so you need to slide down first:


Therefore, all li are obtained as follows:

web.execute_script('window.scrollTo(0, document.body.scrollHeight);')
time.sleep(2)
page_text = web.page_source
tree = etree.HTML(page_text)
li_list = tree.xpath('//li[contains(@class,"gl-item")]')
print(len(li_list))



Then get the information of each book and get it in the loop

for li in li_list:
	pass

Get title and price:

book_name = ''.join(li.xpath('.//div[@class="p-name"]/a/em/text()'))
price = '¥' + li.xpath('.//div[@class="p-price"]/strong/i/text()')[0]

For the author, if there is no author in some books, it is recorded as none:

author_span = li.xpath('.//span[@class="p-bi-name"]/a/text()')
if len(author_span) > 0:
    author = author_span[0]
else:
    author = 'nothing'

For obtaining a publisher, it is the same as the author:

store_span = li.xpath('.//span[@class="p-bi-store"]/a[1]/text()')
if len(store_span) > 0:
    store = store_span[0]
else:
    store = 'nothing'

For book picture addresses, some are in src attribute and some are in data lazy img attribute:

Therefore, the address for obtaining Book pictures is as follows:

img_url_a = li.xpath('.//div[@class="p-img"]/a/img')[0]
if len(img_url_a.xpath('./@src')) > 0:
    img_url = 'https' + img_url_a.xpath('./@src')[0]  # Book picture address
else:
    img_url = 'https' + img_url_a.xpath('./@data-lazy-img')[0]

Therefore, the information on one page is as follows:

# Crawl a page
def get_onePage_info(web):
    web.execute_script('window.scrollTo(0, document.body.scrollHeight);')
    time.sleep(2)
    page_text = web.page_source

    # Analyze
    tree = etree.HTML(page_text)
    li_list = tree.xpath('//li[contains(@class,"gl-item")]')
    book_infos = []
    for li in li_list:
        book_name = ''.join(li.xpath('.//div[@class="p-name"]/a/em/text()) # book title
        price = '¥' + li.xpath('.//Div [@ class = "p-price"] / strong / I / text() '[0] # price
        author_span = li.xpath('.//span[@class="p-bi-name"]/a/text()')
        if len(author_span) > 0:  # author
            author = author_span[0]
        else:
            author = 'nothing'
        store_span = li.xpath('.//span[@class="p-bi-store"]/a[1]/text()) # Publishing House
        if len(store_span) > 0:
            store = store_span[0]
        else:
            store = 'nothing'
        img_url_a = li.xpath('.//div[@class="p-img"]/a/img')[0]
        if len(img_url_a.xpath('./@src')) > 0:
            img_url = 'https' + img_url_a.xpath('./@src')[0]  # Book picture address
        else:
            img_url = 'https' + img_url_a.xpath('./@data-lazy-img')[0]
        one_book_info = [book_name, price, author, store, img_url]
        book_infos.append(one_book_info)
    return book_infos

2. Climb 3 pages


Click Next:

web.find_element_by_class_name('pn-next').click()  # Click next

Crawl three pages:

for i in range(0, 3):
    all_book_infos += get_onePage_info(web)
    web.find_element_by_class_name('pn-next').click()  # Click next
    time.sleep(2)

3. Data storage

with open('JD.COM-science fiction.csv', 'w', encoding='utf-8')as fp:
    writer = csv.writer(fp)
    writer.writerow(['title', 'Price', 'author', 'press', 'Preview picture address'])
    writer.writerows(all_book_info)

4. Total code

from selenium.webdriver import Chrome
from selenium.webdriver.common.keys import Keys
import time
from lxml import etree
import csv

# Crawl a page
def get_onePage_info(web):
    web.execute_script('window.scrollTo(0, document.body.scrollHeight);')
    time.sleep(2)
    page_text = web.page_source
    # with open('3-.html', 'w', encoding='utf-8')as fp:
    #     fp.write(page_text)
    # Analyze
    tree = etree.HTML(page_text)
    li_list = tree.xpath('//li[contains(@class,"gl-item")]')
    book_infos = []
    for li in li_list:
        book_name = ''.join(li.xpath('.//div[@class="p-name"]/a/em/text()) # book title
        price = '¥' + li.xpath('.//Div [@ class = "p-price"] / strong / I / text() '[0] # price
        author_span = li.xpath('.//span[@class="p-bi-name"]/a/text()')
        if len(author_span) > 0:  # author
            author = author_span[0]
        else:
            author = 'nothing'
        store_span = li.xpath('.//span[@class="p-bi-store"]/a[1]/text()) # Publishing House
        if len(store_span) > 0:
            store = store_span[0]
        else:
            store = 'nothing'
        img_url_a = li.xpath('.//div[@class="p-img"]/a/img')[0]
        if len(img_url_a.xpath('./@src')) > 0:
            img_url = 'https' + img_url_a.xpath('./@src')[0]  # Book picture address
        else:
            img_url = 'https' + img_url_a.xpath('./@data-lazy-img')[0]
        one_book_info = [book_name, price, author, store, img_url]
        book_infos.append(one_book_info)
    return book_infos


def main():
    web = Chrome()
    web.get('https://www.jd.com/')
    web.maximize_window()
    web.find_element_by_id('key').send_keys('science fiction', Keys.ENTER)  # Find the input box and enter
    time.sleep(2)
    all_book_info = []
    for i in range(0, 3):
        all_book_info += get_onePage_info(web)
        print('Climb to the third' + str(i+1) + 'Page success')
        web.find_element_by_class_name('pn-next').click()  # Click next
        time.sleep(2)
    with open('JD.COM-science fiction.csv', 'w', encoding='utf-8')as fp:
        writer = csv.writer(fp)
        writer.writerow(['title', 'Price', 'author', 'press', 'Preview picture address'])
        writer.writerows(all_book_info)


if __name__ == '__main__':
    main()

result:



4, Summary

selenium is very convenient for crawling dynamic data, but the speed is relatively slow.


reference resources

CSS property selector

Summary of selenium operation drop-down scroll bar methods in python

Tags: Python Selenium crawler

Posted on Thu, 02 Dec 2021 18:33:13 -0500 by jdpatrick