Using selenium library to crawl data from JD's interface

Environmental needs
  • python running environment
  • Selenium Library (pip install selenium)
  • Pyquery Library (pip install pyquery)
  • pymongo (pip install pymongo) there is a mongo database locally. If there is no mongo database, you don't need to install it
Questions raised
  • We open Jingdong's web page, enter food in the search bar, and we search

  • Through Google's console, we can see all the requests sent by a web browser at a time. Because the requests of an e-commerce platform are too cumbersome, we can't get the page information by analyzing the url of the browser or the ajax requests of a website
Analyze solutions

The browser interface is rendered through DOM. The information we get is the HTML and JS files we get from the browser. Since the browser does not have an external interface to let us get the information of the received page, do we have a plug-in to simulate the browser to make the request of the web page, and it can also return the information we need from the external interface What about the data? The answer is: Yes

selenium Library

Https://python-selenium-zh.readthedocs.io/zh CN / latest / (selenium + Python Chinese document)
From 2.0, elenium has integrated the API of webdriver, providing a simpler and simpler programming interface.
We can do this by using the API in webdriver

  • Start using
from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
driver = webdriver.Firefox()

So we get a browser (or Google browser, or something)

  • Send request
driver.get('https://www.baidu.com')
  • pick up information
input = driver.find_element_by_id('kw')
input.send_keys(u'Delicious food')
input.send_keys(Keys.ENTER)
    wait = WebDriverWait(browser, 10)
    wait.until(EC.presence_of_element_located((By.ID, 'content_left')))
    print(driver.page_source)

Page source is the information we need

  • Close browser
driver.close()
pyquery Library


The official website of pyquery library shows that it is a library similar to jquery. We will not explain it too much here. The purpose of introducing this library is to better analyze and analyze the html code we have obtained

Start actual combat

Now that we have a basic understanding of what we need, let's start our actual combat

  • requirement analysis
    1: Open JD's interface, search for delicacies, and get information about all delicacies
    2: Page turning during search
    3: Save the content to file or database

  • Realization

1: Introduce the third-party library we need

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from config import *
from pyquery import PyQuery as pq
import json

import pymongo
client = pymongo.MongoClient(MONGO_URL)
db = client[MONGO_DB]
browser = webdriver.PhantomJS(service_args=SERVICE_ARGS)
# I am using PhantomJS, a browser without interface. I use or need to download and install it and configure it into environment variables

2: Define the function of inputting delicious food and clicking the button

def search():
  try:
      browser.get('https://www.jd.com/')
      input = WebDriverWait(browser, 10).until(
        EC.presence_of_element_located((By.CSS_SELECTOR, '#key'))
      )
      submit = WebDriverWait(browser, 10).until(
        EC.element_to_be_clickable((By.CSS_SELECTOR, '#search > div > div.form > button'))
      )
      input.send_keys(KEYWORD)
      submit.click()
      total = WebDriverWait(browser, 10).until(
        EC.presence_of_element_located((By.CSS_SELECTOR, '#J_bottomPage > span.p-skip > em:nth-child(1) > b'))
        )
      get_products()
      return total.text
  except Exception:
      return search()

It takes time to load the page. We can judge whether the page elements we need have been loaded through the selenium wait event, so that we can operate these elements

3: Define functions for flipping pages

def next_page(page_number):
  try:
      input = WebDriverWait(browser, 10).until(
        EC.presence_of_element_located((By.CSS_SELECTOR, '#J_bottomPage > span.p-skip > input'))
      )
      submit = WebDriverWait(browser, 10).until(
        EC.element_to_be_clickable((By.CSS_SELECTOR, '#J_bottomPage > span.p-skip > a'))
      )
      
      input.clear()
      input.send_keys(page_number)
      submit.click()
      WebDriverWait(browser,10).until(
        EC.text_to_be_present_in_element((By.CSS_SELECTOR,'#J_bottomPage > span.p-num > a.curr'),str(page_number))
      )
      get_products()
  except Exception:
        next_page(page_number)

We use exception handling to determine whether we have loaded the interface we need, and how to throw exceptions. We use recursion so that we can ensure that we get information

4: Define the function to get product information

def get_products():
    WebDriverWait(browser, 10).until(
      EC.presence_of_element_located((By.CSS_SELECTOR, '#J_goodsList .gl-item .gl-i-wrap'))
    )
    html = browser.page_source
    doc = pq(html)
    items = doc('#J_goodsList .gl-item .gl-i-wrap').items()
    for item in items:
        product = {
          'image': 'https:' + item.find('.p-img a').attr('href'),
          'price': item.find('.p-price i').text(),
          'ratings': item.find('.p-commit strong a').text(),
          'shop': item.find('.p-shop a').text(),
          'titlle':item.find('.p-name em').text(),
          'info' :item.find('p-icons i').text()
        }
        print(product)
        # save(product)
        write_to_file(product)

Here we use pyquery to parse the page. I don't know that students can go to the official website to see the usage process (here we need some basic DOM operation knowledge)

5: Define how to write to a file

def write_to_file(content):
    with open('jingdong.txt','a',encoding='utf-8') as f:
        f.write(json.dumps(content,ensure_ascii=False) + '\n')
        f.close()

6: Define how to write to the database

def save(result):
  try:
      if db[MONGO_TABLE].insert(result):
          print('Store to database successful')
  except Exception:
      print('Storage failure',result)

7: All codes

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support.ui import WebDriverWait
from selenium.webdriver.support import expected_conditions as EC
from config import *
from pyquery import PyQuery as pq
import json

import pymongo
client = pymongo.MongoClient(MONGO_URL)
db = client[MONGO_DB]
browser = webdriver.PhantomJS(service_args=SERVICE_ARGS)


def search():
  try:
      browser.get('https://www.jd.com/')
      input = WebDriverWait(browser, 10).until(
        EC.presence_of_element_located((By.CSS_SELECTOR, '#key'))
      )
      submit = WebDriverWait(browser, 10).until(
        EC.element_to_be_clickable((By.CSS_SELECTOR, '#search > div > div.form > button'))
      )
      input.send_keys(KEYWORD)
      submit.click()
      total = WebDriverWait(browser, 10).until(
        EC.presence_of_element_located((By.CSS_SELECTOR, '#J_bottomPage > span.p-skip > em:nth-child(1) > b'))
        )
      get_products()
      return total.text
  except Exception:
      return search()

def next_page(page_number):
  try:
      input = WebDriverWait(browser, 10).until(
        EC.presence_of_element_located((By.CSS_SELECTOR, '#J_bottomPage > span.p-skip > input'))
      )
      submit = WebDriverWait(browser, 10).until(
        EC.element_to_be_clickable((By.CSS_SELECTOR, '#J_bottomPage > span.p-skip > a'))
      )
      
      input.clear()
      input.send_keys(page_number)
      submit.click()
      WebDriverWait(browser,10).until(
        EC.text_to_be_present_in_element((By.CSS_SELECTOR,'#J_bottomPage > span.p-num > a.curr'),str(page_number))
      )
      get_products()
  except Exception:
        next_page(page_number)

def get_products():
    WebDriverWait(browser, 10).until(
      EC.presence_of_element_located((By.CSS_SELECTOR, '#J_goodsList .gl-item .gl-i-wrap'))
    )
    html = browser.page_source
    doc = pq(html)
    items = doc('#J_goodsList .gl-item .gl-i-wrap').items()
    for item in items:
        product = {
          'image': 'https:' + item.find('.p-img a').attr('href'),
          'price': item.find('.p-price i').text(),
          'ratings': item.find('.p-commit strong a').text(),
          'shop': item.find('.p-shop a').text(),
          'titlle':item.find('.p-name em').text(),
          'info' :item.find('p-icons i').text()
        }
        print(product)
        # save(product)
        write_to_file(product)
def main():
  try:
    total = search()
    total = int(total)
    for i in range(2, total + 1):
      next_page(i)
  finally:
    browser.close()
    

def write_to_file(content):
    with open('jingdong.txt','a',encoding='utf-8') as f:
        f.write(json.dumps(content,ensure_ascii=False) + '\n')
        f.close()

def save(result):
  try:
      if db[MONGO_TABLE].insert(result):
          print('Store to database successful')
  except Exception:
      print('Storage failure',result)

      
if __name__ == '__main__':
   main() 

8: Content in config

SERVICE_ARGS = ['--load-images=false','--disk-cache=true']
MONGO_URL = 'localhost'
MONGO_DB = 'jingdong'
MONGO_TABLE = 'product'

KEYWORD = u'Delicious food'

This is the running profile

Result view
  • Data in the file
  • Data in the database

So we get the information we need.

I hope more friends come to communicate with me

Published 2 original articles, praised 0 and visited 27
Private letter follow

Tags: Selenium Database JSON Python

Posted on Sat, 01 Feb 2020 07:25:12 -0500 by crwtrue