python advanced crawler notes

Written in front

selenium is a friendly crawler tool for novices, but I don't think it is suitable for novices.
It is recommended that you look at selenium after you understand the reptiles of the requests system and have some common sense of reptiles.

In fact, the crawler of requests system is enough to meet the crawler needs of most websites at this stage

About Selenium

Selenium was born in 2014 by Jason Huggins, a testing engineer at ThoughtWorks. Selenium was created for automated testing to detect Web interaction and avoid duplication of effort.
This tool can be used to automatically load web pages for crawlers to grab data.

Official documents


  1. from Here Download chrome driver
    Note: consistent with the version of Chrome currently in use
    Add: for Mac OS users, you can put this file in / usr/local/bin / directory, which can save some configuration troubles
  2. pip install selenium


  1. set configuration
    option = webdriver.ChromeOptions()
  2. Add driver
    driver = webdriver.Chrome(chrome_options=option)

A master hand's first small display

# Interact with Baidu Homepage

from selenium import webdriver
from import WebDriverWait
from import By
from import expected_conditions as EC

option = webdriver.ChromeOptions()
# option.add_argument('headless')

# Change to a chrome driver that adapts to your operating system
driver = webdriver.Chrome(chrome_options=option)

url = ''

# Open web site

# Print current page title

# Enter text in the search box
timeout = 5
search_content = WebDriverWait(driver, timeout).until(
    # lambda d: d.find_element_by_xpath('//input[@id="kw"]')
    EC.presence_of_element_located((By.XPATH, '//input[@id="kw"]'))

import time

# Analog click "Baidu below"
search_button = WebDriverWait(driver, timeout).until(
    lambda d: d.find_element_by_xpath('//input[@id="su"]'))

# Print search results
search_results = WebDriverWait(driver, timeout).until(
    # lambda d: d.find_elements_by_xpath('//h3[@class="t c-title-en"] | //h3[@class="t"]')
    lambda e: e.find_elements_by_xpath('//h3[contains(@class,"t")]/a[1]')
# print(search_results)

for item in search_results:

/usr/local/Caskroom/miniconda/base/envs/scikit/lib/python3.7/site-packages/ DeprecationWarning: use options instead of chrome_options
  del sys.path[0]

Baidu once, you will know
 python free online learning every day
Welcome to
 Python Baidu Encyclopedia
 Python basic tutorial | rookie tutorial
Download Python |
 Python tutorial - Liao Xuefeng's official website
 Python, official computer version, pure download by Chinese Army
 We live in the "Python era"
Introduction to Python
 Python - Zhihu
 Python basic tutorial, python introduction tutorial (very detailed)
Intel Python distribution
 You don't know how to program, you don't know how to make art, and you can easily make games
 Free and versatile pagoda Linux panel one click management server

Page interaction method

# Find element:
element = driver.find_element_by_id("passwd-id")
element = driver.find_element_by_name("passwd")
element = driver.find_element_by_xpath("//input[@id='passwd-id']")

# Enter text:
element.send_keys("some text")

# click

# Action chain
from selenium.webdriver import ActionChains
action_chains = ActionChains(driver)
action_chains.drag_and_drop(element, target).perform()

# Switch between pages
window_handles = driver.window_handles

# Save screenshot

Location element

# Find an element

# Find multiple elements

# Locate by id

  <form id="loginForm">
   <input name="username" type="text" />
   <input name="password" type="password" />
   <input name="continue" type="submit" value="Login" />

login_form = driver.find_element_by_id('loginForm')

# Locate by name

  <form id="loginForm">
   <input name="username" type="text" />
   <input name="password" type="password" />
   <input name="continue" type="submit" value="Login" />
   <input name="continue" type="button" value="Clear" />

username = driver.find_element_by_name('username')
password = driver.find_element_by_name('password')

# Locate by linking text

  <p>Are you sure you want to do this?</p>
  <a href="continue.html">Continue</a>
  <a href="cancel.html">Cancel</a>

continue_link = driver.find_element_by_link_text('Continue')
continue_link = driver.find_element_by_partial_link_text('Conti')

# Locate by tag name

  <p>Site content goes here.</p>

heading1 = driver.find_element_by_tag_name('h1')

# Locate by class name

  <p class="content">Site content goes here.</p>

content = driver.find_element_by_class_name('content')

# Positioning through CSS selectors

  <p class="content">Site content goes here.</p>

content = driver.find_element_by_css_selector('p.content')

# Two private methods
from import By

driver.find_element(By.XPATH, '//button[text()="Some text"]')
driver.find_elements(By.XPATH, '//button')

By Properties that can be used to locate later
ID = "id"
XPATH = "xpath"
LINK_TEXT = "link text"
PARTIAL_LINK_TEXT = "partial link text"
NAME = "name"
TAG_NAME = "tag name"
CLASS_NAME = "class name"
CSS_SELECTOR = "css selector"

# xpath positioning is recommended
username = driver.find_element_by_xpath("//form[input/@name='username']")
username = driver.find_element_by_xpath("//form[@id='loginForm']/input[1]")
username = driver.find_element_by_xpath("//input[@name='username']")

# Link text positioning is recommended
continue_link = driver.find_element_by_link_text('Continue')
continue_link = driver.find_element_by_partial_link_text('Conti')

On the location of elements

Recommended use katalon After the software is turned on, the click record of the browser can be recorded, and then the selenium simulation click code can be generated with one click

At the same time, through the element review function of the browser, right-click the element to be located, and most browsers have the function of copying xpath directly

Personal experience


  • Novice friendly, easy to operate
  • Naturally suitable for crawling dynamically loaded pages
  • Screenshots are very powerful
  • The access of cookies is very convenient. It can be called a cult with requests


  • Complicated initial installation process
  • Slow speed, low efficiency
  • Large memory usage
Published 7 original articles, won praise 6, visited 432
Private letter follow

Tags: Python Selenium Lambda Mac

Posted on Wed, 05 Feb 2020 07:00:08 -0500 by davard