python advanced crawler notes

Written in front

selenium is a friendly crawler tool for novices, but I don't think it is suitable for novices.
It is recommended that you look at selenium after you understand the reptiles of the requests system and have some common sense of reptiles.

In fact, the crawler of requests system is enough to meet the crawler needs of most websites at this stage

About Selenium

Selenium was born in 2014 by Jason Huggins, a testing engineer at ThoughtWorks. Selenium was created for automated testing to detect Web interaction and avoid duplication of effort.
This tool can be used to automatically load web pages for crawlers to grab data.

Official documents

install

  1. from Here Download chrome driver
    Note: consistent with the version of Chrome currently in use
    Add: for Mac OS users, you can put this file in / usr/local/bin / directory, which can save some configuration troubles
  2. pip install selenium

Use

  1. set configuration
    option = webdriver.ChromeOptions()
    option.add_argument('headless')
  2. Add driver
    driver = webdriver.Chrome(chrome_options=option)

A master hand's first small display

# Interact with Baidu Homepage

from selenium import webdriver
from selenium.webdriver.support.wait import WebDriverWait
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC

option = webdriver.ChromeOptions()
# option.add_argument('headless')

# Change to a chrome driver that adapts to your operating system
driver = webdriver.Chrome(chrome_options=option)


url = 'https://www.baidu.com'

# Open web site
driver.get(url)

# Print current page title
print(driver.title)

# Enter text in the search box
timeout = 5
search_content = WebDriverWait(driver, timeout).until(
    # lambda d: d.find_element_by_xpath('//input[@id="kw"]')
    EC.presence_of_element_located((By.XPATH, '//input[@id="kw"]'))
)
search_content.send_keys('python')

import time
time.sleep(3)

# Analog click "Baidu below"
search_button = WebDriverWait(driver, timeout).until(
    lambda d: d.find_element_by_xpath('//input[@id="su"]'))
search_button.click()

# Print search results
search_results = WebDriverWait(driver, timeout).until(
    # lambda d: d.find_elements_by_xpath('//h3[@class="t c-title-en"] | //h3[@class="t"]')
    lambda e: e.find_elements_by_xpath('//h3[contains(@class,"t")]/a[1]')
)
# print(search_results)

for item in search_results:
    print(item.text)

driver.close()
/usr/local/Caskroom/miniconda/base/envs/scikit/lib/python3.7/site-packages/ipykernel_launcher.py:13: DeprecationWarning: use options instead of chrome_options
  del sys.path[0]


Baidu once, you will know
 python free online learning every day
Welcome to Python.org
 Python Baidu Encyclopedia
 Python basic tutorial | rookie tutorial
Download Python | Python.org
 Python tutorial - Liao Xuefeng's official website
 Python, official computer version, pure download by Chinese Army
 We live in the "Python era"
Introduction to Python
 Python - Zhihu
 Python basic tutorial, python introduction tutorial (very detailed)
Intel Python distribution
 You don't know how to program, you don't know how to make art, and you can easily make games
 Free and versatile pagoda Linux panel one click management server

Page interaction method

# Find element:
element = driver.find_element_by_id("passwd-id")
element = driver.find_element_by_name("passwd")
element = driver.find_element_by_xpath("//input[@id='passwd-id']")

# Enter text:
element.send_keys("some text")

# click
element.click()

# Action chain
from selenium.webdriver import ActionChains
action_chains = ActionChains(driver)
action_chains.drag_and_drop(element, target).perform()

# Switch between pages
window_handles = driver.window_handles
driver.switch_to.window(window_handles[-1])

# Save screenshot
driver.save_screenshot('screen.png')

Location element

# Find an element
find_element_by_id
find_element_by_name
find_element_by_xpath
find_element_by_link_text
find_element_by_partial_link_text
find_element_by_tag_name
find_element_by_class_name
find_element_by_css_selector

# Find multiple elements
find_elements_by_name
find_elements_by_xpath
find_elements_by_link_text
find_elements_by_partial_link_text
find_elements_by_tag_name
find_elements_by_class_name
find_elements_by_css_selector

# Locate by id

<html>
 <body>
  <form id="loginForm">
   <input name="username" type="text" />
   <input name="password" type="password" />
   <input name="continue" type="submit" value="Login" />
  </form>
 </body>
<html>

login_form = driver.find_element_by_id('loginForm')

# Locate by name

<html>
 <body>
  <form id="loginForm">
   <input name="username" type="text" />
   <input name="password" type="password" />
   <input name="continue" type="submit" value="Login" />
   <input name="continue" type="button" value="Clear" />
  </form>
</body>
<html>

username = driver.find_element_by_name('username')
password = driver.find_element_by_name('password')

# Locate by linking text

<html>
 <body>
  <p>Are you sure you want to do this?</p>
  <a href="continue.html">Continue</a>
  <a href="cancel.html">Cancel</a>
</body>
<html>

continue_link = driver.find_element_by_link_text('Continue')
continue_link = driver.find_element_by_partial_link_text('Conti')

# Locate by tag name

<html>
 <body>
  <h1>Welcome</h1>
  <p>Site content goes here.</p>
</body>
<html>

heading1 = driver.find_element_by_tag_name('h1')

# Locate by class name

<html>
 <body>
  <p class="content">Site content goes here.</p>
</body>
<html>

content = driver.find_element_by_class_name('content')

# Positioning through CSS selectors

<html>
 <body>
  <p class="content">Site content goes here.</p>
</body>
<html>

content = driver.find_element_by_css_selector('p.content')

# Two private methods
from selenium.webdriver.common.by import By

driver.find_element(By.XPATH, '//button[text()="Some text"]')
driver.find_elements(By.XPATH, '//button')

By Properties that can be used to locate later
ID = "id"
XPATH = "xpath"
LINK_TEXT = "link text"
PARTIAL_LINK_TEXT = "partial link text"
NAME = "name"
TAG_NAME = "tag name"
CLASS_NAME = "class name"
CSS_SELECTOR = "css selector"

# xpath positioning is recommended
username = driver.find_element_by_xpath("//form[input/@name='username']")
username = driver.find_element_by_xpath("//form[@id='loginForm']/input[1]")
username = driver.find_element_by_xpath("//input[@name='username']")

# Link text positioning is recommended
continue_link = driver.find_element_by_link_text('Continue')
continue_link = driver.find_element_by_partial_link_text('Conti')

On the location of elements

Recommended use katalon After the software is turned on, the click record of the browser can be recorded, and then the selenium simulation click code can be generated with one click

At the same time, through the element review function of the browser, right-click the element to be located, and most browsers have the function of copying xpath directly

Personal experience

Advantage:

  • Novice friendly, easy to operate
  • Naturally suitable for crawling dynamically loaded pages
  • Screenshots are very powerful
  • The access of cookies is very convenient. It can be called a cult with requests

Disadvantages:

  • Complicated initial installation process
  • Slow speed, low efficiency
  • Large memory usage
Published 7 original articles, won praise 6, visited 432
Private letter follow

Tags: Python Selenium Lambda Mac

Posted on Wed, 05 Feb 2020 07:00:08 -0500 by davard