python advanced crawler notes

Written in front

selenium is a friendly crawler tool for novices, but I don't think it is suitable for novices.
It is recommended that you look at selenium after you understand the reptiles of the requests system and have some common sense of reptiles.

In fact, the crawler of requests system is enough to meet the crawler needs of most websites at this stage

About Selenium

Selenium was born in 2014 by Jason Huggins, a testing engineer at ThoughtWorks. Selenium was created for automated testing to detect Web interaction and avoid duplication of effort.
This tool can be used to automatically load web pages for crawlers to grab data.

Official documents

install

from Here Download chrome driver
Note: consistent with the version of Chrome currently in use
Add: for Mac OS users, you can put this file in / usr/local/bin / directory, which can save some configuration troubles
pip install selenium

Use

set configuration
option = webdriver.ChromeOptions()
option.add_argument('headless')
Add driver
driver = webdriver.Chrome(chrome_options=option)

A master hand's first small display

# Interact with Baidu Homepage from selenium import webdriver from selenium.webdriver.support.wait import WebDriverWait from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC option = webdriver.ChromeOptions() # option.add_argument('headless') # Change to a chrome driver that adapts to your operating system driver = webdriver.Chrome(chrome_options=option) url = 'https://www.baidu.com' # Open web site driver.get(url) # Print current page title print(driver.title) # Enter text in the search box timeout = 5 search_content = WebDriverWait(driver, timeout).until( # lambda d: d.find_element_by_xpath('//input[@id="kw"]') EC.presence_of_element_located((By.XPATH, '//input[@id="kw"]')) ) search_content.send_keys('python') import time time.sleep(3) # Analog click "Baidu below" search_button = WebDriverWait(driver, timeout).until( lambda d: d.find_element_by_xpath('//input[@id="su"]')) search_button.click() # Print search results search_results = WebDriverWait(driver, timeout).until( # lambda d: d.find_elements_by_xpath('//h3[@class="t c-title-en"] | //h3[@class="t"]') lambda e: e.find_elements_by_xpath('//h3[contains(@class,"t")]/a[1]') ) # print(search_results) for item in search_results: print(item.text) driver.close()

/usr/local/Caskroom/miniconda/base/envs/scikit/lib/python3.7/site-packages/ipykernel_launcher.py:13: DeprecationWarning: use options instead of chrome_options del sys.path[0] Baidu once, you will know python free online learning every day Welcome to Python.org Python Baidu Encyclopedia Python basic tutorial | rookie tutorial Download Python | Python.org Python tutorial - Liao Xuefeng's official website Python, official computer version, pure download by Chinese Army We live in the "Python era" Introduction to Python Python - Zhihu Python basic tutorial, python introduction tutorial (very detailed) Intel Python distribution You don't know how to program, you don't know how to make art, and you can easily make games Free and versatile pagoda Linux panel one click management server

Page interaction method

# Find element: element = driver.find_element_by_id("passwd-id") element = driver.find_element_by_name("passwd") element = driver.find_element_by_xpath("//input[@id='passwd-id']") # Enter text: element.send_keys("some text") # click element.click() # Action chain from selenium.webdriver import ActionChains action_chains = ActionChains(driver) action_chains.drag_and_drop(element, target).perform() # Switch between pages window_handles = driver.window_handles driver.switch_to.window(window_handles[-1]) # Save screenshot driver.save_screenshot('screen.png')

Location element

# Find an element find_element_by_id find_element_by_name find_element_by_xpath find_element_by_link_text find_element_by_partial_link_text find_element_by_tag_name find_element_by_class_name find_element_by_css_selector # Find multiple elements find_elements_by_name find_elements_by_xpath find_elements_by_link_text find_elements_by_partial_link_text find_elements_by_tag_name find_elements_by_class_name find_elements_by_css_selector # Locate by id <html> <body> <form id="loginForm"> <input name="username" type="text" /> <input name="password" type="password" /> <input name="continue" type="submit" value="Login" /> </form> </body> <html> login_form = driver.find_element_by_id('loginForm') # Locate by name <html> <body> <form id="loginForm"> <input name="username" type="text" /> <input name="password" type="password" /> <input name="continue" type="submit" value="Login" /> <input name="continue" type="button" value="Clear" /> </form> </body> <html> username = driver.find_element_by_name('username') password = driver.find_element_by_name('password') # Locate by linking text <html> <body> <p>Are you sure you want to do this?</p> <a href="continue.html">Continue</a> <a href="cancel.html">Cancel</a> </body> <html> continue_link = driver.find_element_by_link_text('Continue') continue_link = driver.find_element_by_partial_link_text('Conti') # Locate by tag name <html> <body> <h1>Welcome</h1> <p>Site content goes here.</p> </body> <html> heading1 = driver.find_element_by_tag_name('h1') # Locate by class name <html> <body> <p>Site content goes here.</p> </body> <html> content = driver.find_element_by_class_name('content') # Positioning through CSS selectors <html> <body> <p>Site content goes here.</p> </body> <html> content = driver.find_element_by_css_selector('p.content') # Two private methods from selenium.webdriver.common.by import By driver.find_element(By.XPATH, '//button[text()="Some text"]') driver.find_elements(By.XPATH, '//button') By Properties that can be used to locate later ID = "id" XPATH = "xpath" LINK_TEXT = "link text" PARTIAL_LINK_TEXT = "partial link text" NAME = "name" TAG_NAME = "tag name" CLASS_NAME = "class name" CSS_SELECTOR = "css selector" # xpath positioning is recommended username = driver.find_element_by_xpath("//form[input/@name='username']") username = driver.find_element_by_xpath("//form[@id='loginForm']/input[1]") username = driver.find_element_by_xpath("//input[@name='username']") # Link text positioning is recommended continue_link = driver.find_element_by_link_text('Continue') continue_link = driver.find_element_by_partial_link_text('Conti')

On the location of elements

Recommended use katalon After the software is turned on, the click record of the browser can be recorded, and then the selenium simulation click code can be generated with one click

At the same time, through the element review function of the browser, right-click the element to be located, and most browsers have the function of copying xpath directly

Personal experience

Advantage:

Novice friendly, easy to operate
Naturally suitable for crawling dynamically loaded pages
Screenshots are very powerful
The access of cookies is very convenient. It can be called a cult with requests

Disadvantages:

Complicated initial installation process
Slow speed, low efficiency
Large memory usage

Acher_zxj Published 7 original articles, won praise 6, visited 432 Private letter follow