Primary use of Selenium

1, Dynamic rendering page crawling

1. Background issues

  • For the data that directly responds when accessing the Web (that is, the response content is visible (not ajax loaded or rendered data)), we use the urlib, requests or scratch framework to crawl.
  • For general JavaScript dynamically rendered page information (Ajax loading), we can grab information by analyzing the Ajax request address through packet grabbing.
    • Ajax = asynchronous JavaScript and XML (a subset of the standard generic markup language).
    • Ajax is a technology for creating fast dynamic web pages.
    • Ajax is a technology that can update some web pages without reloading the whole web page. For example: JD specifies the comment information of commodity information.
  • Even if we get the data through Ajax, there will be some encrypted parameters. Later, we generate the content through JavaScript calculation, which makes it difficult for us to find the rules directly, such as Taobao page.

2. Solutions

  • Method principle: in order to solve these problems, we can directly use the way of simulated browser to achieve information acquisition.
  • There are many simulated browser runtime libraries in Python, such as Selenium, Splash, PyV8, Ghost, etc.

2, Introduction to Selenium

1. Function and introduction of Selenium

  • Selenium is an automatic testing tool, which can drive the browser to perform specific actions, such as click, pull, drag, move the scroll bar and so on. (selenium is powerful, webDriver is just a branch of the module)
  • Selenium can obtain the source code of the page currently presented by the browser, so that it can be seen and crawled. It is very effective to crawl the information corresponding to JavaScript dynamic rendering.
  • Selenium supports many browsers, such as Chrome, Firefox, Edge, etc. it also supports browser without interface (phantom JS driver needs to be installed).
  • Official website: http://www.seleniumhq.org
  • Official documents: http://selenium-python.readthedocs.io
  • Chinese document: http://selenium-python-zh.readthedocs.io

2. Selenium installation

2.1 module installation

pip install selenium

2.2. Driver installation (including driver download and placing the driver in the specified location):

2.2.1 driver download:

2.2.2 drive installation:

  • Windows installation: files to extract: chromedriver.exe Place it in Python's Scripts directory.
  • Mac/Linux Installation: place the extracted file: chromedriver in the directory * * / usr/local/bin / *;

3. Shortcomings

It takes a long time to use Selenium simulation browser to run and wait for the web page to finish executing before rendering to data.

3, Use of Selenium

1. Declare browser objects

from selenium import webdriver

driver = webdriver.Chrome()  #Google needs: Chrome driver driver
driver = webdriver.FireFox() #Firefox needs: GeckoDriver driver
driver = webdriver.Edge()  
driver = webdriver.Safari()  
driver = webdriver.PhantomJS() #No interface browser

Create Google browser object and enable Headless interface free mode of Chrome

chrome_options = webdriver.ChromeOptions()
chrome_options.add_argument('--headless')
driver = webdriver.Chrome(chrome_options=chrome_options)

2. Visit page

from selenium import webdriver

driver = webdriver.Chrome()
#driver = webdriver.PhantomJS()
driver.get("http://www.taobao.com")	#Visit the specified website
print(driver.page_source)
#driver.close() (driver shutdown is required before the end, but the browser will be closed when running this code, so the program cannot continue to run)

3. Find node:

3.1. Use selenium built-in function to get nodes

How to get a single node:

find_element_by_id()
find_element_by_name()
find_element_by_xpath()
find_element_by_link_text()
find_element_by_partial_link_text()
find_element_by_tag_name()
find_element_by_class_name()
find_element_by_css_seletor()
from selenium import webdriver
from selenium.webdriver.common.by import By

#Creating browser objects
driver = webdriver.Chrome()
#driver = webdriver.PhantomJS()
driver.get("http://www.taobao.com")
#The following are all node objects whose id attribute value is q
input = driver.find_element_by_id("q")
print(input)

input = driver.find_element_by_css_selector("#q")
print(input)

input = driver.find_element_by_xpath("//*[@id='q']")
print(input)

#The effect is the same as above
input = driver.find_element(By.ID,"q")
print(input)

#driver.close()

How to get multiple nodes:

find_elements_by_id()
find_elements_by_name()
find_elements_by_xpath()
find_elements_by_link_text()
find_elements_by_partial_link_text()
find_elements_by_tag_name()
find_elements_by_class_name()
find_elements_by_css_seletor()
3.2 use selenium.webdriver.common By in the. By module to find the node
  • Import module: from selenium.webdriver.common.by import By
  • By is a class built in selenium, in which there are various methods to locate elements
  • Classification of positioners supported By:

1. id attribute location:

find_element(By.ID,"id")

2. name attribute positioning

find_element(By.NAME,"name")

3. classname attribute positioning

find_element(By.CLASS_NAME,"claname")

4. a label text attribute positioning

find_element(By.LINK_TEXT,"text")

5. a label part text attribute positioning

find_element(By.PARTIAL_LINK_TEXT,"partailtext")

6. Label name positioning

find_elemnt(By.TAG_NAME,"input")

7. xpath path location

find_element(By.XPATH,"//div[@name='name']")

8. css selector positioning

find_element(By.CSS_SELECTOR,"#id")

4. Node interaction:

input.clear() -- clear the input box (get the input box node first)
input.send_keys('* *') -- input the specified content of analog keyboard (get the input box node first)
botton.click() -- trigger click action (get button node first)

from selenium import webdriver
import time

#Creating browser objects
driver = webdriver.Chrome()
#driver = webdriver.PhantomJS()
driver.get("http://www.taobao.com")
#The following are all node objects whose id attribute value is q
input = driver.find_element_by_id("q")
#Analog keyboard input iphone
input.send_keys('iphone')
time.sleep(3)
#Clear input box
input.clear()
#Analog keyboard input iPad
input.send_keys('iPad')
#Get search button node
botton = driver.find_element_by_class_name("btn-search")
#Trigger click action
botton.click()

#driver.close()

5. Dynamic chain:

To import ActionChains module:

from selenium.webdriver import ActionChains
  • ActionChains is a method of automating low-level interaction (simulating various mouse operations), such as mouse movement, mouse button operation, key operation and context menu interaction.
  • This is useful for more complex operations such as hovering and dragging
    • move_to_element(to_ Element) -- move the mouse to the middle of the element (hover)
    • move_by_offset (xoffset, yoffset) -- the offset to move the mouse to the current mouse position
    • drag_and_drop - then move to the target element and release the mouse button.
    • pause (seconds) - pauses all input for a specified duration in seconds
    • perform () -- performs all stored operations.
    • release(on_element = None) releases a hold mouse button on the element.
    • reset_actions () -- clears actions that are already stored on the remote side.
    • send_keys(* keys_to_send) - sends the key to the current focus element.
    • send_keys_to_element(element,* keys_to_send) - sends the key to an element.

Drag case:

from selenium import webdriver
from selenium.webdriver import ActionChains
import time

#Creating browser objects
driver = webdriver.Chrome()
#Load the specified url address
url = 'http://www.runoob.com/try/try.php?filename=jqueryui-api-droppable'
driver.get(url)

# Switch Frame window    
driver.switch_to.frame('iframeResult')
#Get two div node objects
source = driver.find_element_by_css_selector("#draggable")
target = driver.find_element_by_css_selector("#droppable")
#Create an action chain object
actions = ActionChains(driver)
#Add a drag operation to the action chain queue
actions.drag_and_drop(source,target)
time.sleep(3)
#Perform all stored operations (sequence triggered)
actions.perform()
#driver.close()

6. Execute JavaScript:

driver.execute_script('window.open() '-- execute javascript command
Scroll bar

from selenium import webdriver

#Creating browser objects
driver = webdriver.Chrome()
#Load the specified url address
driver.get("https://www.zhihu.com/explore")
#Executing a javascript program to scroll the page to the bottom
driver.execute_script('window.scrollTo(0,document.body.scrollHeight)')
#Execute javascript to realize a pop-up operation
driver.execute_script('window.alert("Hello Selenium!")')

#driver.close()

7. Get node information:

from selenium import webdriver
from selenium.webdriver import ActionChains

#Creating browser objects
driver = webdriver.Chrome()
#Load request specify url address
driver.get("https://www.zhihu.com/explore")
#Get the node (logo) whose id attribute value is zh top link logo
logo = driver.find_element_by_id("zh-top-link-logo")
print(logo) #Output node object
print(logo.get_attribute('class')) #class attribute value of node
#Get the node whose id attribute value is Zu top add question (question button)
input = driver.find_element_by_id("zu-top-add-question")
print(input.text) #Get content between nodes
print(input.id)  #Get id attribute value
print(input.location) #Relative position of nodes in the page
print(input.tag_name) #Node label name
print(input.size)     #Get the size of the node
print(driver.page_source)	#Get page code
#driver.close()

8. Switch Frame:

  • In a web page, there is a node called iframe, which can divide a page into multiple child parent interfaces.
  • We can use switch_to.frame() to switch the Frame interface. For example, see the dynamic chain case in ⑥

9. Delay waiting:

  • It takes time for the browser to load the web page, and Selenium is no exception. To get the full content of the web page, you need to delay waiting.
  • There are two ways to delay waiting in Selenium: implicit waiting and explicit waiting (recommended).
9.1 implicit waiting (fixed time): driver.implicitly_wait(2)
from selenium import webdriver

#Creating browser objects
driver = webdriver.Chrome()
#Use implicit wait (fixed time)
driver.implicitly_wait(2) 
#Load request specify url address
driver.get("https://www.zhihu.com/explore")
#Get node    
input = driver.find_element_by_id("zu-top-add-question")
print(input.text) #Get content between nodes

#driver.close()
9.2 explicit waiting
  • When will it be used? ——When your operation causes page changes, and you need to operate the changed elements next, you must wait.

  • usage method:
    Before use, introduce related libraries

    from selenium.webdriver.support.wait import WebDriverWait
    from selenium.webdriver.support import expected_conditions as EC
    from selenium.webdriver.common.by import By
    

    1. Determine the positioning expression of the element first
    web_ locator = 'XXXX'
    2. Call the webdriverWait class to set the total waiting time and polling cycle. And call period until, until_ not method.

    WebDriverWait(driver, waiting time, cycle). until_ not() 
    

    3. Use expected_ conditions to generate the judgment conditions.

    EC. Class name ((positioning method, positioning expression))
    

    For example: EC.presence_ of_ element_ located((By.CSS_ SELECTOR,web locator)))

  • Waiting judgment conditions of explicit waiting (often used in combination with EC module)

    • Wait for element to be visible
    • Wait for element to be available
    • Wait for the new window to appear
    • Waiting for url address = xxx
    • How many seconds is the maximum waiting time? Check whether the condition is valid every 0.5 seconds
  • Show the nature of waiting:
    The display wait will explicitly wait until a certain condition is met before performing the next operation. The program takes a look every xx seconds. If the condition is true, the next step is executed. Otherwise, it continues to wait until the waiting element is visible. If the maximum time is exceeded, then a TimeoutException is thrown.

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait

#Creating browser objects
driver = webdriver.Chrome()
#Load request specify url address
driver.get("https://www.zhihu.com/explore")
#Explicit wait, up to 10 seconds
wait = WebDriverWait(driver,10)
#Waiting condition: within 10 seconds, a node with the id attribute value of Zu top add question must be loaded, otherwise an exception will be thrown.
input = wait.until(EC.presence_of_element_located((By.ID,'zu-top-add-question')))
print(input.text) #Get content between nodes
#driver.close()

10,expected_conditions EC module

  • Import module: from selenium.webdriver.support import expected_ conditions as EC
  • Introduction: expected of selenium_ The conditions module collects a series of scene judgment methods, such as: judge whether an element exists, how to judge whether the alert pop-up window is out, how to judge the dynamic element and so on.
  • Suggestion: it can be used with wait module to judge page Jump, etc
  • Partial method functions:
    1,title_is: determines whether the title of the current page is exactly equal to the expected string (= =), and returns a Boolean value;
    from selenium import webdriver
    from selenium.webdriver.support import expected_conditions as EC
    
    driver = webdriver.Chrome()  # Start chrome
    driver.get('http://www.baidu.com')
    title = EC.title_is('Baidu once, you will know')(driver)
    print(title)
    
    2,title_contains: determines whether the title of the current page contains the expected string, and returns a Boolean value
    from selenium import webdriver
    from selenium.webdriver.support import expected_conditions as EC
    
    driver = webdriver.Chrome()  # Start chrome
    driver.get('http://www.baidu.com')
    title = EC.title_contains('Baidu')(driver)
    print(title)
    
    3,presence_of_element_located: to determine whether an element has been added to the dom tree does not mean that the element must be visible;
    4,visibility_of_element_located: judge whether an element is visible. Visible means that the element is not hidden, and the width and height of the element are not equal to 0;
    5,visibility_of: do the same thing as the above method, but the above method needs to be passed to the locator, and the method can be passed directly to the element to be located;
    6,presence_of_all_elements_located: determines whether at least one element exists in the dom tree. For example, if the class of n elements on the page is' column-md-3 ', then as long as one element exists, the method returns True;
    7,text_ To_ Be_ present_ In_ Element (str): judge whether the text in an element node object contains the expected string;
    8,text_to_be_present_in_element_value: judge whether the value attribute in an element contains the expected string;
    9,frame_to_be_available_and_switch_to_it: judge whether the frame can be switched in, if so, return True and switch in, otherwise return False;
    10,invisibility_of_element_located: determine whether an element does not exist in the dom tree or is not visible;
    11,element_to_be_clickable: to determine whether an element is visible and enable d, which is called clickable;
    12,staleness_of: wait for an element to be removed from the dom tree. Note that this method also returns True or False;
    13,element_selection_state_to_be: judge whether the selected state of an element meets the expectation;
    14,element_located_selection_state_to_be: just like the above method, the above method passes in the element to be located, and this method passes in the locator;
    15,alert_is_present: judge whether there is an alert on the page.

11. Forward and backward:

driver.back() -- backward
driver.forward() -- forward

from selenium import webdriver
import time

#Creating browser objects
driver = webdriver.Chrome()
#Load request specify url address
driver.get("https://www.baidu.com")
driver.get("https://www.taobao.com")
driver.get("https://www.jd.com")
time.sleep(2)
driver.back() #back off
time.sleep(2) 
driver.forward() #forward
#driver.close()

12,Cookies:

driver.get_cookies() -- get the current cookies information
driver.add_cookie({'name':'namne','domain':'www.zhihu.com '}) - add cookies

from selenium import webdriver
from selenium.webdriver import ActionChains

#Creating browser objects
driver = webdriver.Chrome()
#Load request specify url address
driver.get("https://www.zhihu.com/explore")
print(driver.get_cookies())
driver.add_cookie({'name':'namne','domain':'www.zhihu.com','value':'zhangsan'})
print(driver.get_cookies())
driver.delete_all_cookies()
print(driver.get_cookies())
#driver.close()

13. Tab (window) management:

driver.window_handles[1] - tab name (ranking by sequence number)
driver.switch_to_window(driver.window_handles[0]) - switch to the specified tab

from selenium import webdriver
import time

#Creating browser objects
driver = webdriver.Chrome()
#Load request specify url address
driver.get("https://www.baidu.com")
#Using JavaScript to open a new selection card
driver.execute_script('window.open()')
print(driver.window_handles)
#Switch to the second tab and open the url address
driver.switch_to_window(driver.window_handles[1])
driver.get("https://www.taobao.com")
time.sleep(2)
#Switch to the first tab and open the url address
driver.switch_to_window(driver.window_handles[0])
driver.get("https://www.jd.com")
#driver.close()

14. Exception handling:

Timeout exception:

from selenium import webdriver
from selenium.common.exceptions import TimeoutException,NoSuchElementException

#Creating browser objects
driver = webdriver.Chrome()
try:
    #Load request specify url address
    driver.get("https://www.baidu.com")
except TimeoutException:
    print('Time Out')

Page load completed cannot find the specified element exception

try:
    #Load request specify url address
    driver.find_element_by_id("demo")
except NoSuchElementException:
    print('No Element')
finally:
    #driver.close()
    pass

15. Small case

Simulate Google browser to visit Baidu homepage, and input python keyword search

from selenium import webdriver
from selenium.webdriver.common.by import By
from selenium.webdriver.common.keys import Keys
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait

#Initializing a browser (e.g. Google, using Chrome requires Chrome driver to be installed)
driver = webdriver.Chrome()
#driver = webdriver.PhantomJS() #No interface browser
try:
    #Request web page
    driver.get("https://www.baidu.com")
    #Find node object with id value kw (search input box)
    input = driver.find_element_by_id("kw")
    #Analog keyboard input string content
    input.send_keys("python")
    #Analog keyboard click enter
    input.send_keys(Keys.ENTER)
    #Explicit wait, up to 10 seconds
    wait = WebDriverWait(driver,10)
    #Waiting condition: there must be an id attribute value of content in 10 seconds_ Load the left node, otherwise throw an exception.
    wait.until(EC.presence_of_element_located((By.ID,'content_left')))
    # Output response information
    print(driver.current_url)
    print(driver.get_cookies())
    print(driver.page_source)
finally:
    #Close browser
    #driver.close()
    pass

3, Selenium's Taobao climbing battle

1. Foreword

Taobao uses a lot of anti climbing measures, and can use selenium to call out the web page and scan the mobile phone code to log in.

2. Case requirements

Use Selenium to crawl Taobao products, specify keywords and page information to crawl

3. Case study:

url address: https://s.taobao.com/search?q=ipad

4. Specific code implementation

'''Crawling information data of Taobao website through keywords'''
from selenium import webdriver
from selenium.common.exceptions import TimeoutException
from selenium.webdriver.common.by import By
from selenium.webdriver.support import expected_conditions as EC
from selenium.webdriver.support.wait import WebDriverWait
from pyquery import PyQuery as pq
from urllib.parse import quote
import json

KEYWORD = "ipad"
MAX_PAGE = 10

# browser = webdriver.Chrome()
# browser = webdriver.PhantomJS()
#Create Google browser object and enable Headless interface free mode of Chrome
#chrome_options = webdriver.ChromeOptions()
#chrome_options.add_argument('--headless')
#browser = webdriver.Chrome(chrome_options=chrome_options)
#You need an interface here to scan and log in your mobile phone
browser = webdriver.Chrome()
#Explicit wait:
wait = WebDriverWait(browser, 10)

def index_page(page):
    '''Grab index page :param page: Page'''
    print('Climbing to the top', page, 'page')
    try:
        url = 'https://s.taobao.com/search?q=' + quote(KEYWORD)
        browser.get(url)
        if page > 1:
            input = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '#mainsrp-pager div.form > input')))
            submit = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, '#mainsrp-pager div.form > span.btn.J_Submit')))
            input.clear()
            input.send_keys(page)
            submit.click()
        #Waiting condition: displays the current page number, and the node object of the explicit product appears
        wait.until(EC.text_to_be_present_in_element((By.CSS_SELECTOR, '#mainsrp-pager li.item.active > span'), str(page)))
        wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.m-itemlist .items .item')))
        html = browser.page_source
        return html
    except TimeoutException:
        index_page(page)
        
def get_products(html):
    '''Extract product data'''
    doc = pq(html)
    items = doc('#mainsrp-itemlist .items .item').items()
    for item in items:
        yield {
            'image': item.find('.pic .img').attr('data-src'),
            'price': item.find('.price').text(),
            'deal': item.find('.deal-cnt').text(),
            'title': item.find('.title').text(),
            'shop': item.find('.shop').text(),
            'location': item.find('.location').text()
        }

def save_data(result):
    '''Save data'''
	with open("./taobao.txt","a",encoding="utf-8") as f1:
		f1.write(json.dumps(result,ensure_ascii=False)+"\n")

def main():
    '''Traverse every page'''
    for i in range(1, MAX_PAGE + 1):
        html = index_page(i)
        result = get_products(html)
		for i in content:
			save_data(i)
    browser.close()

# Main program entry
if __name__ == '__main__':
    main()

Tags: Selenium Attribute Javascript Google

Posted on Tue, 23 Jun 2020 01:02:47 -0400 by TimR