python crawler -- understanding and use of selenium

preface

  • Judgment method for whether a page is dynamically loaded (Ajax) data ⭐⭐

Dynamically loading data means that you can't directly get the page data by directly requesting the web address. We can locate the network through the packet capture tool on the web page to request the web page, and check whether a data on the web page is in the data page loaded by the network request page

  • Detailed steps


How does the dynamically loaded data come from?


1, Introduction to selenium

What is the relationship between selenium module and crawler?

  • Convenient access to dynamically loaded data in the website
  • Implementation of boundary simulation login

What is selenium module?

  • A module based on browser automation

selenium module usage process:

  • Installation environment: pip install selenium ⭐
  • Download a browser driver: download the browser driver for the browser you use: Google driver
    For the google browser currently in use, you can find the corresponding driver according to the version number.
  • Instantiate a browser object
  • Write operation code based on browser automation:
  1. Initiate request: get(url)
  2. Label positioning: find series method
  3. Tag interaction: send_keys(‘xxx’)
  4. Execute js program: execte_script(‘jsCode’)
  5. Close browser: quit()

  • Use selenium to crawl the data of the State Food and drug administration ⭐
from selenium import webdriver
from lxml import etree
from time import sleep

# Instantiate a browser object (the driver passed in to the browser)
bro = webdriver.Chrome(executable_path='E:\Google\chromedriver')
# Let the browser initiate a request corresponding to the specified url
bro.get('http://scxk.nmpa.gov.cn:81/xk/')

# Get the source code data of the current page of the browser
page_text = bro.page_source

# Resolve enterprise name
tree = etree.HTML(page_text)
li_list = tree.xpath('//ul[@id="gzlist"]/li') 
for li in li_list:
    name = li.xpath('./dl/@title')[0]
    print(name)

sleep(5)
# Close browser
bro.quit()

  • Simple automation using selenium
from selenium import webdriver
from time import sleep

bro = webdriver.Chrome(executable_path=r'E:\Google\chromedriver')

bro.get('https://www.taobao.com/')

# Find search box: label positioning
search_input = bro.find_element_by_id('q') # Returns the positioned label
# Tag interaction
search_input.send_keys('Iphone')

# Execute a set of js programs: scroll the height of a screen
bro.execute_script('window.scrollTo(0,document.body.scrollHeight)')
sleep(2)
# Click the search button class = 'BTN search TB BG'
btn = bro.find_element_by_css_selector('.btn-search')
# btn = bro.find_element_by_class_name('btn-search')
btn = bro.find_element_by_class_name('tb-bg')
btn.click()


bro.get('https://www.baidu.com')
sleep(2)
# Back off
bro.back()
sleep(2)
# forward
bro.forward()

sleep(5)
bro.quit()

2, selenium handles iframe

  • If the positioned tag exists in the iframe tag, you must use switch_to.frame(id)
  • Action chain (drag): from selenium.webdriver import ActionChains # action chain
    • Instantiate action chain object: action = ActionChains(bro)
    • action.click_and_hold(div): long press and click
    • move_by_offset(x,y): the pixel value you want to offset
    • action.move_by_offset(17,yoffset=0).perform(): make the action chain execute immediately
    • action.release() releases the action chain object
  • iframe

Iframe is our commonly used iframe tag: < iframe >. Iframe tag is a form of framework, which is also commonly used. Iframe is generally used to include other pages. For example, we can load the contents of other websites or other pages of our site on our own website page.
The biggest function of iframe tag is to make the page beautiful.
There are many uses of iframe tags. The main difference lies in the different forms of defining iframe tags, such as defining the length, width and height of iframe.

  • Use selenium to achieve the following operations
  1. View and output dragged labels
from selenium import webdriver
from time import sleep

bro = webdriver.Chrome(executable_path=r'E:\Google\chromedriver')

url = 'http://www.runoob.com/try/try.php?filename=jqueryui-api-droppable'
bro.get(url)

div = bro.find_element_by_id('draggable')
print(div)


If the located tag exists in the iframe tag, it must be located by the following operations

# Fill in the id and name of the included iframme
bro.switch_to.frame('iframeResult') # Toggles the scope of browser labels

Run again without error, and print the results:

  1. Complete action
from selenium import webdriver
from time import sleep
from selenium.webdriver import ActionChains # Action chain
 
bro = webdriver.Chrome(executable_path=r'E:\Google\chromedriver')

url = 'http://www.runoob.com/try/try.php?filename=jqueryui-api-droppable'
bro.get(url)

# If the located tag exists in the iframe tag, it must be located by the following operations
bro.switch_to.frame('iframeResult') # Toggles the scope of browser labels
div = bro.find_element_by_id('draggable')
# print(div)

# Action chain: trigger a series of operations
action = ActionChains(bro)
# Click and hold
action.click_and_hold(div)
for i in range(5):
    # move_by_offset(x,y)
    action.move_by_offset(17,yoffset=0).perform() # perform() executes the action chain immediately
    sleep(0.3)

# Action chain release
action.release()

sleep(5)
bro.quit()

3, Simulated Login of selenium

  • demand

Simulate login: https://qzone.qq.com/

  1. Simulate open URL
from selenium import webdriver
from time import sleep

bro = webdriver.Chrome(executable_path=r'E:\Google\chromedriver')

bro.get('https://qzone.qq.com/')


  1. Jump iframe
from selenium import webdriver
from time import sleep

bro = webdriver.Chrome(executable_path=r'E:\Google\chromedriver')

bro.get('https://qzone.qq.com/')

bro.switch_to.frame('login_frame')

a_tag = bro.find_element_by_id('switcher_plogin')
a_tag.click()

# Find the text for entering the account and password
userName_tag = bro.find_element_by_id('u')
password_tag = bro.find_element_by_id('p')
sleep(1)
userName_tag.send_keys('2328409226')
sleep(1)
password_tag.send_keys('123456789')
sleep(1)

# Login button
btn = bro.find_element_by_id('login_button')
btn.click()

sleep(3)
bro.quit()

4, Headless browser and evasion detection

4.1 no visual interface

How to make our Google browser do not carry out visual operation, that is, there is no visual interface (headless browser)

  • Add code ⭐
from selenium.webdriver.chrome.options import Options

# Create a parameter object to control chrome to open in no interface mode
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')

bro = webdriver.Chrome(executable_path=r'E:\Google\chromedriver',chrome_options=chrome_options)

  • verification
from selenium import webdriver
from time import sleep
from selenium.webdriver.chrome.options import Options 

# Create a parameter object to control chrome to open in no interface mode
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')

bro = webdriver.Chrome(executable_path=r'E:\Google\chromedriver',chrome_options=chrome_options)

# No visual interface (headless browser)  
# phantomJs is also a headless browser, but it has stopped updating and maintaining
bro.get('https://www.baidu.com/')
print(bro.page_source)


bro.quit()

4.2 avoidance detection

  • Add code ⭐
from selenium.webdriver import ChromeOptions

# Implementation of circumvention detection
option = ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])

  • verification
from selenium import webdriver
from time import sleep
# Realize no visual interface
from selenium.webdriver.chrome.options import Options 
# Implementation of circumvention detection
from selenium.webdriver import ChromeOptions

# Create a parameter object to control chrome to open in no interface mode
chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')

# Implementation of circumvention detection
option = ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])



# How to make selenium avoid the risk of being detected
bro = webdriver.Chrome(executable_path=r'E:\Google\chromedriver',chrome_options=chrome_options,options=option)

# No visual interface (headless browser)  
# phantomJs is also a headless browser, but it has stopped updating and maintaining
bro.get('https://www.baidu.com/')
print(bro.page_source)


bro.quit()

5, One two three zero six simulated Login

  • Coding process
  1. Find account login
  2. Find the label corresponding to the account and password
  3. Find the label of the slider
  4. Sliding with action chain

Thank you for the solutions provided by the following Blogs:

  1. Solve the sliding problem of slider verification code
  2. move_to_element()/drag_and_drop()

5.1 step explanation

  1. No sliding

Since there is no time pause after clicking the button, set the time interval

  1. Slider sliding failed
bro.get('https://kyfw.12306.cn/otn/resources/login.html')
#Prevent login identified as selenium by 12306 ⭐⭐
script = 'Object.defineProperty(navigator,"webdriver",{get:()=>undefined,});'
bro.execute_script(script)
  1. Slider movement position

move_by_offset(xoffset, yoffset) -- move the mouse from the current position to a certain coordinate

# Action chain: trigger a series of operations
action = ActionChains(bro)
# Click and hold
action.click_and_hold(div)
for i in range(5):
    # move_by_offset(x,y)
    action.move_by_offset(17,yoffset=0).perform() # One time offset 17 pixels  
    sleep(0.3)

# Action chain release
action.release()

move_to_element(to_element) -- move the mouse to an element
move_to_element_with_offset(to_element, xoffset, yoffset) -- how far is it from an element (upper left coordinate)


drag_and_drop_by_offset(source, xoffset, yoffset) -- drag to a coordinate and release

#Sliding verification code
span = bro.find_element_by_xpath('//*XPath of [@ id="nc_1_n1z"]) # slider: / / * [@ id="nc_1_n1z"]

bigspan = bro.find_element_by_class_name('nc-lang-cnt')
print('big_span:',bigspan.size)# Length and width corresponding to the label

# To div_tag for sliding operation
action = ActionChains(bro)
# Click and hold the specified label
action.click_and_hold(span).perform()
action.drag_and_drop_by_offset(span, 350, 0).perform()  # Drag to a coordinate and release
# action.drag_and_drop_by_offset(span, 400, 0)
# Action chain release
action.release()

Coordinate system of computer display screen: X increases from left to right and Y increases from top to bottom. That is, the coordinates of the top left visible pixel are (0, 0)

Questions about dragging coordinate points:
As for whether it is (350, 0) or (400, 0) or other coordinates (x,y), just drag the slider to the bottom

perform() -- execute the action chain immediately

5.2 source code

from selenium import webdriver
from time import sleep
from selenium.webdriver import ActionChains # Action chain
from selenium.webdriver import ChromeOptions

# Implementation of circumvention detection
option = ChromeOptions()
option.add_experimental_option('excludeSwitches', ['enable-automation'])


bro = webdriver.Chrome(executable_path=r'E:\Google\chromedriver')

bro.get('https://kyfw.12306.cn/otn/resources/login.html')
#Prevent login identified as selenium by 12306
script = 'Object.defineProperty(navigator,"webdriver",{get:()=>undefined,});'
bro.execute_script(script)

# Find the account login button
tag = bro.find_element_by_xpath('//*[@id="toolbar_Div"]/div[2]/div[2]/ul/li[2]/a')
tag.click()
sleep(1)

# Enter account password
userName_tag = bro.find_element_by_id('J-userName')
password_tag = bro.find_element_by_id('J-password')
sleep(1)
userName_tag.send_keys('17349868689')
sleep(1)
password_tag.send_keys('ZHENHAO0903')
sleep(1)

# Login button
btn = bro.find_element_by_id('J-login')
btn.click()
sleep(3)

#Sliding verification code
span = bro.find_element_by_xpath('//*XPath of [@ id="nc_1_n1z"]) # slider: / / * [@ id="nc_1_n1z"]

bigspan = bro.find_element_by_class_name('nc-lang-cnt')
print('big_span:',bigspan.size)# Length and width corresponding to the label


# Slide div_tag
action = ActionChains(bro)
# Click and hold the specified label
action.click_and_hold(span).perform()
action.drag_and_drop_by_offset(span, 350, 0).perform()  # Drag to a coordinate and release
# action.drag_and_drop_by_offset(span, 400, 0)
# Action chain release
action.release()

sleep(5)
bro.quit()

summary

  1. Slide click using action chain

Given the coordinates 234,11 | 21,3, click the data in the coordinates

  • Convert coordinate data to a list
result = '253,83|253,153'
all_list = [] # Stores the coordinates of the point to be clicked
if '|' in result:
	list_1 = result.split('|') 
	cout_1 = len(list_1)
	for i in range(count_1):
		xy_list = []
		x = int(list_1[i].split(',')[0])
		y = int(list_1[i].split(',')[1])
		xy_list.append(x)
		xy_list.append(y)
		all_list.append(xy_list)
else:
	xy_list = []
	x = int(list_1[i].split(',')[0])
	y = int(list_1[i].split(',')[1])
	xy_list.append(x)
	xy_list.append(y)
	all_list.append(xy_list)
print(all_list)

  • Traverse the list and use the action chain operation
# Traverse the list and click the position specified by X and Y corresponding to each list element using the action chain
for l in all_list:
	x = l[0]
	y = l[1]
	ActionChains(bro).move_to_element-with_offset(img,x,y).perform() # perform() executes immediately
	time.sleep(0.5)
# img is the label that switches the reference to the current picture

Tags: Python Selenium crawler

Posted on Fri, 08 Oct 2021 20:52:37 -0400 by jandante@telenet.be