python crawler -- understanding and use of selenium

preface

Judgment method for whether a page is dynamically loaded (Ajax) data ⭐⭐

Dynamically loading data means that you can't directly get the page data by directly requesting the web address. We can locate the network through the packet capture tool on the web page to request the web page, and check whether a data on the web page is in the data page loaded by the network request page

Detailed steps

How does the dynamically loaded data come from?

1, Introduction to selenium

What is the relationship between selenium module and crawler?

Convenient access to dynamically loaded data in the website
Implementation of boundary simulation login

What is selenium module?

A module based on browser automation

selenium module usage process:

Installation environment: pip install selenium ⭐
Download a browser driver: download the browser driver for the browser you use: Google driver
For the google browser currently in use, you can find the corresponding driver according to the version number.
Instantiate a browser object
Write operation code based on browser automation:

Initiate request: get(url)
Label positioning: find series method
Tag interaction: send_keys(‘xxx’)
Execute js program: execte_script(‘jsCode’)
Close browser: quit()

Use selenium to crawl the data of the State Food and drug administration ⭐

from selenium import webdriver from lxml import etree from time import sleep # Instantiate a browser object (the driver passed in to the browser) bro = webdriver.Chrome(executable_path='E:\Google\chromedriver') # Let the browser initiate a request corresponding to the specified url bro.get('http://scxk.nmpa.gov.cn:81/xk/') # Get the source code data of the current page of the browser page_text = bro.page_source # Resolve enterprise name tree = etree.HTML(page_text) li_list = tree.xpath('//ul[@id="gzlist"]/li') for li in li_list: name = li.xpath('./dl/@title')[0] print(name) sleep(5) # Close browser bro.quit()

Simple automation using selenium

from selenium import webdriver from time import sleep bro = webdriver.Chrome(executable_path=r'E:\Google\chromedriver') bro.get('https://www.taobao.com/') # Find search box: label positioning search_input = bro.find_element_by_id('q') # Returns the positioned label # Tag interaction search_input.send_keys('Iphone') # Execute a set of js programs: scroll the height of a screen bro.execute_script('window.scrollTo(0,document.body.scrollHeight)') sleep(2) # Click the search button class = 'BTN search TB BG' btn = bro.find_element_by_css_selector('.btn-search') # btn = bro.find_element_by_class_name('btn-search') btn = bro.find_element_by_class_name('tb-bg') btn.click() bro.get('https://www.baidu.com') sleep(2) # Back off bro.back() sleep(2) # forward bro.forward() sleep(5) bro.quit()

2, selenium handles iframe

If the positioned tag exists in the iframe tag, you must use switch_to.frame(id)
Action chain (drag): from selenium.webdriver import ActionChains # action chain
- Instantiate action chain object: action = ActionChains(bro)
- action.click_and_hold(div): long press and click
- move_by_offset(x,y): the pixel value you want to offset
- action.move_by_offset(17,yoffset=0).perform(): make the action chain execute immediately
- action.release() releases the action chain object

iframe

Iframe is our commonly used iframe tag: < iframe >. Iframe tag is a form of framework, which is also commonly used. Iframe is generally used to include other pages. For example, we can load the contents of other websites or other pages of our site on our own website page.
The biggest function of iframe tag is to make the page beautiful.
There are many uses of iframe tags. The main difference lies in the different forms of defining iframe tags, such as defining the length, width and height of iframe.

Use selenium to achieve the following operations

View and output dragged labels

from selenium import webdriver from time import sleep bro = webdriver.Chrome(executable_path=r'E:\Google\chromedriver') url = 'http://www.runoob.com/try/try.php?filename=jqueryui-api-droppable' bro.get(url) div = bro.find_element_by_id('draggable') print(div)

If the located tag exists in the iframe tag, it must be located by the following operations

# Fill in the id and name of the included iframme bro.switch_to.frame('iframeResult') # Toggles the scope of browser labels

Run again without error, and print the results:

Complete action

from selenium import webdriver from time import sleep from selenium.webdriver import ActionChains # Action chain bro = webdriver.Chrome(executable_path=r'E:\Google\chromedriver') url = 'http://www.runoob.com/try/try.php?filename=jqueryui-api-droppable' bro.get(url) # If the located tag exists in the iframe tag, it must be located by the following operations bro.switch_to.frame('iframeResult') # Toggles the scope of browser labels div = bro.find_element_by_id('draggable') # print(div) # Action chain: trigger a series of operations action = ActionChains(bro) # Click and hold action.click_and_hold(div) for i in range(5): # move_by_offset(x,y) action.move_by_offset(17,yoffset=0).perform() # perform() executes the action chain immediately sleep(0.3) # Action chain release action.release() sleep(5) bro.quit()

3, Simulated Login of selenium

demand

Simulate login: https://qzone.qq.com/

Simulate open URL

from selenium import webdriver from time import sleep bro = webdriver.Chrome(executable_path=r'E:\Google\chromedriver') bro.get('https://qzone.qq.com/')

Jump iframe

from selenium import webdriver from time import sleep bro = webdriver.Chrome(executable_path=r'E:\Google\chromedriver') bro.get('https://qzone.qq.com/') bro.switch_to.frame('login_frame') a_tag = bro.find_element_by_id('switcher_plogin') a_tag.click() # Find the text for entering the account and password userName_tag = bro.find_element_by_id('u') password_tag = bro.find_element_by_id('p') sleep(1) userName_tag.send_keys('2328409226') sleep(1) password_tag.send_keys('123456789') sleep(1) # Login button btn = bro.find_element_by_id('login_button') btn.click() sleep(3) bro.quit()

4, Headless browser and evasion detection

4.1 no visual interface

How to make our Google browser do not carry out visual operation, that is, there is no visual interface (headless browser)

Add code ⭐

from selenium.webdriver.chrome.options import Options # Create a parameter object to control chrome to open in no interface mode chrome_options = Options() chrome_options.add_argument('--headless') chrome_options.add_argument('--disable-gpu') bro = webdriver.Chrome(executable_path=r'E:\Google\chromedriver',chrome_options=chrome_options)

verification

from selenium import webdriver from time import sleep from selenium.webdriver.chrome.options import Options # Create a parameter object to control chrome to open in no interface mode chrome_options = Options() chrome_options.add_argument('--headless') chrome_options.add_argument('--disable-gpu') bro = webdriver.Chrome(executable_path=r'E:\Google\chromedriver',chrome_options=chrome_options) # No visual interface (headless browser) # phantomJs is also a headless browser, but it has stopped updating and maintaining bro.get('https://www.baidu.com/') print(bro.page_source) bro.quit()

4.2 avoidance detection

Add code ⭐

from selenium.webdriver import ChromeOptions # Implementation of circumvention detection option = ChromeOptions() option.add_experimental_option('excludeSwitches', ['enable-automation'])

verification

from selenium import webdriver from time import sleep # Realize no visual interface from selenium.webdriver.chrome.options import Options # Implementation of circumvention detection from selenium.webdriver import ChromeOptions # Create a parameter object to control chrome to open in no interface mode chrome_options = Options() chrome_options.add_argument('--headless') chrome_options.add_argument('--disable-gpu') # Implementation of circumvention detection option = ChromeOptions() option.add_experimental_option('excludeSwitches', ['enable-automation']) # How to make selenium avoid the risk of being detected bro = webdriver.Chrome(executable_path=r'E:\Google\chromedriver',chrome_options=chrome_options,options=option) # No visual interface (headless browser) # phantomJs is also a headless browser, but it has stopped updating and maintaining bro.get('https://www.baidu.com/') print(bro.page_source) bro.quit()

5, One two three zero six simulated Login

Coding process

Find account login
Find the label corresponding to the account and password
Find the label of the slider
Sliding with action chain

Thank you for the solutions provided by the following Blogs:

5.1 step explanation

No sliding

Since there is no time pause after clicking the button, set the time interval

Slider sliding failed

bro.get('https://kyfw.12306.cn/otn/resources/login.html') #Prevent login identified as selenium by 12306 ⭐⭐ script = 'Object.defineProperty(navigator,"webdriver",);' bro.execute_script(script)

Slider movement position

move_by_offset(xoffset, yoffset) -- move the mouse from the current position to a certain coordinate

# Action chain: trigger a series of operations action = ActionChains(bro) # Click and hold action.click_and_hold(div) for i in range(5): # move_by_offset(x,y) action.move_by_offset(17,yoffset=0).perform() # One time offset 17 pixels sleep(0.3) # Action chain release action.release()

move_to_element(to_element) -- move the mouse to an element
move_to_element_with_offset(to_element, xoffset, yoffset) -- how far is it from an element (upper left coordinate)

drag_and_drop_by_offset(source, xoffset, yoffset) -- drag to a coordinate and release

#Sliding verification code span = bro.find_element_by_xpath('//*XPath of [@ id="nc_1_n1z"]) # slider: / / * [@ id="nc_1_n1z"] bigspan = bro.find_element_by_class_name('nc-lang-cnt') print('big_span:',bigspan.size)# Length and width corresponding to the label # To div_tag for sliding operation action = ActionChains(bro) # Click and hold the specified label action.click_and_hold(span).perform() action.drag_and_drop_by_offset(span, 350, 0).perform() # Drag to a coordinate and release # action.drag_and_drop_by_offset(span, 400, 0) # Action chain release action.release()

Coordinate system of computer display screen: X increases from left to right and Y increases from top to bottom. That is, the coordinates of the top left visible pixel are (0, 0)

Questions about dragging coordinate points:
As for whether it is (350, 0) or (400, 0) or other coordinates (x,y), just drag the slider to the bottom

perform() -- execute the action chain immediately

5.2 source code

from selenium import webdriver from time import sleep from selenium.webdriver import ActionChains # Action chain from selenium.webdriver import ChromeOptions # Implementation of circumvention detection option = ChromeOptions() option.add_experimental_option('excludeSwitches', ['enable-automation']) bro = webdriver.Chrome(executable_path=r'E:\Google\chromedriver') bro.get('https://kyfw.12306.cn/otn/resources/login.html') #Prevent login identified as selenium by 12306 script = 'Object.defineProperty(navigator,"webdriver",);' bro.execute_script(script) # Find the account login button tag = bro.find_element_by_xpath('//*[@id="toolbar_Div"]/div[2]/div[2]/ul/li[2]/a') tag.click() sleep(1) # Enter account password userName_tag = bro.find_element_by_id('J-userName') password_tag = bro.find_element_by_id('J-password') sleep(1) userName_tag.send_keys('17349868689') sleep(1) password_tag.send_keys('ZHENHAO0903') sleep(1) # Login button btn = bro.find_element_by_id('J-login') btn.click() sleep(3) #Sliding verification code span = bro.find_element_by_xpath('//*XPath of [@ id="nc_1_n1z"]) # slider: / / * [@ id="nc_1_n1z"] bigspan = bro.find_element_by_class_name('nc-lang-cnt') print('big_span:',bigspan.size)# Length and width corresponding to the label # Slide div_tag action = ActionChains(bro) # Click and hold the specified label action.click_and_hold(span).perform() action.drag_and_drop_by_offset(span, 350, 0).perform() # Drag to a coordinate and release # action.drag_and_drop_by_offset(span, 400, 0) # Action chain release action.release() sleep(5) bro.quit()

summary

Slide click using action chain

Given the coordinates 234,11 | 21,3, click the data in the coordinates

Convert coordinate data to a list

result = '253,83|253,153' all_list = [] # Stores the coordinates of the point to be clicked if '|' in result: list_1 = result.split('|') cout_1 = len(list_1) for i in range(count_1): xy_list = [] x = int(list_1[i].split(',')[0]) y = int(list_1[i].split(',')[1]) xy_list.append(x) xy_list.append(y) all_list.append(xy_list) else: xy_list = [] x = int(list_1[i].split(',')[0]) y = int(list_1[i].split(',')[1]) xy_list.append(x) xy_list.append(y) all_list.append(xy_list) print(all_list)

Traverse the list and use the action chain operation

# Traverse the list and click the position specified by X and Y corresponding to each list element using the action chain for l in all_list: x = l[0] y = l[1] ActionChains(bro).move_to_element-with_offset(img,x,y).perform() # perform() executes immediately time.sleep(0.5) # img is the label that switches the reference to the current picture

python crawler -- understanding and use of selenium

4.1 no visual interface

4.2 avoidance detection

5.1 step explanation

5.2 source code

8 October 2021, 20:52 | Views: 2153

Add new comment

0 comments