1, Dynamic rendering page crawling
1. Background issues
- For the data that directly responds when accessing the Web (that is, the response content is visible (not ajax loaded or rendered data)), we use the urlib, requests or scratch framework to crawl.
- For general JavaScript dynamically rendered page information (Ajax loading), we can grab information by analyzing the Ajax request address through packet grabbing.
- Ajax = asynchronous JavaScript and XML (a subset of the standard generic markup language).
- Ajax is a technology for creating fast dynamic web pages.
- Ajax is a technology that can update some web pages without reloading the whole web page. For example: JD specifies the comment information of commodity information.
- Even if we get the data through Ajax, there will be some encrypted parameters. Later, we generate the content through JavaScript calculation, which makes it difficult for us to find the rules directly, such as Taobao page.
2. Solutions
- Method principle: in order to solve these problems, we can directly use the way of simulated browser to achieve information acquisition.
- There are many simulated browser runtime libraries in Python, such as Selenium, Splash, PyV8, Ghost, etc.
2, Introduction to Selenium
1. Function and introduction of Selenium
- Selenium is an automatic testing tool, which can drive the browser to perform specific actions, such as click, pull, drag, move the scroll bar and so on. (selenium is powerful, webDriver is just a branch of the module)
- Selenium can obtain the source code of the page currently presented by the browser, so that it can be seen and crawled. It is very effective to crawl the information corresponding to JavaScript dynamic rendering.
- Selenium supports many browsers, such as Chrome, Firefox, Edge, etc. it also supports browser without interface (phantom JS driver needs to be installed).
- Official website: http://www.seleniumhq.org
- Official documents: http://selenium-python.readthedocs.io
- Chinese document: http://selenium-python-zh.readthedocs.io
2. Selenium installation
2.1 module installation
pip install selenium
2.2. Driver installation (including driver download and placing the driver in the specified location):
2.2.1 driver download:
- Google: installation of chrome driver browser driver (please pay attention to browser version). First check the current version of Google Chrome browser (help group - > about chrome) v61 ~ v67 (corresponding to 2.35 ~ 2.38), and then download it at the following website: ChromeDriver driver download website
- Firefox needs: GeckoDriver drive: geckodriver.exe Drive Daquan ; Official image address of firefox.
- Download address of phantom JS driver: No interface driver download address.
2.2.2 drive installation:
- Windows installation: files to extract: chromedriver.exe Place it in Python's Scripts directory.
- Mac/Linux Installation: place the extracted file: chromedriver in the directory * * / usr/local/bin / *;
3. Shortcomings
It takes a long time to use Selenium simulation browser to run and wait for the web page to finish executing before rendering to data.
3, Use of Selenium
1. Declare browser objects
from selenium import webdriver driver = webdriver.Chrome() #Google needs: Chrome driver driver driver = webdriver.FireFox() #Firefox needs: GeckoDriver driver driver = webdriver.Edge() driver = webdriver.Safari() driver = webdriver.PhantomJS() #No interface browser
Create Google browser object and enable Headless interface free mode of Chrome
chrome_options = webdriver.ChromeOptions() chrome_options.add_argument('--headless') driver = webdriver.Chrome(chrome_options=chrome_options)
2. Visit page
from selenium import webdriver driver = webdriver.Chrome() #driver = webdriver.PhantomJS() driver.get("http://www.taobao.com") #Visit the specified website print(driver.page_source) #driver.close() (driver shutdown is required before the end, but the browser will be closed when running this code, so the program cannot continue to run)
3. Find node:
3.1. Use selenium built-in function to get nodes
How to get a single node:
find_element_by_id() find_element_by_name() find_element_by_xpath() find_element_by_link_text() find_element_by_partial_link_text() find_element_by_tag_name() find_element_by_class_name() find_element_by_css_seletor()
from selenium import webdriver from selenium.webdriver.common.by import By #Creating browser objects driver = webdriver.Chrome() #driver = webdriver.PhantomJS() driver.get("http://www.taobao.com") #The following are all node objects whose id attribute value is q input = driver.find_element_by_id("q") print(input) input = driver.find_element_by_css_selector("#q") print(input) input = driver.find_element_by_xpath("//*[@id='q']") print(input) #The effect is the same as above input = driver.find_element(By.ID,"q") print(input) #driver.close()
How to get multiple nodes:
find_elements_by_id() find_elements_by_name() find_elements_by_xpath() find_elements_by_link_text() find_elements_by_partial_link_text() find_elements_by_tag_name() find_elements_by_class_name() find_elements_by_css_seletor()
3.2 use selenium.webdriver.common By in the. By module to find the node
- Import module: from selenium.webdriver.common.by import By
- By is a class built in selenium, in which there are various methods to locate elements
- Classification of positioners supported By:
1. id attribute location:
find_element(By.ID,"id")
2. name attribute positioning
find_element(By.NAME,"name")
3. classname attribute positioning
find_element(By.CLASS_NAME,"claname")
4. a label text attribute positioning
find_element(By.LINK_TEXT,"text")
5. a label part text attribute positioning
find_element(By.PARTIAL_LINK_TEXT,"partailtext")
6. Label name positioning
find_elemnt(By.TAG_NAME,"input")
7. xpath path location
find_element(By.XPATH,"//div[@name='name']")
8. css selector positioning
find_element(By.CSS_SELECTOR,"#id")
4. Node interaction:
input.clear() -- clear the input box (get the input box node first)
input.send_keys('* *') -- input the specified content of analog keyboard (get the input box node first)
botton.click() -- trigger click action (get button node first)
from selenium import webdriver import time #Creating browser objects driver = webdriver.Chrome() #driver = webdriver.PhantomJS() driver.get("http://www.taobao.com") #The following are all node objects whose id attribute value is q input = driver.find_element_by_id("q") #Analog keyboard input iphone input.send_keys('iphone') time.sleep(3) #Clear input box input.clear() #Analog keyboard input iPad input.send_keys('iPad') #Get search button node botton = driver.find_element_by_class_name("btn-search") #Trigger click action botton.click() #driver.close()
5. Dynamic chain:
To import ActionChains module:
from selenium.webdriver import ActionChains
- ActionChains is a method of automating low-level interaction (simulating various mouse operations), such as mouse movement, mouse button operation, key operation and context menu interaction.
- This is useful for more complex operations such as hovering and dragging
- move_to_element(to_ Element) -- move the mouse to the middle of the element (hover)
- move_by_offset (xoffset, yoffset) -- the offset to move the mouse to the current mouse position
- drag_and_drop - then move to the target element and release the mouse button.
- pause (seconds) - pauses all input for a specified duration in seconds
- perform () -- performs all stored operations.
- release(on_element = None) releases a hold mouse button on the element.
- reset_actions () -- clears actions that are already stored on the remote side.
- send_keys(* keys_to_send) - sends the key to the current focus element.
- send_keys_to_element(element,* keys_to_send) - sends the key to an element.
Drag case:
from selenium import webdriver from selenium.webdriver import ActionChains import time #Creating browser objects driver = webdriver.Chrome() #Load the specified url address url = 'http://www.runoob.com/try/try.php?filename=jqueryui-api-droppable' driver.get(url) # Switch Frame window driver.switch_to.frame('iframeResult') #Get two div node objects source = driver.find_element_by_css_selector("#draggable") target = driver.find_element_by_css_selector("#droppable") #Create an action chain object actions = ActionChains(driver) #Add a drag operation to the action chain queue actions.drag_and_drop(source,target) time.sleep(3) #Perform all stored operations (sequence triggered) actions.perform() #driver.close()
6. Execute JavaScript:
driver.execute_script('window.open() '-- execute javascript command
Scroll bar
from selenium import webdriver #Creating browser objects driver = webdriver.Chrome() #Load the specified url address driver.get("https://www.zhihu.com/explore") #Executing a javascript program to scroll the page to the bottom driver.execute_script('window.scrollTo(0,document.body.scrollHeight)') #Execute javascript to realize a pop-up operation driver.execute_script('window.alert("Hello Selenium!")') #driver.close()
7. Get node information:
from selenium import webdriver from selenium.webdriver import ActionChains #Creating browser objects driver = webdriver.Chrome() #Load request specify url address driver.get("https://www.zhihu.com/explore") #Get the node (logo) whose id attribute value is zh top link logo logo = driver.find_element_by_id("zh-top-link-logo") print(logo) #Output node object print(logo.get_attribute('class')) #class attribute value of node #Get the node whose id attribute value is Zu top add question (question button) input = driver.find_element_by_id("zu-top-add-question") print(input.text) #Get content between nodes print(input.id) #Get id attribute value print(input.location) #Relative position of nodes in the page print(input.tag_name) #Node label name print(input.size) #Get the size of the node print(driver.page_source) #Get page code #driver.close()
8. Switch Frame:
- In a web page, there is a node called iframe, which can divide a page into multiple child parent interfaces.
- We can use switch_to.frame() to switch the Frame interface. For example, see the dynamic chain case in ⑥
9. Delay waiting:
- It takes time for the browser to load the web page, and Selenium is no exception. To get the full content of the web page, you need to delay waiting.
- There are two ways to delay waiting in Selenium: implicit waiting and explicit waiting (recommended).
9.1 implicit waiting (fixed time): driver.implicitly_wait(2)
from selenium import webdriver #Creating browser objects driver = webdriver.Chrome() #Use implicit wait (fixed time) driver.implicitly_wait(2) #Load request specify url address driver.get("https://www.zhihu.com/explore") #Get node input = driver.find_element_by_id("zu-top-add-question") print(input.text) #Get content between nodes #driver.close()
9.2 explicit waiting
-
When will it be used? ——When your operation causes page changes, and you need to operate the changed elements next, you must wait.
-
usage method:
Before use, introduce related librariesfrom selenium.webdriver.support.wait import WebDriverWait from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.common.by import By
1. Determine the positioning expression of the element first
web_ locator = 'XXXX'
2. Call the webdriverWait class to set the total waiting time and polling cycle. And call period until, until_ not method.WebDriverWait(driver, waiting time, cycle). until_ not()
3. Use expected_ conditions to generate the judgment conditions.
EC. Class name ((positioning method, positioning expression))
For example: EC.presence_ of_ element_ located((By.CSS_ SELECTOR,web locator)))
-
Waiting judgment conditions of explicit waiting (often used in combination with EC module)
- Wait for element to be visible
- Wait for element to be available
- Wait for the new window to appear
- Waiting for url address = xxx
- How many seconds is the maximum waiting time? Check whether the condition is valid every 0.5 seconds
-
Show the nature of waiting:
The display wait will explicitly wait until a certain condition is met before performing the next operation. The program takes a look every xx seconds. If the condition is true, the next step is executed. Otherwise, it continues to wait until the waiting element is visible. If the maximum time is exceeded, then a TimeoutException is thrown.
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.wait import WebDriverWait #Creating browser objects driver = webdriver.Chrome() #Load request specify url address driver.get("https://www.zhihu.com/explore") #Explicit wait, up to 10 seconds wait = WebDriverWait(driver,10) #Waiting condition: within 10 seconds, a node with the id attribute value of Zu top add question must be loaded, otherwise an exception will be thrown. input = wait.until(EC.presence_of_element_located((By.ID,'zu-top-add-question'))) print(input.text) #Get content between nodes #driver.close()
10,expected_conditions EC module
- Import module: from selenium.webdriver.support import expected_ conditions as EC
- Introduction: expected of selenium_ The conditions module collects a series of scene judgment methods, such as: judge whether an element exists, how to judge whether the alert pop-up window is out, how to judge the dynamic element and so on.
- Suggestion: it can be used with wait module to judge page Jump, etc
- Partial method functions:
1,title_is: determines whether the title of the current page is exactly equal to the expected string (= =), and returns a Boolean value;
2,title_contains: determines whether the title of the current page contains the expected string, and returns a Boolean valuefrom selenium import webdriver from selenium.webdriver.support import expected_conditions as EC driver = webdriver.Chrome() # Start chrome driver.get('http://www.baidu.com') title = EC.title_is('Baidu once, you will know')(driver) print(title)
3,presence_of_element_located: to determine whether an element has been added to the dom tree does not mean that the element must be visible;from selenium import webdriver from selenium.webdriver.support import expected_conditions as EC driver = webdriver.Chrome() # Start chrome driver.get('http://www.baidu.com') title = EC.title_contains('Baidu')(driver) print(title)
4,visibility_of_element_located: judge whether an element is visible. Visible means that the element is not hidden, and the width and height of the element are not equal to 0;
5,visibility_of: do the same thing as the above method, but the above method needs to be passed to the locator, and the method can be passed directly to the element to be located;
6,presence_of_all_elements_located: determines whether at least one element exists in the dom tree. For example, if the class of n elements on the page is' column-md-3 ', then as long as one element exists, the method returns True;
7,text_ To_ Be_ present_ In_ Element (str): judge whether the text in an element node object contains the expected string;
8,text_to_be_present_in_element_value: judge whether the value attribute in an element contains the expected string;
9,frame_to_be_available_and_switch_to_it: judge whether the frame can be switched in, if so, return True and switch in, otherwise return False;
10,invisibility_of_element_located: determine whether an element does not exist in the dom tree or is not visible;
11,element_to_be_clickable: to determine whether an element is visible and enable d, which is called clickable;
12,staleness_of: wait for an element to be removed from the dom tree. Note that this method also returns True or False;
13,element_selection_state_to_be: judge whether the selected state of an element meets the expectation;
14,element_located_selection_state_to_be: just like the above method, the above method passes in the element to be located, and this method passes in the locator;
15,alert_is_present: judge whether there is an alert on the page.
11. Forward and backward:
driver.back() -- backward
driver.forward() -- forward
from selenium import webdriver import time #Creating browser objects driver = webdriver.Chrome() #Load request specify url address driver.get("https://www.baidu.com") driver.get("https://www.taobao.com") driver.get("https://www.jd.com") time.sleep(2) driver.back() #back off time.sleep(2) driver.forward() #forward #driver.close()
12,Cookies:
driver.get_cookies() -- get the current cookies information
driver.add_cookie({'name':'namne','domain':'www.zhihu.com '}) - add cookies
from selenium import webdriver from selenium.webdriver import ActionChains #Creating browser objects driver = webdriver.Chrome() #Load request specify url address driver.get("https://www.zhihu.com/explore") print(driver.get_cookies()) driver.add_cookie({'name':'namne','domain':'www.zhihu.com','value':'zhangsan'}) print(driver.get_cookies()) driver.delete_all_cookies() print(driver.get_cookies()) #driver.close()
13. Tab (window) management:
driver.window_handles[1] - tab name (ranking by sequence number)
driver.switch_to_window(driver.window_handles[0]) - switch to the specified tab
from selenium import webdriver import time #Creating browser objects driver = webdriver.Chrome() #Load request specify url address driver.get("https://www.baidu.com") #Using JavaScript to open a new selection card driver.execute_script('window.open()') print(driver.window_handles) #Switch to the second tab and open the url address driver.switch_to_window(driver.window_handles[1]) driver.get("https://www.taobao.com") time.sleep(2) #Switch to the first tab and open the url address driver.switch_to_window(driver.window_handles[0]) driver.get("https://www.jd.com") #driver.close()
14. Exception handling:
Timeout exception:
from selenium import webdriver from selenium.common.exceptions import TimeoutException,NoSuchElementException #Creating browser objects driver = webdriver.Chrome() try: #Load request specify url address driver.get("https://www.baidu.com") except TimeoutException: print('Time Out')
Page load completed cannot find the specified element exception
try: #Load request specify url address driver.find_element_by_id("demo") except NoSuchElementException: print('No Element') finally: #driver.close() pass
15. Small case
Simulate Google browser to visit Baidu homepage, and input python keyword search
from selenium import webdriver from selenium.webdriver.common.by import By from selenium.webdriver.common.keys import Keys from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.wait import WebDriverWait #Initializing a browser (e.g. Google, using Chrome requires Chrome driver to be installed) driver = webdriver.Chrome() #driver = webdriver.PhantomJS() #No interface browser try: #Request web page driver.get("https://www.baidu.com") #Find node object with id value kw (search input box) input = driver.find_element_by_id("kw") #Analog keyboard input string content input.send_keys("python") #Analog keyboard click enter input.send_keys(Keys.ENTER) #Explicit wait, up to 10 seconds wait = WebDriverWait(driver,10) #Waiting condition: there must be an id attribute value of content in 10 seconds_ Load the left node, otherwise throw an exception. wait.until(EC.presence_of_element_located((By.ID,'content_left'))) # Output response information print(driver.current_url) print(driver.get_cookies()) print(driver.page_source) finally: #Close browser #driver.close() pass
3, Selenium's Taobao climbing battle
1. Foreword
Taobao uses a lot of anti climbing measures, and can use selenium to call out the web page and scan the mobile phone code to log in.
2. Case requirements
Use Selenium to crawl Taobao products, specify keywords and page information to crawl
3. Case study:
url address: https://s.taobao.com/search?q=ipad
4. Specific code implementation
'''Crawling information data of Taobao website through keywords''' from selenium import webdriver from selenium.common.exceptions import TimeoutException from selenium.webdriver.common.by import By from selenium.webdriver.support import expected_conditions as EC from selenium.webdriver.support.wait import WebDriverWait from pyquery import PyQuery as pq from urllib.parse import quote import json KEYWORD = "ipad" MAX_PAGE = 10 # browser = webdriver.Chrome() # browser = webdriver.PhantomJS() #Create Google browser object and enable Headless interface free mode of Chrome #chrome_options = webdriver.ChromeOptions() #chrome_options.add_argument('--headless') #browser = webdriver.Chrome(chrome_options=chrome_options) #You need an interface here to scan and log in your mobile phone browser = webdriver.Chrome() #Explicit wait: wait = WebDriverWait(browser, 10) def index_page(page): '''Grab index page :param page: Page''' print('Climbing to the top', page, 'page') try: url = 'https://s.taobao.com/search?q=' + quote(KEYWORD) browser.get(url) if page > 1: input = wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '#mainsrp-pager div.form > input'))) submit = wait.until(EC.element_to_be_clickable((By.CSS_SELECTOR, '#mainsrp-pager div.form > span.btn.J_Submit'))) input.clear() input.send_keys(page) submit.click() #Waiting condition: displays the current page number, and the node object of the explicit product appears wait.until(EC.text_to_be_present_in_element((By.CSS_SELECTOR, '#mainsrp-pager li.item.active > span'), str(page))) wait.until(EC.presence_of_element_located((By.CSS_SELECTOR, '.m-itemlist .items .item'))) html = browser.page_source return html except TimeoutException: index_page(page) def get_products(html): '''Extract product data''' doc = pq(html) items = doc('#mainsrp-itemlist .items .item').items() for item in items: yield { 'image': item.find('.pic .img').attr('data-src'), 'price': item.find('.price').text(), 'deal': item.find('.deal-cnt').text(), 'title': item.find('.title').text(), 'shop': item.find('.shop').text(), 'location': item.find('.location').text() } def save_data(result): '''Save data''' with open("./taobao.txt","a",encoding="utf-8") as f1: f1.write(json.dumps(result,ensure_ascii=False)+"\n") def main(): '''Traverse every page''' for i in range(1, MAX_PAGE + 1): html = index_page(i) result = get_products(html) for i in content: save_data(i) browser.close() # Main program entry if __name__ == '__main__': main()