Personal blog https://blog.fmujie.cn/
Production reason:
Simply put, the library teacher has a project that needs to tell a company something and needs a retrieval system, but the data source has to crawl from the Internet. Because the account is very expensive, it's not worth buying another account and finding a senior to do it. The senior is busy taking the postgraduate entrance examination (● '◡ ●), so it's down. Ha ha! At first, I thought it was easy to climb something, and then I really started to do it Here is a link But I think that using Python automatic crawler is the same as hanging. I can help you control the browser to grab data efficiently. I can imagine that if you climb more, it will be a little slow (later, I cry about what I have to do.)
Python knowledge required:
Selenium is a Web automation testing tool. It was originally developed for website automation testing. Its type is like the key wizard we use to play games. It can operate automatically according to the specified command. The difference is that selenium can run directly on the browser. It supports all the mainstream browsers, but it no longer supports phantom JS. Generally, chrome and firefox are used to browse without interface Device. Selenium can, according to our instructions, let the browser automatically load the page, obtain the required data, or even screen capture the page, or judge whether some actions on the website occur. Automatic testing tools, support a variety of browsers, crawlers are mainly used to solve JavaScript rendering problems. For more information, please click here!!!
UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless '
Installation:
pip install selenium
Install driver
Here are the driver download addresses of several mainstream browsers:
Browser | Download address |
---|---|
Chrome | https://sites.google.com/a/chromium.org/chromedriver/downloads |
Edge | https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/ |
Firefox | https://github.com/mozilla/geckodriver/releases |
Safari | https://webkit.org/blog/6900/webdriver-support-in-safari-10/ |
I use the windows system and Chrome browser, so I downloaded the chrome driver ﹣ Win32. After extracting it, there will be an. exe executable program. Add its path to the environment variable of the system and it will be OK (it is not needed, but it will be said below)
Select page elements:
Selenium's WebDriver provides various methods to find elements. Suppose there is a form input box:
<input type="text" name="user-name" id="passwd-id" />
Using elements in a command allows you to select multiple elements.
# Get id tag value element = driver.find_element_by_id("passwd-id") # Get name tag value element = driver.find_element_by_name("user-name") # Get tag name value element = driver.find_elements_by_tag_name("input") # You can also use XPath to match the recommended usage element = driver.find_element_by_xpath("//input[@id='passwd-id']")
Easy to use:
from selenium import webdriver from lxml import etree from selenium.common.exceptions import TimeoutException # Set the path of the chrome driver driver_path = r"Your chromedriver.exe position" # Enable chrome driver (declare browser objects) driver = webdriver.Chrome(executable_path=driver_path) #Simply enter Baidu homepage first driver.get('https://www.baidu.com/') #Set width and height of open browser driver.set_window_size(1500, 1000) # The page buffer time (implicit waiting) is afraid of slow loading when the network is not good, resulting in incomplete code for subsequent page grabbing driver.implicitly_wait(3) # Page buffering time (display wait - > intelligent), try the following first, and then try again try: element = WebDriverWait(driver, 5).until( EC.presence_of_element_located((By.ID, 'kwssss')) ) print(element) except TimeoutException as ex: print(ex) #Grab web source code html = etree.HTML(driver.page_source) # Baidu test grabs the input box. Send "keys is to input the content in input and wait for three seconds before clearing # inputTag = driver.find_element_by_id('kw') # inputTag = driver.find_element_by_name('wd') # inputTag = driver.find_element_by_class_name('s_ipt') inputTag = driver.find_element(By.ID, 'kw') inputTag.send_keys('python') time.sleep(3) inputTag.clear() #You can also grab the search button and click search practice without clearing it # submitBtn = driver.find_element(By.ID, 'su') # submitBtn =driver.find_element_by_xpath("//*[@id='su']") # submitBtn.click()
Mouse action chain:
# selenium behavior chain test inputTag = driver.find_element_by_id('kw') #Grab elements by ID submitBtn = driver.find_element(By.ID, 'su') actions = ActionChains(driver) #Call behavior link actions.move_to_element(inputTag) #Move the mouse over actions.send_keys_to_element(inputTag, 'Python') #Fill in the data actions.move_to_element(submitBtn) #Mouse over search button actions.click(submitBtn) #click actions.perform() #The behavior chain starts, otherwise the upper side will not execute
Get text value:
#According to the Xpath syntax html.xpath("Xpath Route/text()")
matters needing attention:
1. If you just want to parse the data in the web page, it is recommended to throw the source code of the web page to lxml for parsing. Because the underlying language of lxml is C, the parsing efficiency will be higher.
2. If you want to perform some operations on the element, such as entering a value for a text box or clicking a button, you must use the method provided by selenium to find the element.
Common form elements: input type = 'text/password/email/number' button, input[type = 'submit'] chexbox: input = 'chexbox' select; drop-down list '
#####To be continued
#####git full project link