The first play of the imitative book retrieval system Python

Personal blog https://blog.fmujie.cn/

Production reason:

Simply put, the library teacher has a project that needs to tell a company something and needs a retrieval system, but the data source has to crawl from the Internet. Because the account is very expensive, it's not worth buying another account and finding a senior to do it. The senior is busy taking the postgraduate entrance examination (● '◡ ●), so it's down. Ha ha! At first, I thought it was easy to climb something, and then I really started to do it Here is a link But I think that using Python automatic crawler is the same as hanging. I can help you control the browser to grab data efficiently. I can imagine that if you climb more, it will be a little slow (later, I cry about what I have to do.)

Python knowledge required:

  • Selenium:

Selenium is a Web automation testing tool. It was originally developed for website automation testing. Its type is like the key wizard we use to play games. It can operate automatically according to the specified command. The difference is that selenium can run directly on the browser. It supports all the mainstream browsers, but it no longer supports phantom JS. Generally, chrome and firefox are used to browse without interface Device. Selenium can, according to our instructions, let the browser automatically load the page, obtain the required data, or even screen capture the page, or judge whether some actions on the website occur. Automatic testing tools, support a variety of browsers, crawlers are mainly used to solve JavaScript rendering problems. For more information, please click here!!!
UserWarning: Selenium support for PhantomJS has been deprecated, please use headless versions of Chrome or Firefox instead
warnings.warn('Selenium support for PhantomJS has been deprecated, please use headless '
Installation:
pip install selenium
Install driver
Here are the driver download addresses of several mainstream browsers:
Browser Download address
Chrome https://sites.google.com/a/chromium.org/chromedriver/downloads
Edge https://developer.microsoft.com/en-us/microsoft-edge/tools/webdriver/
Firefox https://github.com/mozilla/geckodriver/releases
Safari https://webkit.org/blog/6900/webdriver-support-in-safari-10/
I use the windows system and Chrome browser, so I downloaded the chrome driver ﹣ Win32. After extracting it, there will be an. exe executable program. Add its path to the environment variable of the system and it will be OK (it is not needed, but it will be said below)
Select page elements:
Selenium's WebDriver provides various methods to find elements. Suppose there is a form input box:
<input type="text" name="user-name" id="passwd-id" />
Using elements in a command allows you to select multiple elements.
# Get id tag value
element = driver.find_element_by_id("passwd-id")
# Get name tag value
element = driver.find_element_by_name("user-name")
# Get tag name value
element = driver.find_elements_by_tag_name("input")
# You can also use XPath to match the recommended usage
element = driver.find_element_by_xpath("//input[@id='passwd-id']")
Easy to use:
from selenium import webdriver
from lxml import etree
from selenium.common.exceptions import TimeoutException

# Set the path of the chrome driver
driver_path = r"Your chromedriver.exe position"

# Enable chrome driver (declare browser objects)
driver = webdriver.Chrome(executable_path=driver_path)

#Simply enter Baidu homepage first
driver.get('https://www.baidu.com/')
#Set width and height of open browser
driver.set_window_size(1500, 1000)

# The page buffer time (implicit waiting) is afraid of slow loading when the network is not good, resulting in incomplete code for subsequent page grabbing
driver.implicitly_wait(3)

# Page buffering time (display wait - > intelligent), try the following first, and then try again
try:
    element = WebDriverWait(driver, 5).until(
        EC.presence_of_element_located((By.ID, 'kwssss'))
    )
    print(element)
except TimeoutException as ex:
    print(ex)
#Grab web source code
html = etree.HTML(driver.page_source)
# Baidu test grabs the input box. Send "keys is to input the content in input and wait for three seconds before clearing
# inputTag = driver.find_element_by_id('kw')
# inputTag = driver.find_element_by_name('wd')
# inputTag = driver.find_element_by_class_name('s_ipt')
inputTag = driver.find_element(By.ID, 'kw')
inputTag.send_keys('python')
time.sleep(3)
inputTag.clear()
#You can also grab the search button and click search practice without clearing it
# submitBtn = driver.find_element(By.ID, 'su')
# submitBtn =driver.find_element_by_xpath("//*[@id='su']")
# submitBtn.click()
Mouse action chain:
# selenium behavior chain test
inputTag = driver.find_element_by_id('kw')   #Grab elements by ID
submitBtn = driver.find_element(By.ID, 'su')
actions = ActionChains(driver)   #Call behavior link
actions.move_to_element(inputTag)   #Move the mouse over
actions.send_keys_to_element(inputTag, 'Python')   #Fill in the data
actions.move_to_element(submitBtn)	#Mouse over search button
actions.click(submitBtn)   #click
actions.perform()   #The behavior chain starts, otherwise the upper side will not execute
Get text value:
#According to the Xpath syntax
html.xpath("Xpath Route/text()")
matters needing attention:
1. If you just want to parse the data in the web page, it is recommended to throw the source code of the web page to lxml for parsing. Because the underlying language of lxml is C, the parsing efficiency will be higher.
2. If you want to perform some operations on the element, such as entering a value for a text box or clicking a button, you must use the method provided by selenium to find the element.
Common form elements: input type = 'text/password/email/number' button, input[type = 'submit'] chexbox: input = 'chexbox' select; drop-down list '

#####To be continued
#####git full project link

Published 3 original articles · praised 2 · visited 24
Private letter follow

Tags: Selenium Python Firefox Javascript

Posted on Fri, 17 Jan 2020 10:29:41 -0500 by depsipher