python crawler -- [Baidu knows] auto answer

  • The first python crawler project I did. I just started to learn it. If there are any mistakes, point out that it's OK

Baidu knows how to answer questions automatically

function

  • Visit Baidu Knows , we will see a lot of new questions.
  • In fact, many of the questions have been explained or ready-made answers can be found online.
  • Therefore, on the one hand, to meet the needs of those who ask questions, on the other hand, use python to practice.
  • The main function of the project is: for Baidu to know the new problems, the program will search on the web page to find the best answer. If so, answer him; if not, skip.

Implementation ideas

  1. stay Baidu Knows Get the list of questions, get the address links of all questions, and then traverse these links to enter the details page of questions in turn.
  2. Get the question on the question details page. The question is used to search for relevant answers
  3. The address links of search answers are all fixed. You can search the relevant answers by replacing the content after word in the address.
  4. After getting the search results, get the address of the answer list and traverse to access the answer details page. If there is the best answer in the answer details page, the answer content will be obtained and the traversal of the answer list will be terminated.
  5. Write the answer back to the answer text box on the question details page and click the submit answer button to complete the answer.
    Note: during the implementation of the whole project, the user has logged in. Baidu's recent login verification is to make the picture correct by sliding. At present, I haven't finished it yet. Therefore, to obtain the cookie, the login content is manual.

Specific implementation steps

  • Get cookie, use cookie to log in content
from selenium import webdriver
import json,time

driver = webdriver.Chrome()
driver.get('https://baidu.com/')
driver.find_element_by_xpath('//*[@id="u1"]/a[7]').click()
time.sleep(2)
driver.find_element_by_xpath('//*[@id="TANGRAM__PSP_10__footerULoginBtn"]').click()
time.sleep(2)
driver.find_element_by_xpath('//*[@id="TANGRAM__PSP_10__userName"]').send_keys('XXX')
driver.find_element_by_xpath('//*[@id="TANGRAM__PSP_10__password"]').send_keys('XXXXX')
driver.find_element_by_xpath('//*[@id="TANGRAM__PSP_10__submit"]').click()
time.sleep(30)


cookies = driver.get_cookies()
f1 = open('cookie.txt','w')
f1.write(json.dumps(cookies))
f1.close()
  • As above, XXXX is your own baidu account and password respectively. When you run this program, a page pops up. When you log in, Baidu will verify, or SMS verification code or slider, etc. this step is manual operation. After the operation, the program will automatically save the cookie value of your account and password on the browser. For next login.
  • Get the user's login information, and then realize automatic answer. The whole answer process involves four pages: Baidu knows the question list page, Baidu knows the question details page, answer search page and answer details page.
  • In the question list page, the HTML code of each question is generated by the tag < a > and the attribute value of the attribute class is title link, as shown in the figure below. Therefore, selenium can locate the attribute class, obtain the tag < a > of all questions, and traverse these tags to extract the corresponding link address.-Visit each question link in the new window, and these links will go to the corresponding question details page. In the question details page, first judge whether the question has been answered. If it has not been answered, the program will go to Baidu according to the question to search relevant answers, find the best answer among these relevant answers, write it into the question answer input box and click the "submit answer" button. If the question has been answered, the program will close the current window and return to the question list to execute the next one One question. The HTML code of the answer input box and "submit answer" button on the question details page is shown in the figure below.-Two new pages are involved in the process of answering questions: the answer search page and the answer details page. The answer search page searches for relevant answers in a new window based on the questions. The link of each answer is indicated by a label < DT > under which there is a label < a >. Locate selenium to the label < a > of each answer, and obtain the value of the href attribute, which is used to enter the answer details page.
  • Access the link of the answer detail page in a new window. Each answer detail page does not necessarily have the best answer. According to the analysis, the class attribute value of the best answer is best-text mb-10. If selenium can locate the attribute class, it means that the current answer detail page has the best answer. Otherwise, it does not.
  • According to the above element positioning and the business logic of the answer, the whole answer program needs to pay attention to the switching between each page window. If the switching logic of the window is not rigorous, it is easy to cause program errors. In addition, we need to consider some unusual situations, such as the problem can not find the answer, the problem has been answered, and the network delay response and so on. Through comprehensive analysis, the code of automatic answer is as follows:
from selenium import webdriver
import json,time

url = 'https://zhidao.baidu.com/list?cid=110'
driver = webdriver.Chrome()
driver.get(url)
# Log in using cookie s
driver.delete_all_cookies()
f1 = open('cookie.txt')
cookie = json.loads(f1.read())
f1.close()
for c in cookie:
    driver.add_cookie(c)
driver.refresh()

title_link = driver.find_elements_by_class_name('title-link')
print(title_link)
time.sleep(5)
for i in title_link:
    # Open the question details page and switch windows
    driver.switch_to.window(driver.window_handles[0])
    href = i.get_attribute('href')
    driver.execute_script('window.open("%s");' % (href))
    time.sleep(5)
    driver.switch_to.window(driver.window_handles[1])
    try:
        # Find iframe to determine whether the question has been answered
        driver.find_element_by_id('ueditor_0')
        # Get questions and search for answers
        title = driver.find_element_by_class_name('ask-title ').text
        title_url = 'https://zhidao.baidu.com/search?&word=' + title
        js = 'window.open("%s");' % (title_url)
        driver.execute_script(js)
        time.sleep(5)
        driver.switch_to.window(driver.window_handles[2])
        # Get answer list
        answer_list = driver.find_elements_by_class_name('dt,mb-4,line')
        for k in answer_list:
            # Open the answer details page
            href = k.find_element_by_tag_name('a').get_attribute('href')
            driver.execute_script('window.open("%s");' % (href))
            time.sleep(5)
            driver.switch_to.window(driver.window_handles[3])
            # Get the best answer
            try:
                text = driver.find_element_by_class_name('best-text,mb-10').text
            except:
                text = ''
            finally:
                # Close the window on the answer details page
                driver.close()
            # The answer is not empty
            if text:
                # Close the window of answer list page
                driver.switch_to.window(driver.window_handles[2])
                driver.close()
                # Write the answer on the question answer text box and submit the answer button
                driver.switch_to.window(driver.window_handles[1])
                driver.switch_to.frame('ueditor_0')
                driver.find_element_by_xpath('/html/body').click()
                driver.find_element_by_xpath('/html/body').send_keys(text)
                # Jump back to HTML of web page
                driver.switch_to.default_content()
                # Single machine submit answer button
                driver.find_element_by_xpath('//*[@id="answer-editor"]/div[2]/a').click()
                time.sleep(5)
                # Close the window of the problem details page
                driver.switch_to.window(driver.window_handles[1])
                driver.close()
                break
    except  Exception as err:
        # Close other windows except the question list page
        all_handles = driver.window_handles
        for i, v in enumerate(all_handles):
            if i != 0:
                driver.switch_to.window(v)
                driver.close()
        driver.switch_to.window(driver.window_handles[0])
        print(err)

  • GitHub address of the project: https://github.com/zhangyi-13572252156/baidu-knowledge-automatic-answer/tree/master
  • If you need to communicate, welcome to communicate. QQ: 1251108673
166 original articles published, 96 praised, 90000 visitors+
Private letter follow

Tags: Attribute Selenium JSON Python

Posted on Thu, 16 Jan 2020 10:52:03 -0500 by vombomin