[data acquisition and fusion technology] the fifth major operation

Operation ①:

  • requirement:

    • Master Selenium to find HTML elements, crawl Ajax web page data, wait for HTML elements, etc.
    • Use Selenium framework to crawl the information and pictures of certain commodities in Jingdong Mall.
  • Candidate sites: http://www.jd.com/

  • Key words: Students' free choice

  • Output information: the output information of MYSQL is as follows

    mNo mMark mPrice mNote mFile
    000001 Samsung Galaxy 9199.00 Samsung Galaxy Note20 Ultra 5G 000001.jpg

1. Ideas and codes

1.1 code link:

5/01.py · data acquisition and fusion - Code cloud - Open Source China (gitee.com)

1.2 web page analysis and key codes:

Since Selenium framework is used to simulate real person operation to visit the website, find the search box first and find it through id='key '

keyinput = self.driver.find_element_by_id("key")

Then type the keywords we want to search, directly simulate keyboard enter to search, and jump to the corresponding page, so there is no need to search the button and click the button


Since page loading takes time, pause for 10 seconds to wait for page loading. In addition, sleep is required in many places


By analyzing the product page, we can see that each product item is under a li tag, so locate the li tag first

Then analyze the content of each li tag and extract the title, picture and price. Generally speaking, the brand is the first word of the title, so it can be extracted with split

for li in lis:
        src1 = li.find_element_by_xpath(".//div[@class='p-img']//a//img").get_attribute("src")
        src1 = ""
        src2 = li.find_element_by_xpath(".//div[@class='p-img']//a//img").get_attribute("data-lazy-img")
        src2 = ""
        price = li.find_element_by_xpath(".//div[@class='p-price']//i").text
        price = "0"

    note = li.find_element_by_xpath(".//div[@class='p-name p-name-type-2']//em").text
    mark = note.split(" ")[0]
    mark = mark.replace("Love Dongdong\n", "")
    mark = mark.replace(",", "")
    note = note.replace("Love Dongdong\n", "")
    note = note.replace(",", "")

Process picture links

if src1:
    src1 = urllib.request.urljoin(self.driver.current_url, src1)
    p = src1.rfind(".")
    mFile = no + src1[p:]
elif src2:
    src2 = urllib.request.urljoin(self.driver.current_url, src2)
    p = src2.rfind(".")
    mFile = no + src2[p:]

Multi thread downloading pictures

if src1 or src2:
    T = threading.Thread(target=self.downloadDB, args=(src1, src2, mFile))
    mFile = ""

insert database

sql = "insert into phones (mNo,mMark,mPrice,mNote,mFile) values (?,?,?,?,?)"
self.cursor.execute(sql, (mNo, mMark, mPrice, mNote, mFile))

1.3 results

2. Experience

  1. Previously, when using Selenium framework to crawl, first find the input content in the search box, and then find the search button to simulate clicking. Through the repetition of this question, it is found that you can directly simulate the keyboard and enter keyinput.send_keys(Keys.ENTER) to achieve the same effect is a stroke of genius.

  2. When extracting the brand, there are still some irrelevant words (as shown in the figure below), which need further processing, such as entering the product details page to crawl the brand (which will take a lot of time).

Operation ②:

  • requirement:

    • Proficient in Selenium's search for HTML elements, user simulated Login, crawling Ajax web page data, waiting for HTML elements, etc.
    • Use Selenium framework + Mysql to simulate login to muke.com, obtain the information of the courses learned in the students' own account, save it in MySQL (course number, course name, teaching unit, teaching progress, course status and course picture address), and store the pictures in the imgs folder under the root directory of the local project. The names of the pictures are stored with the course name.
  • Candidate website: China mooc website: https://www.icourse163.org

  • Output information: MYSQL database storage and output format

    The header should be named in English, for example: course number ID, course name: cCourse... The header should be defined and designed by students themselves:

    Id cCourse cCollege cSchedule cCourseStatus cImgUrl
    1 Python web crawler and information extraction Beijing University of Technology 3 / 18 class hours learned Completed on May 18, 2021 http://edu-image.nosdn.127.net/C0AB6FA791150F0DFC0946B9A01C8CB2.jpg

1. Ideas and codes

1.1 code link:

5/02.py · data acquisition and fusion - Code cloud - Open Source China (gitee.com)

1.2 web page analysis and key codes:

Start the drive and send the request

import time
from selenium import webdriver
from selenium.webdriver.common.by import By
import pymysql

driver = webdriver.Chrome()

To crawl personal course information, you must first log in. Because you need to enter the password to log in or manually enter the verification code, it is easier to directly select scanning code to log in.

First find the login button

Locate the login button and click, then wait for code scanning login

driver.find_element(By.XPATH, "//div[@class='unlogin']").click()
time.sleep(20)  # Waiting for code scanning login

After successful login, locate the button in the personal center and click to jump

driver.find_element(By.XPATH, "//div[@class='ga-click u-navLogin-myCourse u-navLogin-center-container']/a").click()

Each course is under a div tag, under which we can view all the information we need to crawl

title = driver.find_elements(By.XPATH, '//div[@class="course-card-wrapper"]//div[@class="body"]//span[@class="text"]')
school = driver.find_elements(By.XPATH,'//div[@class="course-card-wrapper"]//div[@class="body"]//div[@class="school"]/a')
learn = driver.find_elements(By.XPATH, '//div[@class="course-card-wrapper"]//div[@class="body"]//div['
status = driver.find_elements(By.XPATH, '//div[@class="course-card-wrapper"]//div[@class="body"]//div[@class="course-status"]')
url = driver.find_elements(By.XPATH, '//div[@class="course-card-wrapper"]//div[@class="img"]/img')

Finally, the results are stored in the database

con = pymysql.connect(host='localhost', user='root', password='123456', charset="utf8", database='DATA_acquisition')
cursor = con.cursor()
for i in range(len(title)):
    cursor.execute("insert into mooc values(%s,%s,%s,%s,%s)", (title[i].text, school[i].text, learn[i].text, status[i].text, url[i].get_attribute('src')))

Read the course name and picture address from the database for multi-threaded download

cursor.execute("SELECT url,name FROM mooc")
rows = cursor.fetchall()
threads = []
for row in rows:
    T = threading.Thread(target=downloadPic, args=(row[0], row[1]))
for t in threads:

1.3 results

2. Experience

I've been bothering with the login process for a long time. I've been trying to make the machine log in automatically, trying to make the Selenium framework automatically identify the verification code and pass the man-machine verification. However, in the end, it still logs in with manual code scanning. Although Selenium can simulate human behavior, it does not have human intelligence after all.

Operation ③: Flume log collection experiment

  • requirement

    : Master big data related services and be familiar with the use of Xshell

    • Complete the document Hua Weiyun_ The tasks in the big data real-time analysis and processing experiment manual Flume log collection experiment (part) v2.docx are the following five tasks. See the document for specific operations.

    • Environment construction

      • Task 1: open MapReduce service
    • Real time analysis and development practice:

      • Task 1: generate test data from Python script
      • Task 2: configure Kafka
      • Task 3: install Flume client
      • Task 4: configure Flume to collect data

1. Steps

  • Task 1: generate test data from Python script

  • Task 2: configure Kafka

  • Task 3: install Flume client

  • Task 4: configure Flume to collect data

2. Experience

Learned how to use Flume to collect real-time stream front-end data, which is convenient for later data processing and data visualization. It is part of the data flow of real-time stream scene.

Through the experimental study of this chapter, I can partially master the data acquisition ability of big data in real-time scenarios.

Posted on Mon, 06 Dec 2021 16:00:25 -0500 by soto134