Fifth Big Job in Data Acquisition

Job 1:
Familiarize yourself with Selenium's ability to find HTML elements, crawl Ajax web data, wait for HTML elements, etc.
Use Selenium framework to crawl some kind of commodity information and pictures in Jingdong Mall.
Candidate sites:
Keyword: Students are free to choose
Output information: The output information of MYSQL is as follows:

mNo mMark mPrice mNote mFile
000001 Samsung Galaxy 9199.00 Samsung Galaxy Note20 Ultra 5G... 000001.jpg
Problem solving steps (code reproduction in this topic)
The main thing to note in this topic is the special case when crawling
                    note = li.find_element_by_xpath(".//div[@class='p-name p-name-type-2']//em").text
                    mark = note.split(" ")[0]
                    mark = mark.replace("Love East\n", "")
                    mark = mark.replace(",", "")
                    note = note.replace("Love East\n", "")
                    note = note.replace(",", "")

                    note = ""
                    mark = ""

Sometimes irrelevant text appears and needs to be replaced or removed

  • Run result:
    1) End result

    2) Database Results

    3) Folder Picture Save Results
  • Experimentation Experience
    1) Familiarize yourself with selenium crawling and accessing databases and downloading pictures
    2) Strengthened awareness of handling some special texts
    Code address: Fifth Homework/Topic 1
    Job 2:
    Familiarize yourself with Selenium's ability to find HTML elements, simulate user login, crawl Ajax web page data, and wait for HTML elements.
    Use Selenium Framework + MySQL to simulate login to Mutsu. Get information about the courses you have learned from the student's own account and save it in MySQL (course number, course name, teaching unit, teaching progress, course status, course picture address), and store the picture in the imgs folder under the root directory of the local project. The name of the picture is stored in the course name.
    Candidate website: China mooc network:
    Output information: MYSQL database storage and output format
    The header should be named in English, for example: Course ID, Course Name: cCourse..., and the design header should be defined by the students themselves:
Id cCourse cCollege cSchedule cCourseStatus cImgUrl
1 Python Web Crawler and Information Extraction Beijing Polytechnic University 3/18 Hours Learned Ended May 18, 2021

Solving steps:
STEP1 establishes the mooc database and table MC in mysql, then establishes the database connection

STEP2 Simulates chrome browser login to Mooc

driver = webdriver.Chrome()
driver.switch_to.frame(driver.find_elements_by_tag_name("iframe")[1])#The login interface is another html file nested within html and must be switched
idinput = driver.find_element_by_xpath('//*[@id="phoneipt"]').send_keys("18133010322")
passwdinput = driver.find_element_by_xpath('//*[@class="j-inputtext dlemail"]')

STEP3 enters the My Course page, finds the xpath path path of the corresponding property, and starts crawling

cCourse_list = driver.find_elements_by_xpath('//*[@class="course-card-wrapper"]/div/a/div[@class="img"]/img')
cCollege_list = driver.find_elements_by_xpath('//*[@class="course-card-wrapper"]/div/a/div[@class="body"]/div[1]/div[@class="school"]/a')
cSchedule_list = driver.find_elements_by_xpath('//div[@class="course-card-wrapper"]//div[@class="box"]//div[@class="body"]//div[@class="text"]//a/span')
cCourseStatus_list = driver.find_elements_by_xpath('//*[@class="course-card-wrapper"]/div/a/div[@class="body"]/div[2]/div[@class="course-status"]')
img_url = driver.find_elements_by_xpath('//*[@class="course-card-wrapper"]/div/a/div[@class="img"]/img')
for i in range(len(cCourse_list)):
    cCourse = cCourse_list[i].get_attribute('alt')
    cCollege = cCollege_list[i].text
    cSchedule = cSchedule_list[i].text
    cCourseStatus = cCourseStatus_list[i].text
    clmgUrl = img_url[i].get_attribute('src')
    cursor.execute("insert into mc values (%s,%s,%s,%s,%s,%s)",(i+1,cCourse,cCollege,cSchedule,cCourseStatus,clmgUrl))

Run result:
1) End result:

2) Database results:

3) Picture save result:

  • Experiments:
    1) selenium is a web page simulated by users, which is intuitive and easy to understand. Familiar with selenium click->locate->crawl process
    2) Understanding that the login interface has an "iframe", a document in the document, in which elements need to be located first to the "iframe" tag and converted to the past
    Code address: Fifth major assignment/second topic
    Job 3: Flume log collection experiment

Requirements: Master big data related services and familiarize yourself with the use of Xshell
Complete Document Huawei Cloud_ Large data real-time analysis and processing experiment manual - Flume log collection experiment (part) The tasks in v2.docx, that is, the following five tasks, specific operations see the documentation.
Environment Setup
Task 1: Open the MapReduce service
Real-time analysis development practice:
Task 1: Python scripts generate test data
Task 2: Configure Kafka
Task 3: Install the Flume client
Task 4: Configure Flume to collect data
Process Screenshot:
Task 1: Pthon scripts generate test data

Task 2: Configure kafka

Task 3: Install the Flume client

Enter Directory

Decompression package

Unzip the "MRS_Flume_ClientConfig.tar" file

Install Flume environment variables

Unzip Flume Client

Install FLume Client

Restart Flume Service

Task 4: Configure Flume to collect data

1) Initial exposure to Huawei cloud-based experiments, first feel the concept of "cloud"

Posted on Fri, 03 Dec 2021 13:11:27 -0500 by anand