Job 1:
Requirement:
Familiarize yourself with Selenium's ability to find HTML elements, crawl Ajax web data, wait for HTML elements, etc.
Use Selenium framework to crawl some kind of commodity information and pictures in Jingdong Mall.
Candidate sites: http://www.jd.com/
Keyword: Students are free to choose
Output information: The output information of MYSQL is as follows:
mNo | mMark | mPrice | mNote | mFile |
---|---|---|---|---|
000001 | Samsung Galaxy | 9199.00 | Samsung Galaxy Note20 Ultra 5G... | 000001.jpg |
000002...... | ||||
Problem solving steps (code reproduction in this topic) | ||||
The main thing to note in this topic is the special case when crawling |
try: note = li.find_element_by_xpath(".//div[@class='p-name p-name-type-2']//em").text mark = note.split(" ")[0] mark = mark.replace("Love East\n", "") mark = mark.replace(",", "") note = note.replace("Love East\n", "") note = note.replace(",", "") except: note = "" mark = ""
Sometimes irrelevant text appears and needs to be replaced or removed
- Run result:
1) End result
2) Database Results
3) Folder Picture Save Results
- Experimentation Experience
1) Familiarize yourself with selenium crawling and accessing databases and downloading pictures
2) Strengthened awareness of handling some special texts
Code address: https://gitee.com/zhubeier/zhebeier/blob/master/ Fifth Homework/Topic 1
Job 2:
Requirement:
Familiarize yourself with Selenium's ability to find HTML elements, simulate user login, crawl Ajax web page data, and wait for HTML elements.
Use Selenium Framework + MySQL to simulate login to Mutsu. Get information about the courses you have learned from the student's own account and save it in MySQL (course number, course name, teaching unit, teaching progress, course status, course picture address), and store the picture in the imgs folder under the root directory of the local project. The name of the picture is stored in the course name.
Candidate website: China mooc network: https://www.icourse163.org
Output information: MYSQL database storage and output format
The header should be named in English, for example: Course ID, Course Name: cCourse..., and the design header should be defined by the students themselves:
Id | cCourse | cCollege | cSchedule | cCourseStatus | cImgUrl |
---|---|---|---|---|---|
1 | Python Web Crawler and Information Extraction | Beijing Polytechnic University | 3/18 Hours Learned | Ended May 18, 2021 | http://edu-image.nosdn.127.net/C0AB6FA791150F0DFC0946B9A01C8CB2.jpg |
2...... |
Solving steps:
STEP1 establishes the mooc database and table MC in mysql, then establishes the database connection
STEP2 Simulates chrome browser login to Mooc
driver = webdriver.Chrome() driver.get('https://www.icourse163.org') driver.maximize_window() time.sleep(2) driver.find_element_by_xpath('//*[@id="app"]/div/div/div[1]/div[3]/div[3]/div').click() time.sleep(2) driver.find_element_by_xpath('//*[@class="zcnt"]/div/div/div/div/div[2]/span').click() driver.find_element_by_xpath('//*[@class="mooc-login-set"]/div/div/div[1]/div/div[1]/div[1]/ul/li[2]').click() time.sleep(1) driver.switch_to.frame(driver.find_elements_by_tag_name("iframe")[1])#The login interface is another html file nested within html and must be switched idinput = driver.find_element_by_xpath('//*[@id="phoneipt"]').send_keys("18133010322") time.sleep(1) passwdinput = driver.find_element_by_xpath('//*[@class="j-inputtext dlemail"]') passwdinput.send_keys("Zbenotfound404") time.sleep(1) driver.find_element_by_xpath('//*[@id="submitBtn"]').click() time.sleep(3)
STEP3 enters the My Course page, finds the xpath path path of the corresponding property, and starts crawling
driver.find_element_by_xpath('//*[@id="privacy-ok"]').click() driver.find_element_by_xpath('//*[@id="app"]/div/div/div[1]/div[3]/div[4]/div').click() cCourse_list = driver.find_elements_by_xpath('//*[@class="course-card-wrapper"]/div/a/div[@class="img"]/img') cCollege_list = driver.find_elements_by_xpath('//*[@class="course-card-wrapper"]/div/a/div[@class="body"]/div[1]/div[@class="school"]/a') cSchedule_list = driver.find_elements_by_xpath('//div[@class="course-card-wrapper"]//div[@class="box"]//div[@class="body"]//div[@class="text"]//a/span') cCourseStatus_list = driver.find_elements_by_xpath('//*[@class="course-card-wrapper"]/div/a/div[@class="body"]/div[2]/div[@class="course-status"]') img_url = driver.find_elements_by_xpath('//*[@class="course-card-wrapper"]/div/a/div[@class="img"]/img')
for i in range(len(cCourse_list)): cCourse = cCourse_list[i].get_attribute('alt') cCollege = cCollege_list[i].text cSchedule = cSchedule_list[i].text #print(cSchedule) cCourseStatus = cCourseStatus_list[i].text clmgUrl = img_url[i].get_attribute('src') print(cCourse+"\t"+cCollege+"\t"+cSchedule+"\t"+cCourseStatus+'\t'+clmgUrl+'\t') cursor.execute("insert into mc values (%s,%s,%s,%s,%s,%s)",(i+1,cCourse,cCollege,cSchedule,cCourseStatus,clmgUrl)) db.commit()
Run result:
1) End result:
2) Database results:
3) Picture save result:
- Experiments:
1) selenium is a web page simulated by users, which is intuitive and easy to understand. Familiar with selenium click->locate->crawl process
2) Understanding that the login interface has an "iframe", a document in the document, in which elements need to be located first to the "iframe" tag and converted to the past
Code address: https://gitee.com/zhubeier/zhebeier/blob/master/ Fifth major assignment/second topic
Job 3: Flume log collection experiment
Requirements: Master big data related services and familiarize yourself with the use of Xshell
Complete Document Huawei Cloud_ Large data real-time analysis and processing experiment manual - Flume log collection experiment (part) The tasks in v2.docx, that is, the following five tasks, specific operations see the documentation.
Environment Setup
Task 1: Open the MapReduce service
Real-time analysis development practice:
Task 1: Python scripts generate test data
Task 2: Configure Kafka
Task 3: Install the Flume client
Task 4: Configure Flume to collect data
Process Screenshot:
Task 1: Pthon scripts generate test data
Task 2: Configure kafka
Task 3: Install the Flume client
Enter Directory
Decompression package
Unzip the "MRS_Flume_ClientConfig.tar" file
Install Flume environment variables
Unzip Flume Client
Install FLume Client
Restart Flume Service
Task 4: Configure Flume to collect data
Experiments:
1) Initial exposure to Huawei cloud-based experiments, first feel the concept of "cloud"