Assignment 1
Operation requirements:
Master Selenium to find HTML elements, crawl Ajax web page data, wait for HTML elements, etc.
Use Selenium framework to crawl the information and pictures of certain commodities in Jingdong Mall.
Candidate sites: http://www.jd.com/
Experimental process:
Drive configuration
chrome_options = Options() chrome_options.add_argument("-headless") chrome_options.add_argument("-disable-gpu") self.driver = webdriver.Chrome(chrome_options=chrome_options)
Data acquisition
lis = self.driver.find_elements_by_xpath("//div[@id='J_goodsList']//li[@class='gl-item']") time.sleep(1) for li in lis: time.sleep(1) try: src1 = li.find_element_by_xpath(".//div[@class='p-img']//a//img").get_attribute("src") time.sleep(1) except: src1 = "" try: src2 = li.find_element_by_xpath(".//div[@class='p-img']//a//img").get_attribute("data-lazy-img") time.sleep(1) except: src2 = "" try: price = li.find_element_by_xpath(".//div[@class='p-price']//i").text time.sleep(1) except: price = "0" note = li.find_element_by_xpath(".//div[@class='p-name p-name-type-2']//em").text mark = note.split(" ")[0] mark = mark.replace("Love Dongdong\n", "") mark = mark.replace(",", "") note = note.replace("Love Dongdong\n", "") note = note.replace(",", "")
Save picture
if src1: src1 = urllib.request.urljoin(self.driver.current_url, src1) p = src1.rfind(".") mFile = no + src1[p:] elif src2: src2 = urllib.request.urljoin(self.driver.current_url, src2) p = src2.rfind(".") mFile = no + src2[p:] if src1 or src2: T = threading.Thread(target=self.downloadDB, args=(src1, src2, mFile)) T.setDaemon(False) T.start() self.threads.append(T) else: mFile = ""
Data insertion
def insertDB(self, mNo, mMark, mPrice, mNote, mFile): try: sql = "insert into phones (mNo,mMark,mPrice,mNote,mFile) values (?,?,?,?,?)" self.cursor.execute(sql, (mNo, mMark, mPrice, mNote, mFile)) except Exception as err: print(err)
Picture download
def downloadDB(self, src1, src2, mFile): data = None if src1: try: req = urllib.request.Request(src1, headers=JD.header) resp = urllib.request.urlopen(req, timeout=100) data = resp.read() except: pass if not data and src2: try: req = urllib.request.Request(src2, headers=JD.header) resp = urllib.request.urlopen(req, timeout=100) data = resp.read() except: pass if data: print("download begin!", mFile) fobj = open(JD.imagepath + "\\" + mFile, "wb") fobj.write(data) fobj.close() print("download finish!", mFile)
Experimental results:
Experimental experience: this experiment is a reproduction of classroom code. Through this experiment, I have deepened my understanding of Selenium and mastered xpath
Code cloud address: Assignment 5/main.py · Liu Yang / 2019 data acquisition and fusion - Code cloud - Open Source China (gitee.com)
Assignment 2
Operation requirements:
Proficient in Selenium's search for HTML elements, user simulated Login, crawling Ajax web page data, waiting for HTML elements, etc.
Use Selenium framework + MySQL to crawl the course resource information of China mooc network (course number, course name, teaching
Progress, course status, course picture address), and store the pictures in the imgs folder under the root directory of the local project
In, the name of the picture is stored with the course name.
Candidate website: China mooc website: https://www.icourse163.org
Experimental process:
Send request
option = webdriver.ChromeOptions() option.add_experimental_option("detach", True) driver = webdriver.Chrome(chrome_options=option) #request driver.get('https://www.icourse163.org/')
Database part, including the implementation of inserting data
class CSDB: con = pymysql.connect(host="127.0.0.1", port=3306, user="root", passwd="20010109l?y!", db="class", charset="utf8") cursor = con.cursor(pymysql.cursors.DictCursor) def createDB(self): try: self.cursor.execute('create table Course(Id varchar (10),cCourse varchar (40),cCollege varchar (20),cSchedule varchar (30),' 'cCourseStatus varchar (30), clmgUrl varchar (255))') except Exception as err: print(err) print(1) def insert(self,id,course,college,schedule,coursestatus,clmgurl): try: self.cursor.execute('insert into Course(Id,cCourse,cCollege,cSchedule,cCourseStatus,clmgUrl) ' 'values (%s,%s,%s,%s,%s,%s)', (id, course, college, schedule, coursestatus, clmgurl)) except Exception as err: print(err) def closeDB(self): self.con.commit() self.con.close()
Save the photo. The photo is called the course name. The photo is saved in the images folder
def download(url,name): req = urllib.request.Request(url,) data = urllib.request.urlopen(req, timeout=100) data = data.read() fobj = open(r"images/" + str(name) + ".jpg", "wb") fobj.write(data) fobj.close() print("downloaded" + (name) + ".jpg")
Click to enter the login page, scan the code on the login page, and then click to enter my course page to obtain data
sign = driver.find_element(By.XPATH, '/html/body/div[4]/div[2]/div[1]/div/div/div[1]/div[3]/div[3]/div').click() time.sleep(10) #Enter my class find = driver.find_element(By.XPATH, '//*[@id="j-indexNav-bar"]/div/div/div/div/div[7]/div[3]/div/div/a/span') # The icon is blocked, so it cannot be used directly find.click()click driver.execute_script("arguments[0].click();", find)
Data acquisition
#Course name course = driver.find_elements(By.XPATH, '//*[@id="j-coursewrap"]/div/div[1]/div/div[1]/a/div[2]/div[1]/div[1]/div/span[2]') #school college = driver.find_elements(By.XPATH, '//*[@id="j-coursewrap"]/div/div[1]/div/div[1]/a/div[2]/div[1]/div[2]/a') #class hour sche = driver.find_elements(By.XPATH, '//*[@id="j-coursewrap"]/div/div[1]/div/div[1]/a/div[2]/div[2]/div[1]/div[1]/div[1]/a/span') #state status = driver.find_elements(By.XPATH, '/html/body/div[4]/div[2]/div[3]/div/div[1]/div[3]/div/div[2]/div/div/div[2]/div[1]/div[2]/div/div[1]/div/div[1]/a/div[2]/div[2]/div[2]') #url url = driver.find_elements(By.XPATH, '/html/body/div[4]/div[2]/div[3]/div/div[1]/div[3]/div/div[2]/div/div/div[2]/div[1]/div[2]/div/div[1]/div/div[1]/a/div[1]/img')
Experimental results:
Experimental experience: through this experiment, I deepened my understanding of Selenium and became familiar with the use of xpath.
Code cloud address: Assignment 5/task2_1.py · Liu Yang / 2019 data acquisition and fusion - Code cloud - Open Source China (gitee.com)
Assignment 3
Operation requirements:
Understand Flume architecture and key features, and master the use of Flume to complete log collection tasks.
Complete Flume log collection experiment, including the following steps: Task 1: open MapReduce service
Task 2: generate test data from Python script
Task 3: configure Kafka
Task 4: install Flume client
Task 5: configure Flume to collect data
Experimental process and results:
python script generates test data
Write a Python script, use Xshell 7 to connect to the server, enter the / opt/client / directory, and use xftp7 to upload the local autodatapython.py file to the server / opt/client / directory.
Use the mkdir command to create the directory flume under / tmp_ Spooldir, execute Python command, test and generate 100 pieces of data
Configure Kafka
First, set the environment variable, execute the source command to make the variable effective, and create topic in kafka
View topic information
Install Flume client
Unzip the downloaded flume client file
Extract the compressed package to obtain the verification file and client configuration package
Verification package
Unzip the MRS_Flume_ClientConfig.tar file
Installing Flume environment variables
Install the client running environment to the new directory "/ opt/Flumeenv", which is automatically generated during installation
Unzip Flume client
Install Flume client, install Flume to new directory "/ opt/FlumeClient", and the directory is automatically generated during installation
Restart Flume service and the installation is complete
Configure Flume to collect data
Edit the file properties.properties in the conf directory
Create data in consumer consumption kafka
Open a new Xshell 7 window (right click the corresponding session -- > and open it in the right tab group), execute Python script commands and regenerate a copy of data
Experimental experience: preliminarily understand the use of Huawei cloud.