Data mining fifth

Assignment 1

Operation requirements:

Master Selenium to find HTML elements, crawl Ajax web page data, wait for HTML elements, etc.
Use Selenium framework to crawl the information and pictures of certain commodities in Jingdong Mall.
Candidate sites:


Experimental process:

Drive configuration

chrome_options = Options()
self.driver = webdriver.Chrome(chrome_options=chrome_options)

Data acquisition

lis = self.driver.find_elements_by_xpath("//div[@id='J_goodsList']//li[@class='gl-item']")
for li in lis:
        src1 = li.find_element_by_xpath(".//div[@class='p-img']//a//img").get_attribute("src")
        src1 = ""
        src2 = li.find_element_by_xpath(".//div[@class='p-img']//a//img").get_attribute("data-lazy-img")
        src2 = ""
        price = li.find_element_by_xpath(".//div[@class='p-price']//i").text
        price = "0"

    note = li.find_element_by_xpath(".//div[@class='p-name p-name-type-2']//em").text
    mark = note.split(" ")[0]
    mark = mark.replace("Love Dongdong\n", "")
    mark = mark.replace(",", "")
    note = note.replace("Love Dongdong\n", "")
    note = note.replace(",", "")

Save picture

if src1:
    src1 = urllib.request.urljoin(self.driver.current_url, src1)
    p = src1.rfind(".")
    mFile = no + src1[p:]
elif src2:
    src2 = urllib.request.urljoin(self.driver.current_url, src2)
    p = src2.rfind(".")
    mFile = no + src2[p:]
if src1 or src2:
    T = threading.Thread(target=self.downloadDB, args=(src1, src2, mFile))
    mFile = ""

Data insertion

def insertDB(self, mNo, mMark, mPrice, mNote, mFile):
        sql = "insert into phones (mNo,mMark,mPrice,mNote,mFile) values (?,?,?,?,?)"

        self.cursor.execute(sql, (mNo, mMark, mPrice, mNote, mFile))
    except Exception as err:

Picture download

def downloadDB(self, src1, src2, mFile):
    data = None
    if src1:
            req = urllib.request.Request(src1, headers=JD.header)
            resp = urllib.request.urlopen(req, timeout=100)
            data =
    if not data and src2:
            req = urllib.request.Request(src2, headers=JD.header)
            resp = urllib.request.urlopen(req, timeout=100)
            data =
    if data:
        print("download begin!", mFile)
        fobj = open(JD.imagepath + "\\" + mFile, "wb")
        print("download finish!", mFile)

Experimental results:



  Experimental experience: this experiment is a reproduction of classroom code. Through this experiment, I have deepened my understanding of Selenium and mastered xpath

  Code cloud address: Assignment 5/ · Liu Yang / 2019 data acquisition and fusion - Code cloud - Open Source China (

Assignment 2

Operation requirements:

Proficient in Selenium's search for HTML elements, user simulated Login, crawling Ajax web page data, waiting for HTML elements, etc.
Use Selenium framework + MySQL to crawl the course resource information of China mooc network (course number, course name, teaching
Progress, course status, course picture address), and store the pictures in the imgs folder under the root directory of the local project
In, the name of the picture is stored with the course name.
Candidate website: China mooc website:

Experimental process:

Send request

option = webdriver.ChromeOptions()
option.add_experimental_option("detach", True)
driver = webdriver.Chrome(chrome_options=option)

Database part, including the implementation of inserting data

class CSDB:
    con = pymysql.connect(host="", port=3306, user="root", passwd="20010109l?y!", db="class",
    cursor = con.cursor(pymysql.cursors.DictCursor)

    def createDB(self):
            self.cursor.execute('create table Course(Id varchar (10),cCourse varchar (40),cCollege varchar (20),cSchedule varchar (30),'
                            'cCourseStatus varchar (30), clmgUrl varchar (255))')
        except Exception as err:

    def insert(self,id,course,college,schedule,coursestatus,clmgurl):
            self.cursor.execute('insert into Course(Id,cCourse,cCollege,cSchedule,cCourseStatus,clmgUrl) '
                            'values (%s,%s,%s,%s,%s,%s)', (id, course, college, schedule, coursestatus, clmgurl))
        except Exception as err:

    def closeDB(self):

Save the photo. The photo is called the course name. The photo is saved in the images folder

def download(url,name):
    req = urllib.request.Request(url,)
    data = urllib.request.urlopen(req, timeout=100)
    data =
    fobj = open(r"images/" + str(name) + ".jpg", "wb")
    print("downloaded" + (name) + ".jpg")

Click to enter the login page, scan the code on the login page, and then click to enter my course page to obtain data

sign = driver.find_element(By.XPATH, '/html/body/div[4]/div[2]/div[1]/div/div/div[1]/div[3]/div[3]/div').click()
#Enter my class
find = driver.find_element(By.XPATH, '//*[@id="j-indexNav-bar"]/div/div/div/div/div[7]/div[3]/div/div/a/span')
# The icon is blocked, so it cannot be used directly
driver.execute_script("arguments[0].click();", find)

Data acquisition

#Course name
course = driver.find_elements(By.XPATH, '//*[@id="j-coursewrap"]/div/div[1]/div/div[1]/a/div[2]/div[1]/div[1]/div/span[2]')
college = driver.find_elements(By.XPATH, '//*[@id="j-coursewrap"]/div/div[1]/div/div[1]/a/div[2]/div[1]/div[2]/a')
#class hour
sche = driver.find_elements(By.XPATH, '//*[@id="j-coursewrap"]/div/div[1]/div/div[1]/a/div[2]/div[2]/div[1]/div[1]/div[1]/a/span')
status = driver.find_elements(By.XPATH, '/html/body/div[4]/div[2]/div[3]/div/div[1]/div[3]/div/div[2]/div/div/div[2]/div[1]/div[2]/div/div[1]/div/div[1]/a/div[2]/div[2]/div[2]')
url = driver.find_elements(By.XPATH, '/html/body/div[4]/div[2]/div[3]/div/div[1]/div[3]/div/div[2]/div/div/div[2]/div[1]/div[2]/div/div[1]/div/div[1]/a/div[1]/img')

Experimental results:




  Experimental experience: through this experiment, I deepened my understanding of Selenium and became familiar with the use of xpath.

  Code cloud address: Assignment 5/ · Liu Yang / 2019 data acquisition and fusion - Code cloud - Open Source China (

Assignment 3

Operation requirements:

Understand Flume architecture and key features, and master the use of Flume to complete log collection tasks.
Complete Flume log collection experiment, including the following steps: Task 1: open MapReduce service
Task 2: generate test data from Python script
Task 3: configure Kafka
Task 4: install Flume client
Task 5: configure Flume to collect data

Experimental process and results:

python script generates test data

Write a Python script, use Xshell 7 to connect to the server, enter the / opt/client / directory, and use xftp7 to upload the local file to the server / opt/client / directory.

  Use the mkdir command to create the directory flume under / tmp_ Spooldir, execute Python command, test and generate 100 pieces of data




  Configure Kafka

  First, set the environment variable, execute the source command to make the variable effective, and create topic in kafka



  View topic information



  Install Flume client

  Unzip the downloaded flume client file



  Extract the compressed package to obtain the verification file and client configuration package

  Verification package




Unzip the MRS_Flume_ClientConfig.tar file


Installing Flume environment variables

Install the client running environment to the new directory "/ opt/Flumeenv", which is automatically generated during installation


  Unzip Flume client


Install Flume client, install Flume to new directory "/ opt/FlumeClient", and the directory is automatically generated during installation


Restart Flume service and the installation is complete


  Configure Flume to collect data

Edit the file in the conf directory


Create data in consumer consumption kafka


  Open a new Xshell 7 window (right click the corresponding session -- > and open it in the right tab group), execute Python script commands and regenerate a copy of data


  Experimental experience: preliminarily understand the use of Huawei cloud.

Posted on Wed, 24 Nov 2021 07:05:17 -0500 by keith73