Data mining fifth

Assignment 1

Operation requirements:

Master Selenium to find HTML elements, crawl Ajax web page data, wait for HTML elements, etc.
Use Selenium framework to crawl the information and pictures of certain commodities in Jingdong Mall.
Candidate sites: http://www.jd.com/

 

Experimental process:

Drive configuration

chrome_options = Options()
chrome_options.add_argument("-headless")
chrome_options.add_argument("-disable-gpu")
self.driver = webdriver.Chrome(chrome_options=chrome_options)

Data acquisition

lis = self.driver.find_elements_by_xpath("//div[@id='J_goodsList']//li[@class='gl-item']")
time.sleep(1)
for li in lis:
    time.sleep(1)
    try:
        src1 = li.find_element_by_xpath(".//div[@class='p-img']//a//img").get_attribute("src")
        time.sleep(1)
    except:
        src1 = ""
    try:
        src2 = li.find_element_by_xpath(".//div[@class='p-img']//a//img").get_attribute("data-lazy-img")
        time.sleep(1)
    except:
        src2 = ""
    try:
        price = li.find_element_by_xpath(".//div[@class='p-price']//i").text
        time.sleep(1)
    except:
        price = "0"

    note = li.find_element_by_xpath(".//div[@class='p-name p-name-type-2']//em").text
    mark = note.split(" ")[0]
    mark = mark.replace("Love Dongdong\n", "")
    mark = mark.replace(",", "")
    note = note.replace("Love Dongdong\n", "")
    note = note.replace(",", "")

Save picture

if src1:
    src1 = urllib.request.urljoin(self.driver.current_url, src1)
    p = src1.rfind(".")
    mFile = no + src1[p:]
elif src2:
    src2 = urllib.request.urljoin(self.driver.current_url, src2)
    p = src2.rfind(".")
    mFile = no + src2[p:]
if src1 or src2:
    T = threading.Thread(target=self.downloadDB, args=(src1, src2, mFile))
    T.setDaemon(False)
    T.start()
    self.threads.append(T)
else:
    mFile = ""

Data insertion

def insertDB(self, mNo, mMark, mPrice, mNote, mFile):
    try:
        sql = "insert into phones (mNo,mMark,mPrice,mNote,mFile) values (?,?,?,?,?)"

        self.cursor.execute(sql, (mNo, mMark, mPrice, mNote, mFile))
    except Exception as err:
        print(err)

Picture download

def downloadDB(self, src1, src2, mFile):
    data = None
    if src1:
        try:
            req = urllib.request.Request(src1, headers=JD.header)
            resp = urllib.request.urlopen(req, timeout=100)
            data = resp.read()
        except:
            pass
    if not data and src2:
        try:
            req = urllib.request.Request(src2, headers=JD.header)
            resp = urllib.request.urlopen(req, timeout=100)
            data = resp.read()
        except:
            pass
    if data:
        print("download begin!", mFile)
        fobj = open(JD.imagepath + "\\" + mFile, "wb")
        fobj.write(data)
        fobj.close()
        print("download finish!", mFile)

Experimental results:

 

 

  Experimental experience: this experiment is a reproduction of classroom code. Through this experiment, I have deepened my understanding of Selenium and mastered xpath

  Code cloud address: Assignment 5/main.py · Liu Yang / 2019 data acquisition and fusion - Code cloud - Open Source China (gitee.com)

Assignment 2

Operation requirements:

Proficient in Selenium's search for HTML elements, user simulated Login, crawling Ajax web page data, waiting for HTML elements, etc.
Use Selenium framework + MySQL to crawl the course resource information of China mooc network (course number, course name, teaching
Progress, course status, course picture address), and store the pictures in the imgs folder under the root directory of the local project
In, the name of the picture is stored with the course name.
Candidate website: China mooc website: https://www.icourse163.org

Experimental process:

Send request

option = webdriver.ChromeOptions()
option.add_experimental_option("detach", True)
driver = webdriver.Chrome(chrome_options=option)
#request
driver.get('https://www.icourse163.org/')

Database part, including the implementation of inserting data

class CSDB:
    con = pymysql.connect(host="127.0.0.1", port=3306, user="root", passwd="20010109l?y!", db="class",
                          charset="utf8")
    cursor = con.cursor(pymysql.cursors.DictCursor)

    def createDB(self):
        try:
            self.cursor.execute('create table Course(Id varchar (10),cCourse varchar (40),cCollege varchar (20),cSchedule varchar (30),'
                            'cCourseStatus varchar (30), clmgUrl varchar (255))')
        except Exception as err:
            print(err)
            print(1)

    def insert(self,id,course,college,schedule,coursestatus,clmgurl):
        try:
            self.cursor.execute('insert into Course(Id,cCourse,cCollege,cSchedule,cCourseStatus,clmgUrl) '
                            'values (%s,%s,%s,%s,%s,%s)', (id, course, college, schedule, coursestatus, clmgurl))
        except Exception as err:
            print(err)

    def closeDB(self):
        self.con.commit()
        self.con.close()

Save the photo. The photo is called the course name. The photo is saved in the images folder

def download(url,name):
    req = urllib.request.Request(url,)
    data = urllib.request.urlopen(req, timeout=100)
    data = data.read()
    fobj = open(r"images/" + str(name) + ".jpg", "wb")
    fobj.write(data)
    fobj.close()
    print("downloaded" + (name) + ".jpg")

Click to enter the login page, scan the code on the login page, and then click to enter my course page to obtain data

sign = driver.find_element(By.XPATH, '/html/body/div[4]/div[2]/div[1]/div/div/div[1]/div[3]/div[3]/div').click()
time.sleep(10)
#Enter my class
find = driver.find_element(By.XPATH, '//*[@id="j-indexNav-bar"]/div/div/div/div/div[7]/div[3]/div/div/a/span')
# The icon is blocked, so it cannot be used directly find.click()click
driver.execute_script("arguments[0].click();", find)

Data acquisition

#Course name
course = driver.find_elements(By.XPATH, '//*[@id="j-coursewrap"]/div/div[1]/div/div[1]/a/div[2]/div[1]/div[1]/div/span[2]')
#school
college = driver.find_elements(By.XPATH, '//*[@id="j-coursewrap"]/div/div[1]/div/div[1]/a/div[2]/div[1]/div[2]/a')
#class hour
sche = driver.find_elements(By.XPATH, '//*[@id="j-coursewrap"]/div/div[1]/div/div[1]/a/div[2]/div[2]/div[1]/div[1]/div[1]/a/span')
#state
status = driver.find_elements(By.XPATH, '/html/body/div[4]/div[2]/div[3]/div/div[1]/div[3]/div/div[2]/div/div/div[2]/div[1]/div[2]/div/div[1]/div/div[1]/a/div[2]/div[2]/div[2]')
#url
url = driver.find_elements(By.XPATH, '/html/body/div[4]/div[2]/div[3]/div/div[1]/div[3]/div/div[2]/div/div/div[2]/div[1]/div[2]/div/div[1]/div/div[1]/a/div[1]/img')

Experimental results:

 

 

 

  Experimental experience: through this experiment, I deepened my understanding of Selenium and became familiar with the use of xpath.

  Code cloud address: Assignment 5/task2_1.py · Liu Yang / 2019 data acquisition and fusion - Code cloud - Open Source China (gitee.com)

Assignment 3

Operation requirements:

Understand Flume architecture and key features, and master the use of Flume to complete log collection tasks.
Complete Flume log collection experiment, including the following steps: Task 1: open MapReduce service
Task 2: generate test data from Python script
Task 3: configure Kafka
Task 4: install Flume client
Task 5: configure Flume to collect data

Experimental process and results:

python script generates test data

Write a Python script, use Xshell 7 to connect to the server, enter the / opt/client / directory, and use xftp7 to upload the local autodatapython.py file to the server / opt/client / directory.

  Use the mkdir command to create the directory flume under / tmp_ Spooldir, execute Python command, test and generate 100 pieces of data

 

 

 

  Configure Kafka

  First, set the environment variable, execute the source command to make the variable effective, and create topic in kafka

 

 

  View topic information

 

 

  Install Flume client

  Unzip the downloaded flume client file

 

 

  Extract the compressed package to obtain the verification file and client configuration package

  Verification package

 

 

 

Unzip the MRS_Flume_ClientConfig.tar file

 

Installing Flume environment variables

Install the client running environment to the new directory "/ opt/Flumeenv", which is automatically generated during installation

 

  Unzip Flume client

 

Install Flume client, install Flume to new directory "/ opt/FlumeClient", and the directory is automatically generated during installation

 

Restart Flume service and the installation is complete

 

  Configure Flume to collect data

Edit the file properties.properties in the conf directory

 

Create data in consumer consumption kafka

 

  Open a new Xshell 7 window (right click the corresponding session -- > and open it in the right tab group), execute Python script commands and regenerate a copy of data

 

  Experimental experience: preliminarily understand the use of Huawei cloud.

Posted on Wed, 24 Nov 2021 07:05:17 -0500 by keith73