Getting started with Python crawler

Written in front

A beautiful day has begun again. Today, we continue to climb the IT online education website, mooc.com. The data volume of this platform is not very large, so IT is relatively simple to climb up

Ready to crawl

Open the page we want to crawl, find the paging point and check whether the data is loaded asynchronously.

After some corresponding analysis, it is found that there is no asynchronous data, only page turning is needed, and data can be obtained through HTML parsing,
Page turning data is as follows, totaling 32 pages, which is very small in data volume.

https://www.imooc.com/course/list?page=1
https://www.imooc.com/course/list?page=2
....

https://www.imooc.com/course/list?page=32

Writing code

The code is divided into three parts: automatic URL splicing, HTML parsing and storage in mongodb

page = 1

def main(page):
    print(f"Crawling{page}Page data")
    try:
        with requests.Session() as s:
            res = s.get("https://www.imooc.com/course/list?page={}".format(page))
            d = pq(res.text)
            get_content(d)  # The detailed function contents are as follows
    except Exception as e:
        print(e)
    finally:
        page+=1
        main(page)   # Page number + 1, call main function again


if __name__ == '__main__':
    main(1)
    

The function of the above code is mainly used for page turning. If you want to get detailed data, you need to parse the source code of the web page. The parsing library uses pyquery, a python library similar to jquery. Of course, if you search the corresponding information, you will see a lot of introductory tutorials.

def get_content(d):
    courses = d.items(".course-card-container")

    for course in courses:
        title = course.find(".course-card-name").text()  # Find title
        des = course.find(".course-card-desc").text()
        level = course.find(".course-card-info>span:eq(0)").text()
        users = course.find(".course-card-info>span:eq(1)").text()
        labels = course.find(".course-label").text().split(" ")
        url = urljoin("https://Www.imooc.com/learn/ ", course. Find (" a "). Attr (" href ") ා URL splicing
        img_url = urljoin("https://Img3. Mukewang. COM / ", course. Find (" img "). Attr (" SRC ")) (URL splicing
        dict = {
            "title":title,
            "des":des,
            "level":level,
            "users":users,
            "labels":labels,
            "url":url,
            "img_url":img_url
        }
        save_mongodb(dict)  # Save to mongodb

The last step is to save to mongodb, which is a basic operation. Just look at the previous tutorial, and finish it by yourself.

Another online education platform was climbed by us

Tags: Python MongoDB Session JQuery

Posted on Wed, 04 Dec 2019 04:21:25 -0500 by vang