Getting started with Python crawler

Written in front A beautiful day has begun again. Today, we continue to climb the IT online education website, mooc.com. The data volume of this platf...
Written in front

A beautiful day has begun again. Today, we continue to climb the IT online education website, mooc.com. The data volume of this platform is not very large, so IT is relatively simple to climb up

Ready to crawl

Open the page we want to crawl, find the paging point and check whether the data is loaded asynchronously.

After some corresponding analysis, it is found that there is no asynchronous data, only page turning is needed, and data can be obtained through HTML parsing,
Page turning data is as follows, totaling 32 pages, which is very small in data volume.

https://www.imooc.com/course/list?page=1 https://www.imooc.com/course/list?page=2 .... https://www.imooc.com/course/list?page=32
Writing code

The code is divided into three parts: automatic URL splicing, HTML parsing and storage in mongodb

page = 1 def main(page): print(f"CrawlingPage data") try: with requests.Session() as s: res = s.get("https://www.imooc.com/course/list?page={}".format(page)) d = pq(res.text) get_content(d) # The detailed function contents are as follows except Exception as e: print(e) finally: page+=1 main(page) # Page number + 1, call main function again if __name__ == '__main__': main(1)

The function of the above code is mainly used for page turning. If you want to get detailed data, you need to parse the source code of the web page. The parsing library uses pyquery, a python library similar to jquery. Of course, if you search the corresponding information, you will see a lot of introductory tutorials.

def get_content(d): courses = d.items(".course-card-container") for course in courses: title = course.find(".course-card-name").text() # Find title des = course.find(".course-card-desc").text() level = course.find(".course-card-info>span:eq(0)").text() users = course.find(".course-card-info>span:eq(1)").text() labels = course.find(".course-label").text().split(" ") url = urljoin("https://Www.imooc.com/learn/ ", course. Find (" a "). Attr (" href ") ා URL splicing img_url = urljoin("https://Img3. Mukewang. COM / ", course. Find (" img "). Attr (" SRC ")) (URL splicing dict = { "title":title, "des":des, "level":level, "users":users, "labels":labels, "url":url, "img_url":img_url } save_mongodb(dict) # Save to mongodb

The last step is to save to mongodb, which is a basic operation. Just look at the previous tutorial, and finish it by yourself.

Another online education platform was climbed by us

4 December 2019, 04:21 | Views: 4971

Add new comment

For adding a comment, please log in
or create account

0 comments