Python crawler practice ❀️ Analyze the page from scratch and grab the data - climb to any number of pages of Douban movie. I don't understand. Come to me! ❀️

πŸ“’πŸ“’πŸ“’πŸ“£πŸ“£πŸ“£
🌻🌻🌻 Hello, everyone. My name is Dream. I'm an interesting Python blogger. I have a small white one. Please take care of it 😜😜😜
πŸ…πŸ…πŸ… CSDN is a new star creator in Python field. I'm a sophomore. Welcome to cooperate with me
πŸ’• Introduction note: this paradise is never short of genius, and hard work is your final admission ticket! πŸš€πŸš€πŸš€
πŸ’“ Finally, may we all shine where we can't see and make progress together 🍺🍺🍺
πŸ‰πŸ‰πŸ‰ "Ten thousand times sad, there will still be Dream, I have been waiting for you in the warmest place", singing is me! Ha ha ha~ 🌈🌈🌈
🌟🌟🌟✨✨✨

preface:

Next, let's use Python crawler to crawl Douban movie. Start from scratch and keep up with the rhythm. I can't understand you. Come to me~

1, urllib_ajax get method single page fetching

1. Page analysis

To climb Douban movies, our first step is, of course, baidu: Douban movies, enter the official website, Angang Duang~
We open the corresponding Douban movie page: Douban film action film ranking

Then right-click to find the inspection at the bottom of the interface and click in:


Then find the network, click and refresh the web page to obtain new data:

Analyze the data: first, exclude js, png and jpg files. On this basis, we found this file:

Click to find that it is indeed the data we need:

Then we find its url in the headers:

2. Grasping steps

(1) url and headers selection

import urllib.request
url = 'https://movie.douban.com/j/chart/top_list?type=5&interval_id=100%3A90&action=&start=0&limit=20'

headers={
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36'
}

(2) Customization of request object

request = urllib.request.Request(url=url,headers=headers)

(3) Get response data

response = urllib.request.urlopen(request)
content = response.read().decode('utf-8')

(4) Data download to local

The open method uses gbk encoding by default. If we want to save Chinese characters, use
The encoding format specified in the open method is utf-8, encoding=utf-8

fp = open('douban.json','w',encoding='utf-8')
fp.write(content)

Complete equivalence:

with open('douban1.json','w',encoding='utf-8') as fp:
    fp.write(content)

3. Specific code

# -*-coding:utf-8 -*-
# @Author: it's time. I love brother Xu
# Ollie, do it!!!

# Get request, get the first page data and save it
import urllib.request
url = 'https://movie.douban.com/j/chart/top_list?type=5&interval_id=100%3A90&action=&start=0&limit=20'

headers={
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36'
}

# (1) Customization of request object
request = urllib.request.Request(url=url,headers=headers)

# (2) Get response data
response = urllib.request.urlopen(request)
content = response.read().decode('utf-8')
print(content)

# (3) Data download to local
# The open method uses gbk encoding by default. If we want to save Chinese characters, use
# The coding format in the open method is utf-8
# encoding=utf-8
fp = open('douban.json','w',encoding='utf-8')
fp.write(content)
# Complete equivalence
with open('douban1.json','w',encoding='utf-8') as fp:
    fp.write(content)

4. Result display

Save as json data, and you will find that your data is one line, which is very ugly. Just press ctrl + alt +l to change the data into a standard line. However, this key will coincide with the computer QQ lock key, so it is generally said that the QQ is returned first for this operation, or capable students can change the key position selection and transpose it to the combination of other key positions.

2, urllib_ajax get method crawls any number of pages

1. Page analysis

In this page, you can gradually slide down with the mouse, and you will find that some almost the same page data will be added. In fact, each page displays 20 movies, and when you continue to slide down, another 20 movies will appear. We will take down the corresponding URLs and analyze them separately:
first page:

https://movie.douban.com/j/chart/top_list?type=5&interval_id=100%3A90&action=&start=0&limit=20

Page 2:

https://movie.douban.com/j/chart/top_list?type=5&interval_id=100%3A90&action=&start=20&limit=20

Page 3:

https://movie.douban.com/j/chart/top_list?type=5&interval_id=100%3A90&action=&start=40&limit=20

It can be found that each start is different
Set page to N, then: start = (n-1) * 20

2. Grasping steps

(1) Customization of request object

resquest = urllib.request.Request(url=url,headers=headers)

(2)url splicing

Splice with urlib.parse:

url = 'https://movie.douban.com/j/chart/top_list?type=5&interval_id=100%3A90&action=unwatched&'

    data={
        'start':(page - 1)*20,
        'limit':20
    }
    data = urllib.parse.urlencode(data)
    url = url + data

(3) Set entry

if __name__=='__main__':
    start_page = int(input('Please enter the starting page number'))
    end_page = int(input('Please enter the end page'))
    for page in range(start_page,end_page+1):
  1. Get request
  2. Response request
  3. Save data
	 	request = create_request(page)
        # Get response data
        content = get_content(request)
        down_load(page,content)

(4) create_request(page) function

def create_request(page):

    url = 'https://movie.douban.com/j/chart/top_list?type=5&interval_id=100%3A90&action=unwatched&'

    data={
        'start':(page - 1)*20,
        'limit':20
    }
    data = urllib.parse.urlencode(data)
    url = url + data
    headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36'
    }
    resquest = urllib.request.Request(url=url,headers=headers)
    return resquest

(5) get_content(request) function

def get_content(request):
    response = urllib.request.urlopen(request)
    content = response.read().decode('utf-8')
    return content

(6) down_load(page,content) function

def down_load(page,content):
    # with open('douban_'+str(page)+'.json','w',encoding='utf-8') as fp:
    #     fp.write(content)
    fp = open('douban_'+str(page)+'.json','w',encoding='utf-8')
    fp.write(content)

3. Specific code

# -*-coding:utf-8 -*-
# @Author: it's time. I love brother Xu
# Ollie, do it!!!
import urllib.request
import urllib.parse

def create_request(page):
    url = 'https://movie.douban.com/j/chart/top_list?type=5&interval_id=100%3A90&action=unwatched&'
    data={
        'start':(page - 1)*20,
        'limit':20
    }
    data = urllib.parse.urlencode(data)
    url = url + data
    headers={
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36'
    }
    resquest = urllib.request.Request(url=url,headers=headers)
    return resquest

def get_content(request):
    response = urllib.request.urlopen(request)
    content = response.read().decode('utf-8')
    return content

def down_load(page,content):
    # with open('douban_'+str(page)+'.json','w',encoding='utf-8') as fp:
    #     fp.write(content)
    fp = open('douban_'+str(page)+'.json','w',encoding='utf-8')
    fp.write(content)

if __name__=='__main__':
    start_page = int(input('Please enter the starting page number'))
    end_page = int(input('Please enter the end page'))
    for page in range(start_page,end_page+1):
        request = create_request(page)
        # Get response data
        content = get_content(request)
        down_load(page,content)

4. Result display


Convert it to a more standardized form: ctrl+alt+l

3, Final charge ⚑⚑⚑

After a series of operations, you have climbed to the data of Douban film. When you go back to the official website of Douban film again with joy, you suddenly find that you can't get in. You need to log in first!
Douban: are you polite? Don't go whoring for nothing!
Hahaha, you have been blacklisted by Douban. People's awareness of prevention is still very high!

After all, once crawled, anti crawler again and again!

OK, let's say goodbye. See you next time!

Recommendations of previous articles:

Don't you understand Python OpenCV? No, I won't allow it! The uncle next door said he could understand! ❀️ Environment configuration + problem analysis + introduction to video image ❀️ Ten thousand words only for you~

Python OpenCV actual combat drawing - it will work this time! It's recommended to praise the collection~ ❀️❀️❀️

❀️ Happy Mid Autumn Festival ❀️ Next, please enjoy the image threshold and blur processing of Python Opencv actual combat, 10000 words actual combat, and collect it~

Python crawler ❀️ Urllib usage collection—— ⚑ One click easy entry crawler ⚑

🌲🌲🌲 Well, that's all I want to share with you today
❀️❀️❀️ If you like, don't save your one button three connections~

Tags: Python crawler Ajax

Posted on Wed, 06 Oct 2021 23:09:20 -0400 by sn202