π’π’π’π£π£π£
π»π»π» Hello, everyone. My name is Dream. I'm an interesting Python blogger. I have a small white one. Please take care of it πππ
π
π
π
CSDN is a new star creator in Python field. I'm a sophomore. Welcome to cooperate with me
π Introduction note: this paradise is never short of genius, and hard work is your final admission ticket! πππ
π Finally, may we all shine where we can't see and make progress together πΊπΊπΊ
πππ "Ten thousand times sad, there will still be Dream, I have been waiting for you in the warmest place", singing is me! Ha ha ha~ πππ
πππβ¨β¨β¨
preface:
Next, let's use Python crawler to crawl Douban movie. Start from scratch and keep up with the rhythm. I can't understand you. Come to me~
1, urllib_ajax get method single page fetching
1. Page analysis
To climb Douban movies, our first step is, of course, baidu: Douban movies, enter the official website, Angang Duang~
We open the corresponding Douban movie page: Douban film action film ranking
Then right-click to find the inspection at the bottom of the interface and click in:
Then find the network, click and refresh the web page to obtain new data:
Analyze the data: first, exclude js, png and jpg files. On this basis, we found this file:
Click to find that it is indeed the data we need:
Then we find its url in the headers:
2. Grasping steps
(1) url and headers selection
import urllib.request url = 'https://movie.douban.com/j/chart/top_list?type=5&interval_id=100%3A90&action=&start=0&limit=20' headers={ 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36' }
(2) Customization of request object
request = urllib.request.Request(url=url,headers=headers)
(3) Get response data
response = urllib.request.urlopen(request) content = response.read().decode('utf-8')
(4) Data download to local
The open method uses gbk encoding by default. If we want to save Chinese characters, use
The encoding format specified in the open method is utf-8, encoding=utf-8
fp = open('douban.json','w',encoding='utf-8') fp.write(content)
Complete equivalence:
with open('douban1.json','w',encoding='utf-8') as fp: fp.write(content)
3. Specific code
# -*-coding:utf-8 -*- # @Author: it's time. I love brother Xu # Ollie, do it!!! # Get request, get the first page data and save it import urllib.request url = 'https://movie.douban.com/j/chart/top_list?type=5&interval_id=100%3A90&action=&start=0&limit=20' headers={ 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36' } # (1) Customization of request object request = urllib.request.Request(url=url,headers=headers) # (2) Get response data response = urllib.request.urlopen(request) content = response.read().decode('utf-8') print(content) # (3) Data download to local # The open method uses gbk encoding by default. If we want to save Chinese characters, use # The coding format in the open method is utf-8 # encoding=utf-8 fp = open('douban.json','w',encoding='utf-8') fp.write(content) # Complete equivalence with open('douban1.json','w',encoding='utf-8') as fp: fp.write(content)
4. Result display
Save as json data, and you will find that your data is one line, which is very ugly. Just press ctrl + alt +l to change the data into a standard line. However, this key will coincide with the computer QQ lock key, so it is generally said that the QQ is returned first for this operation, or capable students can change the key position selection and transpose it to the combination of other key positions.
2, urllib_ajax get method crawls any number of pages
1. Page analysis
In this page, you can gradually slide down with the mouse, and you will find that some almost the same page data will be added. In fact, each page displays 20 movies, and when you continue to slide down, another 20 movies will appear. We will take down the corresponding URLs and analyze them separately:
first page:
https://movie.douban.com/j/chart/top_list?type=5&interval_id=100%3A90&action=&start=0&limit=20
Page 2:
https://movie.douban.com/j/chart/top_list?type=5&interval_id=100%3A90&action=&start=20&limit=20
Page 3:
https://movie.douban.com/j/chart/top_list?type=5&interval_id=100%3A90&action=&start=40&limit=20
It can be found that each start is different
Set page to N, then: start = (n-1) * 20
2. Grasping steps
(1) Customization of request object
resquest = urllib.request.Request(url=url,headers=headers)
(2)url splicing
Splice with urlib.parse:
url = 'https://movie.douban.com/j/chart/top_list?type=5&interval_id=100%3A90&action=unwatched&' data={ 'start':(page - 1)*20, 'limit':20 } data = urllib.parse.urlencode(data) url = url + data
(3) Set entry
if __name__=='__main__': start_page = int(input('Please enter the starting page number')) end_page = int(input('Please enter the end page')) for page in range(start_page,end_page+1):
- Get request
- Response request
- Save data
request = create_request(page) # Get response data content = get_content(request) down_load(page,content)
(4) create_request(page) function
def create_request(page): url = 'https://movie.douban.com/j/chart/top_list?type=5&interval_id=100%3A90&action=unwatched&' data={ 'start':(page - 1)*20, 'limit':20 } data = urllib.parse.urlencode(data) url = url + data headers={ 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36' } resquest = urllib.request.Request(url=url,headers=headers) return resquest
(5) get_content(request) function
def get_content(request): response = urllib.request.urlopen(request) content = response.read().decode('utf-8') return content
(6) down_load(page,content) function
def down_load(page,content): # with open('douban_'+str(page)+'.json','w',encoding='utf-8') as fp: # fp.write(content) fp = open('douban_'+str(page)+'.json','w',encoding='utf-8') fp.write(content)
3. Specific code
# -*-coding:utf-8 -*- # @Author: it's time. I love brother Xu # Ollie, do it!!! import urllib.request import urllib.parse def create_request(page): url = 'https://movie.douban.com/j/chart/top_list?type=5&interval_id=100%3A90&action=unwatched&' data={ 'start':(page - 1)*20, 'limit':20 } data = urllib.parse.urlencode(data) url = url + data headers={ 'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/94.0.4606.61 Safari/537.36' } resquest = urllib.request.Request(url=url,headers=headers) return resquest def get_content(request): response = urllib.request.urlopen(request) content = response.read().decode('utf-8') return content def down_load(page,content): # with open('douban_'+str(page)+'.json','w',encoding='utf-8') as fp: # fp.write(content) fp = open('douban_'+str(page)+'.json','w',encoding='utf-8') fp.write(content) if __name__=='__main__': start_page = int(input('Please enter the starting page number')) end_page = int(input('Please enter the end page')) for page in range(start_page,end_page+1): request = create_request(page) # Get response data content = get_content(request) down_load(page,content)
4. Result display
Convert it to a more standardized form: ctrl+alt+l
3, Final charge β‘β‘β‘
After a series of operations, you have climbed to the data of Douban film. When you go back to the official website of Douban film again with joy, you suddenly find that you can't get in. You need to log in first!
Douban: are you polite? Don't go whoring for nothing!
Hahaha, you have been blacklisted by Douban. People's awareness of prevention is still very high!
After all, once crawled, anti crawler again and again!
OK, let's say goodbye. See you next time!
Recommendations of previous articles:
Python crawler β€οΈ Urllib usage collection—— β‘ One click easy entry crawler β‘
π²π²π² Well, that's all I want to share with you today
β€οΈβ€οΈβ€οΈ If you like, don't save your one button three connections~