Using python to climb the top 250 of Douban film list and its corresponding website

python has a powerful requests library and beautiful soup library. Through these two libraries, you can simply crawl the website data. Recently, I have just taught myself some simple reptile knowledge. Next, I will take top 250 (https://movie.double.com/top 250) as an example to crawl some data.

Catalog

(1) Observe the composition of web pages

1. First, we use a browser to open the website (https://movie.double.com/top250) and find the developer tool. Because I use Google browser, I use the shortcut key shift+ctrl+i to open the developer tool directly. I also recommend you to use Google browser, which I think is better than using.
Download address: http://wie.rwbzx.com/
2. After opening the developer tool, click the element review button in the upper right corner.

3. Click it, and then move it to any content in the web page and click it. The developer tool will automatically find and light the HTML code corresponding to the content. This function can facilitate us to find the HTML code we want to crawl. Since the data we need is the movie name and its corresponding website, first click on the first movie name: shawshank redemption. We can find that both the movie name and the web address are contained in a div of class='td ', and the web address and the movie name are respectively contained in the element a and the span of the first class='title' under the Div.

4. Repeat the above steps to observe the names of other films. We can find that they are the same. Now we have finished the previous observation.

(2) Crawling a single web page

After finishing the observation of the website code, we started to formally crawl the data. When crawling the data, we need to use python's requests library and beautiful soup library. My IDE is pycharm. If you have any questions about the import of requests library and beautiful soup library, please refer to my previous blog. Here is the code for crawling a single web page.

import requests
from bs4 import BeautifulSoup

# Set request header for anti crawler
headers = {
    'user-agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.132 Safari/537.36'
}
# Request a connection to the site
res = requests.get('https://movie.douban.com/top250', headers=headers)
# Save site data to beautifulsop object
soup = BeautifulSoup(res.text, 'html.parser')
# Crawls all data in the website with the label of 'div' and class='pl2 'is saved to the Tag object
items = soup.find_all('div', class_='hd')
for i in items:
    # Then filter out all the data labeled a
    tag = i.find('a')
    # Read only the first class='title 'as the movie name
    name = tag.find(class_='title').text
    # Crawl the website corresponding to the title of the book
    link = tag['href']
    print(name, link)

(3) Crawling the entire website

The above code can only crawl a single page. If we want to crawl the whole website, we have to return to the top 250 (https://movie.doublan.com/top250) homepage of Douban movie list. After another observation, we find that there are 25 movies on the whole page, which are divided into 10 pages, so we click Page 2, page 3 We found that the website has changed.


The red part is the changing part, and we found that the second page is much more than the website of the home page
? start = 25 & filter = this part is called query string. Here we can understand it as a special tag. Each sub page corresponds to a specific tag.
By clicking on the second page, the third page and the fourth page in turn We find that only the value after start = changes in the query string, and it changes with each increase of 25. This 25 is exactly the number of movies per page, so we can conclude that the value after start = is equal to 25 * (page number - 1). With this conclusion, we can start crawling the whole website. Here is the full code:

import requests
import time
from bs4 import BeautifulSoup

# Encapsulate the code to obtain the data of Douban movie into a function
def get_douban_movie(url):
  # Set request header for anti crawler
  headers = {
  'user-agent':'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/79.0.3945.130 Safari/537.36'
  }

  # Request a connection to the site
  res = requests.get('https://movie.douban.com/top250', headers=headers)
  # Save site data to beautifulsop object
  soup = BeautifulSoup(res.text,'html.parser')
  # Crawls all data in the website with the label of 'div' and class='pl2 'is saved to the Tag object
  items = soup.find_all('div', class_='hd')
  for i in items:
    # Then filter out all the data labeled a
    tag = i.find('a')
    # Read only the first class='title 'as the movie name
    name = tag.find(class_='title').text
    # Crawl the website corresponding to the title of the book
    link = tag['href']
    print(name,link)

url = 'https://movie.douban.com/top250?start={}&filter='
# Save all URL information to the list
urls = [url.format(num*25) for num in range(10)]
for item in urls:
  get_douban_movie(item)
  # Pause for 1 second to prevent access from being blocked too fast
  time.sleep(1)

Some operation results are attached:

The above is all the code and main steps, I hope to help you, thank you!

Published 2 original articles, won praise 1 and visited 49
Private letter follow

Tags: Python Google Pycharm Mac

Posted on Thu, 06 Feb 2020 05:54:51 -0500 by thirdeye