Introduction to crawler

preface

This blog is used to record the learning process. It will be updated and corrected continuously. I hope you can correct your mistakes. The case studied in this paper is: crawling the scoring and evaluation information of Douban film for data analysis.

1, What you need to master before learning reptiles

We entered Douban's official website, and now we want to climb to get information such as film ratings and evaluations.

Extracting this information is very useful. For example, for evaluation information, you can extract keywords and other information, analyze them, and make a word cloud map.



If we need to crawl this information, we need the following basis:
① Can use Python to complete the relevant code
② Can analyze html
③ Can use csv files and sql to store data

2, Crawl data

2.1 how to store data

As we are getting started, we use csv files to store data, which is also called comma separator. The relevant codes for storing files using csv files are as follows:

import csv  # The csv module is used to read and store data

# Create a new list to store some information
goods = [[1, 'jack', 18],
         [2, 'Lucy', 19],
         [3, 'Lily', 18],
         [4, 'Tom', 20]
         ]

# The normal steps of using open are as follows: first create a file stream, then operate, and finally close the file stream.
# f = open('persons.csv')
# ......
# f.close()
# However, sometimes you forget to close the file stream, resulting in a waste of resources, so you can use with.
# Perform file related operations in the indent of with. When the indent of with is exceeded, the resources will be released automatically.

# The default file read / write mode is text mode t
# Write all pictures, music, videos mode='wb '
# Write normal text mode='wt '

# When the Windows system writes content, a blank line will appear between the lines with content, so newline = '' needs to be added. See the link in the blog for specific reasons

# Write content to file
with open('persons.csv', mode='wt', newline='') as f:
    w_file = csv.writer(f)    # Upgrade f pipe to csv pipe
    w_file.writerows(goods)   # Write content
    print('Write complete')

# Read the contents of the file
with open('persons.csv', mode='r') as f:
    r_file = csv.reader(f)
    for row in r_file:
        print(row)

Users of Windows system need to pay attention to some things, that is, blank lines will appear when writing data into csv files. For specific reasons, see: Thoughts on blank lines when writing two-dimensional list data with the writerows() function of csv module in python

2.2 obtaining html network data

To obtain network data, you need to use the request module. The official document link is as follows: request official document



Now we need to know the developer tools of the browser, because the developer tools can view the details of requests and responses. We right-click on the browser page and select the check element.



The following interface will pop up, where Element is html code.



We select Network to view the relevant information of the response.



The Doc file contains important information, and the interface is as follows. In the General column, the Request URL represents the address we want to visit. Request Method represents the method of the request, including GET and POST.



We can analyze the contents of Request Headers to see what we sent to the server to give us a response and allow us to access. User Agent is the most important in Request Headers, which indicates who is currently accessing. The server is considered to be accessed by the browser through the User Agent.



When we start accessing, what we need to save is the returned content of the response.



The code is as follows:

import locale    # View the coding method of the system
import requests

# A dictionary that holds user agent information
headers = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36 OPR/78.0.4093.231"
}

# Use requests to send get type requests, and disguise requests as initiated by the browser
url = 'https://movie.douban.com/explore#! Type = movie & tag = hot & sort = recommend & page_ limit=20&page_ start=0'
response = requests.get(url, headers=headers)      # The headers parameter requires the input of a sub dictionary type parameter
code = response.status_code                        # Return code, 200 when correct

# Take a look at the encoding method of the system. Windows defaults to GBK.
# print(locale.getpreferredencoding(False))

if code == 200:
    # Obtained response data
    data = response.text
    # Save the corresponding data to the doublan.html file
    # The encoding method of the open function depends on the system, and we use UTF-8 to open html files in pycharm. Therefore, when using the open function, change the encoding mode to UTF-8
    with open('douban.html', mode='w', encoding="utf-8") as f:
         f.write(data)
else:
    print('Wrong request!')

After getting the returned html raw data, let's take a look and open it in the browser.

We found that the effect is as follows. The information of the movie is not displayed, indicating that the information we want is not in this link.

2.3 obtaining json network data

So we have to go back and analyze. As long as our data is not in the Doc, the loading method may be asynchronous loading. How to verify whether it is loaded asynchronously? The operation steps are as follows:

The URL s we compared several times are as follows:
Request URL: https://movie.douban.com/j/search_subjects?type=movie&tag=%E8%B1%86%E7%93%A3%E9%AB%98%E5%88%86&sort=time&page_limit=20&page_start=0

Request URL: https://movie.douban.com/j/search_subjects?type=movie&tag=%E8%B1%86%E7%93%A3%E9%AB%98%E5%88%86&sort=time&page_limit=20&page_start=20

Request URL: https://movie.douban.com/j/search_subjects?type=movie&tag=%E8%B1%86%E7%93%A3%E9%AB%98%E5%88%86&sort=time&page_limit=20&page_start=40


We can find that it's just the last page_start is different. I use a browser to open the page_ The URL corresponding to start = 20, the interface is as follows, and the json information is returned.



We can get:
Document returns data in html format as follows:

<html>
......
<html>

XHR(ajax) returns json format data in the following format:

{
data:["",""]
}

If you think the json format is messy, you can use bejson for format verification to make the format clearer.

The code for obtaining json data is as follows:

import requests

headers = {
    "User-Agent": "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36 OPR/78.0.4093.231"
}

url = 'https://movie.douban.com/j/search_subjects?type=movie&tag=%E7%83%AD%E9%97%A8&sort=recommend&page_limit=20&page_start=0'
response = requests.get(url, headers=headers)
code = response.status_code

if code == 200:
    data = response.json()    # The response data obtained is of json type
    print(type(data))         # The type of data is string
    data = str(data)          # Convert dictionary type to str type
    with open('movies.txt', mode='w', encoding="utf-8") as f:
        f.write(data)
else:
    print('Wrong request!')

2.4 obtaining picture data

If we want to extract the poster image of the movie, first check the json information formatted by bejson.

It can be seen that the cover item corresponds to the location of the movie poster. We copied the website and checked it. We found that it is the location where the movie poster is stored.



We set the URL as the URL to be crawled, and the code is as follows:

import requests

headers = {
    "User-Agent": "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36 OPR/78.0.4093.231"
}

url = "https://img3.doubanio.com/view/photo/s_ratio_poster/public/p2615830700.webp"
response = requests.get(url, headers=headers)
code = response.status_code

if code == 200:
    # The response data obtained is binary
    data = response.content
    with open('picture.jpg', mode='wb') as f:
        f.write(data)
else:
    print('Wrong request!')

The crawling pictures are as follows:

3, Extract data

After crawling all kinds of data, we begin to extract data from them. We can use XPATH, which can parse html and xml. For detailed syntax of XPATH, please refer to: XPATH syntax

3.1 extract Baidu hot search information

The baidu html files we crawled from the beginning are as follows:

After the browser is opened, the interface is as follows. How can we extract hot search information from it?



We can see that the location of hot search information is as follows:

The remaining hot search information (click to change a batch to see) is here:



The code for extracting hot search information is as follows:

import requests
from lxml import etree

headers = {
    "User-Agent": "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36 OPR/78.0.4093.231"
}
url = 'https://www.baidu.com'

response = requests.get(url, headers=headers)
code = response.status_code

if code == 200:
    data = response.text
    # Give the obtained network data to etree for analysis
    html = etree.HTML(data)
    hotsearchs = html.xpath('//ul[@class="s-hotsearch-content"]/li/a/span[2]/text()')
    print(hotsearchs)
else:
    print('Wrong request!')

The effects are as follows:


3.2 extract the recent activity information of Douban in the same city

The recent activity information of Douban in the city that you want to extract is shown in the figure below:



We want to extract the activity title and time information. The activity title is located as follows:



The location of the activity time is as follows:



The extracted title and time information codes are as follows:

import requests
from lxml import etree
import re
import csv

# 1. Send a request and get the response data
headers = {
    "User-Agent": "User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36 OPR/78.0.4093.231"
}
url = 'https://beijing.douban.com/events/week-all'
response = requests.get(url, headers=headers)

if response.status_code == 200:
    data = response.text
    # 2. Analyze data
    html = etree.HTML(data)
    titles = html.xpath('//ul[@class="events-list events-list-pic100 events-list-psmall"]/li/div[2]/div/a/span/text()')
    print(titles)
    times = html.xpath('//ul[@class="events-list events-list-pic100 events-list-psmall"]/li/div[2]/ul/li[1]/text()')
    print(times)

    # Handle and spaces in times \ n
    # Prepare a list times1 to store the processed data
    times1 = []
    for t in range(1,len(times),4):
        ti = re.sub(r'\s+','',times[t])
        times1.append(ti)
        print(ti)
    # 3. Store data
    titles = zip(titles, times1)
    with open('tongcheng.csv', mode='w', newline='', encoding='utf-8') as f:
        w_file = csv.writer(f)
        w_file.writerows(titles)
else:
    print('error')

Tags: Python Database html crawler xpath

Posted on Sun, 26 Sep 2021 21:24:54 -0400 by dub