Python and Web Crawlers

Python and Web Crawlers

Article Directory

This article was originally created by CDFMLR and is included on the personal home page https://clownote.github.io And publish to CSDN at the same time.I do not guarantee the correct CSDN layout. Please visit clownote For a good reading experience.

Basic Principles of Crawlers

Crawlers are automated programs that retrieve web pages and extract and save information.

It can be said that everything we see in a browser can be crawled (including web pages rendered with JavaScript).

Crawlers solve the following main problems:

Get Web Page

Construct a request and send it to the server, then receive the response and parse it out.

We can use urllib, requests, and other libraries to help us implement HTTP request operations. Requests and responses can be represented by the data structures provided by the class libraries. After the response is obtained, you only need to parse the Body part of the data structure to get the source code of the web page.

Extract Information

Analyze the source code of the web page to extract the data we want.

The most common method is regular expression extraction, which is an all-purpose method, but it is complex and error-prone to construct regular expressions.
Using the Beautiful Soup, pyquery, lxml and other libraries, we can quickly and efficiently extract web page information from them, such as node attributes, text values, and so on.

Save data

Save the extracted data somewhere for later use.

Automated Programs

Let the crawler do these things instead of a person.
In the process of crawling, various operations such as exception handling and error retry are carried out to ensure the continuous and efficient operation of crawling.

Crawler Practice - Grab the movie rankings

Crawler practice: Use Requests and regular expressions to capture the content of the cat-eye movie TOP100.

target

We intend to extract Cat Eye Movie TOP100 List File name, time, rating, picture and other information, the extracted results are saved as a file.

Get ready

  • System environment: macOS High Sierra 10.13.6
  • Development language: Python 3.7.2 (default)
  • Third party library: Requests

Analysis

Target Site: https://maoyan.com/board/4

After opening the page, we can find that the effective information displayed on the page is the movie name, the director, the time, the region, the rating, the picture.

Then, let's look at the html source for the top item, and we'll start designing regular expressions in a few moments:

<dd>
        <i class="board-index board-index-1">1</i>
        <a href="/films/1203" title="Farewell to my concubine" class="image-link" data-act="boarditem-click" data-val="{movieId:1203}">
                <img src="//s0.meituan.net/bs/?f=myfe/mywww:/image/loading_2.e3d934bf.png" alt=""
                        class="poster-default" />
                <img data-src="https://p1.meituan.net/movie/20803f59291c47e1e116c11963ce019e68711.jpg@160w_220h_1e_1c"
                        alt="Farewell to my concubine" class="board-img" />
        </a>
        <div class="board-item-main">
                <div class="board-item-content">
                        <div class="movie-item-info">
                                <p class="name"><a href="/films/1203" title="Farewell to my concubine" data-act="boarditem-click"
                                                data-val="{movieId:1203}">Farewell to my concubine</a></p>
                                <p class="star">
                                        Actor: Leslie Cheung,Zhang Fengyi,Gong Li
                                </p>
                                <p class="releasetime">Show time: 1993-01-01</p>
                        </div>
                        <div class="movie-item-number score-num">
                                <p class="score"><i class="integer">9.</i><i class="fraction">6</i></p>
                        </div>

                </div>
        </div>

</dd>

Let's see what happens with page flipping:
url changed from https://maoyan.com/board/4 to https://maoyan.com/board/4?offset=10.
Well, one more? offset=10.
On the next page, it becomes offset=20.
Originally, offset was added 10 for every page that was turned over (one page shows exactly 10 entries, which makes sense.)
In fact, we can even try to change the value to 0 (the first page), or any value in the range [0, 100].

Design

We are now designing a crawler program that can accomplish the target.

This crawl should have several parts:

  • Grab pages (and be careful with paging issues): You can request them with requests.get(), preferably forging a set of headers.
  • Regular extraction (matching movie name, director, show time, show area, rating, picture): use re.findall() and the appropriate regular expression to extract information.
  • Writing to a file (saving information in JSON format): involves json.dumps() and file writing

Now we need to design a particularly critical regular expression:

'<dd>. *? Board-index. *?> (. *?)</i>. *? Data-src='(. *?)@. *? Title=' (. *?)''. *? Actor: (. *?)\s*</p>. *? Time of show: (. *?)</p>. *? Integer > (. *?)</i>*? Fraction (*?)</i></p>'
The order matched is: (Rank, Picture Address, Name, Actor, Show Time, Score Integer, Score Decimal)
re.S is required

Realization

Let's now follow the design to implement the first version of the program:

import re
import json
import time

import requests

url = 'https://maoyan.com/board/4'
filename = './movies.txt'
pattern = r'<dd>.*?board-index.*?>(.*?)</i>.*?data-src="(.*?)@.*?title="(.*?)".*?To star:(.*?)\s*</p>.*?Show time:(.*?)</p>.*?integer">(.*?)</i>.*?fraction">(.*?)</i></p>'

headers = {
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0.3 Safari/605.1.15',
        'Accept-Language': 'zh-cn'
        }

def get_page(url):   # Grab the page and return the html string
    print('\tGetting...')
    try:
        response = requests.get(url, headers=headers)
        return response.text
    except Exception as e:
        print('[Error]', e)
        return ''


def extract(html):  # Regular extraction returns the list of dict s
    print('\tExtracting...')
    raws = re.findall(pattern, html, re.S)   # [(Rank, Picture Address, Name, Actor, Show Time, Score Integer, Score Decimal),...]
    result = []
    for raw in raws:
        dc = {                      # The order is adjusted here
                'index': raw[0],
                'title': raw[2],
                'stars': raw[3],
                'otime': raw[4],
                'score': raw[5] + raw[6],   # Combining Integers, Decimals
                'image': raw[1]
                }
        result.append(dc)

    return result
    

def save(data):      # write file
    print('\tSaving...')
    with open(filename, 'a', encoding='utf-8') as f:
        for i in data:
            f.write(json.dumps(i, ensure_ascii=False) + '\n')


if __name__ == '__main__':
    for i in range(0, 100, 10):     # Page Flip
        target = url + '?offset=' + str(i)
        print('[%s%%](%s)' % (i, target))
        page = get_page(target)
        data = extract(page)
        save(data)
        time.sleep(0.5)     # Prevent overdense requests from being blocked

    print('[100%] All Finished.\n Results in', filename)

debugging

Run the program and if everything goes well, we'll get results:

{"index": "1", "title": "Farewell to my concubine", "stars": "Leslie Cheung,Zhang Fengyi,Gong Li", "otime": "1993-01-01", "score": "9.6", "image": "https://p1.meituan.net/movie/20803f59291c47e1e116c11963ce019e68711.jpg"}
{"index": "2", "title": "The Shawshank Redemption", "stars": "Tim·Robins,morgan·Freeman,Bob·Gunton", "otime": "1994-10-14(U.S.A)", "score": "9.5", "image": "https://p0.meituan.net/movie/283292171619cdfd5b240c8fd093f1eb255670.jpg"}
{"index": "3", "title": "Roman Holiday", "stars": "gregory·Pike,Audrey Hepburn·Hepburn,Eddy·Albert", "otime": "1953-09-02(U.S.A)", "score": "9.1", "image": "https://p0.meituan.net/movie/54617769d96807e4d81804284ffe2a27239007.jpg"}
{"index": "4", "title": "Léon", "stars": "Give Way·Renault,Gary·Odeman,Natalie·Portman", "otime": "1994-09-14(France)", "score": "9.5", "image": "https://p0.meituan.net/movie/e55ec5d18ccc83ba7db68caae54f165f95924.jpg"}
{"index": "5", "title": "Titanic", "stars": "Leonado·leonardo dicaprio,Kate·winslet,Billy·Zane", "otime": "1998-04-03", "score": "9.6", "image": "https://p1.meituan.net/movie/0699ac97c82cf01638aa5023562d6134351277.jpg"}

At first glance it looks like there's no problem, and we've got what we need.
However, a closer look reveals several problems:

  1. stars usually have several people, so we'd better separate them into a list or tuple.
  2. Each item of information is directly separated from each other, so it is not easy to transfer and use.

To solve the first problem, we can add a part to extract to handle it, while the second problem requires that we read all the pages, put them in the specified location, and write the files together.

Modify source program:

Add a new function to process the actor information and get a list:

def stars_split(st):
    return st.split(',')

Modify extract() to add a call to stars_split:

def extract(html):  # Regular extraction returns the list of dict s
    print('\tExtracting...')
    raws = re.findall(pattern, html, re.S)   # [(Rank, Picture Address, Name, Actor, Show Time, Score Integer, Score Decimal),...]
    result = []
    for raw in raws:
        dc = {                      # The order is adjusted here
                'index': raw[0],
                'title': raw[2],
                'stars': stars_split(raw[3]),   # [Modify]: Separate the actors
                'otime': raw[4],
                'score': raw[5] + raw[6],       # Combining Integers, Decimals
                'image': raw[1]
                }
        result.append(dc)

    return result

Add a new global variable and function to integrate the results:

result = {'top movies': []}

def merge(data):
    print('\tMerging...')
    result['top movies'] += data

Modify save:

def save(data):      # write file
    print('Saving...')
    with open(filename, 'a', encoding='utf-8') as f:
        f.write(json.dumps(data, ensure_ascii=False))

Modify the program framework:

if __name__ == '__main__':
    for i in range(0, 100, 10):     # Page Flip
        target = url + '?offset=' + str(i)
        print('[%s%%](%s)' % (i, target))
        page = get_page(target)
        data = extract(page)
        merge(data)
        time.sleep(0.5)     # Prevent overdense requests from being blocked
        
    save(result)
    print('[100%] All Finished.\n Results in', filename)

Integration Code:

import re
import json
import time

import requests

url = 'https://maoyan.com/board/4'
result = {'top movies': []}
filename = './movies.json'      # (Better change the saved file name, otherwise it will be added to the end of the last run)

pattern = r'<dd>.*?board-index.*?>(.*?)</i>.*?data-src="(.*?)@.*?title="(.*?)".*?To star:(.*?)\s*</p>.*?Show time:(.*?)</p>.*?integer">(.*?)</i>.*?fraction">(.*?)</i></p>'

headers = {
        'Accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,*/*;q=0.8',
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0.3 Safari/605.1.15',
        'Accept-Language': 'zh-cn'
        }


def get_page(url):   # Grab the page and return the html string
    print('\tGetting...')
    try:
        response = requests.get(url, headers=headers)
        return response.text
    except Exception as e:
        print('[Error]', e)
        return ''


def stars_split(st):
    return st.split(',')


def extract(html):  # Regular extraction returns the list of dict s
    print('\tExtracting...')
    raws = re.findall(pattern, html, re.S)   # [(Rank, Picture Address, Name, Actor, Show Time, Score Integer, Score Decimal),...]
    result = []
    for raw in raws:
        dc = {                      # The order is adjusted here
                'index': raw[0],
                'title': raw[2],
                'stars': stars_split(raw[3]),   # Separate actors
                'otime': raw[4],
                'score': raw[5] + raw[6],       # Combining Integers, Decimals
                'image': raw[1]
                }
        result.append(dc)

    return result
    

def merge(data):
    print('\tMerging...')
    result['top movies'] += data
    
    
def save(data):      # write file
    print('Saving...')
    with open(filename, 'a', encoding='utf-8') as f:
        f.write(json.dumps(data, ensure_ascii=False))


if __name__ == '__main__':
    for i in range(0, 100, 10):     # Page Flip
        target = url + '?offset=' + str(i)
        print('[%s%%](%s)' % (i, target))
        page = get_page(target)
        data = extract(page)
        merge(data)
        time.sleep(0.5)     # Prevent overdense requests from being blocked
        
    save(result)
    print('[100%] All Finished.\n Results in', filename)

Run the modified program and get new results:

{"top movies": [{"index": "1", "title": "Farewell to my concubine", "stars": ["Leslie Cheung", "Zhang Fengyi", "Gong Li"], "otime": "1993-01-01", "score": "9.6", "image": "https://p1.meituan.net/movie/20803f59291c47e1e116c11963ce019e68711.jpg"}, {"index": "2", "title": "The Shawshank Redemption", "stars": ["Tim·Robins", "morgan·Freeman", "Bob·Gunton"], "otime": "1994-10-14(U.S.A)", "score": "9.5", "image": "https://p0.meituan.net/movie/283292171619cdfd5b240c8fd093f1eb255670.jpg"}, ..., {"index": "100", "title": "Totoro", "stars": ["Qin Lan", "Mine weight", "Samurai Ishimoto"], "otime": "2018-12-14", "score": "9.2", "image": "https://p0.meituan.net/movie/c304c687e287c7c2f9e22cf78257872d277201.jpg"}]}

That's ideal.

complete

The crawl project is over.
To summarize, we mainly used requests.get() to complete the request, falsified headers, parsed the results regularly with re.findall(), then adjusted the order of the information, and saved the results in json format.

In fact, with a few modifications, this project can be used to crawl many other movie charts, such as we have actually implemented a program to crawl Douban top250. It is really easy to do just a few modifications.

We show the development process of this project, from the goal to the final completion, step by step. This sequence of development is applicable to many projects and is philosophical. We think it is worth understanding and practice.

Ajax Data Crawling

Many websites use Ajax. Here's how to crawl pages rendered by Ajax.

Introduction to Ajax

Ajax, all called Asynchronous JavaScript and XML, is asynchronous JavaScript and XML.

Ajax is a technology that uses JavaScript to exchange data with the server and update some pages without refreshing pages or changing page links.

The process of sending an Ajax request to a Web page for updates can be as simple as three steps:

  • Send Request
  • Parse content
  • Render Web Page

Send Request

var xmlhttp;
if (window.XMLHttpRequest) {
    // code for IE7+, Firefox, Chrome, Opera, Safari
    xmlhttp=new XMLHttpRequest();
} else {// code for IE6, IE5
    xmlhttp=new ActiveXObject("Microsoft.XMLHTTP");
}
xmlhttp.onreadystatechange=function() {
    if (xmlhttp.readyState==4 && xmlhttp.status==200) {
        document.getElementById("myDiv").innerHTML=xmlhttp.responseText;
    }
}
xmlhttp.open("POST","/ajax/",true);
xmlhttp.send();

JavaScript implements the lowest level of Ajax,
The new XMLHttpRequest object is created.
The onreadystatechange property is then called to set listening,
Then call the open() and send() methods to send a request to a link, that is, to the server.
When the server returns a response, the onreadystatechange method is triggered.
The response content is then parsed in this method.

Parse content

Once the response is received, the method corresponding to the onreadystatechange property is triggered, and the content of the response can be retrieved using the responseText property of xmlhttp.

The returned content may be HTML or Json, and you only need to further process it in the method using JavaScript.For example, if it is Json, it can be parsed and transformed.

Render Web Page

DOM operation, that is, to operate on a Document Web page document, such as changes, deletions, and so on.

Ajax analysis method

In Network for browser developer tools, the request type for Ajax is xhr.

An X-Requested-With: XMLHttpRequest tagged the request as Ajax in Request Headers.

You can see the contents of the response in Preview.

With the Request URL, Request Headers, Response Headers, Response Body, and so on, you can simulate sending Ajax requests.

Python simulates Ajax requests

Crawl the microblog of the People's Daily:

import requests
from bs4 import BeautifulSoup

url_base = 'https://m.weibo.cn/api/container/getIndex?type=uid&value=2803301701&containerid=1076032803301701'

headers = {
        'Accept': 'application/json, text/plain, */*',
        'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_6) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/12.0.3 Safari/605.1.15',
        'X-Requested-With': 'XMLHttpRequest',
        'MWeibo-Pwa': '1'
        }

def get_page(basicUrl, headers, page):
    url = basicUrl + '&page=%s' % page
    try:
        response = requests.get(url, headers=headers)
        if response.status_code == 200:
            return response.json()      # Would return a dict
        else:
            raise RuntimeError('Response Status Code != 200')
    except Exception as e:
        print('Get Page False:', e)
        return None


def parse_html(html):
    soup = BeautifulSoup(html, 'lxml')
    return soup.get_text()
    

def get_content(data):
    result = []
    if data and data.get('data').get('cards'):
        for item in data.get('data').get('cards'):
            useful = {}
            useful['source'] = item.get('mblog').get('source')
            useful['text'] = parse_html(item.get('mblog').get('text'))

            result.append(useful)

    return result


def save_data(data):    # It's not saved here, it's just printed out
    for i in data:
        print(i)


if __name__ == '__main__':
    for page in range(1, 3):    # Remember to adjust the number of pages you need.
        r = get_page(url_base, headers, page)
        d = get_content(r)
        save_data(d)
10 original articles published. 2. 581 visits
Private letter follow

Tags: JSON Javascript xml Python

Posted on Fri, 14 Feb 2020 01:34:38 -0500 by guestabc