Python crawler project for boys only: crawling through today's headlines

You haven't climbed the headlines since 2020. Do you look OUT when you are a reptile?

But it's OK. Although the current interface has changed, I'll talk about how to make today's headlines in 2020. This is an improved project, which has participated in many of my own ideas.

For example, some of them are very difficult to understand. I have implemented them in a simple way. I feel that they are good. You can have a look at them.

Project technology

Simple process pool:

I don't know a lot about the process here. To put it simply, I need the following functions:

from multiprocessing import Pool  # Call function library
p = Pool(4)  #  Construct a process pool, unit: 4
p.close()  # Shut down process
p.join()   # Start process

Calling the join() method on the Pool object will wait for all subprocesses to finish executing. Before calling join(), you must call close(), and then you cannot add a new Process.

Ajax data crawling

A lot of information about the URL will not appear in the source code directly.

For example, when you swipe a web page, the new pages will be loaded one by one through the ajax interface. This is an asynchronous loading method. The original page will not contain a lot of data, and the data will be placed in one interface. Only when we request the ajax interface, and then the server background receives the interface information, can the data be returned. Then JavaScript analyzes this Data, in rendering to browser page, this is the mode we see.

Now more and more web pages are loaded asynchronously, so it's not so easy for crawlers. This concept is also awkward. Let's start to fight directly!

Project operation

Final operation rendering:

Analyze ajax interface to get data

Data include:

Title of each page

Web address of each page

Target website: https://www.toutiao.com/search/?keyword=%E7%BE%8E%E5%A5%B3

How to know if it is an ajax interface? There are three main points:

**First point**

Pay attention to my arrows. As long as you can't find the corresponding text or links in search, it may be.

Second point

Find the URL of the arrow in the XHR, click and view the preview. At this time, you can open the contents at will and find a lot of the same points as the article

Third point

In this picture, you can see that the interface in X-requested is xmlhttprequests If three points are satisfied at the same time, it is the Ajax interface, and then the data is loaded asynchronously.

Get data

In the second picture, we can see 0, 1, 2, 3, 4 and so on. When you open it, you will find that they are all here. In the picture, I marked them red with arrows, and there are titles and page links. As long as I get the page links, it is very simple.

Programming

Get the json file:

First request the first page:

https://www.toutiao.com/search/?keyword=%E7%BE%8E%E5%A5%B3

But we can't directly hand over the page to the requests library to do this, because this is an ajax interface. If you don't add parameters, it's likely that you can enter any verification code or pull the verification bar, which is troublesome anyway. Then we will add parameters. The specific measures are as follows:

def get_page(offset):    # Offset offset, because each ajax is loaded with a fixed number of pages
                     # Here is 20. You can see it on the third point
    global headers  # Global variables I'll use later
    headers = {
        'cookie': 'tt_webid=6821518909792273933; WEATHER_CITY=%E5%8C%97%E4%BA%AC; SLARDAR_WEB_ID=b4a776dd-f454-43c6-81cd-bd37cb5fd0ec; tt_webid=6821518909792273933; csrftoken=4a2a6afcc9de4484af87a2ff8cba0638; ttcid=8732e6def0484fae975c136222a44f4932; s_v_web_id=verify_k9o5qf2w_T0dyn2r8_X6CE_4egN_9OwH_CCxYltDKYSQj; __tasessionId=oxyt6axwv1588341559186; tt_scid=VF6tWUudJvebIzhQ.fYRgRk.JHpeP88S02weA943O6b6-7o36CstImgKj1M3tT3mab1b',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36 Edg/81.0.416.68',
        'referer': 'https://www.toutiao.com/search/?keyword=%E7%BE%8E%E5%A5%B3',
        'x-requested-with': 'XMLHttpRequest'
    }  # Header information adding parameter
    params = {
        'aid': ' 24',
        'app_name': ' web_search',
        'offset': offset,
        'format': ' json',
        'keyword': ' beauty',
        'autoload': ' true',
        'count': ' 20',
        'en_qc': ' 1',
        'cur_tab': ' 1',
        'from': ' search_tab',
        'pd': ' synthesis',
        'timestamp': int(time.time())
    }
    url = 'https://Www.toutiao.com/api/search/content/? '+ URLEncode (params) ා url is constructed and the urlencode() function is used 
    url = url.replace('=+', '=')  # It's important to note that the current website is not the same at all
    # print(url)          
    try:
        r = requests.get(url, headers=headers, params=params)
        r.content.decode('utf-8')
        if r.status_code == 200:
            return r.json()  # Return json format because it's all dictionary type
    except requests.ConnectionError as e:
        print(e)

Here we must pay attention to that the requested URL has been changed. I explained it in the code and have a look carefully.

Get title and web address:

def get_image(json):  # Get pictures
    if json.get('data'):  # If this exists
        for item in json.get('data'):
            if item.get('title') is None:
                continue  # If the title is null
            title = item.get('title')  # Get title
            if item.get('article_url') == None:
                continue
            url_page = item.get('article_url')
            # print(url_page)
            rr = requests.get(url_page, headers=headers)
            if rr.status_code == 200:
                pat = '<script>var BASE_DATA = .*?articleInfo:.*?content:(.*?)groupId.*?;</script>'  #  Match the range roughly with regularity
                match = re.search(pat, rr.text, re.S)
                if match != None:
                    result = re.findall(r'img src&#x3D;\\&quot;(.*?)\\&quot;', match.group(), re.S)
                
                #  print(i.encode('utf-8').decode('unicode_escape')
                    # Change the encoding mode to change \ u and so on
                    yield {
                        'title': title,
                        'image': result
                    }

The web links obtained here are all in Unicode format. In the later download section, I gave the modification scheme, which is also a dark pit.

Download pictures:

def save_image(content):
    path = 'D://Today's headlines
    if not os.path.exists(path):  # Create directory
        os.mkdir(path)
        os.chdir(path)
    else:
        os.chdir(path)
    # ------------------------------------------
​
    if not os.path.exists(content['title']):  # Create a single folder
        if '\t' in content['title']:  # Create a single folder with title as the title
            title = content['title'].replace('\t', '')  # Remove special symbols or the file name cannot be created
            os.mkdir(title + '//')
            os.chdir(title + '//')
            print(title)
        else:
            title = content['title']
            os.mkdir(title + '//') create folder
            os.chdir(title + '//')
            print(title)
    else:  # If present
        if '\t' in content['title']:  # Create a single folder with title as the title
            title = content['title'].replace('\t', '')  # Remove special symbols or the file name cannot be created
            os.chdir(title + '//')
            print(title)
        else:
            title = content['title']
            os.chdir(title + '//')
            print(title)
    for q, u in enumerate(content['image']):  # Traverse image address list
        u = u.encode('utf-8').decode('unicode_escape')
        # First encode and decode to get the required URL link
        #  Start downloading
        r = requests.get(u, headers=headers)
        if r.status_code == 200:
            # file_path = r'{0}/{1}.{2}'.format('beauty', q, 'jpg')  # The name and address of the file, and the name of the debugging folder with the binocular operator
            # Hexdigest() returns a hexadecimal picture
            with open(str(q) + '.jpg', 'wb') as fw:
                fw.write(r.content)
                print(f'This series----->download{q}Zhang')

When the U variable is encoded and decoded, the URL is much more normal.

Project code:

# -*- coding '':''  utf-8 -*-''
# @Time      '':''  2020/5/1  9:34''
# @author '': '' the hourglass is raining ''
# @Software  '':''  PyCharm''
# @CSDN      '':''  https://me.csdn.net/qq_45906219''
​
​
import requests
from urllib.parse import urlencode  # Construct url
import time
import os
from hashlib import md5
from lxml import etree
from bs4 import BeautifulSoup
import re
from multiprocessing.pool import Pool
​
​
def get_page(offset):
    global headers
    headers = {
        'cookie': 'tt_webid=6821518909792273933; WEATHER_CITY=%E5%8C%97%E4%BA%AC; SLARDAR_WEB_ID=b4a776dd-f454-43c6-81cd-bd37cb5fd0ec; tt_webid=6821518909792273933; csrftoken=4a2a6afcc9de4484af87a2ff8cba0638; ttcid=8732e6def0484fae975c136222a44f4932; s_v_web_id=verify_k9o5qf2w_T0dyn2r8_X6CE_4egN_9OwH_CCxYltDKYSQj; __tasessionId=oxyt6axwv1588341559186; tt_scid=VF6tWUudJvebIzhQ.fYRgRk.JHpeP88S02weA943O6b6-7o36CstImgKj1M3tT3mab1b',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36 Edg/81.0.416.68',
        'referer': 'https://www.toutiao.com/search/?keyword=%E7%BE%8E%E5%A5%B3',
        'x-requested-with': 'XMLHttpRequest'
    }  # Header information adding parameter
    params = {
        'aid': ' 24',
        'app_name': ' web_search',
        'offset': offset,
        'format': ' json',
        'keyword': ' beauty',
        'autoload': ' true',
        'count': ' 20',
        'en_qc': ' 1',
        'cur_tab': ' 1',
        'from': ' search_tab',
        'pd': ' synthesis',
        'timestamp': int(time.time())
    }
    url = 'https://Www.toutiao.com/api/search/content/? '+ urlcode (params) ා construct url
    url = url.replace('=+', '=')  # The website is not the same at all
    # print(url)
    try:
        r = requests.get(url, headers=headers, params=params)
        r.content.decode('utf-8')
        if r.status_code == 200:
            return r.json()  # Return json format because it's all dictionary type
    except requests.ConnectionError as e:
        print(e)
​
​
def get_image(json):  # Get pictures
    if json.get('data'):  # If this exists
        for item in json.get('data'):
            if item.get('title') is None:
                continue  # If the title is null
            title = item.get('title')  # Get title
            # if item.get('image_list') is None:  # Judge empty
            #     continue
            # urls = item.get('image_list')  # Get picture URL
            # for url in urls:  # Traverse this urls
            #     url = url.get('url')
            #     # Use regular splicing URL
            #     url = 'http://p1.pstatp.com/origin/' + 'pgc-image/' + url.split('/')[-1]
            if item.get('article_url') == None:
                continue
            url_page = item.get('article_url')
            # print(url_page)
            rr = requests.get(url_page, headers=headers)
​
            if rr.status_code == 200:
                pat = '<script>var BASE_DATA = .*?articleInfo:.*?content:(.*?)groupId.*?;</script>'
                match = re.search(pat, rr.text, re.S)
                if match != None:
                    result = re.findall(r'img src&#x3D;\\&quot;(.*?)\\&quot;', match.group(), re.S)
                    # for i in result:
                    #     print(i.encode('utf-8').decode('unicode_escape')
                    # Change the encoding mode to change \ u and so on
                    yield {
                        'title': title,
                        'image': result
                    }
            #  Format error. Hexadecimal value is generated here. The URL can't be obtained. See tomorrow
            # yield {
            #     'title': title,
            #     'image': url
            # }  # Back to title and URL
​
​
def save_image(content):
    path = 'D://Today's headlines
    if not os.path.exists(path):  # Create directory
        os.mkdir(path)
        os.chdir(path)
    else:
        os.chdir(path)
    # ------------------------------------------
​
    if not os.path.exists(content['title']):  # Create a single folder
        if '\t' in content['title']:  # Create a single folder with title as the title
            title = content['title'].replace('\t', '')  # Remove special symbols or the file name cannot be created
            os.mkdir(title + '//')
            os.chdir(title + '//')
            print(title)
        else:
            title = content['title']
            os.mkdir(title + '//') create folder
            os.chdir(title + '//')
            print(title)
    else:  # If present
        if '\t' in content['title']:  # Create a single folder with title as the title
            title = content['title'].replace('\t', '')  # Remove special symbols or the file name cannot be created
            os.chdir(title + '//')
            print(title)
        else:
            title = content['title']
            os.chdir(title + '//')
            print(title)
    for q, u in enumerate(content['image']):  # Traverse image address list
        u = u.encode('utf-8').decode('unicode_escape')
​
        # First encode and decode to get the required URL link
        #  Start downloading
        r = requests.get(u, headers=headers)
        if r.status_code == 200:
            # file_path = r'{0}/{1}.{2}'.format('beauty', q, 'jpg')  # The name and address of the file, and the name of the debugging folder with the binocular operator
            # Hexdigest() returns a hexadecimal picture
            with open(str(q) + '.jpg', 'wb') as fw:
                fw.write(r.content)
                print(f'This series----->download{q}Zhang')
​
​
def main(offset):
    json = get_page(offset)
    get_image(json)
    for content in get_image(json):
        try:
            # print(content)
            save_image(content)
        except FileExistsError and OSError:
            print('Error creating file format, including special string:')
            continue
​
​
if __name__ == '__main__':
    pool = Pool()
    groups = [j * 20 for j in range(8)]
    pool.map(main, groups) # Transmit offset
    pool.close()
    pool.join()

Project postscript

This is my first crawler project. It was very simple before. This time, it was a bit troublesome. In a word, we should not be afraid of the difficulties. Baidu will not be the only one, and Baidu will definitely get results all the time!

Project repair:

The correct splicing of web address fixes the bug of data = None, which is reflected in the obtained json file

It's also great to add the necessary parameters to avoid the occurrence of verification code and verification bar.

I modified the problem of link mismatch in the format of \ u by encoding and decoding. I also mentioned it when I got the URL.

When downloading pictures, we use the processing of subfolders to make the downloaded pictures less messy.

Download adopts a simple thread pool problem to speed up the download.

Original link:

https://blog.csdn.net/qq_45906219/article/details/105889730

Source network, for learning purposes only, invasion and deletion.

Don't panic. I have a set of learning materials, including 40 + E-books, 800 + teaching videos, involving Python foundation, reptile, framework, data analysis, machine learning, etc. I'm not afraid you won't learn! https://shimo.im/docs/JWCghr8prjCVCxxK/ Python learning materials

Pay attention to the official account [Python circle].

file

Tags: Programming JSON encoding Python Windows

Posted on Wed, 13 May 2020 05:48:03 -0400 by r2ks