You haven't climbed the headlines since 2020. Do you look OUT when you are a reptile?
But it's OK. Although the current interface has changed, I'll talk about how to make today's headlines in 2020. This is an improved project, which has participated in many of my own ideas.
For example, some of them are very difficult to understand. I have implemented them in a simple way. I feel that they are good. You can have a look at them.
Project technology
Simple process pool:
I don't know a lot about the process here. To put it simply, I need the following functions:
from multiprocessing import Pool # Call function library p = Pool(4) # Construct a process pool, unit: 4 p.close() # Shut down process p.join() # Start process
Calling the join() method on the Pool object will wait for all subprocesses to finish executing. Before calling join(), you must call close(), and then you cannot add a new Process.
Ajax data crawling
A lot of information about the URL will not appear in the source code directly.
For example, when you swipe a web page, the new pages will be loaded one by one through the ajax interface. This is an asynchronous loading method. The original page will not contain a lot of data, and the data will be placed in one interface. Only when we request the ajax interface, and then the server background receives the interface information, can the data be returned. Then JavaScript analyzes this Data, in rendering to browser page, this is the mode we see.
Now more and more web pages are loaded asynchronously, so it's not so easy for crawlers. This concept is also awkward. Let's start to fight directly!
Project operation
Final operation rendering:
Analyze ajax interface to get data
Data include:
Title of each page
Web address of each page
Target website: https://www.toutiao.com/search/?keyword=%E7%BE%8E%E5%A5%B3
How to know if it is an ajax interface? There are three main points:
**First point**
Pay attention to my arrows. As long as you can't find the corresponding text or links in search, it may be.
Second point
Find the URL of the arrow in the XHR, click and view the preview. At this time, you can open the contents at will and find a lot of the same points as the article
Third point
In this picture, you can see that the interface in X-requested is xmlhttprequests If three points are satisfied at the same time, it is the Ajax interface, and then the data is loaded asynchronously.
Get data
In the second picture, we can see 0, 1, 2, 3, 4 and so on. When you open it, you will find that they are all here. In the picture, I marked them red with arrows, and there are titles and page links. As long as I get the page links, it is very simple.
Programming
Get the json file:
First request the first page:
https://www.toutiao.com/search/?keyword=%E7%BE%8E%E5%A5%B3
But we can't directly hand over the page to the requests library to do this, because this is an ajax interface. If you don't add parameters, it's likely that you can enter any verification code or pull the verification bar, which is troublesome anyway. Then we will add parameters. The specific measures are as follows:
def get_page(offset): # Offset offset, because each ajax is loaded with a fixed number of pages # Here is 20. You can see it on the third point global headers # Global variables I'll use later headers = { 'cookie': 'tt_webid=6821518909792273933; WEATHER_CITY=%E5%8C%97%E4%BA%AC; SLARDAR_WEB_ID=b4a776dd-f454-43c6-81cd-bd37cb5fd0ec; tt_webid=6821518909792273933; csrftoken=4a2a6afcc9de4484af87a2ff8cba0638; ttcid=8732e6def0484fae975c136222a44f4932; s_v_web_id=verify_k9o5qf2w_T0dyn2r8_X6CE_4egN_9OwH_CCxYltDKYSQj; __tasessionId=oxyt6axwv1588341559186; tt_scid=VF6tWUudJvebIzhQ.fYRgRk.JHpeP88S02weA943O6b6-7o36CstImgKj1M3tT3mab1b', 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36 Edg/81.0.416.68', 'referer': 'https://www.toutiao.com/search/?keyword=%E7%BE%8E%E5%A5%B3', 'x-requested-with': 'XMLHttpRequest' } # Header information adding parameter params = { 'aid': ' 24', 'app_name': ' web_search', 'offset': offset, 'format': ' json', 'keyword': ' beauty', 'autoload': ' true', 'count': ' 20', 'en_qc': ' 1', 'cur_tab': ' 1', 'from': ' search_tab', 'pd': ' synthesis', 'timestamp': int(time.time()) } url = 'https://Www.toutiao.com/api/search/content/? '+ URLEncode (params) ා url is constructed and the urlencode() function is used url = url.replace('=+', '=') # It's important to note that the current website is not the same at all # print(url) try: r = requests.get(url, headers=headers, params=params) r.content.decode('utf-8') if r.status_code == 200: return r.json() # Return json format because it's all dictionary type except requests.ConnectionError as e: print(e)
Here we must pay attention to that the requested URL has been changed. I explained it in the code and have a look carefully.
Get title and web address:
def get_image(json): # Get pictures if json.get('data'): # If this exists for item in json.get('data'): if item.get('title') is None: continue # If the title is null title = item.get('title') # Get title if item.get('article_url') == None: continue url_page = item.get('article_url') # print(url_page) rr = requests.get(url_page, headers=headers) if rr.status_code == 200: pat = '<script>var BASE_DATA = .*?articleInfo:.*?content:(.*?)groupId.*?;</script>' # Match the range roughly with regularity match = re.search(pat, rr.text, re.S) if match != None: result = re.findall(r'img src=\\"(.*?)\\"', match.group(), re.S) # print(i.encode('utf-8').decode('unicode_escape') # Change the encoding mode to change \ u and so on yield { 'title': title, 'image': result }
The web links obtained here are all in Unicode format. In the later download section, I gave the modification scheme, which is also a dark pit.
Download pictures:
def save_image(content): path = 'D://Today's headlines if not os.path.exists(path): # Create directory os.mkdir(path) os.chdir(path) else: os.chdir(path) # ------------------------------------------ if not os.path.exists(content['title']): # Create a single folder if '\t' in content['title']: # Create a single folder with title as the title title = content['title'].replace('\t', '') # Remove special symbols or the file name cannot be created os.mkdir(title + '//') os.chdir(title + '//') print(title) else: title = content['title'] os.mkdir(title + '//') create folder os.chdir(title + '//') print(title) else: # If present if '\t' in content['title']: # Create a single folder with title as the title title = content['title'].replace('\t', '') # Remove special symbols or the file name cannot be created os.chdir(title + '//') print(title) else: title = content['title'] os.chdir(title + '//') print(title) for q, u in enumerate(content['image']): # Traverse image address list u = u.encode('utf-8').decode('unicode_escape') # First encode and decode to get the required URL link # Start downloading r = requests.get(u, headers=headers) if r.status_code == 200: # file_path = r'{0}/{1}.{2}'.format('beauty', q, 'jpg') # The name and address of the file, and the name of the debugging folder with the binocular operator # Hexdigest() returns a hexadecimal picture with open(str(q) + '.jpg', 'wb') as fw: fw.write(r.content) print(f'This series----->download{q}Zhang')
When the U variable is encoded and decoded, the URL is much more normal.
Project code:
# -*- coding '':'' utf-8 -*-'' # @Time '':'' 2020/5/1 9:34'' # @author '': '' the hourglass is raining '' # @Software '':'' PyCharm'' # @CSDN '':'' https://me.csdn.net/qq_45906219'' import requests from urllib.parse import urlencode # Construct url import time import os from hashlib import md5 from lxml import etree from bs4 import BeautifulSoup import re from multiprocessing.pool import Pool def get_page(offset): global headers headers = { 'cookie': 'tt_webid=6821518909792273933; WEATHER_CITY=%E5%8C%97%E4%BA%AC; SLARDAR_WEB_ID=b4a776dd-f454-43c6-81cd-bd37cb5fd0ec; tt_webid=6821518909792273933; csrftoken=4a2a6afcc9de4484af87a2ff8cba0638; ttcid=8732e6def0484fae975c136222a44f4932; s_v_web_id=verify_k9o5qf2w_T0dyn2r8_X6CE_4egN_9OwH_CCxYltDKYSQj; __tasessionId=oxyt6axwv1588341559186; tt_scid=VF6tWUudJvebIzhQ.fYRgRk.JHpeP88S02weA943O6b6-7o36CstImgKj1M3tT3mab1b', 'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/81.0.4044.129 Safari/537.36 Edg/81.0.416.68', 'referer': 'https://www.toutiao.com/search/?keyword=%E7%BE%8E%E5%A5%B3', 'x-requested-with': 'XMLHttpRequest' } # Header information adding parameter params = { 'aid': ' 24', 'app_name': ' web_search', 'offset': offset, 'format': ' json', 'keyword': ' beauty', 'autoload': ' true', 'count': ' 20', 'en_qc': ' 1', 'cur_tab': ' 1', 'from': ' search_tab', 'pd': ' synthesis', 'timestamp': int(time.time()) } url = 'https://Www.toutiao.com/api/search/content/? '+ urlcode (params) ා construct url url = url.replace('=+', '=') # The website is not the same at all # print(url) try: r = requests.get(url, headers=headers, params=params) r.content.decode('utf-8') if r.status_code == 200: return r.json() # Return json format because it's all dictionary type except requests.ConnectionError as e: print(e) def get_image(json): # Get pictures if json.get('data'): # If this exists for item in json.get('data'): if item.get('title') is None: continue # If the title is null title = item.get('title') # Get title # if item.get('image_list') is None: # Judge empty # continue # urls = item.get('image_list') # Get picture URL # for url in urls: # Traverse this urls # url = url.get('url') # # Use regular splicing URL # url = 'http://p1.pstatp.com/origin/' + 'pgc-image/' + url.split('/')[-1] if item.get('article_url') == None: continue url_page = item.get('article_url') # print(url_page) rr = requests.get(url_page, headers=headers) if rr.status_code == 200: pat = '<script>var BASE_DATA = .*?articleInfo:.*?content:(.*?)groupId.*?;</script>' match = re.search(pat, rr.text, re.S) if match != None: result = re.findall(r'img src=\\"(.*?)\\"', match.group(), re.S) # for i in result: # print(i.encode('utf-8').decode('unicode_escape') # Change the encoding mode to change \ u and so on yield { 'title': title, 'image': result } # Format error. Hexadecimal value is generated here. The URL can't be obtained. See tomorrow # yield { # 'title': title, # 'image': url # } # Back to title and URL def save_image(content): path = 'D://Today's headlines if not os.path.exists(path): # Create directory os.mkdir(path) os.chdir(path) else: os.chdir(path) # ------------------------------------------ if not os.path.exists(content['title']): # Create a single folder if '\t' in content['title']: # Create a single folder with title as the title title = content['title'].replace('\t', '') # Remove special symbols or the file name cannot be created os.mkdir(title + '//') os.chdir(title + '//') print(title) else: title = content['title'] os.mkdir(title + '//') create folder os.chdir(title + '//') print(title) else: # If present if '\t' in content['title']: # Create a single folder with title as the title title = content['title'].replace('\t', '') # Remove special symbols or the file name cannot be created os.chdir(title + '//') print(title) else: title = content['title'] os.chdir(title + '//') print(title) for q, u in enumerate(content['image']): # Traverse image address list u = u.encode('utf-8').decode('unicode_escape') # First encode and decode to get the required URL link # Start downloading r = requests.get(u, headers=headers) if r.status_code == 200: # file_path = r'{0}/{1}.{2}'.format('beauty', q, 'jpg') # The name and address of the file, and the name of the debugging folder with the binocular operator # Hexdigest() returns a hexadecimal picture with open(str(q) + '.jpg', 'wb') as fw: fw.write(r.content) print(f'This series----->download{q}Zhang') def main(offset): json = get_page(offset) get_image(json) for content in get_image(json): try: # print(content) save_image(content) except FileExistsError and OSError: print('Error creating file format, including special string:') continue if __name__ == '__main__': pool = Pool() groups = [j * 20 for j in range(8)] pool.map(main, groups) # Transmit offset pool.close() pool.join()
Project postscript
This is my first crawler project. It was very simple before. This time, it was a bit troublesome. In a word, we should not be afraid of the difficulties. Baidu will not be the only one, and Baidu will definitely get results all the time!
Project repair:
The correct splicing of web address fixes the bug of data = None, which is reflected in the obtained json file
It's also great to add the necessary parameters to avoid the occurrence of verification code and verification bar.
I modified the problem of link mismatch in the format of \ u by encoding and decoding. I also mentioned it when I got the URL.
When downloading pictures, we use the processing of subfolders to make the downloaded pictures less messy.
Download adopts a simple thread pool problem to speed up the download.
Original link:
https://blog.csdn.net/qq_45906219/article/details/105889730
Source network, for learning purposes only, invasion and deletion.
Don't panic. I have a set of learning materials, including 40 + E-books, 800 + teaching videos, involving Python foundation, reptile, framework, data analysis, machine learning, etc. I'm not afraid you won't learn! https://shimo.im/docs/JWCghr8prjCVCxxK/ Python learning materials
Pay attention to the official account [Python circle].