Crawler Warfare - Crawl mzitu.com (tragic anti-pickpocket failure)

This process is only for learning reference, please abide by relevant laws and regulations.

 

First let's analyze the website: https://www.mzitu.com/all/

 

 

It is not difficult to find that this page contains a large number of picture links, which can be said to be very convenient for us to crawl pictures. This is a good thing.So let's continue with the analysis

 

 

This is the address of the first page

 

 

This is the second page, so all we need to do when crawling is to add'/num'after the link

So let's crawl the first page

import  requests
import re
#Crawl Home Page
url = 'https://www.mzitu.com/all/'
header = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0'
}
response = requests.get(url, headers = header).text
print(response)

We can get the following

 

 

 

ok Next we need to extract the links we want from it

 

#Use regular expressions to extract the links you need
req = '<br>.*?day: <a href="(.*?)" target="_blank">(.*?)</a>'
urls = re.findall(req,response)
for url,pic_name in urls:
print(url,pic_name)

So we get the links we need

 

 

 

 

So what we need to know next is how many pages are in each link?

Further analysis of the source code of the webpage observation page shows that

 

 

 

We can pull the last page out of the page

req = '<br>.*?day: <a href="(.*?)" target="_blank">(.*?)</a>'
urls = re.findall(req,response)
for url,pic_name in urls:
    print(url,pic_name)
    #Get each url Total number of pages in
    html = requests.get(url, headers=header).text
    req_last = "<a href='.*?'><span>&laquo;Previous Group</span></a><span>1</span><a href='.*?'><span>2</span></a><a href='.*?'><span>3</span></a><a href='.*?'><span>4</span></a><span class='dots'>...</span><a href='.*?'><span>(.*?)</span></a><a href='.*?'><span>next page&raquo;</span></a>      </div>"
    last_num = re.findall(req_last,html)
    print(last_num)
    exit()

 

We use the rule to extract the page number of the last page we need to get the result

 

 

Then the next step is to stitch the url

We tried to add'/1'after the url of the original first page and found that we could also get to the page we needed, which greatly reduced the amount of code

#Convert List to int
    for k in last_num:
        k=k
    k = int(k)
    #Stitching url
    for i in range(1,k):
        url_pic = url + '/' + str(i)
        print(url_pic)
    exit()

ps: The "exit()" here is to make it easier for the test program to get an address under a primary url that will be deleted later

We can get the following link

 

These links are valid, but we are not opening the url of the image directly

So we need to filter the information further and optimize the code

#Convert List to int
    for k in last_num:
        k=k
    k = int(k)
    #Stitching url
    for i in range(1,k):
        url_pic = url + '/' + str(i)

        headerss = {
            'User-Agent': '"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36"',

            'Referer': url_pic,
            'Host': 'www.mzitu.com'
        }

        html_pic_last = requests.get(url_pic, headers=headerss).text
        req_last_pic = '<div class=.*?><p><a href=.*? ><img src="(.*?)" alt=.*? width=.*? height=.*? /></a></p>'
        req_pic_url =  re.findall(req_last_pic,html_pic_last)
        for link in req_pic_url:
            link = link
        links = str(link)
        print(links)
        image_content = requests.get(links, headerss).content
        print(image_content)
        # with open("image/" + pic_name+ str(i) + ".jpg", "wb") as f:

            # f.write(image_content)
        exit()

 

However, after testing, we found that the saved pictures could not be opened, and checked that a 403 error occurred when we last downloaded the pictures: the server refused access

 

I tried to change the herder but it still didn't work, so GG!

Let's go ahead and try to find out the reason for 403 later.

Attach all source code so far

import  requests
import re
#Crawl Home Page
url_head = 'https://www.mzitu.com/all/'
header = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:74.0) Gecko/20100101 Firefox/74.0',
    'Referer' :'https://www.mzitu.com'
}
response = requests.get(url_head, headers = header).text
#Use regular expressions to extract the links you need
req = '<br>.*?day: <a href="(.*?)" target="_blank">(.*?)</a>'
urls = re.findall(req,response)
for url,pic_name in urls:
    #Get each url Total number of pages in
    html = requests.get(url, headers=header).text
    req_last = "<a href='.*?'><span>&laquo;Previous Group</span></a><span>1</span><a href='.*?'><span>2</span></a><a href='.*?'><span>3</span></a><a href='.*?'><span>4</span></a><span class='dots'>...</span><a href='.*?'><span>(.*?)</span></a><a href='.*?'><span>next page&raquo;</span></a>      </div>"
    last_num = re.findall(req_last,html)
    #Convert List to int
    for k in last_num:
        k=k
    k = int(k)
    #Stitching url
    for i in range(1,k):
        url_pic = url + '/' + str(i)

        headerss = {
            'User-Agent': '"Mozilla/5.0 (Windows NT 6.3; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/39.0.2171.95 Safari/537.36"',
            'Referer': url_pic,        
        }
        html_pic_last = requests.get(url_pic, headers=headerss).text
        req_last_pic = '<div class=.*?><p><a href=.*? ><img src="(.*?)" alt=.*? width=.*? height=.*? /></a></p>'
        req_pic_url =  re.findall(req_last_pic,html_pic_last)
        for link in req_pic_url:
            link = link
        links = str(link)
        print(links)
        image_content = requests.get(links, headerss).content
        print(image_content)
        # with open("image/" + pic_name+ str(i) + ".jpg", "wb") as f:

            # f.write(image_content)
        exit()
    exit()

Tags: Python Windows Firefox

Posted on Wed, 18 Mar 2020 14:53:04 -0400 by Jenk