Crawler capture the latest method of the whole web, this time it is finally 4k high-definition beautiful picture, just because I don't download JPG pictures!

I've always had a dream of learning reptiles, which is to really achieve the attitude of being visible and able to climb, so I've been trying to find a way to deal with it, from the first analog interface to hand over data, but I found that it didn't work, but all roads lead to Rome, and how can the waves disappear on the beach until I thought about a question, why do we save the map every time The films are all JPG format, why not PNG format? Then I looked up the relevant literature, and finally, I succeeded. I managed these libraries with different ways to limit the number of times. This is a bold idea, which can be applied to any map station, hoping to help everyone!

In 2020, is the crawler party still worrying about the quality of downloading pictures? Or just download JPG images? Then I have an alternative way to help you with the real high-quality pictures of white whoring. After a week's hard work, you can definitely change your view and thinking about reptiles. The original reptiles are so interesting and can't think like this. Of course, the code way should be as simple as possible so that everyone can understand it. Come on everyone!


To prevent people from saying that my title party, I pay tribute to the picture first. It's absolutely 4k or above. Fake hands!



Reptilian way:

Technical analysis:

Do reptiles use JPG?

This alternative technology is designed to be worth thinking about. Every time we save a picture in a crawler, it is the default JPG. Why

  • No matter what their definition is, in short, JPG is a lossy compression technology. On the premise of meeting the required image quality, improve the compression ratio as much as possible to reduce the storage space, so the transmission speed is fast, the file is small, and the image quality is low, which is why the website uses JPG to store pictures.
  • PNG is defined as a kind of great health care. It can preserve complete picture quality, that is, relatively lossless compression technology. The disadvantage is that it has a large storage space and retains many original channels. Therefore, it is not used in websites and reptile teaching.
  • And the definition of bitmap can be magnified infinitely in theory. JPG is a bad choice as bitmap, while PNG is a perfect partner. Here, I take advantage of this gap and use this way to finish the high-definition gallery.
  • In this way, I'm not bragging about how powerful I am. Some people may have thought about it, but I'm just sharing this way with more friends to improve their inherent thinking. If you want to save small pictures, JPG is also a good choice. I like high-definition large pictures. I use the Image class operation of PIL library. I'll give a good explanation below. This is a crawler of whole site picture crawling, the speed may be a little slow, I tried my best to use the thread pool, thank you for understanding!

Technical key code:

Store it as a bitmap, and then enlarge the picture without damage, you can get our effect. Ask me if you don't understand.

with open(''.join(name) + '.png', 'wb') as fw:
	fw.write(rr.content)
    img = Image.open(''.join(name) + '.png')
    img = img.resize((4000, 2000), Image.ANTIALIAS)  # Resize anti aliasing
    # print(img.size)
    img.save(''.join(name) + '.png', quality=95)  # Mass fraction 95

If you see that it is helpful for you, I hope to give you some attention and praise, which is the biggest support for me!

Project technology:

  • Familiar website Other bank map network The quality of this website is very high, but there is a drawback that is to limit the number of times, so I decided to do it this time. Other websites are also similar.
  • The overall steps are divided into several modules. There are not many technical points in the design. As the basic technology of crawler, requests and xpath, as well as Image class and thread pool technology, the file os library is also necessary.
  • I promise to be clear, let's start to practice!

Get group URL and Title:

  • What we do this time is a whole site crawling, so we have to consider the grouping problem. Here we write a crawler to get the information pointed by the arrow. I save it as a list for my next operation.

    code:
def get_groups(url, headers):  # Get important information
    """Get the URL and title of each column based on the original URL and title header passed in"""
    r = requests.get(url, headers=headers)
    r.encoding = r.apparent_encoding  # Transcoding
    if r.status_code == 200:
        html = etree.HTML(r.text)
        html = etree.tostring(html)
        html = etree.fromstring(html)
        url_list = html.xpath(r"//div[@class='classify clearfix']//a/@href")  # Get the url of the web page group
        name = html.xpath(r"//div[@class='classify clearfix']//a/text()")  # Get the title of the page group

    else:
        print('Request error!')
    return name, url_list  # Backtrack group title and group URL

Start download selection:

  • To achieve success, we start to choose download, which is more friendly and humanized:
    Here I'll write simply, don't mind:

The code is as follows:

def begin_down(title, url, headers):  # Download selection
	"""The parameters here are the title and URL just obtained and the request header"""
    print('Download high-definition pictures by white whoring'.center(30, '-'))
    for i, j in enumerate(title):
        print(i, '\t\t\t\t\t', j)
    inp = int(input('Enter download options:'))
    # print(title[inp], url[inp])
    get_image_1st(title[inp], url[inp], headers)  # Call the website on the first page to download
    get_image(title[inp], url[inp], headers)  # Download all remaining pages

The website acquired here needs to be spliced to the next step. According to my observation, the first page of the grouped website is not connected with other pages, and there is no rule to follow, so I wrote two functions of crawling the first page and crawling other pages to obtain pictures, which I will also say below.

Get picture website:

  • Then we click the picture to enter such a page. Look at my arrow annotation. We want to get the content of the href, not the website in img. In img, we want to narrow down the picture, which has no quality. It hurts our eyes if we see too much. It's not worth it. I'll give you a big picture, and you'll be happy. Write the code, get the href, and then splice the URL. http://pic.netbian.com/tupian/17781.html can be opened.

code:

def get_image_1st(title, url, headers):  # Get the picture URL on the first page
    url_1st = 'http://pic.netbian.com/' + url  # Splicing group website
    r = requests.get(url_1st, headers)
    if r.status_code == 200:
        html = etree.HTML(r.text)
        html = etree.tostring(html)
        html = etree.fromstring(html)
        page_url = html.xpath(r'//ul[@class="clearfix"]//li/a/@href')  # Get the real URL of each picture
        #  print(page_url)
        page_url = ['http://pic.netbian.com' + i for i in page_url]  # Website splicing
        pool_down(title, page_url, headers)  # Call the picture download function to download all the pictures on the first page of the selection page, because the first page has a special URL


def get_image(title, url, headers):  # Get other page's image URL
    """Find the URL of other pages, and then find the click image URL of each picture"""
    pages_url = []
    for i in range(2, 10):  # We assume a maximum of 10 pages
        other_url = 'http://pic.netbian.com' + url + 'index_' + str(i) + '.html'  # Splicing website
        # print(other_url)
        r = requests.get(other_url, headers=headers)  # Try to request the click URL of the second page to get the picture
        if r.status_code == 200:
            html = etree.HTML(r.text)
            html = etree.tostring(html)
            html = etree.fromstring(html)
            page_url = html.xpath(r'//ul[@class="clearfix"]//li/a/@href')  # Get the real URL of each picture
            page_url = ['http://pic.netbian.com' + i for i in page_url]  # Website splicing
            pages_url.append(page_url)
    pool_down(title, pages_url, headers)  # Call Download

Get the download website:

  • Then we open the picture, through the console, we can get my picture below:

Note here that what we get is the Image URL in src, which is of higher quality and is convenient for us to modify the Image quality. Here we use the thread pool and Image, which is not difficult to use simply.

The code is as follows:

def image_down(title, page_url, headers):  # Download pictures
    if not os.path.exists(title + '//'):
        os.mkdir(title + '//')
        os.chdir(title + '//')
    else:
        os.chdir(title + '//')
    for i, j in enumerate(page_url):  # Traverse the chart list on the first page
        r = requests.get(j, headers=headers)  # Request this picture URL
        if r.status_code == 200:
            r.encoding = r.apparent_encoding  # Modify code
            html = etree.HTML(r.text)
            html = etree.tostring(html)
            html = etree.fromstring(html)  # Build the xpath object above
            url = html.xpath(r'//a[@id="img"]/img/@src')
            name = html.xpath(r'//a[@id="img"]/img/@title')
            rr = requests.get('http://pic.netbian.com' + ''.join(url), headers=headers)
            if rr.status_code == 200:  # Download Image URL
                rr.encoding = rr.apparent_encoding  # Modify code
                with open(''.join(name) + '.png', 'wb') as fw:
                    fw.write(rr.content)
                img = Image.open(''.join(name) + '.png')
                img = img.resize((4000, 2000), Image.ANTIALIAS)  # Resize anti aliasing
                # print(img.size)
                img.save(''.join(name) + '.png', quality=95)  # Mass fraction 95
                print(f'{title} pass the civil examinations {i + 1} Download completed!')
        else:
            print('Wrong')


def pool_down(title, page_url, headers):  # Thread Download
    # print(title, len(page_url))
    path = 'D://Other bank library / / '
    # Create total folders
    if not os.path.exists(path):
        os.mkdir(path)
        os.chdir(path)
    else:
        os.chdir(path)
    #  Create a multithreaded Download
    pool = Pool(6)  # Six at a time
    if len(page_url) > 2:  # If it's another website
        for i in page_url:
            pool.apply(image_down, args=(title, i, headers))
    elif len(page_url) == 1:  # If it's the first page
        pool.apply(image_down, args=(title, page_url, headers))  # Call thread pool
    pool.close()
    pool.join()

Project experience:

  • I've given almost all the codes. In order to keep you healthy, I hope you can write the main function by yourself. The main function is very simple, that is, simple call. I've given all the parameters to experience the joy of programming, so that your left and right hands are full!

  • My understanding of crawlers is not how hard it is to implement the code, but how to analyze the website, a good analysis, absolutely twice the result with half the effort!

  • This time, because the project is a static website, I didn't encounter many potholes, all of which are small problems, so I won't mention it to win the lucky and bitter expenses.

  • Generally speaking, if you think it's helpful, give me attention and support. It's the biggest help for me. I'll cheer on, you too!

Project postscript:

Let him soar, let him rush, let him wave!

28 original articles praised 464 visitors 30000+
follow Private letter

Tags: encoding network Programming

Posted on Sat, 09 May 2020 07:05:52 -0400 by jaxdevil