Python crawls cat pictures in batches to realize thousand image imaging

catalogue

preface

Use Python to crawl the cat picture and create a picture for the cat???? Make thousands of images!

Crawling cat pictures

The Python version used in this article is version 3.10.0, which can be downloaded directly from the official website: https://www.python.org .

The installation and configuration process of Python is not described in detail here. Any search on the Internet is a tutorial!

1. Climb the painting material website

Crawl website: Cat picture

First install the necessary libraries:

pip install BeautifulSoup4
pip install requests
pip install urllib3
pip install lxml

Crawl image code:

from bs4 import BeautifulSoup
import requests
import urllib.request
import os

# Page 1 cat picture website
url = 'https://www.huiyi8.com/tupian/tag-%E7%8C%AB%E5%92%AA/1.html'
# Image saving path, where r means no escape
path = r"/Users/lpc/Downloads/cats/"
# Judge whether the directory exists, skip if it exists, and create if it does not exist
if os.path.exists(path):
    pass
else:
    os.mkdir(path)


# Get the web address of all cat pictures
def allpage():
    all_url = []
    # Cycle page turning times 20 times
    for i in range(1, 20):
        # Replace the number of pages turned. Here [- 6] refers to the penultimate digit of the web page address
        each_url = url.replace(url[-6], str(i))
        # Add all obtained URLs to all_url array
        all_url.append(each_url)
    # Return all obtained addresses
    return all_url


# Main function entry
if __name__ == '__main__':
    # Call the allpage function to get all web page addresses
    img_url = allpage()
    for url in img_url:
        # Get web page source code
        requ = requests.get(url)
        req = requ.text.encode(requ.encoding).decode()
        html = BeautifulSoup(req, 'lxml')
        # Add a url array
        img_urls = []
        # Get the contents of all img tags in html
        for img in html.find_all('img'):
            #  Filter matching src tag content starts with http and ends with jpg
            if img["src"].startswith('http') and img["src"].endswith("jpg"):
                # Add qualified img tags to img_urls array
                img_urls.append(img)
        # Loop through all src in the array
        for k in img_urls:
            # Get picture url
            img = k.get('src')
            # To get the picture name, cast is very important
            name = str(k.get('alt'))
            type(name)
            # Name the picture
            file_name = path + name + '.jpg'
            # Download cat pictures through picture url and picture name
            with open(file_name, "wb") as f, requests.get(img) as res:
                f.write(res.content)
            # Print crawled pictures
            print(img, file_name)

???? Note: the above code cannot be copied and run directly. You need to modify the download image path: / Users/lpc/Downloads/cats. Please modify it to the local save path of the reader!

Successful crawling:

A total of 346 cat pictures were taken!

2. Crawl ZOL website

Crawl to ZOL website: Adorable cat

Crawl Code:

import requests
import time
import os
from lxml import etree

# Requested path
url = 'https://desk.zol.com.cn/dongwu/mengmao/1.html'
# Image saving path, where r means no escape
path = r"/Users/lpc/Downloads/ZOL/"
# Here is the location of the path you want to save. The preceding r indicates that this paragraph is not escaped
if os.path.exists(path):  # Judge whether the directory exists, skip if it exists, and create if it does not exist
    pass
else:
    os.mkdir(path)
# Request header
headers = {"Referer": "Referer: http://desk.zol.com.cn/dongman/1920x1080/",
           "User-Agent": "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.77 Safari/537.36", }

headers2 = {
    "User-Agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.81 Safari/537.36 SE 2.X MetaSr 1.0", }


def allpage():  # Get all pages
    all_url = []
    for i in range(1, 4):  # Number of page turning cycles
        each_url = url.replace(url[-6], str(i))  # replace
        all_url.append(each_url)
    return all_url  # Return to address list


# TODO obtains the Html page for parsing
if __name__ == '__main__':
    img_url = allpage()  # Call function
    for url in img_url:
        # Send request
        resq = requests.get(url, headers=headers)
        # Displays whether the request was successful
        print(resq)
        # Page obtained after parsing the request
        html = etree.HTML(resq.text)
        # Get the url to enter the HD map page under the a tag
        hrefs = html.xpath('.//a[@class="pic"]/@href')
        # TODO goes deeper to get high-definition pictures
        for i in range(1, len(hrefs)):
            # request
            resqt = requests.get("https://desk.zol.com.cn" + hrefs[i], headers=headers)
            # analysis
            htmlt = etree.HTML(resqt.text)
            srct = htmlt.xpath('.//img[@id="bigImg"]/@src')
            # Cut picture name
            imgname = srct[0].split('/')[-1]
            # Get pictures according to url
            img = requests.get(srct[0], headers=headers2)
            # Execute write picture to file
            with open(path + imgname, "ab") as file:
                file.write(img.content)
            # Print crawled pictures
            print(img, imgname)

Successful crawling:

A total of 81 cat pictures were taken!

3. Climb Baidu picture website

Climb Baidu website: Baidu cat pictures

1. Crawl image code:

import requests
import os
from lxml import etree
path = r"/Users/lpc/Downloads/baidu1/"
# Judge whether the directory exists, skip if it exists, and create if it does not exist
if os.path.exists(path):
    pass
else:
    os.mkdir(path)

page = input('Please enter how many pages to crawl:')
page = int(page) + 1
header = {
    'User-Agent': 'Mozilla/5.0 (Macintosh; Intel Mac OS X 11_1_0) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.88 Safari/537.36'
}
n = 0
pn = 1
# pn is obtained from the first few pictures. When Baidu pictures decline, 30 are displayed at one time by default
for m in range(1, page):
    url = 'https://image.baidu.com/search/acjson?'

    param = {
        'tn': 'resultjson_com',
        'logid': '7680290037940858296',
        'ipn': 'rj',
        'ct': '201326592',
        'is': '',
        'fp': 'result',
        'queryWord': 'Kitty',
        'cl': '2',
        'lm': '-1',
        'ie': 'utf-8',
        'oe': 'utf-8',
        'adpicid': '',
        'st': '-1',
        'z': '',
        'ic': '0',
        'hd': '1',
        'latest': '',
        'copyright': '',
        'word': 'Kitty',
        's': '',
        'se': '',
        'tab': '',
        'width': '',
        'height': '',
        'face': '0',
        'istype': '2',
        'qc': '',
        'nc': '1',
        'fr': '',
        'expermode': '',
        'nojc': '',
        'acjsonfr': 'click',
        'pn': pn,  # Which picture to start with
        'rn': '30',
        'gsm': '3c',
        '1635752428843=': '',
    }
    page_text = requests.get(url=url, headers=header, params=param)
    page_text.encoding = 'utf-8'
    page_text = page_text.json()
    print(page_text)
    # First, take out the dictionary where all links are located and store it in a list
    info_list = page_text['data']
    # Since the last dictionary retrieved in this way is empty, the last element in the list is deleted
    del info_list[-1]
    # Define a list for storing picture addresses
    img_path_list = []
    for i in info_list:
        img_path_list.append(i['thumbURL'])
    # Then take out all the picture addresses and download them
    # n will be the name of the picture
    for img_path in img_path_list:
        img_data = requests.get(url=img_path, headers=header).content
        img_path = path + str(n) + '.jpg'
        with open(img_path, 'wb') as fp:
            fp.write(img_data)
        n = n + 1

    pn += 29

2. Crawl code

# -*- coding:utf-8 -*-
import requests
import re, time, datetime
import os
import random
import urllib.parse
from PIL import Image  # Import a module

imgDir = r"/Volumes/DBA/python/img/"
# Set headers to prevent anti pickpocketing, set multiple headers
# chrome,firefox,Edge
headers = [
    {
        'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/84.0.4147.105 Safari/537.36',
        'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
        'Connection': 'keep-alive'
    },
    {
        "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:79.0) Gecko/20100101 Firefox/79.0',
        'Accept-Language': 'zh-CN,zh;q=0.8,zh-TW;q=0.7,zh-HK;q=0.5,en-US;q=0.3,en;q=0.2',
        'Connection': 'keep-alive'
    },
    {
        "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/70.0.3538.102 Safari/537.36 Edge/18.19041',
        'Accept-Language': 'zh-CN',
        'Connection': 'keep-alive'
    }
]

picList = []  # Empty List of stored pictures

keyword = input("Please enter a keyword to search for:")
kw = urllib.parse.quote(keyword)  # transcoding 


# Get 1000 thumbnail list s searched by Baidu
def getPicList(kw, n):
    global picList
    weburl = r"https://image.baidu.com/search/acjson?tn=resultjson_com&logid=11601692320226504094&ipn=rj&ct=201326592&is=&fp=result&queryWord={kw}&cl=2&lm=-1&ie=utf-8&oe=utf-8&adpicid=&st=&z=&ic=&hd=&latest=©right=&word={kw}&s=&se=&tab=&width=&height=&face=&istype=&qc=&nc=1&fr=&expermode=&force=&cg=girl&pn={n}&rn=30&gsm=1e&1611751343367=".format(
        kw=kw, n=n * 30)
    req = requests.get(url=weburl, headers=random.choice(headers))
    req.encoding = req.apparent_encoding  # Prevent Chinese garbled code
    webJSON = req.text
    imgurlReg = '"thumbURL":"(.*?)"'  # regular
    picList = picList + re.findall(imgurlReg, webJSON, re.DOTALL | re.I)


for i in range(150):  # The number of cycles is relatively large. If there are not so many graphs, the picList data will not increase.
    getPicList(kw, i)

for item in picList:
    # Suffix and first name
    itemList = item.split(".")
    hz = ".jpg"
    picName = str(int(time.time() * 1000))  # Millisecond timestamp
    # Request picture
    imgReq = requests.get(url=item, headers=random.choice(headers))
    # Save picture
    with open(imgDir + picName + hz, "wb") as f:
        f.write(imgReq.content)
    #  Open picture with Image module
    im = Image.open(imgDir + picName + hz)
    bili = im.width / im.height  # Get the width height ratio, and adjust the picture size according to the width height ratio
    newIm = None
    # Resize the picture with the smallest side set to 50
    if bili >= 1:
        newIm = im.resize((round(bili * 50), 50))
    else:
        newIm = im.resize((50, round(50 * im.height / im.width)))
    # Intercept the 50 * 50 part of the picture
    clip = newIm.crop((0, 0, 50, 50))  # Intercept the picture and crop it
    clip.convert("RGB").save(imgDir + picName + hz)  # Save the captured picture
    print(picName + hz + " Processing completed")

Successful crawling:

Summary: the three websites crawled 1600 cat pictures!

Kilogram imaging

After crawling thousands of pictures, you need to use the pictures to splice them into a cat picture, that is, thousands of images.

1. Implementation of foto mosaik Edda software

Download the software first: Foto-Mosaik-Edda Installer , if you can't download it, search foto mosaik Edda directly on Baidu!

The process of installing foto mosaik Edda for Windows is relatively simple!

???? Note: however, the. NET Framework 2 needs to be installed in advance, otherwise the following error will be reported and the installation will not succeed!

How to enable. NET Framework 2:

Confirm that it has been successfully enabled:

Then you can continue the installation!

After installation, open the following:

Step 1: create a gallery:

Step 2: thousand image imaging:

Here, check the first step to create a good Gallery:

Moments of Miracles:

Make another lovely cat:

be accomplished!

2. Implementation using Python

First, select a picture:

Run the following code:

# -*- coding:utf-8 -*-
from PIL import Image
import os
import numpy as np

imgDir = r"/Volumes/DBA/python/img/"
bgImg = r"/Users/lpc/Downloads/494.jpg"


# Gets the average color value of the image
def compute_mean(imgPath):
    '''
    Get image average color value
    :param imgPath: Thumbnail path
    :return: (r,g,b)Of the entire thumbnail rgb average value
    '''
    im = Image.open(imgPath)
    im = im.convert("RGB")  # Switch to rgb mode
    # Convert image data into data sequence. Each row stores the color of each pixel in behavioral units
    '''For example:
     [[ 60  33  24] 
      [ 58  34  24]
      ...
      [188 152 136] 
      [ 99  96 113]]

     [[ 60  33  24] 
      [ 58  34  24]
      ...
      [188 152 136] 
      [ 99  96 113]]
    '''
    imArray = np.array(im)
    # Function of mean() function: calculate the mean value of the specified data
    R = np.mean(imArray[:, :, 0])  # Gets the average of all R values
    G = np.mean(imArray[:, :, 1])
    B = np.mean(imArray[:, :, 2])
    return (R, G, B)


def getImgList():
    """
    Get the path and average color of thumbnails
    :return: list,The image path and average color value are stored.
    """
    imgList = []
    for pic in os.listdir(imgDir):
        imgPath = imgDir + pic
        imgRGB = compute_mean(imgPath)
        imgList.append({
            "imgPath": imgPath,
            "imgRGB": imgRGB
        })
    return imgList


def computeDis(color1, color2):
    '''
    To calculate the color difference between two images, the computer is the color space distance.
    dis = (R**2 + G**2 + B**2)**0.5
    Parameters: color1,color2 Color data( r,g,b)
    '''
    dis = 0
    for i in range(len(color1)):
        dis += (color1[i] - color2[i]) ** 2
    dis = dis ** 0.5
    return dis


def create_image(bgImg, imgDir, N=2, M=50):
    '''
    Fill in the new picture with the avatar according to the background picture
    bgImg: Background map address
    imgDir: Avatar catalog
    N: Magnification of background image scaling
    M: Size of Avatar( MxM)
    '''
    # Get picture list
    imgList = getImgList()

    # Read picture
    bg = Image.open(bgImg)
    # bg = bg.resize((bg.size[0] // N. BG. Size [1] / / N)) # zoom. It is recommended to zoom the original image. The image is too large and the operation time is very long.
    bgArray = np.array(bg)
    width = bg.size[0] * M  # The width of the newly generated picture. Each pixel is magnified by M times
    height = bg.size[1] * M  # Height of the newly generated picture

    # Create a new blank diagram
    newImg = Image.new('RGB', (width, height))

    # Cyclic filling diagram
    for x in range(bgArray.shape[0]):  # x. Row data can be replaced by the original width
        for y in range(bgArray.shape[1]):  # y. Column data,, can be replaced by the original figure height
            # Find the picture with the smallest distance
            minDis = 10000
            index = 0
            for img in imgList:
                dis = computeDis(img['imgRGB'], bgArray[x][y])
                if dis < minDis:
                    index = img['imgPath']
                    minDis = dis
            # After the cycle, index stores the image path with the closest color
            #         minDis stores the color difference
            # fill
            tempImg = Image.open(index)  # Open the picture with the smallest color difference distance
            # Adjust the size of the picture. It can not be adjusted here, because I have already adjusted it when downloading the picture
            tempImg = tempImg.resize((M, M))
            # Paste the small picture on the new picture. Pay attention to x, y, rows and columns. Paste one at a distance of M.
            newImg.paste(tempImg, (y * M, x * M))
            print('(%d, %d)' % (x, y))  # Print progress. Formatted output x, y

    # Save picture
    newImg.save('final.jpg')  # Last save picture


create_image(bgImg, imgDir)

Operation results:

It can be seen from the above figure that the clarity of the picture is comparable to that of the original picture. After zooming in, the small picture is still clearly visible!

??? note: running with Python will be slow!

Write at the end

It's nice to suck cats again~

Posted on Wed, 03 Nov 2021 03:13:56 -0400 by wakerider017