Data acquisition and fusion technology -- Experiment 3

Operation ①

  • Requirements: specify a website and crawl all the pictures in the website, such as China Meteorological Network( http://www.weather.com.cn ). Single thread and multi thread crawling are used respectively. (the number of crawling pictures is limited to the last 3 digits of the student number)

  • Output information: output the downloaded Url information on the console, store the downloaded image in the images subfolder, and give a screenshot.

Implementation process

sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='gb18030')  # Change the default code of standard output to Chinese character code
url='http://www.weather.com.cn/'
header = 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36'
html=requests.get(url,header)
data=html.text
count = 1
#print(data)

2. Construction of regular expressions

s1='<img src=\"(.*?)[png,jpg]\"'#Match picture address
s2='href=\"(.*?)\"'#Matching sub website

3. Climb all picture download links under the web page and sub web pages

for tag in tags:
    tag = tag +'g'
    print('The address to download the picture is:'+tag)
    urllib.request.urlretrieve(tag, 'Picture of Chen Shuo's school bag'  + str(count) + '.jpg')
    count+=1
    if count>104:
        break
lis = re.findall(s2,data)
for li in lis:
    ht = requests.get(li)
    dt = ht.text
    img_ls = re.findall(s1,dt)#Image address in the matching suburl
    for img in img_ls:
        img = img +'g'
        print('The address to download the picture is:' + img)
        urllib.request.urlretrieve(img, 'Picture of Chen Shuo's school bag' + str(count) + '.jpg')
        count+=1
        if count>104:
            break
    if count>104:
        break

4. Output results

5. Crawl and download results

def download(url,name):
    urllib.request.urlretrieve(url)

for tag in tags:
    tag = tag +'g'
    print('Download pictures'+str(count)+'Your address is:'+tag)
    T = threading.Thread(target=download(tag, 'Picture of Chen Shuo's school bag'  + str(count) + '.jpg'))
    T.setDaemon(False)
    T.start()
    count+=1
    if count>104:
        break
lis = re.findall(s2,data)
for li in lis:
    ht = requests.get(li)
    dt = ht.text
    img_ls = re.findall(s1,dt)#Image address in the matching suburl
    for img in img_ls:
        img = img +'g'
        print('Download pictures' + str(count) + 'Your address is:' + tag)
        T = threading.Thread(target=download(img, 'Picture of Chen Shuo's school bag' + str(count) + '.jpg'))
        T.setDaemon(False)
        T.start()
        count+=1

experience

  • Reviewed the use of concurrency and had a deeper understanding of concurrency.
  • Reviewed the writing of regular expressions

Operation ②

  • Requirements: use the sketch framework to reproduce the operation ①.

  • Output information: the same as operation ①

Implementation process

Code implementation: https://gitee.com/chenshuooooo/data-acquisition/tree/master/%E4%BD%9C%E4%B8%9A3/%E7%AC%AC%E4%BA%8C%E9%A2%98
1. Write the main program of weather.py crawler

# -*- coding:utf-8 -*-
import scrapy
from ..items import Exp3Item
#031904104 Chen Shuo
class WeatherSpider(scrapy.Spider):
    name = 'weather'
    start_urls = ['http://p.weather.com.cn/txqg/index.shtml']
    global count
    count = 0  # Count, climb the last three digits of the student number and stop after 104 items


    def get_urllist(self,response):
        urls = response.xpath('//div[@class="tu"]/a/@href').extract() # match suburl
        for url in urls:
            if self.count>104:
                break
            yield scrapy.Request(url=url, callback=self.get_imgurl)

    def get_imgurl(self, response):
        global count
        item = Exp3Item()
        img_url=response.xpath('//div[@class="buttons"]/span/img/@src')
        for i in img_url:
            count+=1
            url=i.extract()
        if self.count<=104:
            item['img_url']=url
        yield item
        pass

2.setting.py
Change the robot protocol to False, set the storage path and crawler priority

3. Prepare pipelines.py file

import requests

class Exp3Pipeline:
    def open_spider(self, spider):
        self.num = 1

    def process_item(self, item, spider):
        url = item['img_url']
        resp = requests.get(url)
        img = resp.content

        with open('D:\image\%d' % (self.num) + '.jpg', 'wb') as f:
            f.write(img)
            print('%d' % (self.num))
            self.num += 1
        return item

4. Write item.py

import scrapy
class Exp3Item(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    img_url = scrapy.Field() #Picture url address
    pass

5. Write run.py to simulate the command line to run the crawler

# -*- coding:utf-8 -*-
from scrapy import cmdline
import sys
sys.path.append(r'D:\python project\exp3\exp3spiders\weather')#Add a crawler path to prevent error reporting and failure to find the path
cmdline.execute('scrapy crawl weather'.split())#Run crawler

6. Crawling results

  • Error: error: spider error processing < get http://p.weather.com.cn/txqg/index.shtml> (referer: None)

    By querying the data, it is found that there may be two problems:
  • ① . xpath expression parsing error (which will cause the crawler to fail to parse dom). After asking the boss, I reconstructed the expression. There should be no problem in this regard
  • ② The robot protocol is set to False. After checking the setting file, it is found that it has been modified. This should not be the problem.
    However, after these two problems are solved, an error is still reported.

experience

  • After reviewing the use of scratch, I have a deeper understanding of scratch and feel that the climbing speed of scratch is really fast.

Operation ③

Requirements: crawl the Douban movie data, use scene and xpath, store the content in the database, and store the pictures in the imgs path.
Candidate sites: https://movie.douban.com/top250
Output information:

Serial number Movie title director performer brief introduction Film rating Film cover
1 The Shawshank Redemption Frank delabond Tim Robbins Want to set people free 9.7 ./imgs/xsk.jpg

Implementation process

Code link: https://gitee.com/chenshuooooo/data-acquisition/tree/master/%E4%BD%9C%E4%B8%9A3/%E7%AC%AC%E4%B8%89%E9%A2%98
1.items class

2.settings

3. Crawler db.py

# -*- coding:utf-8 -*-
from urllib.request import Request
import scrapy
import re
from douban.items import DoubanItem


class DbSpider(scrapy.Spider):
    def start_requests(self):
        ##Page turning processing
        for i in range(10):
            url = 'https://movie.douban.com/top250?start=' + str(i * 25) + '&filter='
            yield Request(url=url,  callback=self.parse1)

    ##xpath selects the corresponding item content and passes it into item
    def parse1(self, response):
        global count
        data = response.body.decode()
        selector = scrapy.Selector(text=data)
        # Get each movie item
        movies = selector.xpath(
            "//ol[@class='grid_view']/li")
        ##Select the corresponding label content under each movie label
        for i in movies:
            image = i.xpath("./div[@class='item']/div[@class='pic']/a/img/@src").extract_first()
            name = i.xpath(
                "./div[@class='item']/div[@class='info']/div[@class='hd']//span[@class='title']/text()").extract_first()
            dir = i.xpath(
                "./div[@class='item']/div[@class='info']/div[@class='bd']/p[@class='']/text()").extract_first()
            desp = i.xpath(
                "./div[@class='item']/div[@class='info']/div[@class='bd']/p[@class='quote']/span/text()").extract_first()
            grade = i.xpath(
                "./div[@class='item']/div[@class='info']/div[@class='bd']/div/span[@class='rating_num']/text()").extract_first()
            print(image)
            ##Regular conversion of directors and leading actors to facilitate subsequent access to corresponding content
            dir = dir.replace(' ', '')
            dir = dir.replace('\n', '')
            dir = dir + '\n'
            director = re.findall(r'director:(.*?)\s', dir)
            actor = re.findall(r'to star:(.*?)\n', dir)
            count += 1
            item = DoubanItem()
            # Save to corresponding item
            item['num'] = str(count)
            item['name'] = str(name)
            item['dir'] = str(director[0])
            if (len(actor) != 0):  ##The actor may be empty because there is no actor in the animation or the actor cannot be displayed because the director's name is too long
                item['act'] = str(actor[0])
            else:
                item['actor'] = 'null'
            item['introduce'] = str(desp)
            item['score'] = str(grade)
            item['img'] = str(image)
            yield item
        pass

4.pipeline class

experience

It has further deepened the understanding of the sketch framework, but it is still not familiar with the use of pipeline classes and needs to be strengthened.

Posted on Tue, 09 Nov 2021 17:23:34 -0500 by ckuipers