Operation ①
-
Requirements: specify a website and crawl all the pictures in the website, such as China Meteorological Network( http://www.weather.com.cn ). Single thread and multi thread crawling are used respectively. (the number of crawling pictures is limited to the last 3 digits of the student number)
-
Output information: output the downloaded Url information on the console, store the downloaded image in the images subfolder, and give a screenshot.
Implementation process
- Single thread implementation process and code link: https://gitee.com/chenshuooooo/data-acquisition/blob/master/%E4%BD%9C%E4%B8%9A3/1%E5%8D%95%E7%BA%BF%E7%A8%8B.py
1. Analyze web pages
sys.stdout = io.TextIOWrapper(sys.stdout.buffer, encoding='gb18030') # Change the default code of standard output to Chinese character code url='http://www.weather.com.cn/' header = 'User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/72.0.3626.119 Safari/537.36' html=requests.get(url,header) data=html.text count = 1 #print(data)
2. Construction of regular expressions
s1='<img src=\"(.*?)[png,jpg]\"'#Match picture address s2='href=\"(.*?)\"'#Matching sub website
3. Climb all picture download links under the web page and sub web pages
for tag in tags: tag = tag +'g' print('The address to download the picture is:'+tag) urllib.request.urlretrieve(tag, 'Picture of Chen Shuo's school bag' + str(count) + '.jpg') count+=1 if count>104: break lis = re.findall(s2,data) for li in lis: ht = requests.get(li) dt = ht.text img_ls = re.findall(s1,dt)#Image address in the matching suburl for img in img_ls: img = img +'g' print('The address to download the picture is:' + img) urllib.request.urlretrieve(img, 'Picture of Chen Shuo's school bag' + str(count) + '.jpg') count+=1 if count>104: break if count>104: break
4. Output results
5. Crawl and download results
- Multithreading implementation process and code link: https://gitee.com/chenshuooooo/data-acquisition/blob/master/%E4%BD%9C%E4%B8%9A3/1%E5%A4%9A%E7%BA%BF%E7%A8%8B.py
1. Encapsulate the urllib.request.urlretrieve method as a download method, and then use threading for concurrent processing.
def download(url,name): urllib.request.urlretrieve(url) for tag in tags: tag = tag +'g' print('Download pictures'+str(count)+'Your address is:'+tag) T = threading.Thread(target=download(tag, 'Picture of Chen Shuo's school bag' + str(count) + '.jpg')) T.setDaemon(False) T.start() count+=1 if count>104: break lis = re.findall(s2,data) for li in lis: ht = requests.get(li) dt = ht.text img_ls = re.findall(s1,dt)#Image address in the matching suburl for img in img_ls: img = img +'g' print('Download pictures' + str(count) + 'Your address is:' + tag) T = threading.Thread(target=download(img, 'Picture of Chen Shuo's school bag' + str(count) + '.jpg')) T.setDaemon(False) T.start() count+=1
experience
- Reviewed the use of concurrency and had a deeper understanding of concurrency.
- Reviewed the writing of regular expressions
Operation ②
-
Requirements: use the sketch framework to reproduce the operation ①.
-
Output information: the same as operation ①
Implementation process
Code implementation: https://gitee.com/chenshuooooo/data-acquisition/tree/master/%E4%BD%9C%E4%B8%9A3/%E7%AC%AC%E4%BA%8C%E9%A2%98
1. Write the main program of weather.py crawler
# -*- coding:utf-8 -*- import scrapy from ..items import Exp3Item #031904104 Chen Shuo class WeatherSpider(scrapy.Spider): name = 'weather' start_urls = ['http://p.weather.com.cn/txqg/index.shtml'] global count count = 0 # Count, climb the last three digits of the student number and stop after 104 items def get_urllist(self,response): urls = response.xpath('//div[@class="tu"]/a/@href').extract() # match suburl for url in urls: if self.count>104: break yield scrapy.Request(url=url, callback=self.get_imgurl) def get_imgurl(self, response): global count item = Exp3Item() img_url=response.xpath('//div[@class="buttons"]/span/img/@src') for i in img_url: count+=1 url=i.extract() if self.count<=104: item['img_url']=url yield item pass
2.setting.py
Change the robot protocol to False, set the storage path and crawler priority
3. Prepare pipelines.py file
import requests class Exp3Pipeline: def open_spider(self, spider): self.num = 1 def process_item(self, item, spider): url = item['img_url'] resp = requests.get(url) img = resp.content with open('D:\image\%d' % (self.num) + '.jpg', 'wb') as f: f.write(img) print('%d' % (self.num)) self.num += 1 return item
4. Write item.py
import scrapy class Exp3Item(scrapy.Item): # define the fields for your item here like: # name = scrapy.Field() img_url = scrapy.Field() #Picture url address pass
5. Write run.py to simulate the command line to run the crawler
# -*- coding:utf-8 -*- from scrapy import cmdline import sys sys.path.append(r'D:\python project\exp3\exp3spiders\weather')#Add a crawler path to prevent error reporting and failure to find the path cmdline.execute('scrapy crawl weather'.split())#Run crawler
6. Crawling results
- Error: error: spider error processing < get http://p.weather.com.cn/txqg/index.shtml> (referer: None)
By querying the data, it is found that there may be two problems: - ① . xpath expression parsing error (which will cause the crawler to fail to parse dom). After asking the boss, I reconstructed the expression. There should be no problem in this regard
- ② The robot protocol is set to False. After checking the setting file, it is found that it has been modified. This should not be the problem.
However, after these two problems are solved, an error is still reported.
experience
- After reviewing the use of scratch, I have a deeper understanding of scratch and feel that the climbing speed of scratch is really fast.
Operation ③
Requirements: crawl the Douban movie data, use scene and xpath, store the content in the database, and store the pictures in the imgs path.
Candidate sites: https://movie.douban.com/top250
Output information:
Serial number | Movie title | director | performer | brief introduction | Film rating | Film cover |
---|---|---|---|---|---|---|
1 | The Shawshank Redemption | Frank delabond | Tim Robbins | Want to set people free | 9.7 | ./imgs/xsk.jpg |
Implementation process
Code link: https://gitee.com/chenshuooooo/data-acquisition/tree/master/%E4%BD%9C%E4%B8%9A3/%E7%AC%AC%E4%B8%89%E9%A2%98
1.items class
2.settings
3. Crawler db.py
# -*- coding:utf-8 -*- from urllib.request import Request import scrapy import re from douban.items import DoubanItem class DbSpider(scrapy.Spider): def start_requests(self): ##Page turning processing for i in range(10): url = 'https://movie.douban.com/top250?start=' + str(i * 25) + '&filter=' yield Request(url=url, callback=self.parse1) ##xpath selects the corresponding item content and passes it into item def parse1(self, response): global count data = response.body.decode() selector = scrapy.Selector(text=data) # Get each movie item movies = selector.xpath( "//ol[@class='grid_view']/li") ##Select the corresponding label content under each movie label for i in movies: image = i.xpath("./div[@class='item']/div[@class='pic']/a/img/@src").extract_first() name = i.xpath( "./div[@class='item']/div[@class='info']/div[@class='hd']//span[@class='title']/text()").extract_first() dir = i.xpath( "./div[@class='item']/div[@class='info']/div[@class='bd']/p[@class='']/text()").extract_first() desp = i.xpath( "./div[@class='item']/div[@class='info']/div[@class='bd']/p[@class='quote']/span/text()").extract_first() grade = i.xpath( "./div[@class='item']/div[@class='info']/div[@class='bd']/div/span[@class='rating_num']/text()").extract_first() print(image) ##Regular conversion of directors and leading actors to facilitate subsequent access to corresponding content dir = dir.replace(' ', '') dir = dir.replace('\n', '') dir = dir + '\n' director = re.findall(r'director:(.*?)\s', dir) actor = re.findall(r'to star:(.*?)\n', dir) count += 1 item = DoubanItem() # Save to corresponding item item['num'] = str(count) item['name'] = str(name) item['dir'] = str(director[0]) if (len(actor) != 0): ##The actor may be empty because there is no actor in the animation or the actor cannot be displayed because the director's name is too long item['act'] = str(actor[0]) else: item['actor'] = 'null' item['introduce'] = str(desp) item['score'] = str(grade) item['img'] = str(image) yield item pass
4.pipeline class
experience
It has further deepened the understanding of the sketch framework, but it is still not familiar with the use of pipeline classes and needs to be strengthened.