Page turning request

Website analysis

Before crawling, you first need to understand the website structure. By viewing the website, you can roughly analyze it as follows. The above is some information about movies, ranking, cover, movie name, etc. Paging information in addition to the lowest list.

77dianshi

One page of movie information is more than that.

Create crawler project

  1. Create a scratch project called scratch_ demo
$ scrapy startproject scrapy_demo
  1. Enter the scene_ In the demo project
$ cd scrapy_demo 
  1. Generate a crawler undefined crawler Name: movieundefined crawl range: 77dianshi.com
$ scrapy genspider movie 77dianshi.com
Created spider 'movie' using template 'basic' in module:
  scrapy_demo.spiders.movie
  1. Enter the movie.py file
import scrapy
class MovieSpider(scrapy.Spider):
    name = 'movie'
    allowed_domains = ['77dianshi.com']
    start_urls = ['http://77dianshi.com/']
    def parse(self, response):
        pass
  1. Modify start_urls (below)
import scrapy
class MovieSpider(scrapy.Spider):
    name = 'movie' # entry name
    allowed_domains = ['77dianshi.com'] # Crawling range
    start_urls = ['http://www.77dianshi.com/kdongzuopian / '] # crawl address
    def parse(self, response):
        pass

Web page structure analysis

  1. Use the chrome xpath plug-in.

After debugging, all movie information is surrounded by ul and placed in li.

Web page structure

  1. Script undefined parse code
 def parse(self, response):
        li_list=response.xpath('//ul[@class="fed-list-info fed-part-rows"]/li')
        for li in li_list:
            print(li)

Commissioning; Make sure we have no problem climbing

$ scrapy crawl movie 
2021-06-11 21:33:01 [scrapy.extensions.telnet] INFO: Telnet console listening on 127.0.0.1:6023
<Selector xpath='//ul/li' data='<li class="fed-pull-left"><a class="f...'>
<Selector xpath='//ul/li' data='<li class="fed-pull-left"><a class="f...'>
<Selector xpath='//ul/li' data='<li class="fed-pull-left"><a class="f...'>
<Selector xpath='//ul/li' data='<li class="fed-pull-left"><a class="f...'>
<Selector xpath='//ul/li' data='<li class="fed-pull-left"><a class="f...'>
<Selector xpath='//ul/li' data='<li class="fed-pull-left"><a class="f...'>
<Selector xpath='//ul/li' data='<li class="fed-pull-left"><a class="f...'>
<Selector xpath='//ul/li' data='<li class="fed-col-sm2"><a class="fed...'>
<Selector xpath='//ul/li' data='<li class="fed-col-sm2 fed-this"><a c...'>
<Selector xpath='//ul/li' data='<li class="fed-col-sm2"><a class="fed...'>
<Selector xpath='//ul/li' data='<li class="fed-col-sm2"><a class="fed...'>
...

The above information can be obtained after entering the scratch crawl movie, which proves that we can obtain the data normally.

Data analysis

But the above results will not be the data we want, so we need to analyze the results in each li in more detail.

The data structure of a single li is as follows:

<li class="fed-list-item fed-padding fed-col-xs4 fed-col-sm3 fed-col-md2 xh-highlight"><a class="fed-list-pics fed-lazy fed-part-2by3" href="/t/wumianjuexing/" data-original="https://img.huishij.com/upload/vod/20210610-1/ba6cb9b6c0161b3e46406a154fdec464.jpg" style="display: block;  background-image: url(" https://img.huishij.com/upload/vod/20210610-1/ba6cb9b6c0161b3e46406a154fdec464.jpg&quot ;); "><span class="fed-list-play fed-hide-xs"></span><span class="fed-list-score fed-font-xii fed-back-green">6.0</span><span class="fed-list-remarks fed-font-xii fed-text-white fed-text-center">HD</span></a><a class="fed-list-title fed-font-xiv fed-text-center fed-text-sm-left fed-visible fed-part-eone" href="/t/wumianjuexing/ "> sleepless awakening < / a > < span class =" fed list desc fed font XII fed visible fed part eone fed text muted fed hide XS fed show SM block "> Gina Rodriguez, Shamir Anderson, Jennifer Jason Lee, Ariana Greenblatt, Barry pepper, Frances Fisher, Jill berus, Finn Jones, Sebastian pigot, Sergio ZIO, Alex house, Lucius hojos, troven Hayes, Sean Ahmed, Julia Dyan, Robert bazozzi, Chai varadarez, Katrina Taxia, Martha Gerwin, Elias edraki, Michael, Hough < / span ></li>

First, there are two a tags,

First a label

<a class="fed-list-pics fed-lazy fed-part-2by3" href="/t/wumianjuexing/" data-original="https://img.huishij.com/upload/vod/20210610-1/ba6cb9b6c0161b3e46406a154fdec464.jpg" style="display: block; background-image: url(&quot;https://img.huishij.com/upload/vod/20210610-1/ba6cb9b6c0161b3e46406a154fdec464.jpg&quot;);"><span class="fed-list-play fed-hide-xs"></span><span class="fed-list-score fed-font-xii fed-back-green">6.0</span><span class="fed-list-remarks fed-font-xii fed-text-white fed-text-center">HD</span></a>

Data Original: stores the movie cover

Three span Tags: the first one has no data, the second span tag has score, and the third span tag has HD (I don't know what it means. Let me know if you know).

def parse(self, response):
        li_list=response.xpath('//ul[@class="fed-list-info fed-part-rows"]/li')
        for li in li_list:
            item={} # Used to encapsulate data
            # Get the first a tag
            a1=li.xpath("./a[1]")
            #//Cover
            item["cover"]=li.xpath("./a[1]/@data-original").extract_first()
            # score
            item["score"]=a1.xpath("./span[2]/text()").extract_first()
            # HD
            item["hd"]=a1.xpath("./span[3]/text()").extract_first()
            print(item)

The result of the first a label is as follows:

$ scrapy crawl movie --nolog
{'cover': 'https://img.huishij.com/upload/vod/20210610-1/ba6cb9b6c0161b3e46406a154fdec464.jpg', 'score': '6.0', 'hd': 'HD'}
{'cover': 'http://Ae01.alicdn.com/kf/ua7d5187008104f1b8da8d322d3e160ffv. JPG ',' score ':' 5.0 ',' HD ':' 12 episodes in total, end '}
{'cover': 'https://img.huishij.com/upload/vod/20210609-1/dba912b28f71400c85e98a75ec193aaa.jpg', 'score': '6.0', 'hd': 'HD'}
{'cover': 'http://ae01.alicdn.com/kf/U3dece43d297848f2b5dd58ff0e070eb2D.jpg', 'score': '9.0', 'hd': 'HD'}
{'cover': 'https://img.huishij.com/upload/vod/20210608-1/47c74371cbdf5bb97f02d7c0f06d7cd2.jpg', 'score': '3.0', 'hd': 'HD'}
{'cover': 'https://img.huishij.com/upload/vod/20210607-1/df394b70390b62a5358aeed3163f3131.jpg', 'score': '10.0', 'hd': 'HD'}
{'cover': 'https://img.huishij.com/upload/vod/20210606-1/86dcf8922c009db93ae28458bc51125f.jpg', 'score': '4.0', 'hd': 'HD'}
...

The second a tag stores the cover name, and the last span tag stores the actor list information. It won't be many. Here we deal with it one by one.

<a class="fed-list-title fed-font-xiv fed-text-center fed-text-sm-left fed-visible fed-part-eone" href="/t/wumianjuexing/">Sleepless Awakening</a><span class="fed-list-desc fed-font-xii fed-visible fed-part-eone fed-text-muted fed-hide-xs fed-show-sm-block">Gina·Rodriguez,shamir ·Anderson,Jennifer·Jason·Lee,Ariana·Greenblatt,Barry·Peper,Frances·Fisher,Jill·Berros ,Finn·Jones,Sebastian·Pigot,Sergio·ZIO,Alex·House,Lucius·Hojos,Troven·Hayes,Sean·Ahmed,Julia·Diyang,Robert·Bazozzi,firewood·Varadarez,Katrina·Taccia,Martha·Gerwin,Elias·Edraki,Michael,Hough</span>

Programming

def parse(self, response):
        li_list=response.xpath('//ul[@class="fed-list-info fed-part-rows"]/li')
        for li in li_list:
            item={} # Used to encapsulate data
            # Get the first a tag
            a1=li.xpath("./a[1]")
            # cover
            item["cover"]=li.xpath("./a[1]/@data-original").extract_first()
            # score
            item["score"]=a1.xpath("./span[2]/text()").extract_first()
            # HD
            item["hd"]=a1.xpath("./span[3]/text()").extract_first()
            # Movie title
            item["movie_name"]=li.xpath("./a[2]/text()").extract_first()
            # Cast list
            item["cast_list"]=li.xpath("./span/text()").extract_first()
            print(item)

Final data results

$ scrapy crawl movie --nolog
{'cover': 'https://img.huishij.com/upload/vod/20210610-1/ba6cb9b6c0161b3e46406a154fdec464.jpg', 'score': '6.0', 'hd': 'HD', 'movie_name ':' awake without sleep ','cast'_ List ':' Gina Rodriguez, Shamir Anderson, Jennifer Jason Lee, Ariana Greenbow
Lat,Barry·Peper,Frances·Fisher,Jill·Berros ,Finn·Jones,Sebastian·Pigot,Sergio·ZIO,Alex·House,Lucius·Hojos,Troven·Hayes,Sean·Ahmed,Julia·Diyang,Robert·Bazozzi,firewood·Varadarez,Katrina·Taccia,Martha·Gerwin,Elias·Edraki,Michael,Hough'}
{'cover': 'http://Ae01.alicdn.com/kf/ua7d5187008104f1b8da8d322d3e160ffv. JPG ',' score ':' 5.0 ',' HD ':' 12 episodes in total, end ','movie'_ Name ':' Infernal Affairs', 'cast_list ':' Andy Lau, Tony Leung, Huang Qiusheng, Zeng Zhiwei, Zheng Xiuwen, Chen Huilin, Chen Guanxi, Yu wenle, Du Wenze, Lin Jiadong, Xiao
Yaxuan'}
{'cover': 'https://img.huishij.com/upload/vod/20210609-1/dba912b28f71400c85e98a75ec193aaa.jpg', 'score': '6.0', 'hd': 'HD', 'movie_name ':' thief supreme 1999 ','cast'_ list': 'Alec,Baldwin,Andre,Braugher,Michael,Jai,White'}
{'cover': 'http://ae01.alicdn.com/kf/U3dece43d297848f2b5dd58ff0e070eb2D.jpg', 'score': '9.0', 'hd': 'HD', 'movie_name ':' Shaolin boy ','cast'_ List ':' Jet Li, Huang Qiuyan, pan Qingfu, Yu Chenghui, Yu Hai, Hu Jianqiang '}
{'cover': 'https://img.huishij.com/upload/vod/20210608-1/47c74371cbdf5bb97f02d7c0f06d7cd2.jpg', 'score': '3.0', 'hd': 'HD', 'movie_name ':' crossword event book: fatal puzzle ','cast'_ list': 'Lacey,Chabert,John,Kapelos,Brennan,Elliott'} 
{'cover': 'https://img.huishij.com/upload/vod/20210607-1/df394b70390b62a5358aeed3163f3131.jpg', 'score': '10.0', 'hd': 'HD', 'movie_name ':' Legend of half wolf ','cast'_ List ':' Zijian, Wu Xuanxuan, Ren Tianye '}
{'cover': 'https://img.huishij.com/upload/vod/20210606-1/86dcf8922c009db93ae28458bc51125f.jpg', 'score': '4.0', 'hd': 'HD', 'movie_name ':' National Security 2010 ','cast'_ List ':' inside details'}
{'cover': 'https://img.huishij.com/upload/vod/20210605-1/e9e7b4fff221c92464b3786fa3340dee.jpg', 'score': '7.0', 'hd': 'HD', 'movie_name ':' Kara Eagle ',' cast_list': 'Ricardo,Darín,Martina,Gusman'}
{'cover': 'https://img.huishij.com/upload/vod/20210605-1/558ca1cc9423a9c08ff8e65c42acfef2.jpg', 'score': '6.0', 'hd': 'HD', 'movie_name ':' boxing 2010 ','cast'_ list': 'Mahesh,Babu,Anushka,Shetty'}
{'cover': 'https://img.huishij.com/upload/vod/20210605-1/9a82882e959add4063b6bd7b7ebfb0fb.jpg', 'score': '8.0', 'hd': 'HD', 'movie_name ':' three day crisis', 'cast_ Russell Crowe, Elizabeth Banks, ty simpkins, Olivia Wilde, Liam nee
Sen,Jonathan·Tucker,Brian·Danelli,Rachel·Deacon,Lenny·James,Jason·Begue ,James·Lanson,Moran·Moran Atias ,Aisha·Cinders,Daniel·Stern'}
{'cover': 'https://img.huishij.com/upload/vod/20200627-1/b90ab5f7cf189c8556d9947971c2581b.jpg', 'score': '5.0', 'hd': 'HD', 'movie_name ':' flip ',' cast '_ List ':' Parker Percy, Michael lapaport, Bruce Dunn, Michael coolitz, Paul levisk
'}
...

In this way, we can crawl the movie information of the current whole page

The complete code of movie.py is as follows:

import scrapy
class MovieSpider(scrapy.Spider):
    name = 'movie' # entry name
    allowed_domains = ['77dianshi.com'] # Crawling range
    start_urls = ['http://www.77dianshi.com/kdongzuopian / '] # crawl address
    def parse(self, response):
        li_list=response.xpath('//ul[@class="fed-list-info fed-part-rows"]/li')
        for li in li_list:
            item={} # Used to encapsulate data
            # Get the first a tag
            a1=li.xpath("./a[1]")
            # cover
            item["cover"]=li.xpath("./a[1]/@data-original").extract_first()
            # score
            item["score"]=a1.xpath("./span[2]/text()").extract_first()
            # HD
            item["hd"]=a1.xpath("./span[3]/text()").extract_first()
            # Movie title
            item["movie_name"]=li.xpath("./a[2]/text()").extract_first()
            # Cast list
            item["cast_list"]=li.xpath("./span/text()").extract_first()
            print(item)

Get the address of the next page

After crawling the data on the first page, how to crawl the data on the next page? We just need to get the url address of the next page.

image.png

Use the xpath tool to analyze and obtain the href information including the text content a tag on the next page

Get next page href

Click on the next page: the url is as follows: http://www.77dianshi.com/kdongzuopian-2/

, that is, the href address and address of the next page will be obtained http://www.77dianshi.com Just splice.

Write a program to get the address of the next page

def parse(self, response):
        li_list=response.xpath('//ul[@class="fed-list-info fed-part-rows"]/li')
        for li in li_list:
            item={} # Used to encapsulate data
            # Get the first a tag
            a1=li.xpath("./a[1]")
            # cover
            item["cover"]=li.xpath("./a[1]/@data-original").extract_first()
            # score
            item["score"]=a1.xpath("./span[2]/text()").extract_first()
            # HD
            item["hd"]=a1.xpath("./span[3]/text()").extract_first()
            # Movie title
            item["movie_name"]=li.xpath("./a[2]/text()").extract_first()
            # Cast list
            item["cast_list"]=li.xpath("./span/text()").extract_first()
        # Get the url of the next page
        next_href=response.xpath('//Div [@ class = "fed page info fed text center"] / a [contains (text(), "next page")] / @ href').extract_first()
        next_url="http://www.77dianshi.com"+next_href
        print(next_url)

The results are as follows:

$ scrapy crawl movie --nolog
http://www.77dianshi.com/kdongzuopian-2/

After obtaining the url of the next page, we need to think about a problem. If there is no next page, an error will be reported, which is not what we want to see. So next page

Not all href can exist, so you have to make a good judgment to prevent error reporting.

Let's see what the last page looks like

last page

It can be seen from the above that the href of the next page will exist, but the href of the last page is the href of the current page.

Determine whether it is the last page

    def parse(self, response):
        li_list=response.xpath('//ul[@class="fed-list-info fed-part-rows"]/li')
        for li in li_list:
            item={} # Used to encapsulate data
            # Get the first a tag
            a1=li.xpath("./a[1]")
            # cover
            item["cover"]=li.xpath("./a[1]/@data-original").extract_first()
            # score
            item["score"]=a1.xpath("./span[2]/text()").extract_first()
            # HD
            item["hd"]=a1.xpath("./span[3]/text()").extract_first()
            # Movie title
            item["movie_name"]=li.xpath("./a[2]/text()").extract_first()
            # Cast list
            item["cast_list"]=li.xpath("./span/text()").extract_first()
        # Get the url of the next page
        next_href=response.xpath('//Div [@ class = "fed page info fed text center"] / a [contains (text(), "next page")] / @ href').extract_first()
        next_url="http://www.77dianshi.com"+next_href
        #Gets the url of the current page
        current_url=response.url
        # By judging whether to use next_href is the end to determine whether it is the last page
        if not current_url.endswith(next_href):
            print("last page")
        else:
            print("Not the last page")

Crawl to the next page for movie information

Through the above preparations, the data crawling of the current page, the url address acquisition of the next page and the judgment of the last page are completed. Next, we will use the graph to crawl the data of the next page.

Stop: don't worry. We need to modify some parameters to ensure more stable crawling data.

settings.py configuration:

DOWNLOAD_DELAY = 3 # Set the crawl extension time, which is set to three seconds here. Crawling too fast may lead to IP blocking
ROBOTSTXT_OBEY = False # I call this tearing up the gentleman's agreement**
# Custom USER_AGENT
USER_AGENT = 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.101 Safari/537.36'

Complete code

import scrapy
class MovieSpider(scrapy.Spider):
    name = 'movie' # entry name
    allowed_domains = ['77dianshi.com'] # Crawling range
    start_urls = ['http://www.77dianshi.com/kdongzuopian / '] # crawl address
    def parse(self, response):
        li_list=response.xpath('//ul[@class="fed-list-info fed-part-rows"]/li')
        for li in li_list:
            item={} # Used to encapsulate data
            # Get the first a tag
            a1=li.xpath("./a[1]")
            # cover
            item["cover"]=li.xpath("./a[1]/@data-original").extract_first()
            # score
            item["score"]=a1.xpath("./span[2]/text()").extract_first()
            # HD
            item["hd"]=a1.xpath("./span[3]/text()").extract_first()
            # Movie title
            item["movie_name"]=li.xpath("./a[2]/text()").extract_first()
            # Cast list
            item["cast_list"]=li.xpath("./span/text()").extract_first()
            print(item)
        # Get the url of the next page
        next_href=response.xpath('//Div [@ class = "fed page info fed text center"] / a [contains (text(), "next page")] / @ href').extract_first()
        next_url="http://www.77dianshi.com"+next_href
        #Gets the url of the current page
        current_url=response.url
        #To facilitate viewing progress, print the current url address
        print("Has climbed to:",current_url,"Next page address:",next_url)
        # By judging whether to use next_href is the end to determine whether it is the last page
        if not current_url.endswith(next_href):
            # url: the url address of the next page
            # callback: it needs to be handled by the parse method (which can be customized). Because the data structure of the next page is the same as that of the current page, the processing method is the same. If not, you need to customize it
           yield scrapy.Request(url=next_url,callback=self.parse)
        else:
            print("Crawling completed")

Crawling results

...
{'cover': 'http://Ae01.alicdn.com/kf/u5f88cacadfe24a4894ed96f43676d1b1i. JPG ',' score ':'4.0', 'HD':'bd HD ','movie'_ Name ':' Xueji biography ','cast'_ List ':' Ren Jiao, Wang Ming, Chen Chaoliang, Zhang Xinglan '}
{'cover': 'https://cdn1.mh-pic.com/upload/vod/2019-10-05/157026432110.jpg', 'score': '7.0', 'hd': 'HD', 'movie_name ':' Zhong Lihuang ','cast'_ List ':' Yue Xin, Chen Xijun, Zhao ziluo, he Suo, Yu Xintong, Li Jiawei '}
{'cover': 'https://Cdn1.mh-pic.com/upload/vod/2019-10-05 / 15702721910. JPG ',' score ':'7.0', 'HD':'bd HD ',' movie_name ':' half life 3: Battle of special forces','cast_list ':' Chu Town, Guo Yunfei, Jin Meiling '}
Has climbed to: http://www.77dianshi.com/kdongzuopian-17/ next page address: http://www.77dianshi.com/kdongzuopian-18/
{'cover': 'http://ae01.alicdn.com/kf/Uf774144860144247bbea9d4861c1e7a04.jpg', 'score': '5.0', 'hd': 'HD', 'movie_name ':' see me and you ','cast_list ':' Du Haitao, Liu Xuan, Wu Xin, Tang Yuzhe, Wang Wenwen, Tang Zhenye, Li Jianren, Chen Zhixiong, sun Dandan, Zheng Yihan '}
{'cover': 'http://ae01.alicdn.com/kf/U7402fe10e7564417b3443a322e3f7dd3O.jpg', 'score': '8.0', 'hd': 'HD', 'movie_name ':' dissident 2: Jedi counterattack ','cast_list ':' Shereen Woodley, Theo James, Kate Winslet, Octavia Spencer '}
{'cover': 'http://ae01.alicdn.com/kf/U6a124ba5fdcd4cf5b99f133e16f012000.jpg', 'score': '7.0', 'hd': 'HD', 'movie_name ':' divergent: Alien awakening ','cast'_ List ':' Shereen Woodley, Theo James, Ashley Judd, Jay Courtney '}
{'cover': 'http://ae01.alicdn.com/kf/U0c95ac9b8d964a24b8da02686bd971dbI.jpg', 'score': '7.0', 'hd': 'HD', 'movie_name ':' murderer Tang cut ','cast'_ List ':' Zhang Fengyi, Guan Zhilin, Mo shaocong, Zhang Guangbei '}
....

At present, the first 18 pages of data have been crawled:

`Has climbed to: http://www.77dianshi.com/kdongzuopian-17/ Next page address:

http://www.77dianshi.com/kdongzuopian-18/`

Summary:

Scratch. Request can build a requests and specify the callback function to extract data.

Knowledge points of scene.request:

scrapy.Request(url,callbock,method='GET',headers,body,cookies,meta,dont_filter=False)

  • url: request address
  • Callblock: execute function
  • Method: request method: POST/GET
  • Headers: request headers
  • Body: request body
  • Cookies: cookies are stored in a special place, which is usually not specified here
  • meta: metadata information
  • dont_filter: delete the current url

Common parameters of scene.request:

  • url: the url address of the next request.
  • Callblock: specifies which parsing function to handle the incoming url.
  • meta: transfer data in different parsing functions. meta will carry some information by default, such as download delay, request depth, etc.
  • dont_filter: so that the current url will not be filtered by the de duplication of the sweep. By default, the sweep has the function of url de duplication, which is important for URLs that need repeated requests.

Last note:

For future data archiving operations (saving to local disk or storing to database), we should operate in the file movie.py. We should give the data to

pipelines.py.

What needs to be done? The yield keyword is required.

The specific operations are as follows:

  1. You need to change print(item) to yield item.
for li in li_list:
            item={} # Used to encapsulate data
            # Get the first a tag
            a1=li.xpath("./a[1]")
            # cover
            item["cover"]=li.xpath("./a[1]/@data-original").extract_first()
            # score
            item["score"]=a1.xpath("./span[2]/text()").extract_first()
            # HD
            item["hd"]=a1.xpath("./span[3]/text()").extract_first()
            # Movie title
            item["movie_name"]=li.xpath("./a[2]/text()").extract_first()
            # Cast list
            item["cast_list"]=li.xpath("./span/text()").extract_first()
            #print(item) 
            yield item
  1. Modify the settings.py file undefined item_ The pipeline annotation opens.
  2. Enter the undefined in the pipelines.py file to obtain the item data and simulate the warehousing operation
# Define your item pipelines here
#
# Don't forget to add your pipeline to the ITEM_PIPELINES setting
# See: https://docs.scrapy.org/en/latest/topics/item-pipeline.html
# useful for handling different item types with a single interface
from itemadapter import ItemAdapter
class ScrapyDemoPipeline:
    def process_item(self, item, spider):
        #Get item data
        print("Simulate data warehousing:",item) #The printing here is the same as that in movie.py.

Posted on Fri, 26 Nov 2021 15:22:42 -0500 by Thuy