Summary of the study of the science

1. Create a new crawler item in the scratch:

Script startproject project name
 Such as: scratch startproject itcast

[root@VM_131_54_centos pachong]# tree itcast
itcast
|– itcast
| |– init.py
|| - items.py ා the data container file of the project, which is mainly used to define the data we want to obtain
| |– middlewares.py
|| - pipelines.py ා pipeline file is mainly used for further processing of data defined in items.
|| - settings.py ා settings file
|-- spiders? Write crawlers in this directory
|– init.py
`– profile.cfg

2.items.py file

[root@VM_131_54_centos itcast]# cat items.py   #Definition
# -*- coding: utf-8 -*-

# Define here the models for your scraped items
#
# See documentation in:
# http://doc.scrapy.org/en/latest/topics/items.html

import scrapy


class ItcastItem(scrapy.Item):
    # define the fields for your item here like:
    # name = scrapy.Field()
    pass

3. Write (create) a crawler file

Crawler name primary domain name
scrapy genspider myspider itcast.cn

The following is automatically created:

[root@VM_131_54_centos spiders]# cat testitcast.py 
# -*- coding: utf-8 -*-
import scrapy


class TestitcastSpider(scrapy.Spider):
    name = 'testitcast'
    allowed_domains = ['itcast.cn']
    start_urls = ['http://itcast.cn/']

    def parse(self, response):
        pass

Start to write a crawler class: TestitcastSpider. The class name can be changed, and other variables and methods in the class can't be changed

[root@VM_131_54_centos spiders]# cat testitcast.py 
# -*- coding: utf-8 -*-
import scrapy


class TestitcastSpider(scrapy.Spider):
    name = 'testitcast'
    allowed_domains = ['itcast.cn']
    start_urls = ['http://itcast.cn/']

    def parse(self, response):
        pass

[root@VM_131_54_centos spiders]# vi testitcast.py
[root@VM_131_54_centos spiders]# cat testitcast.py

# -*- coding: utf-8 -*-
import scrapy


class TestitcastSpider(scrapy.Spider):
    name = 'testitcast'  #Reptile name
    allowed_domains = ['itcast.cn']  #Primary domain name
    start_urls = ['http://itcast.cn/channel/teacher.shtml#']  #The starting url, which is the same as the domain name by default, can be modified. Modified here

    def parse(self, response):
        with open("getteacher.html",'w') as f:
            f.write(response.body)  #The content of the response.

4. Execute crawler:

Crawler name
scrapy crawle testitcast

5. xpath in extract() and scrape attach json's format conversion website: http://www.json.cn/

[root@VM_131_54_centos spiders]# cat testitcast.py

# -*- coding: utf-8 -*-
import scrapy
class TestitcastSpider(scrapy.Spider):
    name = 'testitcast'  #Reptile name
    allowed_domains = ['itcast.cn']  #Primary domain name
    start_urls = ['http://itcast.cn/channel/teacher.shtml#']  #The starting url, which is the same as the domain name by default, can be modified. Modified here

    def parse(self, response):
        teacher_list = response.xpath("//div[@class='li_txt']") #xpath of the scratch
        for each in teacher_list:
            #name
            name = each.xpath("./h3/text()").extract() #xpath returns a list, but only a list of text elements

            #title                                  #Using each.xpath("./h4") returns a list of all collections.
            title = each.xpath("./h4/text()").extract()

            #info
            info = each.xpath("./p/text()").extract() #The object extracted by xpath is converted to a unicode string.

            #print name[0]
            #print title[0]
            #print info[0]

6. Use the file itcast.py

[root@VM_131_54_centos spiders]# cat testitcast.py

# -*- coding: utf-8 -*-
import scrapy
from itcast.items import ItcastItem  #Import the item.py file configuration.

class TestitcastSpider(scrapy.Spider):
    name = 'testitcast'  #Reptile name
    allowed_domains = ['itcast.cn']  #Primary domain name
    start_urls = ['http://itcast.cn/channel/teacher.shtml#']  #The starting url, which is the same as the domain name by default, can be modified. Modified here

    def parse(self, response):
        item = ItcastItem()  #Instantiate the imported class.
        dataset = []
        teacher_list = response.xpath("//div[@class='li_txt']") #xpath of the scratch
        for each in teacher_list:
            #name
            name = each.xpath("./h3/text()").extract() #xpath returns a list, but only a list of text elements

            #title                                  #Using each.xpath("./h4") returns a list of all collections.
            title = each.xpath("./h4/text()").extract()

            #info
            info = each.xpath("./p/text()").extract() #The object extracted by xpath is converted to a unicode string.

            #print name[0]
            #print title[0]
            #print info[0]

            item['name'] = name[0]
            item['title'] = title[0]
            item['info'] = info[0]


            dataset.append(item)

        return dataset          

Export to json file
It can also be exported to other files, such as csv

7. Use pycharm to write the crawler.

8. Use of pipeline documents

[root@VM_131_54_centos spiders]# cat testitcast.py

# -*- coding: utf-8 -*-
import scrapy
from itcast.items import ItcastItem  #Import the item.py file configuration.

class TestitcastSpider(scrapy.Spider):
    name = 'testitcast'  #Reptile name
    allowed_domains = ['itcast.cn']  #Primary domain name
    start_urls = ['http://itcast.cn/channel/teacher.shtml#']  #The starting url, which is the same as the domain name by default, can be modified. Modified here

    def parse(self, response):
        item = ItcastItem()  #Instantiate the imported class.
        teacher_list = response.xpath("//div[@class='li_txt']") #xpath of the scratch
        for each in teacher_list:
            #name
            name = each.xpath("./h3/text()").extract() #xpath returns a list, but only a list of text elements

            #title                                  #Using each.xpath("./h4") returns a list of all collections.
            title = each.xpath("./h4/text()").extract()

            #info
            info = each.xpath("./p/text()").extract() #The object extracted by xpath is converted to a unicode string.

            #print name[0]
            #print title[0]
            #print info[0]

            item['name'] = name[0]
            item['title'] = title[0]
            item['info'] = info[0]

            yield item  #Using the generator function, he processes the data into a pipelines.py file

Modify the

# Configure item pipelines
# See http://scrapy.readthedocs.org/en/latest/topics/item-pipeline.html
ITEM_PIPELINES = {
    'itcast.pipelines.ItcastPipeline': 300, 
    #Project name, itcastpipeline class in pipelines.py
}

Modify pipelines.py

[root@VM_131_54_centos itcast]# cat pipelines.py

#encoding:utf-8
import json

class ItcastPipline(object):
    #Optional
    def __init__(self):
        self.filename = open('te.json','w')

    #Yes, this is the way to process data
    def process_item(self,item,spider):
        jsontext = json.dumps(dict(item),ensure_ascii=False)+"\n"
        self.filename.write(jsontext.encode('utf-8'))
        return item

    #Optionally, this will be called automatically at the end of the crawler.
    def close_spider(self,spider):
        self.filename.close()

Use log:
Summary startproject - H view help
scrapy startproject hexian –logfile="./spiderlog.txt"

Tags: JSON Pycharm encoding

Posted on Sun, 03 May 2020 13:41:12 -0400 by wyrd33