Crawler notes - object oriented programming, rewriting crawler code

Catalogue of series articles

[reptile learning notes (4) -- anti climbing encountered by embarrassing reptiles]( https://editor.csdn.net/md/?articleId=112463277 )[crawler learning notes (3) -- use of JSON to crawl watercress]( https://editor.csdn.net/md/?articleId=111053192 )[crawler learning notes (2): basic usage of requests to crawl various mainstream Websites 2] ( https://editor.csdn.net/md/?articleId=110222411 )[crawler learning notes (I): basic usage of requests to crawl various mainstream websites 1]( https://editor.csdn.net/md/?articleId=103932009 )

preface

Object oriented programming
Python has been an object-oriented language from the beginning of design. Because of this, it is easy to create a class and object in Python. This paper improves the code written in the previous article, optimizes the code in an object-oriented programming way, and contacts the object-oriented programming ability

Tip: the following is the main content of this article. The following cases can be used for reference

1, Object oriented programming

Introduction to object oriented technology
  • Class: a collection used to describe objects with the same properties and methods. It defines the properties and methods common to each object in the collection. An object is an instance of a class.
    Class variables: class variables are common in the whole instantiated object. Class variables are defined in the class and outside the function body. Class variables are usually not used as instance variables. Data members: class variables or instance variables,
    Used to process data related to classes and their instance objects.
    Method override: if the method inherited from the parent class cannot meet the needs of the child class, it can be overridden. This process is called method override, also known as method override.
    Local variable: the variable defined in the method, which only acts on the class of the current instance.
    Instance variable: in the class declaration, attributes are represented by variables. Such variables are called instance variables, which are declared inside the class declaration but outside other member methods of the class.
    Inheritance: that is, a derived class inherits the base class
    Class). Inheritance also allows the object of a derived class to be treated as a base class object. For example, there is such a design: an object of Dog type is derived from the Animal class, which simulates the "is-a" relationship (for example, Dog is an Animal).
    Instantiation: create an instance of a class and the concrete object of the class. Method: the function defined in the class.
    Object: an instance of a data structure defined by a class. An object includes two data members (class variables and instance variables) and methods.

2, Use steps

1. Main format

The code is as follows (example): first introduce the corresponding library file and module, and then define a crawler class Class Spider.
1. Define initial parameters in the crawler class, and write headers and urls addresses in the class.
2. Send parameter request, mainly using the get request of request.
3. Define a main program of run and execute all the running programs under run().

import ...
class Spider:  #Define a class of Crawlers
    def __init__(self): #Define initial parameters
    def send_request(self,url,headers): #Define request send parameters
    def run(self):  #Define and run the main program
    if __name__ == '__main__':
 DB=Spider()
    DB.run()

2. Code demonstration

In crawler note 1, the program for crawling the title of Douban film is modified by object-oriented programming. The code is as follows (example):

# -*- coding: utf-8 -*-
import requests
from lxml import html
class DoubanSpider:
    def __init__(self):
        self.url='https://Movie. Doublan. COM / '# need to climb the URL of the data
        self.headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/75.0.3770.100 Safari/537.36'}
    def send_request(self,url,headers):
        page=requests.Session().get(url,headers=headers)  #session simulation login
        tree=html.fromstring(page.text)
        return tree
    def run(self):
        datas = self.send_request(self.url,self.headers)
        result=datas.xpath('//td[@class="title"]//a/text() ') # get the required data
        count = 0
        for i in result:
              print(i, end='\n')
              count += 1
              if (count % len(result) == 0):
                  print(end='\n')
if __name__ == '__main__':
    DoubanPC=DoubanSpider()
    DoubanPC.run()

In crawler note 2, the program of crawling the title of nutshell net.

-*- coding: utf-8 -*-
import requests
from lxml import etree
class GuokrSpider:
    def __init__(self):
        self.url='https://www.guokr.com/science/category/science '# website for data climbing
        self.headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/75.0.3770.100 Safari/537.36'}
    def send_request(self,url,headers):
        page=requests.Session().get(url,headers=headers)  #session simulation login
        tree=etree.HTML(page.text)
        return tree
    def run(self):
        datas = self.send_request(self.url,self.headers)
        result=datas.xpath('//Div [@ class = "layout_skeleton-zgzfsa-3 styled_infotilewrap-sc-1b13hh2-7 ecqury"] / / text()) # get the required data # get the required data
        count = 0
        for i in result:
              print(i, end='\n')
              count += 1
              if (count % len(result) == 0):
                  print(end='\n')
if __name__ == '__main__':
    GuokrPC=GuokrSpider()
    GuokrPC.run()

In crawler note 3, use json to climb the Douban web program.

# -*- coding: utf-8 -*-
import json
import requests
class Dbjson_Spider:
    def __init__(self):
        self.url='https://m.douban.com/rexxar/api/v2/subject_ collection/movie_ showing/items?os=android&for_ mobile=1&start=0&count=18&loc_ Id = 108288 '# URL to climb data
        self.headers={"Referer": "https://m.douban.com/movie/nowintheater?loc_id=108288",
         "User-Agent":"Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Mobile Safari/537.36"}
    #As long as the referer is missing, an error will be reported....
    def send_request(self,url,headers):
        response = requests.get(url,headers=headers)  #session simulation login
        ret1 = json.loads(response.text)
        return ret1
    def run(self):
        datas = self.send_request(self.url,self.headers)
        content_list = datas["subject_collection_items"]
        print(content_list)
        for i in content_list:
            items = {"Movie title": i['title'], "Release time": i['release_date'], "website": i['uri']}
            content = json.dumps(items, ensure_ascii=False) + "\n,"
            with open("doubandianying.txt", "a", encoding="utf-8") as f:
                f.write(content)
if __name__ == '__main__':
    DBjson=Dbjson_Spider()
    DBjson.run()

summary

Tip: here is a summary of the article:
For example, the above is what we want to talk about today. This paper only briefly introduces how to modify our own crawler program by using object-oriented method.

Tags: Python crawler Data Mining

Posted on Thu, 11 Nov 2021 00:31:27 -0500 by vangelis