Catalogue of series articles
[reptile learning notes (4) -- anti climbing encountered by embarrassing reptiles]( https://editor.csdn.net/md/?articleId=112463277 )[crawler learning notes (3) -- use of JSON to crawl watercress]( https://editor.csdn.net/md/?articleId=111053192 )[crawler learning notes (2): basic usage of requests to crawl various mainstream Websites 2] ( https://editor.csdn.net/md/?articleId=110222411 )[crawler learning notes (I): basic usage of requests to crawl various mainstream websites 1]( https://editor.csdn.net/md/?articleId=103932009 )preface
Object oriented programming
Python has been an object-oriented language from the beginning of design. Because of this, it is easy to create a class and object in Python. This paper improves the code written in the previous article, optimizes the code in an object-oriented programming way, and contacts the object-oriented programming ability
Tip: the following is the main content of this article. The following cases can be used for reference
1, Object oriented programming
Introduction to object oriented technology- Class: a collection used to describe objects with the same properties and methods. It defines the properties and methods common to each object in the collection. An object is an instance of a class.
Class variables: class variables are common in the whole instantiated object. Class variables are defined in the class and outside the function body. Class variables are usually not used as instance variables. Data members: class variables or instance variables,
Used to process data related to classes and their instance objects.
Method override: if the method inherited from the parent class cannot meet the needs of the child class, it can be overridden. This process is called method override, also known as method override.
Local variable: the variable defined in the method, which only acts on the class of the current instance.
Instance variable: in the class declaration, attributes are represented by variables. Such variables are called instance variables, which are declared inside the class declaration but outside other member methods of the class.
Inheritance: that is, a derived class inherits the base class
Class). Inheritance also allows the object of a derived class to be treated as a base class object. For example, there is such a design: an object of Dog type is derived from the Animal class, which simulates the "is-a" relationship (for example, Dog is an Animal).
Instantiation: create an instance of a class and the concrete object of the class. Method: the function defined in the class.
Object: an instance of a data structure defined by a class. An object includes two data members (class variables and instance variables) and methods.
2, Use steps
1. Main format
The code is as follows (example): first introduce the corresponding library file and module, and then define a crawler class Class Spider.
1. Define initial parameters in the crawler class, and write headers and urls addresses in the class.
2. Send parameter request, mainly using the get request of request.
3. Define a main program of run and execute all the running programs under run().
import ... class Spider: #Define a class of Crawlers def __init__(self): #Define initial parameters def send_request(self,url,headers): #Define request send parameters def run(self): #Define and run the main program if __name__ == '__main__': DB=Spider() DB.run()
2. Code demonstration
In crawler note 1, the program for crawling the title of Douban film is modified by object-oriented programming. The code is as follows (example):
# -*- coding: utf-8 -*- import requests from lxml import html class DoubanSpider: def __init__(self): self.url='https://Movie. Doublan. COM / '# need to climb the URL of the data self.headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/75.0.3770.100 Safari/537.36'} def send_request(self,url,headers): page=requests.Session().get(url,headers=headers) #session simulation login tree=html.fromstring(page.text) return tree def run(self): datas = self.send_request(self.url,self.headers) result=datas.xpath('//td[@class="title"]//a/text() ') # get the required data count = 0 for i in result: print(i, end='\n') count += 1 if (count % len(result) == 0): print(end='\n') if __name__ == '__main__': DoubanPC=DoubanSpider() DoubanPC.run()
In crawler note 2, the program of crawling the title of nutshell net.
-*- coding: utf-8 -*- import requests from lxml import etree class GuokrSpider: def __init__(self): self.url='https://www.guokr.com/science/category/science '# website for data climbing self.headers={'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML,like Gecko) Chrome/75.0.3770.100 Safari/537.36'} def send_request(self,url,headers): page=requests.Session().get(url,headers=headers) #session simulation login tree=etree.HTML(page.text) return tree def run(self): datas = self.send_request(self.url,self.headers) result=datas.xpath('//Div [@ class = "layout_skeleton-zgzfsa-3 styled_infotilewrap-sc-1b13hh2-7 ecqury"] / / text()) # get the required data # get the required data count = 0 for i in result: print(i, end='\n') count += 1 if (count % len(result) == 0): print(end='\n') if __name__ == '__main__': GuokrPC=GuokrSpider() GuokrPC.run()
In crawler note 3, use json to climb the Douban web program.
# -*- coding: utf-8 -*- import json import requests class Dbjson_Spider: def __init__(self): self.url='https://m.douban.com/rexxar/api/v2/subject_ collection/movie_ showing/items?os=android&for_ mobile=1&start=0&count=18&loc_ Id = 108288 '# URL to climb data self.headers={"Referer": "https://m.douban.com/movie/nowintheater?loc_id=108288", "User-Agent":"Mozilla/5.0 (Linux; Android 6.0; Nexus 5 Build/MRA58N) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/75.0.3770.100 Mobile Safari/537.36"} #As long as the referer is missing, an error will be reported.... def send_request(self,url,headers): response = requests.get(url,headers=headers) #session simulation login ret1 = json.loads(response.text) return ret1 def run(self): datas = self.send_request(self.url,self.headers) content_list = datas["subject_collection_items"] print(content_list) for i in content_list: items = {"Movie title": i['title'], "Release time": i['release_date'], "website": i['uri']} content = json.dumps(items, ensure_ascii=False) + "\n," with open("doubandianying.txt", "a", encoding="utf-8") as f: f.write(content) if __name__ == '__main__': DBjson=Dbjson_Spider() DBjson.run()
summary
Tip: here is a summary of the article:
For example, the above is what we want to talk about today. This paper only briefly introduces how to modify our own crawler program by using object-oriented method.