Quick start Python crawler requests library pyquery location Library

1. Analyze reptiles

stagetypeproblemNeed to do
1requestWhere is the web data?Discover url rules
2requestHow to get web page dataFirst, try to use requests to successfully access a url and get a page data
3analysisDefine the required data from htmlUse pyquery to parse the web page data of this page
3analysisLocate the required data from jsonUse json or resp.json() to parse json web page data
4storageHow to store dataUse the csv library to store data in a csv file
5be accomplishedRepeat 2-4The for loop accesses and parses all URLs

Take public comments as an example:
http://www.dianping.com/shanghai/hotel

Discover web site rules

1. Construct according to laws
2. Generate web addresses in batches. For conventional html web page data, click page turning to find the url law

template='http://www.dianping.com/shanghai/hotel/p{P}'
for page in range(1,51):
	url=template.format(P=page)
	print(url)

2. Library

requests network request Library

1, Installation

pip install requests   //Win command line
pip3 install requests   //Mac command line

2, Access method
requests two access methods, both of which return the Response object:

Common parametersParameter interpretationFrequency of use
requests.get(url,headers,cookies,params,proxies)Initiate get access and return the Response object95%
requests.post(url,headers,cookies,data,proxies)Initiate post access and return the Response object5%
import requests
url='http://www.dianping.com/shanghai/hotel/p1'
resp=requests.get(url)
resp

3, Return Response
Status code followed by Response:

·2 indicates normal access
·4, for example, 403 indicates that the crawler is blocked by the website
·The beginning of 5 indicates that there is a problem with the server

If 403 occurs (as shown in the figure above), the solution is as follows: camouflage

Crawler anti crawl [link]

Python crawler anti crawl method

import requests
url='http://www.dianping.com/shanghai/hotel/p1'
headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36'}   //Find the headers parameter in the developer tool and sort it into dictionary format
resp=requests.get(url,headers=headers)
resp

Common Response methods:

Method of Response objecteffectUsage rating
Response.json()Get json web page data35%
Response.textGet html web page data60%
Response.contentGet binary data (such as files, pictures, audio and video)5%
Response.encodingSet Web page data encoding according to charsetRarely used

For example:

pyquery web page parsing and positioning Library

install

pip install pyquery

Suppose we access the html string data of a web page:

>html = open('example.html', encoding='utf-8').read()
>html

Convert html string to PyQuery

> from pyquery import PyQuery
> doc=PyQuery(html)
> doc

As shown in the figure below, the html string becomes the content in the figure below. The figure below shows that doc can be cycled at present


As can be seen from the following figure, the html string is converted to pyquery type

selector expression

In human relations, a person can be located through parents, children, relatives, appearance and neighbors
Similarly, in the html world, there are many different expressions to locate tags

Selector_Expressionexampleexplain
class.IntroSelect the label of class="Intro" and return PyQuery
id#introSelect the tag with id="intro" and return PyQuery
tagliSelect all li tags and return PyQuery
Multi layer tagul liSelect all li tags (and these li generations are lower than ul) and return PyQuery
tag & attrli[class]Select all li tags with class attribute and return PyQuery
tag & attr & attr_valueli[class=cando]Select all li tags, where li has a class="cando" attribute value

class

>doc('.Intro')  #Positioning with class

id

>doc('#Intro')  #Positioning with ID

tag

##PyQuery(Selector_Expression)
>doc('ul')      #Positioning with labels (not commonly used and not accurate)

Multi layer tag

>doc('html body .Intro') #Multi layer tag
>doc('html body #intro') #Multi layer tag
>doc('html ul li')       #Multi layer tag
>doc('.intro ul li')     #Multi layer tag
>doc('#intro ul li')     #Multi layer tag

tag & attr

>doc('li[class]')Positioning contains class of li label
>doc('div[title]')Positioning contains title of div label
>doc('li[name]')Positioning contains name of li label

tag & attr & attr_value

>doc('li[class=cando]')
>doc('div[title]=')
>doc('li[name=web]')

PyQuery common methods

PyQuery object methodfunction
PyQuery(Selector_Expression)Find matching selector_ Label of expression condition, return PyQuery
PyQuery.items(Selector_Expression)Find matching selector_ The label of the expression condition, which returns the PyQuery list
PyQuery.eq(index)Get the index+1 tag
PyQuery.text()Gets the text within the label
PyQuery.attr(attribute)Gets the attribute value of the tag

PyQuery(Selector_Expression)

>doc('li')

PyQuery.items(Selector_Expression)

>doc.items('li')


Such a return value indicates that it is traversable:

PyQuery.eq(index)

eq indicates the row number, starting from 0

print(doc("li").eq(0))

PyQuery.text()

Take the text in the tag, as long as it is PyQuery type data

print(doc("li").eq(0).text())

PyQuery.attr(attribute)

Get attribute value

doc("li").eq(0).attr("name")
doc("li").eq(0).attr("class")

Using text without positioning will output all text contents (imprecise), and using attr without positioning will only give the first attribute value (the rest will be ignored). Summary: it is recommended to locate all to specific locations before using attr and text.

Tags: Python crawler

Posted on Wed, 13 Oct 2021 12:32:40 -0400 by BloodyMind