1. Analyze reptiles
stage | type | problem | Need to do |
---|---|---|---|
1 | request | Where is the web data? | Discover url rules |
2 | request | How to get web page data | First, try to use requests to successfully access a url and get a page data |
3 | analysis | Define the required data from html | Use pyquery to parse the web page data of this page |
3 | analysis | Locate the required data from json | Use json or resp.json() to parse json web page data |
4 | storage | How to store data | Use the csv library to store data in a csv file |
5 | be accomplished | Repeat 2-4 | The for loop accesses and parses all URLs |
Take public comments as an example:
http://www.dianping.com/shanghai/hotel
Discover web site rules
1. Construct according to laws
2. Generate web addresses in batches. For conventional html web page data, click page turning to find the url law
template='http://www.dianping.com/shanghai/hotel/p{P}' for page in range(1,51): url=template.format(P=page) print(url)
2. Library
requests network request Library
1, Installation
pip install requests //Win command line pip3 install requests //Mac command line
2, Access method
requests two access methods, both of which return the Response object:
Common parameters | Parameter interpretation | Frequency of use |
---|---|---|
requests.get(url,headers,cookies,params,proxies) | Initiate get access and return the Response object | 95% |
requests.post(url,headers,cookies,data,proxies) | Initiate post access and return the Response object | 5% |
import requests url='http://www.dianping.com/shanghai/hotel/p1' resp=requests.get(url) resp
3, Return Response
Status code followed by Response:
·2 indicates normal access
·4, for example, 403 indicates that the crawler is blocked by the website
·The beginning of 5 indicates that there is a problem with the server
If 403 occurs (as shown in the figure above), the solution is as follows: camouflage
Crawler anti crawl [link]
Python crawler anti crawl method
import requests url='http://www.dianping.com/shanghai/hotel/p1' headers={'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.26 Safari/537.36'} //Find the headers parameter in the developer tool and sort it into dictionary format resp=requests.get(url,headers=headers) resp
Common Response methods:
Method of Response object | effect | Usage rating |
---|---|---|
Response.json() | Get json web page data | 35% |
Response.text | Get html web page data | 60% |
Response.content | Get binary data (such as files, pictures, audio and video) | 5% |
Response.encoding | Set Web page data encoding according to charset | Rarely used |
For example:
pyquery web page parsing and positioning Library
install
pip install pyquery
Suppose we access the html string data of a web page:
>html = open('example.html', encoding='utf-8').read() >html
Convert html string to PyQuery
> from pyquery import PyQuery > doc=PyQuery(html) > doc
As shown in the figure below, the html string becomes the content in the figure below. The figure below shows that doc can be cycled at present
As can be seen from the following figure, the html string is converted to pyquery type
selector expression
In human relations, a person can be located through parents, children, relatives, appearance and neighbors
Similarly, in the html world, there are many different expressions to locate tags
Selector_Expression | example | explain |
---|---|---|
class | .Intro | Select the label of class="Intro" and return PyQuery |
id | #intro | Select the tag with id="intro" and return PyQuery |
tag | li | Select all li tags and return PyQuery |
Multi layer tag | ul li | Select all li tags (and these li generations are lower than ul) and return PyQuery |
tag & attr | li[class] | Select all li tags with class attribute and return PyQuery |
tag & attr & attr_value | li[class=cando] | Select all li tags, where li has a class="cando" attribute value |
class
>doc('.Intro') #Positioning with class
id
>doc('#Intro') #Positioning with ID
tag
##PyQuery(Selector_Expression) >doc('ul') #Positioning with labels (not commonly used and not accurate)
Multi layer tag
>doc('html body .Intro') #Multi layer tag >doc('html body #intro') #Multi layer tag >doc('html ul li') #Multi layer tag >doc('.intro ul li') #Multi layer tag >doc('#intro ul li') #Multi layer tag
tag & attr
>doc('li[class]')Positioning contains class of li label >doc('div[title]')Positioning contains title of div label >doc('li[name]')Positioning contains name of li label
tag & attr & attr_value
>doc('li[class=cando]') >doc('div[title]=') >doc('li[name=web]')
PyQuery common methods
PyQuery object method | function |
---|---|
PyQuery(Selector_Expression) | Find matching selector_ Label of expression condition, return PyQuery |
PyQuery.items(Selector_Expression) | Find matching selector_ The label of the expression condition, which returns the PyQuery list |
PyQuery.eq(index) | Get the index+1 tag |
PyQuery.text() | Gets the text within the label |
PyQuery.attr(attribute) | Gets the attribute value of the tag |
PyQuery(Selector_Expression)
>doc('li')
PyQuery.items(Selector_Expression)
>doc.items('li')
Such a return value indicates that it is traversable:
PyQuery.eq(index)
eq indicates the row number, starting from 0
print(doc("li").eq(0))
PyQuery.text()
Take the text in the tag, as long as it is PyQuery type data
print(doc("li").eq(0).text())
PyQuery.attr(attribute)
Get attribute value
doc("li").eq(0).attr("name") doc("li").eq(0).attr("class")
Using text without positioning will output all text contents (imprecise), and using attr without positioning will only give the first attribute value (the rest will be ignored). Summary: it is recommended to locate all to specific locations before using attr and text.