Python 3 native reptile

1, Reptile instance

1. Principle: text analysis and extraction information regular expression.

2. Example purpose: crawling the popularity ranking of the anchor under a certain category of panda TV

Analyze site structure

Operation: F12 view the HTML information, Ctrl+Shift+C mouse select to find the corresponding HTML.


The 3. step:



1. Clear purpose (analyze the purpose of grabbing to determine the grabbing page)

2. Find the corresponding web page of data

3. Analyze the structure of the web page to find the location of the label where the data is located


4. Simulate HTTP request, send request to server, get HTML returned to us by server

5. Using regular expression to extract the data we want


4. code


Two. Debugging code in VSCode

Breakpoint debugging: F5 start, F10 single step, F5 skip breakpoint, F11 enter internal


3, Basic principles of HTML structure analysis

Find the label and identifier to locate the information to be grabbed.

1. Try to choose a unique label

2. Try to select the label closest to the data


4, Hierarchical analysis and principles of data extraction

1. You can think of two data as a set of data and look for labels again.

2. Try to select the labels that can be closed (parent labels) and wrap the required data


5, Regular analysis HTML and specific process


This is a spider,Module annotation
from urllib import request
import re
class Spider():
    This is a spider class
    url = ''
    root_pattern = '<div class="video-info">([\s\S]*?)</div>' #Note single and double quotation marks
    #[\w\W] [\s\S] . Match all characters
    #* Match 0 or infinite times
    #? Non greedy pattern, match to the first</div>
    name_pattern = '</i>([\s\S]*?)</span>'
    number_pattern = '<span class="video-number">([\s\S]*?)</span>'
def __fetch_content(self):
    //Private method, getting Web content
    r = request.urlopen(Spider.url)
    htmls =
    htmls = str(htmls,encoding = 'utf-8')
    return htmls
def __analysis(self,htmls):
    //Regular expressions to extract data
    root_html = re.findall(Spider.root_pattern,htmls)
    anchors = []
    for html in root_html:
        name = re.findall(Spider.name_pattern,html)
        number = re.findall(Spider.number_pattern,html)
        anchor = {'name':name,'number':number}
    return anchors
def __refine(self,anchors):
    l = lambda anchor:{
    'number':anchor['number'][0] #List to single string
    return map(l,anchors)
def __sort_seed(self,anchor):
    r = re.findall('\d*',anchor['number']) #Extract numbers
    number = float(r[0])
    if 'ten thousand' in anchor['number']: #Handle'ten thousand'
        number *= 10000
    return number
def __sort(self,anchors):
    key Determine comparison object
    sorted()Default ascending,reverse = True Descending order
    //Cannot sort with str, use int, and process' 10000 '
    anchors = sorted(anchors,key = self.__sort_seed,reverse = True)
    return anchors
def __show(self,anchors):
    for rank in range(0,len(anchors)):
        print('rank ' + str(rank + 1) +
    ':' + ' ' + anchors[rank]['name'] +
    '--' + anchors[rank]['number'])
def go(self): #Spider Entry method of
    htmls = self.__fetch_content()
    anchors = self.__analysis(htmls)
    anchors = list(self.__refine(anchors))
    anchors = self.__sort(anchors)
spider = Spider()

Reptile frame:

Beautiful Soup


Tags: Python encoding Lambda

Posted on Wed, 12 Feb 2020 11:23:30 -0500 by Mr_J