Python 3 native reptile

1, Reptile instance

1. Principle: text analysis and extraction information regular expression.

2. Example purpose: crawling the popularity ranking of the anchor under a certain category of panda TV

Analyze site structure

Operation: F12 view the HTML information, Ctrl+Shift+C mouse select to find the corresponding HTML.

 

The 3. step:

 

Prelude:

1. Clear purpose (analyze the purpose of grabbing to determine the grabbing page)

2. Find the corresponding web page of data

3. Analyze the structure of the web page to find the location of the label where the data is located

Implementation:

4. Simulate HTTP request, send request to server, get HTML returned to us by server

5. Using regular expression to extract the data we want

......

4. code

 

Two. Debugging code in VSCode

Breakpoint debugging: F5 start, F10 single step, F5 skip breakpoint, F11 enter internal

 

3, Basic principles of HTML structure analysis

Find the label and identifier to locate the information to be grabbed.

1. Try to choose a unique label

2. Try to select the label closest to the data

 

4, Hierarchical analysis and principles of data extraction

1. You can think of two data as a set of data and look for labels again.

2. Try to select the labels that can be closed (parent labels) and wrap the required data

 

5, Regular analysis HTML and specific process

 

'''
This is a spider,Module annotation
'''
 
 
from urllib import request
 
import re
 
 
 
class Spider():
 
    '''
    This is a spider class
    '''
 
    url = 'https://www.panda.tv/cate/lol'
 
    root_pattern = '<div class="video-info">([\s\S]*?)</div>' #Note single and double quotation marks
 
    #[\w\W] [\s\S] . Match all characters
 
    #* Match 0 or infinite times
 
    #? Non greedy pattern, match to the first</div>
 
    name_pattern = '</i>([\s\S]*?)</span>'
 
    number_pattern = '<span class="video-number">([\s\S]*?)</span>'
 
def __fetch_content(self):
 
    '''
    //Private method, getting Web content
    '''
 
    r = request.urlopen(Spider.url)
 
    htmls = r.read()
 
    htmls = str(htmls,encoding = 'utf-8')
 
 
    return htmls
 
 
def __analysis(self,htmls):
 
    '''
    //Regular expressions to extract data
    '''
 
    root_html = re.findall(Spider.root_pattern,htmls)
 
 
    anchors = []
 
    for html in root_html:
 
        name = re.findall(Spider.name_pattern,html)
 
        number = re.findall(Spider.number_pattern,html)
 
        anchor = {'name':name,'number':number}
 
        anchors.append(anchor)
 
 
    return anchors
 
 
 
def __refine(self,anchors):
 
    l = lambda anchor:{
 
    'name':anchor['name'][0].strip(),
 
    'number':anchor['number'][0] #List to single string
 
    }
 
 
    return map(l,anchors)
 
 
 
def __sort_seed(self,anchor):
 
    r = re.findall('\d*',anchor['number']) #Extract numbers
 
    number = float(r[0])
 
    if 'ten thousand' in anchor['number']: #Handle'ten thousand'
 
        number *= 10000
 
        
 
    return number
 
 
 
def __sort(self,anchors):
 
    '''
    key Determine comparison object
    sorted()Default ascending,reverse = True Descending order
    //Cannot sort with str, use int, and process' 10000 '
    '''
 
    anchors = sorted(anchors,key = self.__sort_seed,reverse = True)
 
    return anchors
 
 
 
def __show(self,anchors):
 
    for rank in range(0,len(anchors)):
 
        print('rank ' + str(rank + 1) +
 
    ':' + ' ' + anchors[rank]['name'] +
 
    '--' + anchors[rank]['number'])
 
 
 
def go(self): #Spider Entry method of
 
    htmls = self.__fetch_content()
 
    anchors = self.__analysis(htmls)
 
    anchors = list(self.__refine(anchors))
 
    anchors = self.__sort(anchors)
 
    self.__show(anchors)
 
 
 
spider = Spider()
 
spider.go()

Reptile frame:

Beautiful Soup

Scrapy

Tags: Python encoding Lambda

Posted on Wed, 12 Feb 2020 11:23:30 -0500 by Mr_J