Introduction to python crawler

brief introduction

The main content of the crawler program is in two parts.
1. html getter
2. html parser
Getting started is enough. Managing URLs, camouflage user behavior and running javascript are advanced operations. They are not entry-level, and I haven't learned them.
Finally, the environment construction part is attached.

The following sections take the example of crawling the paper list from the ICML2021 conference official website.

obtain

There are many ways to get web information. Here I record several I have encountered and analyze their uses and characteristics.

Method 1: selenium webdriver.get(url)

[recommended]

from selenium import webdriver
url="https://icml.cc/Conferences/2021/Schedule?type=Poster"
driver = webdriver.Firefox()
driver.get(url)
# Here we have obtained the web page information, because various API s in selenium library support various queries and operations on Web pages

# If you want to get all the html code further
from selenium.webdriver.common.by import By
element = driver.find_element(By.XPATH,'/html')
html = element.get_attribute('outerHTML')

The installation and configuration of selenium is troublesome. It needs to download the driver corresponding to each browser separately (supported by many browsers such as Chrome, Firefox and IE), and configure the system environment variables.
But its effect is also the best. Some web pages generate content dynamically. At this time, they can only pretend to be users and run JavaScript with a browser. Of course, js generally runs automatically. Without using the browser or js running environment, you can only get the initial static content of the web page.

Method 2: requests

import requests        #Import requests package
url = "https://icml.cc/Conferences/2021/Schedule?type=Poster"
html = requests.get(url)        #Get web page data
print(html.text)

The above is the GET method. I don't understand the POST method yet.

Method 3: urlib

from urllib import request
url = "https://icml.cc/Conferences/2021/Schedule?type=Poster"
html = request.urlopen(url).read()
html.decode()

Method 4: Manual

Open the browser, enter the target url, Ctrl+S, save the page to. html, and then read the file in the program.

with open('xxx.html','r') as f:
	html = f.read()
print(html)

analysis

The most important task of analysis is to find what you want to find, not to miss.

What is xpath

[recommended]
course: w3school xpath
xpath looks like this: / html/body/div[@class='container ']
xpath even has built-in functions, which are very complete.
Note that the index in xpath starts with 1.

What is a regular expression

It looks like this: \ d{1,3}(.\d{1,3}){3}, which is a regular expression for expressing IPv4 addresses (but it only limits the number of digits, not three digits ≤ 255).

What is a selector

An html path expression similar to xpath, but not as broad as xpath. For example, Firefox cannot copy the selector.

Method 1: selenium find_elements()

from selenium import webdriver
url="https://icml.cc/Conferences/2021/Schedule?type=Poster"
driver = webdriver.Firefox()
driver.get(url)
# The above content is the same as that in the html acquisition section

from selenium.webdriver.common.by import By
xpath = "/html/body/div[@class='container pull-left']/div[@id='wholepage']/main[@id='main']/div[@class='container']/div[@class='row']/div[@class='col-xs-12 col-sm-9']/div[position()>3]/div"
driver.find_elements(By.XPATH,xpath)
# This returns a list in which all the elements are WebElement s (variable types defined by selenium).

If find is used_ Elements returns all matching elements, such as find_element returns the first hit element.
But this method has some problems.
1. If the browser is placed for a long time, selenium's API will lose its connection, and selenium's API can only control the active browser.
2. selenium does not provide a parser for html code.

Method 2: lxml.etree

[recommended]

from lxml import etree
# html is a string in html format
Selector = etree.HTML(html)
xpath = "/html/body/div[@class='container pull-left']/div[@id='wholepage']/main[@id='main']/div[@class='container']/div[@class='row']/div[@class='col-xs-12 col-sm-9']/div[position()>3]/div"
Selector.xpath(xpath)

I haven't studied the specific usage carefully, but generally speaking, lxml supports xpath.
lxml itself is also one of selenium's dependent libraries, and its function can make up for the shortcomings of selenium.

Method 3: beautiful soup

from bs4 import BeautifulSoup
# Assume that html is a string in html format
soup = BeautifulSoup(strhtml.text,'lxml') # lxml needs to be pre installed
data = soup.select('#main>div>div.mtop.firstMod.clearfix>div.centerBox>ul.newsList>li>a') # It doesn't seem to be xpath

The principle of beautiful soup is different from that of XPath. Beautiful soup is based on DOM and will load the whole document and parse the whole DOM tree. Therefore, the time and memory overhead will be much higher. lxml only traverses locally.
In terms of my personal application requirements, beautiful soup is not as easy to use as selenium+lxml.

Method 4: HTMLParser

from html.parser import HTMLParser
class MyHTMLParser(HTMLParser):

    def handle_starttag(self, tag, attrs):
        """
        recognize start tag, like <div>
        :param tag:
        :param attrs:
        :return:
        """
        print("Encountered a start tag:", tag)

    def handle_endtag(self, tag):
        """
        recognize end tag, like </div>
        :param tag:
        :return:
        """
        print("Encountered an end tag :", tag)

    def handle_data(self, data):
        """
        recognize data, html content string
        :param data:
        :return:
        """
        print("Encountered some data  :", data)

    def handle_startendtag(self, tag, attrs):
        """
        recognize tag that without endtag, like <img />
        :param tag:
        :param attrs:
        :return:
        """
        print("Encountered startendtag :", tag)

    def handle_comment(self,data):
        """

        :param data:
        :return:
        """
        print("Encountered comment :", data)


parser = MyHTMLParser()
parser.feed('<html><head><title>Test</title></head>'
            '<body><h1>Parse me!</h1><img src = "" />'
            '<!-- comment --></body></html>')

(the above code is taken from HtmlParser commonly used by Python Crawlers)
HTMLParser is an inherent function of python, but it is very troublesome to design. It needs to inherit the HTMLParser class and define new classes, and the overloading of five methods is also very inhumane.
However, many methods found on the Internet are like this. Do they all use crawlers to crawl and copy each other.

Method 5: re (regular expression)

import re
# html is a string in html format
pattern = r'<a href=[^>]*>(.*)</a>'
result = re.findall(pattern, html)

No more explanation, put

Method N: hand rubbing

I went through some detours before. At that time, I only knew beautiful soup, but I didn't know selenium and xpath. I created a grammar according to my own ideas, much like xpath. Although it seems useless now, it is still put out for everyone to learn and discuss.

from bs4 import BeautifulSoup
def ruleParse(rule):
    # Parse the rule in the form of string into the content in the format of [{'name': '','attr ': {}},...]
    # examples:
    # rule = "body.div(container pull-left).div(wholepage).main#main.div(container).div(row).div$2.div$3:-1"
    # rule = "body.div{class:[container, pull-left]}.div{class:wholepage}.main{id:main}.div(container).div(row).div{range:2}.div{range:(3,-1)}"
    # rule = "body.div$3.div$0.main.div$0.div$0.div$2.div$3:-1"
    rule_ = []
    i=0
    mode = 'word'
    string = ""
    splitters = "@#$(){}[]:."
    quoters = '\'"'
    quoter='"'
    spaces = " \t\n\r\b"
    while i<len(rule):
        char = rule[i]
        i+=1
        if mode=='word':
            if char in splitters:
                string = string.strip()
                if string!="":
                    rule_.append([string,'word'])
                string = ""
                rule_.append([char,'splitter'])
            elif char in quoters:
                string = string.strip()
                if string!="":
                    rule_.append([string,'word'])
                string = ""
                quoter = char
                mode='str'
            elif char in spaces:
                string = string.strip()
                if string!="":
                    rule_.append([string,'word'])
                string = ""
            elif char=="\\" and i!=len(rule):
                char = rule[i]
                i+=1
                string+=char
            else:
                string+=char
        else: # 'str' (quoted)
            if rule[i] == quoter:
                rule_.append([string,"str"])
                string=""
                mode='word'
            else:
                string+=char
    if mode!='word':
        raise "Syntax Error: unquoted string."
    else:
        string = string.strip()
        if string!="":
            rule_.append([string,'word'])
        string = ""
    rule__ = []
    stack=[]
    mode='name' # name class id range attr
    name=None
    class_=None
    id_=None
    range_=None
    attr={}
    def record():
        nonlocal name,class_,id_,range_,attr,stack,rule__
        if name==None:
            raise ValueError("Empty name")
        if class_!=None:
            attr['class']=class_
        if id_!=None:
            attr['id']=id_
        if range_!=None:
            if len(range_)==0:
                raise SyntaxError("Empty range")
            i=0
            while i<len(range_):
                item=range_[i]
                if item==':':
                    if i==0 or range_[i-1]==':':
                        range_.insert(i,None)
                        i+=1
                i+=1
            if range_[-1]==':':
                range_.append(None)
            range__=[]
            for item in range_:
                if item==':':
                    continue
                elif item==None:
                    range__.append(None)
                else:
                    range__.append(int(item))
            attr['$range']=range__
        rule__.append({
            'name':name,
            'attr':attr
        })
        name=None
        class_=None
        id_=None
        range_=None
        attr={}
    for content,typ in rule_:
        if mode=='name':
            if typ=='word':
                if name==None:
                    name=content
                else:
                    raise ValueError("Duplicated name")
            else: # typ == 'splitter'
                if content=='.':
                    record()
                elif content=='(':
                    mode='class'
                    stack=[]
                elif content=='#':
                    mode='id'
                elif content=='$':
                    mode='range'
                elif content=='{':
                    mode='attr'
                    stack=[]
                else:
                    raise ValueError(f'Invalid splitter "{content}"')
        elif mode=='class':
            if typ=='splitter':
                if content==')':
                    mode='name'
                    class_=stack.copy()
                    stack=[]
                else:
                    raise ValueError(f'Invalid splitter "{content}"')
            else: #typ=='word'
                stack.append(content)
        elif mode=='id':
            if typ=='splitter':
                raise SyntaxError(f'Unexpected splitter "{content}"')
            else: #typ=='word'
                if id_!=None:
                    raise SyntaxError('Duplicated ID')
                else:
                    id_=content
                    mode='name'
        elif mode=='range':
            if typ=='word':
                stack.append(content)
            else: # typ == 'splitter'
                if content=='.':
                    range_=stack.copy()
                    stack=[]
                    record()
                    mode='name'
                elif content in ',:':
                    stack.append(':')
        elif mode=='attr':
            if typ=='splitter':
                if content=="{":
                    pass
                elif content==':':
                    pass
                else:
                    raise ValueError(f'Invalid splitter "{content}"')
        else:
            raise ValueError(f'Invalid mode "{mode}"')
    if mode=='range':
        range_=stack.copy()
        stack=[]
    if mode in ['range','name']:
        record()
    else:
        raise SyntaxError("Unquoted content")
    return rule__

def getElements(source,rule: str):
    # Convert supported source s to [BeutifulSouop]
    if type(source)==str:
        soups=[BeautifulSoup(source,features="lxml")]
    elif isinstance(source, Iterable):
        soups=list(source)
    elif type(source)==BeautifulSoup:
        soups=[source]
    else:
        raise TypeError("Invalid type for source.")
    parsedRule = ruleParse(rule)
    unalignedResultCnt=0
    for i in range(len(parsedRule)):
        item=parsedRule[i]
        name=item['name']
        attr=item['attr']
        def lambda_(tag):
            nonlocal name, attr
            if not tag.name==name:
                return False
            for key,value in attr.items():
                if key=='$range':
                    continue
                if not tag.has_attr(key):
                    return False
                if key=='class':
                    for class_ in value:
                        if not class_ in tag[key]:
                            return False
                else:
                    if not tag[key]==value:
                        return False
            return True
        results=[]
        resultCnt=0
        resultLen=None
        for soup in soups:
            result=soup.find_all(lambda_)
            resultCnt+=len(result)
            if resultLen==None:
                resultLen=len(result)
            else:
                if resultLen!=len(result):
                    unalignedResultCnt+=1
            results.append(result)
        if resultCnt==0:
            return []
        if '$range' in attr.keys():
            range_ = attr['$range']
            if len(range_)==1:
                index=range_[0]
                for i in range(len(results)):
                    results[i]=[results[i][index]]
            elif len(range_)==2:
                start, end=range_
                for i in range(len(results)):
                    results[i]=results[i][start:end]
            elif len(range_)==3:
                start,end,skip=range_
                for i in range(len(results)):
                    results[i]=results[i][start:end:skip]
            else:
                raise ValueError(f'Invalid range {range_}')
        else:
            pass
        soups=[]
        for result in results:
            soups+=result
    if unalignedResultCnt!=0:
        print(f"Warning: unaligned result number ({unalignedResultCnt})")
    return soups

# html is a string in html format
rule="body.div(container pull-left).div(wholepage).main#main.div(container).div(row).div$2.div$3:-1"
soups=getElements(html,rule)

The effect is OK. But I saw xpath on the Internet as soon as I finished writing it 😃 .

Attachment: Environment Construction

Windows: cmd/Powershell

powershell is recommended.

selenium
pip install selenium
selenium official website
Firefox driver: geckodriver
visit geckodriver
Chrome driver
visit chromedriver or image
IE driver: IEDriverServer
visit IEDriverServer
Edge driver: MicrosoftWebDriver
visit MicrosoftWebDriver
Opera driver
visit github: operachromiumdriver
PhantomJS driver: phantomjs
visit phantomjs

After the driver is installed, you need to put the path into the environment variable path, or tell selenium the specific path when using it.

lxml
pip install lxml

BeautifulSoup
pip install beautifulsoup4

urllib
pip install urllib

requests
pip install requests

You may need to use conda to manage different python environments.

Linux/Mac: bash/zsh

I haven't tried. It should be the same as on windows. It's written separately for the sake of preciseness.

Posted on Tue, 30 Nov 2021 22:51:15 -0500 by mwichmann4