brief introduction
The main content of the crawler program is in two parts.
1. html getter
2. html parser
Getting started is enough. Managing URLs, camouflage user behavior and running javascript are advanced operations. They are not entry-level, and I haven't learned them.
Finally, the environment construction part is attached.
The following sections take the example of crawling the paper list from the ICML2021 conference official website.
obtain
There are many ways to get web information. Here I record several I have encountered and analyze their uses and characteristics.
Method 1: selenium webdriver.get(url)
[recommended]
from selenium import webdriver url="https://icml.cc/Conferences/2021/Schedule?type=Poster" driver = webdriver.Firefox() driver.get(url) # Here we have obtained the web page information, because various API s in selenium library support various queries and operations on Web pages # If you want to get all the html code further from selenium.webdriver.common.by import By element = driver.find_element(By.XPATH,'/html') html = element.get_attribute('outerHTML')
The installation and configuration of selenium is troublesome. It needs to download the driver corresponding to each browser separately (supported by many browsers such as Chrome, Firefox and IE), and configure the system environment variables.
But its effect is also the best. Some web pages generate content dynamically. At this time, they can only pretend to be users and run JavaScript with a browser. Of course, js generally runs automatically. Without using the browser or js running environment, you can only get the initial static content of the web page.
Method 2: requests
import requests #Import requests package url = "https://icml.cc/Conferences/2021/Schedule?type=Poster" html = requests.get(url) #Get web page data print(html.text)
The above is the GET method. I don't understand the POST method yet.
Method 3: urlib
from urllib import request url = "https://icml.cc/Conferences/2021/Schedule?type=Poster" html = request.urlopen(url).read() html.decode()
Method 4: Manual
Open the browser, enter the target url, Ctrl+S, save the page to. html, and then read the file in the program.
with open('xxx.html','r') as f: html = f.read() print(html)
analysis
The most important task of analysis is to find what you want to find, not to miss.
What is xpath
[recommended]
course: w3school xpath
xpath looks like this: / html/body/div[@class='container ']
xpath even has built-in functions, which are very complete.
Note that the index in xpath starts with 1.
What is a regular expression
It looks like this: \ d{1,3}(.\d{1,3}){3}, which is a regular expression for expressing IPv4 addresses (but it only limits the number of digits, not three digits ≤ 255).
What is a selector
An html path expression similar to xpath, but not as broad as xpath. For example, Firefox cannot copy the selector.
Method 1: selenium find_elements()
from selenium import webdriver url="https://icml.cc/Conferences/2021/Schedule?type=Poster" driver = webdriver.Firefox() driver.get(url) # The above content is the same as that in the html acquisition section from selenium.webdriver.common.by import By xpath = "/html/body/div[@class='container pull-left']/div[@id='wholepage']/main[@id='main']/div[@class='container']/div[@class='row']/div[@class='col-xs-12 col-sm-9']/div[position()>3]/div" driver.find_elements(By.XPATH,xpath) # This returns a list in which all the elements are WebElement s (variable types defined by selenium).
If find is used_ Elements returns all matching elements, such as find_element returns the first hit element.
But this method has some problems.
1. If the browser is placed for a long time, selenium's API will lose its connection, and selenium's API can only control the active browser.
2. selenium does not provide a parser for html code.
Method 2: lxml.etree
[recommended]
from lxml import etree # html is a string in html format Selector = etree.HTML(html) xpath = "/html/body/div[@class='container pull-left']/div[@id='wholepage']/main[@id='main']/div[@class='container']/div[@class='row']/div[@class='col-xs-12 col-sm-9']/div[position()>3]/div" Selector.xpath(xpath)
I haven't studied the specific usage carefully, but generally speaking, lxml supports xpath.
lxml itself is also one of selenium's dependent libraries, and its function can make up for the shortcomings of selenium.
Method 3: beautiful soup
from bs4 import BeautifulSoup # Assume that html is a string in html format soup = BeautifulSoup(strhtml.text,'lxml') # lxml needs to be pre installed data = soup.select('#main>div>div.mtop.firstMod.clearfix>div.centerBox>ul.newsList>li>a') # It doesn't seem to be xpath
The principle of beautiful soup is different from that of XPath. Beautiful soup is based on DOM and will load the whole document and parse the whole DOM tree. Therefore, the time and memory overhead will be much higher. lxml only traverses locally.
In terms of my personal application requirements, beautiful soup is not as easy to use as selenium+lxml.
Method 4: HTMLParser
from html.parser import HTMLParser class MyHTMLParser(HTMLParser): def handle_starttag(self, tag, attrs): """ recognize start tag, like <div> :param tag: :param attrs: :return: """ print("Encountered a start tag:", tag) def handle_endtag(self, tag): """ recognize end tag, like </div> :param tag: :return: """ print("Encountered an end tag :", tag) def handle_data(self, data): """ recognize data, html content string :param data: :return: """ print("Encountered some data :", data) def handle_startendtag(self, tag, attrs): """ recognize tag that without endtag, like <img /> :param tag: :param attrs: :return: """ print("Encountered startendtag :", tag) def handle_comment(self,data): """ :param data: :return: """ print("Encountered comment :", data) parser = MyHTMLParser() parser.feed('<html><head><title>Test</title></head>' '<body><h1>Parse me!</h1><img src = "" />' '<!-- comment --></body></html>')
(the above code is taken from HtmlParser commonly used by Python Crawlers)
HTMLParser is an inherent function of python, but it is very troublesome to design. It needs to inherit the HTMLParser class and define new classes, and the overloading of five methods is also very inhumane.
However, many methods found on the Internet are like this. Do they all use crawlers to crawl and copy each other.
Method 5: re (regular expression)
import re # html is a string in html format pattern = r'<a href=[^>]*>(.*)</a>' result = re.findall(pattern, html)
No more explanation, put
Method N: hand rubbing
I went through some detours before. At that time, I only knew beautiful soup, but I didn't know selenium and xpath. I created a grammar according to my own ideas, much like xpath. Although it seems useless now, it is still put out for everyone to learn and discuss.
from bs4 import BeautifulSoup def ruleParse(rule): # Parse the rule in the form of string into the content in the format of [{'name': '','attr ': {}},...] # examples: # rule = "body.div(container pull-left).div(wholepage).main#main.div(container).div(row).div$2.div$3:-1" # rule = "body.div{class:[container, pull-left]}.div{class:wholepage}.main{id:main}.div(container).div(row).div{range:2}.div{range:(3,-1)}" # rule = "body.div$3.div$0.main.div$0.div$0.div$2.div$3:-1" rule_ = [] i=0 mode = 'word' string = "" splitters = "@#$(){}[]:." quoters = '\'"' quoter='"' spaces = " \t\n\r\b" while i<len(rule): char = rule[i] i+=1 if mode=='word': if char in splitters: string = string.strip() if string!="": rule_.append([string,'word']) string = "" rule_.append([char,'splitter']) elif char in quoters: string = string.strip() if string!="": rule_.append([string,'word']) string = "" quoter = char mode='str' elif char in spaces: string = string.strip() if string!="": rule_.append([string,'word']) string = "" elif char=="\\" and i!=len(rule): char = rule[i] i+=1 string+=char else: string+=char else: # 'str' (quoted) if rule[i] == quoter: rule_.append([string,"str"]) string="" mode='word' else: string+=char if mode!='word': raise "Syntax Error: unquoted string." else: string = string.strip() if string!="": rule_.append([string,'word']) string = "" rule__ = [] stack=[] mode='name' # name class id range attr name=None class_=None id_=None range_=None attr={} def record(): nonlocal name,class_,id_,range_,attr,stack,rule__ if name==None: raise ValueError("Empty name") if class_!=None: attr['class']=class_ if id_!=None: attr['id']=id_ if range_!=None: if len(range_)==0: raise SyntaxError("Empty range") i=0 while i<len(range_): item=range_[i] if item==':': if i==0 or range_[i-1]==':': range_.insert(i,None) i+=1 i+=1 if range_[-1]==':': range_.append(None) range__=[] for item in range_: if item==':': continue elif item==None: range__.append(None) else: range__.append(int(item)) attr['$range']=range__ rule__.append({ 'name':name, 'attr':attr }) name=None class_=None id_=None range_=None attr={} for content,typ in rule_: if mode=='name': if typ=='word': if name==None: name=content else: raise ValueError("Duplicated name") else: # typ == 'splitter' if content=='.': record() elif content=='(': mode='class' stack=[] elif content=='#': mode='id' elif content=='$': mode='range' elif content=='{': mode='attr' stack=[] else: raise ValueError(f'Invalid splitter "{content}"') elif mode=='class': if typ=='splitter': if content==')': mode='name' class_=stack.copy() stack=[] else: raise ValueError(f'Invalid splitter "{content}"') else: #typ=='word' stack.append(content) elif mode=='id': if typ=='splitter': raise SyntaxError(f'Unexpected splitter "{content}"') else: #typ=='word' if id_!=None: raise SyntaxError('Duplicated ID') else: id_=content mode='name' elif mode=='range': if typ=='word': stack.append(content) else: # typ == 'splitter' if content=='.': range_=stack.copy() stack=[] record() mode='name' elif content in ',:': stack.append(':') elif mode=='attr': if typ=='splitter': if content=="{": pass elif content==':': pass else: raise ValueError(f'Invalid splitter "{content}"') else: raise ValueError(f'Invalid mode "{mode}"') if mode=='range': range_=stack.copy() stack=[] if mode in ['range','name']: record() else: raise SyntaxError("Unquoted content") return rule__ def getElements(source,rule: str): # Convert supported source s to [BeutifulSouop] if type(source)==str: soups=[BeautifulSoup(source,features="lxml")] elif isinstance(source, Iterable): soups=list(source) elif type(source)==BeautifulSoup: soups=[source] else: raise TypeError("Invalid type for source.") parsedRule = ruleParse(rule) unalignedResultCnt=0 for i in range(len(parsedRule)): item=parsedRule[i] name=item['name'] attr=item['attr'] def lambda_(tag): nonlocal name, attr if not tag.name==name: return False for key,value in attr.items(): if key=='$range': continue if not tag.has_attr(key): return False if key=='class': for class_ in value: if not class_ in tag[key]: return False else: if not tag[key]==value: return False return True results=[] resultCnt=0 resultLen=None for soup in soups: result=soup.find_all(lambda_) resultCnt+=len(result) if resultLen==None: resultLen=len(result) else: if resultLen!=len(result): unalignedResultCnt+=1 results.append(result) if resultCnt==0: return [] if '$range' in attr.keys(): range_ = attr['$range'] if len(range_)==1: index=range_[0] for i in range(len(results)): results[i]=[results[i][index]] elif len(range_)==2: start, end=range_ for i in range(len(results)): results[i]=results[i][start:end] elif len(range_)==3: start,end,skip=range_ for i in range(len(results)): results[i]=results[i][start:end:skip] else: raise ValueError(f'Invalid range {range_}') else: pass soups=[] for result in results: soups+=result if unalignedResultCnt!=0: print(f"Warning: unaligned result number ({unalignedResultCnt})") return soups # html is a string in html format rule="body.div(container pull-left).div(wholepage).main#main.div(container).div(row).div$2.div$3:-1" soups=getElements(html,rule)
The effect is OK. But I saw xpath on the Internet as soon as I finished writing it 😃 .
Attachment: Environment Construction
Windows: cmd/Powershell
powershell is recommended.
selenium
pip install selenium
selenium official website
Firefox driver: geckodriver
visit geckodriver
Chrome driver
visit chromedriver or image
IE driver: IEDriverServer
visit IEDriverServer
Edge driver: MicrosoftWebDriver
visit MicrosoftWebDriver
Opera driver
visit github: operachromiumdriver
PhantomJS driver: phantomjs
visit phantomjs
After the driver is installed, you need to put the path into the environment variable path, or tell selenium the specific path when using it.
lxml
pip install lxml
BeautifulSoup
pip install beautifulsoup4
urllib
pip install urllib
requests
pip install requests
You may need to use conda to manage different python environments.
Linux/Mac: bash/zsh
I haven't tried. It should be the same as on windows. It's written separately for the sake of preciseness.