Web page analysis of crawler

When crawling the online data, we will definitely encounter the analysis of the content of the web page. Here are some common ways to summarize:

(1)beautifulsoup

https://www.bilibili.com/video/av93140655?from=search&seid=18437810415575324694

The beautifulsoup class in the bs4 library can be used for simple web page parsing, and several commonly used functions need to be noted

  • soup.find
  • soup.find_all
  • soup.select_one
  • soup.select

There are also attributes such as soup.head.contents and soup.head.children. For a single tag, you need to pay attention to a.attrs (get all attribute values), a.string (get corresponding text values, the same as a.get_text())

Add: when writing CSS, the tag name does not need any decoration. The class name (class="className" in quotation marks is the class name) is preceded by a dot, and the ID name (id="idName" in quotation marks is the ID name) is preceded by a ා. Here we can also use similar methods to filter elements. The method used is soup.select(), and the return type is list. Reference resources https://www.cnblogs.com/kangblog/p/9153871.html Among them : nth of type (n) selector matches the nth sibling element in the same type , similar to: nth child, which matches the nth child in the parent element.

import requests
from bs4 import BeautifulSoup

class Douban:
    def __init__(self):
        self.URL = 'https://movie.douban.com/top250'
        self.starnum =[]
        for start_num in range(0,251,25):
            self.starnum.append(start_num)
            self.header = {'User-Agent':'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/76.0.3809.100 Safari/537.36'}

    def get_top250(self):
        for start in self.starnum:
            start = str(start)
            #import pdb;pdb.set_trace()
            html = requests.get(self.URL, params={'start':start},headers = self.header)
            soup = BeautifulSoup(html.text,"html.parser")
            names = soup.select('#content > div > div.article > ol > li > div > div.info > div.hd > a > span:nth-of-type(1)') ##The > in the middle indicates the dependency relationship, and there should be an interval between them
            for name in names:
                print(name.get_text())

if __name__== "__main__":
    cls = Douban()
    cls.get_top250()

The former needs to match in detail according to the tree directory, while the latter can select local features, and the specific differences can be viewed https://www.cnblogs.com/suancaipaofan/p/11786046.html , according to their own preferences.

(2)re

Regular Expression is a text pattern that includes normal characters (for example, letters between a and z) and special characters (called metacharacters). It is often used to match, retrieve, replace, and split text that conforms to a pattern (rule).

Single character:
        .: all characters except newline
        []: [aoe] [a-w] matches any character in the set
        \d: Number [0-9]
        \D: non numerical
        \w: Numbers, letters, underscores, Chinese
        \W: non \w
        \s: All blank character packages, including spaces, tabs, page breaks, and so on. Equivalent to [\ f\n\r\t\v].
        \S: non blank

    Quantity modification:
        *: any multiple times > = 0
        +: at least once > = 1
        ?: optional 0 times or 1 time
        {m} : fixed m times hello{3,}
        {m,}: at least m times
        {m,n}: m-n times

    Boundary:
        $: end with 
        ^: begin with

    Grouping:
        (ab)  

    Greedy mode:*
    Non greedy (inert) mode:. *?

    re.I: ignore case
    re.M: multi line matching
    re.S: single line matching

Re.sub (regular expression, substitution, string)

Example analysis:

#Extract 170
string = 'I like a girl of 170'
re.findall('\d+',string)

#Extract http: / / and https://
key='http://www.baidu.com and https://boob.com'
re.findall('https?://',key)

#Extract hello
key='lalala<hTml>hello</HtMl>hahah' #Output < HTML > Hello < / HTML >
re.findall('<[Hh][Tt][mM][lL]>(.*)</[Hh][Tt][mM][lL]>',key)



###Crawl and save the Encyclopedia of embarrassing things
import requests
import re
import os

# Create a folder
if not os.path.exists('./qiutuLibs'):
    os.mkdir('./qiutuLibs')

headers = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 6.1; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/73.0.3683.103 Safari/537.36'
}

#Encapsulate a general url template
url = 'https://www.qiushibaike.com/pic/page/%d/?s=5185803'

for page in range(1,36):
    new_url = format(url%page)                            #Don't forget format, without quotes
    page_text = requests.get(url=new_url, headers=headers).text

    # Data analysis (image address)
    ex = '<div class="thumb">.*?<img src="(.*?)" alt.*?</div>'
    src_list = re.findall(ex, page_text, re.S)                        # re.S single line matching, because the page source code contains \ n

    # It is found that the src attribute value is not a complete url, and the protocol header is missing
    for src in src_list:
        src = 'https:' + src
        # Request the url of the picture separately to get the picture data. content returns the response data of binary type
        img_data = requests.get(url=src, headers=headers).content
        img_name = src.split('/')[-1]
        img_path = './qiutuLibs/' + img_name
        with open(img_path, 'wb') as fp:
            fp.write(img_data)
            print(img_name, 'Download succeeded!')

(3)lxml

Use lxml to grab the data of the obtained web page, and then use xpath to filter its content, in the form of:

from lxml import etree
page = etree.HTML(html.decode('utf-8'))

# a Tags
tags = page.xpath(u'/html/body/a')
print(tags)  
# All a's under body under html
# Results [< element a at 0x34b1f08 >,...]

Crawl The most popular language top20 Example:

# Import required libraries
import urllib.request as urlrequest
from lxml import etree

# Get html
url = r'https://www.tiobe.com/tiobe-index/'
page = urlrequest.urlopen(url).read()
# Create lxml object
html = etree.HTML(page)

# Parsing HTML, filtering data
df = html.xpath('//table[contains(@class, "table-top20")]/tbody/tr//text()')
# Data write to database
import pandas as pd
tmp = []
for i in range(0, len(df), 5):
    tmp.append(df[i: i+5])
df = pd.DataFrame(tmp)

All of the above three methods can realize the analysis of the web content, so as to obtain the desired content, and then crawl. Finally, there is a welfare comparison:

"""
//Crawl all the pictures of girls
"""
import requests
import re
import time
import os
from bs4 import BeautifulSoup

header = {
    "user-agent": "Mozilla/5.0 (Windows NT 10.0; WOW64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/77.0.3865.120 Safari/537.36",
    "cookie": "__cfduid=db8ad4a271aefd9fe52709ba6d3d94f561583915551; UM_distinctid=170c8b96544233-0c58b81612557b-404b032d-100200-170c8b96545354" ,
    "accept": "text/html,application/xhtml+xml,application/xml;q=0."
}


# Get current directory
root = os.getcwd()
cnt = 48
for page in range(2):
    # Enter the current directory
    response = requests.get(f"https://www.meizitu.com/a/list_1_{page+1}.html", headers=header,stream=True,timeout=(3,7))
    response.encoding = "gb2312"
    if response.status_code == 200:
        import pdb;pdb.set_trace()
        #result = re.findall("""<a target='_blank' href=".*?"><img src="(.*?)" alt="(.*?)"></a>""", response.text)  ###Regular matching
        result2=BeautifulSoup(response.text,'html.parser').find_all("img") ###beautifulsoup
        #result3=etree.HTML(response.text).xpath("//img/@src")  ###lxml
        for i in result2:
            path = i.attrs["src"]
            try:
                response = requests.get(path, headers=header,stream=True,timeout=(3,7))
                #import pdb;pdb.set_trace()
                with open("./meinv//"+str(cnt)+".jpg", "wb") as f:
                # Response text, content
                    f.write(response.content)
                cnt += 1
            except:
                pass
    print(f"The first{page+1}Please enjoy your success")

 

To be continued!

43 original articles published, 87 praised, 130000 visitors+
Private letter follow

Tags: Windows Attribute xml Database

Posted on Wed, 11 Mar 2020 07:31:49 -0400 by xiledweb