Method tutorial | how to avoid anti crawl detection of websites

[solution]

      Method 1: disable -- enable automation before importing launch to prevent monitoring webdriver

from pyppeteer import launcher
# Disable -- enable automation before importing launch to prevent monitoring webdriver
launcher.AUTOMATION_ARGS.remove("--enable-automation")

This method is written in many blogs on the Internet, but when I use this method,   Operation error.

module 'pyppeteer.launcher' has no attribute 'AUTOMATION_ARGS'

  I don't know if everyone runs like this.

     Method 2: comment it out directly in the launcher script   -- Enable automation parameter

  launcher.py   Script location  

(Your own Python Installation path)\Python37_64\Lib\site-packages\pyppeteer\launcher.py

   

     Then open the script and comment it out  -- Enable automation, save it.

    

     This method is based on Method 1. Since it is impossible to remove this parameter from the code, comment it out in its source code. Ha ha.

     Method 3: by executing the js script and setting the value of webdriver to false, you can also avoid being detected by anti crawling.

await page.evaluateOnNewDocument('() =>{ Object.defineProperties(navigator,'        '{ webdriver:{ get: () => false } }) }')

  Let's put a test code below. The personal test is really feasible:  

import asyncio
from pyppeteer import launch

url = 'http://www.nhc.gov.cn/xcs/yqtb/list_gzbd.shtml'

async def fetchUrl(url):
    browser = await launch({'headless': False,'dumpio':True, 'autoClose':False})
    page = await browser.newPage()
    await page.evaluateOnNewDocument('() =>{ Object.defineProperties(navigator,'
                      '{ webdriver:{ get: () => false } }) }')
    await page.goto(url)

asyncio.get_event_loop().run_until_complete(fetchUrl(url))

Python web crawler actual combat: xxxx travel

1. Read URL list

import pandas as pd

df = pd.read_csv('data.csv', sep = ',', usecols = [0,1])
for index, title, url in df.itertuples():
    print(title)
    print(url)

Operation results:

     You can read the title and link of each article.

2. Initiate network request

     The following is the fetchUrl function, which is used to initiate a network request.

import requests

def fetchUrl(url):
    '''
    Initiate network request
    '''
    headers = {
        'accept': 'text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9',
        'user-agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.97 Safari/537.36',
    }
    r = requests.get(url,headers=headers)
    r.raise_for_status()
    r.encoding = "utf-8"
    return r.text

3. Crawl the text and convert it to Markdown format

     The getContent function is used to extract the Html text of the body part from the web page source code and perform some simple preprocessing.

     Including the replacement of pictures and title formats, the elimination of irrelevant labels, and the replacement of some disturbing special characters.

from bs4 import BeautifulSoup

def getContent(html):
    '''
    Extract the text of the article html
    '''
    html = html.replace(" ", "")
    html = html.replace("~~", "~").replace("~~", "~")

    bsObj = BeautifulSoup(html,'lxml')
    title = bsObj.find("h1").text
    content = bsObj.find("div",attrs = {"id":"b_panel_schedule"})

    imgs = content.find_all("img")
    for img in imgs:
        src = img['data-original']
        txt = img['title']
        img.insert_after("![{0}]({1})".format(txt,src))
        img.extract()

    header5 = content.find_all("h5")
    for h5 in header5:
        t5 = h5.find("div", attrs = {"class":"b_poi_title_box"})
        #print(t5.text)
        h5.insert_after("##### " + t5.text)
        h5.extract()

    cmts = content.find_all("div", attrs = {"class":"ops"})
    for s in cmts:
        s.insert_after('< br/>')
        s.extract()

    return str(content)

    The Html2Markdown function is mainly used to convert HTML text into Markdown format and correct some format errors in the conversion process.

import html2text as ht

def Html2Markdown(html):
    '''
    The main body of the article is divided into html Format conversion to Markdown format
    '''
    text_maker = ht.HTML2Text()
    text = text_maker.handle(html)
    text = text.replace("#\n\n", "# ")
    text = text.replace("\\.",".")
    text = text.replace(".\n",". ")
    text = text.replace("< br/>","\n")
    text = text.replace("tr-\n","tr-")
    text = text.replace("View all __","")
    return text

4. Save files

     When we save the file, we use the article title as the file name. There are some characters in the file name, such as forward and backward slashes / \, English quotation marks  ' ", English greater than less than sign < > and so on. We need to eliminate them or replace them with Chinese symbols. Otherwise, an error will be reported and the save will fail.

import os

def saveMarkdownFile(title,content):
    '''
    Save text to Markdown In the file
    title: file name
    content: Text content to save
    '''
    # Eliminate or replace symbols that are not allowed in file names
    title = title.replace("\\","")
    title = title.replace("/","")
    title = title.replace("\"",""")
    title = title.replace("\'","'")
    title = title.replace("<","<")
    title = title.replace(">",">")
    title = title.replace("|","&")
    title = title.replace(":",": ")
    title = title.replace("*","x")
    title = title.replace("?","?")

    with open("data/" + title + ".md", 'w', encoding='utf-8') as f:
        f.write(content)

5. Crawler scheduler

     Finally, we need to write a crawler scheduling function to start and control our crawler.

import time
from random import randint

def main():

    df = pd.read_csv('data.csv', sep = ',', usecols = [0,1])
    for index, title, url in df.itertuples():
        html = fetchUrl(url)
        content = getContent(html)
        md = Html2Markdown(content)
        saveMarkdownFile(title, md)

        # Random waiting time to avoid too frequent crawling and triggering anti crawling mechanism
        t = randint(0,3)
        print("wait -- ",str(t),"s")
        time.sleep(t)

# Start crawler
hesmain()
print("Crawl complete!")

     This is a saved how to avoid anti crawling detection of websites. If there are deficiencies or more skills, you are welcome to give advice and supplement. I hope the sharing of this article will help you avoid anti crawling of websites in the future. Thank you ~

Reprinted in: clever crane

 

 

 

Tags: Python

Posted on Tue, 23 Nov 2021 19:00:48 -0500 by eleven0