[Python Crawler]☀️Everything can climb☀️- Selenium+Automated Test Tool to Get Data

First crawler:

Crawler: Get resources on the Internet by writing a program!

Requirements: Simulate a browser with a program, enter a web address, and get resources or content from the website!

from urllib.request import urlopen

url = "http://www.baidu.com"
resp = urlopen(url)

with open("mybaidu.html", mode="w",encoding="utf-8") as f:
    f.write(resp.read().decode("utf-8"))  # Read site page source code

print("over!")

Whole-process analysis of Web requests:

  • Server Rendering: Integrate data and html directly on the server side, return to the browser, and see the data in the original page code.
  • Client rendering: the first request only requires an html skeleton, the second request gets the data for data display. No data is visible in the page source code

HTTP protocol:

Protocol: A gentleman's agreement that sets up two computers to communicate smoothly. Common protocols are TCP/IP, SOAP, HTTP, SMTP, etc.

HTTP protocol, short for Hyper Text Transfer Protocol, is a transport protocol for transporting hypertext from a World Wide Web server to a local browser. Data interaction between browser and server follows the HTTP protocol.

The HTTP protocol divides a message into three pieces of content. Request and response should be three pieces of content:

Request:

Request line -> Request Method(get/post) request url Address protocol
 Request Header -> Store additional information for the server to use

Requestor -> Place some request parameters in general

Common important elements in request headers:

  • User-Agent: Identity of the request carrier (with which to send the request)
  • Referer: Anti-theft chain (from which page this request was obtained, used for anti-crawling)
  • cookie: local string data information (user login information, anti-crawl token)

Request method:

  • GET: Display requests
  • POST: Implicit request

Response:

Status line -> Protocol Status Code
 Response Header -> Store additional information that clients need to use

Response Body -> Content to be used by the real client returned by the server(HTML,JSON)Equal data

Requests:

Install Requests:pip install requests

Click Terminal in PyCharm to open the command window:

# With the requests package, you need to install it from the command line before using it!!!

import requests

url = 'https://Cn.bing.com/search?Q=java'# url: Browser request address, crawler crawl data address

dic = {
    "User-Agent": 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/92.0.4515.159 Safari/537.36 Edg/92.0.902.78'
}
resp = requests.get(url, headers=dic)  # Add request vectors to handle a simple anti-crawl mechanism

print(resp.text)  # Get Web Source

Data parsing:

In most cases, we do not need the content of the entire web page, but only the valid data, which involves data extraction.

There are three ways to parse data extraction:

  • re parsing
  • bs4 parsing
  • xpath parsing

These three methods can be mixed and result-oriented. As long as you can get the data you need, it doesn't matter what method you use. When you have this in hand, consider performance again!!!

Regular expression:

Article Connection: A Find Regular Expression

Crawl case:

# Get the page source code through requests
# re to extract valid information you want
import requests
import re
# Data Storage Format Module
import csv

# 1. Get the url address of the crawling site and configure the request header:
url = 'https://movie.douban.com/top250'
header = {
    'User-Agent': 'Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome'
                  '/92.0.4515.159 Safari/537.36 Edg/92.0.902.78'
}

# 2. Make a request to the server at that address through the get method in the requests module and return the response rsp:
rsp = requests.get(url=url, headers=header)

# 3. Request result:
page_content = rsp.text

# Parse data (using regular expressions):
obj = re.compile(r'<li>.*?<span class="title">(?P<name>.*?)</span>.*?<p class="">(?P<director>.*?)&nbsp;&nbsp;&nbsp;'
                 r'(?P<performer>.*?)<br>.*?(?P<year>.*?)&nbsp;/&nbsp;.*?<span property="v:best" content="10.0">'
                 r'</span>.*?<span>(?P<comment>.*?)</span>.*?<span class="inq">'
                 r'(?P<direction>.*?)</span>',
                 re.S)

result = obj.finditer(page_content)
# Prepare the file:
file = open('data.csv', mode='w', encoding='utf-8')  # Setting the file character encoding to UTF-8 and GBK by default will cause inconsistencies between file encoding and IDE (utf-8) write encoding
# Prepare the CSV writer (data content will be written to the file):
CSVWrite = csv.writer(file)

# Write data to a CSV file using an iterator:
for r in result:
    # print(r.group('name').split())
    # print(r.group('director').split())
    # print(r.group('performer').split())
    # print(r.group('year').split())
    # print(r.group('comment').split())
    # print(r.group('direction').split())
    # print('\n')
    dic = r.groupdict()
    dic['name'] = dic['name'].strip()
    dic['director'] = dic['director'].strip()
    dic['performer'] = dic['performer'].strip()
    dic['year'] = dic['year'].strip()
    dic['comment'] = dic['comment'].strip()
    dic['direction'] = dic['direction'].strip()
    CSVWrite.writerow(dic.values())

# Close the file:
file.close()

# Print results:
print('over!')

# 4. Turn off rsp:
rsp.close()

Get data results:

The Shawshank Redemption,director: Frank·Darabont Frank Darabont,To star: Tim·Robins Tim Robbins /...,1994,2435461 Person Evaluation,Hope to make people free.

Farewell to my concubine,director: Chen Kaige Kaige Chen,To star: Leslie Cheung Leslie Cheung / Zhang Fengyi Fengyi Zha...,1993,1811073 Person Evaluation,Be really a most unusual and quite individual beauty.

Forrest Gump,director: Robert·Zemikis Robert Zemeckis,To star: Tom·Hanks Tom Hanks / ...,1994,1831357 Person Evaluation,A modern history of the United States.

Léon,director: Luke·Besson Luc Besson,To star: Give Way·Renault Jean Reno / Natalie·Portman ...,1994,1997475 Person Evaluation,The story that the grotesque millet and the little Laurie had to tell.

Titanic,director: James·James Cameron James Cameron,To star: Leonado·leonardo dicaprio Leonardo...,1997,1793280 Person Evaluation,What is lost is forever.

Beautiful life,director: Roberto·Bernini Roberto Benigni,To star: Roberto·Bernini Roberto Beni...,1997,1122598 Person Evaluation,The most beautiful lies.

Spirited away,director: Hayao Miyazaki Hayao Miyazaki,To star: Chebulomei Rumi Hîragi / Freedom to Enter the Field Miy...,2001,1911622 Person Evaluation,The best Miyazaki, the best long-lasting stone.

Schindler's list,director: Steven·Spielberg Steven Spielberg,To star: Liam·Nissen Liam Neeson...,1993,934608 Person Evaluation,To save one person is to save the whole world.

Inception,director: Christopher·Nolan Christopher Nolan,To star: Leonado·leonardo dicaprio Le...,2010,1761254 Person Evaluation,Nolan has given us a dream we can't steal.

The Story of Eight Faithful Dogs,director: Lesser·Lasse Hallstrom Lasse Hallström,To star: Richard·Kiel Richard Ger...,2009,1211124 Person Evaluation,Never forget the one you love.

Intergalactic Traverse,director: Christopher·Nolan Christopher Nolan,To star: Matthew·Mitch McConnell Matthew Mc...,2014,1435042 Person Evaluation,Love is a power that allows us to perceive its existence beyond time and space.

The world of Chumen,director: Peter·Will Peter Weir,To star: gold·Carey Jim Carrey / Laura·Lynne Lau...,1998,1356451 Person Evaluation,If you can't see you again, good morning, good afternoon, good night.

Maritime Pianist,director: Giuseppe·Giuseppe Tornatore Giuseppe Tornatore,To star: Tim·Ross Tim Roth / ...,1998,1431659 Person Evaluation,Everyone has to take a firm path, even if it's broken bones.

Bollywood,director: Rajkumar·Hirani Rajkumar Hirani,To star: Amir·sweat Aamir Khan / card...,2009,1605312 Person Evaluation,Beans in handsome edition, Sheldon in high emotional commerce edition.

Wall-E,director: Andrew·Stanton Andrew Stanton,To star: book·Belt Ben Burtt / Ali...,2008,1129897 Person Evaluation,Small wattage, big life.

Spring in the cattle ranch,director: Christopher·Bharathi Christophe Barratier,To star: Gerard·Junior Gé...,2004,1114896 Person Evaluation,The voice of a natural child is the closest to the existence of God.

Infinity,director: Liu Weiqiang / Mai Zhaohui,To star: Lau Andy / Liang Chaowei / Huang Qiusheng,2002,1098361 Person Evaluation,Hong Kong's never-outdated masterpiece in film history.

Zootopia,director: Byron·Howard Byron Howard / Ritchey·Mole Rich Moore,To star: Ginnifer·...,2016,1588576 Person Evaluation,This is how Disney created the Utopia for us, always kind and brave, always unexpected.

A Chinese Odyssey Part Two Cinderella,director: Liu Zhenwei Jeffrey Lau,To star: Zhou Xingchi Stephen Chow / Wu Mengda Man Tat Ng...,1995,1304007 Person Evaluation,Love for life.

Furnace,director: Huang Donghe Dong-hyuk Hwang,To star: Kong You Yoo Gong / Yu-mi Jeong Yu-mi Jung /...,2011,790634 Person Evaluation,We are not fighting to change the world, but to keep it from changing us.

Godfather,director: Francis·Ford·Copola Francis Ford Coppola,To star: Malone·Brando M...,1972,794767 Person Evaluation,Never hate your opponent, it will make you lose your mind.

When happiness knocks,director: Gabrielle·Muccino Gabriele Muccino,To star: Will·Smith Will Smith ...,2006,1292846 Person Evaluation,Civilian inspiration film.

Totoro,director: Hayao Miyazaki Hayao Miyazaki,To star: Rhizoma Astragali Noriko Hidaka / Sakamoto Ch...,1988,1080655 Person Evaluation,Everyone has a dragon cat in their heart, and childhood will never disappear.

Palpitate with excitement,director: Robert·Lina Rob Reiner,To star: Madeleine·Carol Madeline Carroll / card...,2010,1540956 Person Evaluation,True happiness comes from deep within.

Prosecuting witness,director: Billy·Wilder Billy Wilder,To star: Tyrone·Bao Hua Tyrone Power / Marine·...,1957,390934 Person Evaluation,Billy·Wydman's works.

Install bs4:

pip install bs4

Environment Setup:

Install Selenium

Open the Terminal window of IDEA and enter the following command:

pip install selenium

Install browser driver:

EdgeDriver:

I'm using the Edge browser here and need to download the driver files for the Edge browser on the official website

Official network ports: Microsoft Edge Driver - Microsoft Edge Developer

from selenium import webdriver

driverpath = "C:\driver\msedgedriver.exe"  # I store the driver file address

driver = webdriver.Edge(executable_path=driverpath)  # Load Driver

driver.get("https:www.bilibili.com")  # Path to browser access

print(driver.page_source)

When the program executes successfully, the Edge browser on the computer is hosted by the test program (EdgeDriver):

The console prints out the source code of the page:

 ChromeDriver:

Google Browser Driven Port:

​​​​​​ChromeDriver Mirror (taobao.org)

View the version of Chrome browser you installed:

  1. Enter Chrome Browser Settings
  2. Find out about Chrome
  3. View the current browser version

Drivers here are generally downward compatible. If you cannot find a browser driver with the corresponding version number, you can choose a new version nearby without affecting your use.

# Import Package Dependency
from selenium import webdriver

# Specify Driver Path
driverPath = '../driver/chromedriver.exe'

# Create Browser Entity
browser = webdriver.Chrome(driverPath)

# Specify visits to websites
url = 'https://www.bilibili.com'
browser.get(url=url)

Selenium element positioning:

Element Location: What automation does is simulate the mouse and keyboard to manipulate these elements, click, type, etc. Before doing this, you need to find them first, and Webdriver offers a variety of ways to locate elements:

Method

EG
find_element_by_idbutton = browser.find_element_by_id(' su ')
find_element_by_namename = browser.find_element_by_name(' wd ')
find_element_by_xpathxpathDemo = browser.find_element_by_xpath(' //imput[@id="su"] ')
find_element_by_tag_namenames = browser.find_elemnet_by_tag_name(' input ')
find_element_by_css_selectormy_input = browser.find_element_by_css_selecter(' #kw ')[0]
find_element_by_link_textbrowser.find_element_by_link_text ('news')

Once you have located the response element, you can get the information in the element in several ways:

Methodcode implementation
Get Element Properties.get_attribute(' class ')
Get element text.text
Get tag name.tag_name

Selenium interaction:

Eventcode implementation
clickclick()
Enter *send_keys()
Backward operationbrowser.back()
Forward actionbrowser.forword()
Simulate JS Scroll

JS = 'document.documentElement.scrollTop = 100000'

browser.execute_script (JS)#Execute JS code

Get Web Sourcepage_source
Sign outbrowser.quit()

Chrome Handless:

In short, Headless Browser is a web browser without a graphical user interface (GUI), usually controlled through programming or command line interfaces. One of the many uses of Headless Browser is to automate usability testing or test browser interaction.

Chrome Handless mode, a new mode added by Google for Chrome Browser 59, allows you to use Chrome without opening the UI interface, so it works just like Chrome.

System requirements:

  • Chrome Browser Version >= 60
  • Python 3.6 and above
  • Selenium more than 3.4
  • ChromeDriver over 2.31
from selenium import webdriver
from selenium.webdriver.chrome.options import Options

chrome_options = Options()
chrome_options.add_argument('--headless')
chrome_options.add_argument('--disable-gpu')

# Path: store browser execution file path
path = 'C:\Program Files\Google\Chrome\Application\chrome.exe'
chrome_options.binary_location = path

browser = webdriver.Chrome(options=chrome_options)

# Destination Address
url = 'https://www.bilibili.com'
browser.get(url=url)
browser.save_screenshot('bilibili.png')

Requests:

Basic use of requests:

Tags: Python crawler

Posted on Fri, 10 Sep 2021 23:49:09 -0400 by sumeet