Crawler: URL resolution link

catalogue

urlparse()   : Realize the identification and segmentation of URL

urlparse() method other API usage

urlunparse(): realize URL splicing

urlsplit():   Parse the URL and add params and to the path

urlunsplit()   : Complete link splicing

urljoin(): completes the merge of links

urlencode(): serialized as GET request parameters

parse_qs(): inverse sequence: Dictionary

parse_qsl()   Deserialization: Lists

quote(): convert content to URL encoded format

unquote(): decode URL content

The urllib library also provides parse module, which defines the standard interface for processing URLs, such as extracting, merging and link conversion of various parts of URLs. It also supports RUL processing of the following protocols: file, ftp, gopher, hdl, http, imap, mailto, mms, news, nntp, prospero, rsync, rtsp, rtspu, sftp, sip, sips, snews, svn, svn+ssh, telnet and wais.

urlparse()   : Realize the identification and segmentation of URL

from urllib.parse import urlparse

result = urlparse("https://www.baidu.com/s/index.html;user?id=5#comment")
print(type(result) ,result)

result:

<class 'urllib.parse.ParseResult'> 
ParseResult(scheme='https', netloc='www.baidu.com', path='/s/index.html', params='user', query='id=5', fragment='comment')

You can see that the returned result is a ParseResult object, which contains six parts: scheme, nteloc, path, params, query and fragment.

You can roughly see urlparse()   Method to split it into six parts. A general observation shows that parsing has a specific separator.

  • ://   : The scheme in front represents the protocol;
  • First/   The symbol is preceded by netloc, the domain name
  • Followed by path, i.e. access path;
  • Semicolon; params in front
  • question mark? Followed by the query condition query, which is generally used as a GET type URL;
  • Well number #, followed by tracing point. Users can directly locate the drop-down position inside the page

In general, we can get a standard link format: scheme://netloc/path;params?query#fragment

urlparse() method other API usage

def urlparse(url, scheme='', allow_fragments=True)

urlstring: This is a required item, that is, the URL to be resolved
scheme: it is the default protocol (such as http or https). If the link does not carry information, it will be used as the default protocol.

from urllib.parse import urlparse

result = urlparse(url="www.baidu.com/s/index.html;user?id=5#comment",scheme="https")
print(type(result) ,result)


result
<class 'urllib.parse.ParseResult'> 
ParseResult(scheme='https', netloc='', path='www.baidu.com/s/index.html', params='user', query='id=5', fragment='comment')

1. The URL provided does not contain the previous scheme information, but by specifying the default scheme parameter, the returned result is https

from urllib.parse import urlparse

result = urlparse(url="http://www.baidu.com/s/index.html;user?id=5#comment",scheme="https")
print(type(result) ,result)

result:
<class 'urllib.parse.ParseResult'> 
ParseResult(scheme='http', netloc='www.baidu.com', path='/s/index.html', params='user', query='id=5', fragment='comment')

2. The scheme parameter takes effect only if the URL does not contain scheme information. If there is scheme information in the URL, scheme will be returned;

allow_fragments: whether to ignore fragments. If it is set to False, the fragment part will be ignored, it will be resolved to a part of path, parameters or query, and the fragment part is empty.
 

from urllib.parse import urlparse

result = urlparse(url="http://www.baidu.com/s/index.html;user?id=5#comment",allow_fragments=False)
print(type(result) ,result)

result:
<class 'urllib.parse.ParseResult'> 
ParseResult(scheme='http', netloc='www.baidu.com', path='/s/index.html', params='user', query='id=5#comment', fragment='')

The URL does not contain params and query. Let's take another look

from urllib.parse import urlparse

result = urlparse(url="http://www.baidu.com/s/index.html#comment",allow_fragments=False)
print(type(result) ,result)

result:
<class 'urllib.parse.ParseResult'> 
ParseResult(scheme='http', netloc='www.baidu.com', path='/s/index.html#comment', params='', query='', fragment='')

From the above, we can find that when the URL does not contain params and query, the fragment will be resolved to a part of the path

The ParseResult returned from the result is actually a tuple, which can be obtained by index order or attribute name
 

from urllib.parse import urlparse

result = urlparse(url="http://www.baidu.com/s/index.html#comment",allow_fragments=False)
print(type(result) ,result[0],result.netloc,result[1],sep='\n')

result
<class 'urllib.parse.ParseResult'>
http
www.baidu.com
www.baidu.com

urlunparse(): realize URL splicing

With urlparse(), there is its opposite method urlunparse(). The parameter it accepts is an iteratable object, but its length must be 6. Otherwise, the problem of insufficient or too many parameters will be thrown:

from urllib.parse import urlunparse

data = ["http","www.baidu.com","index.html","user","a=6","comment"]
print(urlunparse(data))

result:
http://www.baidu.com/index.html;user?a=6#comment

urlsplit():   Parse the URL and add params and to the path

The urlplit () and urlparse() methods are very similar, except that they no longer parse the params part separately and only return five results. The params in the above example will be merged into the path

from urllib.parse import urlsplit

resutl = urlsplit("http://www.baidu.com/index.html;user?a=6#comment")
print(resutl)

result
SplitResult(scheme='http', netloc='www.baidu.com', path='/index.html;user', query='a=6', fragment='comment')

Returned result: SplitResult, which is also a tuple type. You can get the value by attribute or index

from urllib.parse import urlsplit

resutl = urlsplit("http://www.baidu.com/index.html;user?a=6#comment")
print(resutl.scheme,resutl[1])

result
http www.baidu.com

urlunsplit()   : Complete link splicing

Similar to urlunparse(), it is also a method to combine all parts of the link. The passed in parameter is also an iteratable object, such as list, tuple, etc. the only difference is that the length must be 5

from urllib.parse import urlunsplit
data = ["http","www.baidu.com","index.html","a=6","comment"]
print(urlunsplit(data))

result:
http://www.baidu.com/index.html?a=6#comment

urljoin(): completes the merge of links

With urlunparse() and urlunsplit()   Method, we can merge links, but only if there must be objects of specific length and each part of the link must be clearly separated

Another way to generate links is urljoin()   Method, we can provide a base_url (basic link) as the first parameter and the new link as the second parameter. The modified method will analyze the base_ The scheme, netloc and path of the URL are added to the missing part of the new link, and finally the result is returned     

from urllib.parse import urljoin

print(urljoin("http://www.baidu.com","FAQ.html"))
print(urljoin("http://www.baidu.com","https://cuiqingcal.com/FAQ.html"))
print(urljoin("http://www.baidu.com/about.html","https://cuiqingcal.com/FAQ.html"))
print(urljoin("http://www.baidu.com/about.html","https://cuiqingcal.com/FAQ.html?question=2"))
print(urljoin("http://www.baidu.com?wd=abc","https://cuiqingcal.com/index.hph"))
print(urljoin("http://www.baidu.com","?category=2#comment"))
print(urljoin("www.baidu.com","?category=2#comment"))
print(urljoin("www.baidu.com#comment","?category=2"))


result
http://www.baidu.com/FAQ.html
https://cuiqingcal.com/FAQ.html
https://cuiqingcal.com/FAQ.html
https://cuiqingcal.com/FAQ.html?question=2
https://cuiqingcal.com/index.hph
http://www.baidu.com?category=2#comment
www.baidu.com?category=2#comment
www.baidu.com?category=2

To sum up: base_url provides three contents: scheme, netloc, and path. If these three items do not exist in the new link, they will be supplemented; If a new connection exists, the new link part is used instead of the base_ params, query and fragment in URL do not work.

urlencode(): serialized as GET request parameters

urlencode(), which is very useful when constructing GET request parameters

from urllib.parse import urlencode
params = {
    "mame":"germey",
    "age":"22"
}

base_url = "http://www.baidu.com"
url = base_url + urlencode(params)
print(url)

result
http://www.baidu.commame=germey&age=22

Here, the urlencode() method is called to serialize it into GET request parameters; Parameters are successfully converted from dictionary to GET request parameters

parse_qs(): inverse sequence: Dictionary

With serialization, there must be deserialization. If we have a string of GET parameters, use parse_qs()   Method, you can convert it into a field, as shown in the following example:

from urllib.parse import parse_qs

query = "name=germey&age=32"
print(parse_qs(query))

result:
{'name': ['germey'], 'age': ['32']}

parse_qsl()   Deserialization: Lists

parse_qsl() method, which is used to convert parameters into a list of tuples

from urllib.parse import parse_qsl

query = "name=germey&age=32"
print(parse_qsl(query))

result:
[('name', 'germey'), ('age', '32')]

The running result is a list, and each element in the list is a tuple. The first content of the tuple is the parameter name and the second content is the parameter value

quote(): convert content to URL encoded format

This method can convert the content into a URL encoded format. When the URL contains Chinese parameters, it may sometimes lead to the problem of garbled code. At this time, this method can be used to convert Chinese symbols into URL codes. An example is as follows:

from urllib.parse import quote
keyword = "China"
url = "http://www.baidu.com/s?wd=" +quote(keyword)
print(url)

result:
http://www.baidu.com/s?wd=%E4%B8%AD%E5%9B%BD

unquote(): decode URL content

With the quote() method and, of course, the unquote() method, it can decode the URL,
 

from urllib.parse import unquote
url = "http://www.baidu.com/s?wd=%E4%B8%AD%E5%9B%BD"
print(unquote(url))

result:
http://www.baidu.com/s?wd = China

Tags: crawler request urllib

Posted on Sun, 28 Nov 2021 11:39:33 -0500 by DamienRoche