0x01 one sentence solution
1. If the python version is more than 2.7 and the ipv6 url meets the RFC3986 specification, the urlparse can be used directly
2. If the version is low or the url containing ipv6 does not conform to the specification, urlparse cannot be used for parsing. You need to customize a method implementation, as follows
import socket from urlparse import urlparse def is_ipv6(ip): try: socket.inet_pton(socket.AF_INET6, ip) except socket.error: return False return True def extract_host_from_url(url): host = urlparse(url).netloc print 'netloc = ', host if not is_ipv6(host): last_colon_index = host.rfind(':') print 'last_colon_index is ', last_colon_index if last_colon_index == -1: return host host = host[:last_colon_index] print 'extract host from url is : ', host return host
ipv4 is about to run out, ipv6 has arrived, many companies should be doing ipv6 adaptation work or have already done it. Recently, there is a problem of url parsing in development, which needs to consider the url of ipv6 address, so it is simply combed as follows.
0x03 RFC3986 compliant scenario
What is RFC3986
The notation in that case is to encode the IPv6 IP number in square brackets:
That's RFC 3986, section 3.2.2: Host
A host identified by an Internet Protocol literal address, version 6 [RFC3513] or later, is distinguished by enclosing the IP literal within square brackets ("[" and "]"). This is the only place where square bracket characters are allowed in the URI syntax. In anticipation of future, as-yet-undefined IP literal address formats, an implementation may use an optional version flag to indicate such a format explicitly rather than rely on heuristic determination.
According to the RFC document, in order to standardize, the url of ipv6 address must be enclosed with brackets. Therefore, when parsing, we need to take this as a feature. If we do not meet the requirements, we will not parse
Urlparse version 2.7 adds support for ipv6 parsing, which is implemented as follows:
The usage code is as follows
# coding: utf-8 from urlparse import urlparse def test(): url1 = 'http://www.Python.org/doc/#' url2 = 'http://[fe80::240:63ff:fede:3c19]:8080' url3 = 'http://[2001:db8:1f70::999:de8:7648:6e8]:100/' url4 = 'http://[2001:db8:1f70::999:de8:7648:6e8]' urls = [url1, url2, url3, url4] for url in urls: up = urlparse(url) print up.scheme, up.netloc, up.hostname, up.port if __name__ == '__main__': test()
http www.Python.org www.python.org None http [fe80::240:63ff:fede:3c19]:8080 fe80::240:63ff:fede:3c19 8080 http [2001:db8:1f70::999:de8:7648:6e8]:100 2001:db8:1f70::999:de8:7648:6e8 100 http [2001:db8:1f70::999:de8:7648:6e8] 2001:db8:1f70::999:de8:7648:6e8 None
0x04 RFC3986 compliant scenario
Because of the particularity of ipv6 expression:
The leading 0 of each digit can be omitted. If the leading digit is still 0 after omission, it will continue. For example, the following groups of IPv6 are equivalent:
2001:0DB8:02de:0000:0000:0000:0000:0e13 2001:DB8:2de:0000:0000:0000:0000:e13 2001:DB8:2de:000:000:000:000:e13 2001:DB8:2de:00:00:00:00:e13 2001:DB8:2de:0:0:0:0:e13
A double colon '::' can be used to represent one or more consecutive sets of 0, but only once:
2001:DB8:2de:0:0:0:0:e13 2001:DB8:2de::e13 2001:0DB8:0000:0000:0000:0000:1428:57ab 2001:0DB8:0000:0000:0000::1428:57ab 2001:0DB8:0:0:0:0:1428:57ab 2001:0DB8:0::0:1428:57ab 2001:0DB8::1428:57ab
So the problem comes. If the abbreviation in ipv6, such as 2001:0DB8::1428:57ab, is added to port 2001:0DB8::1428:57ab:443, it is still the legal expression of ipv6, because two colons can represent the original four groups of 0, or three groups of 0, and the last 443 port is regarded as a part of ipv6.
1. If the URL containing ipv6 does not conform to the RFC standard, the ": port" method will make the abbreviation expression of ipv6 ambiguous.
2. The solution is either to modify ipv6 expression conforming to RFC specification, or not to use ipv6 abbreviation, and then extract host operation with the code in the first section one sentence solution