Get host compatible ipv6 in URL quickly

0x01 one sentence solution

1. If the python version is more than 2.7 and the ipv6 url meets the RFC3986 specification, the urlparse can be used directly

2. If the version is low or the url containing ipv6 does not conform to the specification, urlparse cannot be used for parsing. You need to customize a method implementation, as follows

import socket
from urlparse import urlparse

def is_ipv6(ip):
    try:
        socket.inet_pton(socket.AF_INET6, ip)
    except socket.error:  
        return False
    return True


def extract_host_from_url(url):
    host = urlparse(url).netloc
    print 'netloc = ', host
    if not is_ipv6(host):
        last_colon_index = host.rfind(':')
        print 'last_colon_index is ', last_colon_index
        if last_colon_index == -1:
            return host
        host = host[:last_colon_index]
    print 'extract host from url is : ', host
    return host

  

0x02 background

ipv4 is about to run out, ipv6 has arrived, many companies should be doing ipv6 adaptation work or have already done it. Recently, there is a problem of url parsing in development, which needs to consider the url of ipv6 address, so it is simply combed as follows.

0x03 RFC3986 compliant scenario

What is RFC3986

The notation in that case is to encode the IPv6 IP number in square brackets:

http://[2001:db8:1f70::999:de8:7648:6e8]:100/

That's RFC 3986, section 3.2.2: Host

A host identified by an Internet Protocol literal address, version 6 [RFC3513] or later, is distinguished by enclosing the IP literal within square brackets ("[" and "]"). This is the only place where square bracket characters are allowed in the URI syntax. In anticipation of future, as-yet-undefined IP literal address formats, an implementation may use an optional version flag to indicate such a format explicitly rather than rely on heuristic determination.

According to the RFC document, in order to standardize, the url of ipv6 address must be enclosed with brackets. Therefore, when parsing, we need to take this as a feature. If we do not meet the requirements, we will not parse

Realization

Urlparse version 2.7 adds support for ipv6 parsing, which is implemented as follows:

https://github.com/enthought/Python-2.7.3/blob/master/Lib/urlparse.py

The usage code is as follows

# coding: utf-8

from urlparse import urlparse


def test():
    url1 = 'http://www.Python.org/doc/#'
    url2 = 'http://[fe80::240:63ff:fede:3c19]:8080'
    url3 = 'http://[2001:db8:1f70::999:de8:7648:6e8]:100/'
    url4 = 'http://[2001:db8:1f70::999:de8:7648:6e8]'
    urls = [url1, url2, url3, url4]
    for url in urls:
        up = urlparse(url)
        print up.scheme, up.netloc, up.hostname, up.port


if __name__ == '__main__':
    test()

Operation result

http www.Python.org www.python.org None
http [fe80::240:63ff:fede:3c19]:8080 fe80::240:63ff:fede:3c19 8080
http [2001:db8:1f70::999:de8:7648:6e8]:100 2001:db8:1f70::999:de8:7648:6e8 100
http [2001:db8:1f70::999:de8:7648:6e8] 2001:db8:1f70::999:de8:7648:6e8 None

0x04 RFC3986 compliant scenario

Because of the particularity of ipv6 expression:

  1. The leading 0 of each digit can be omitted. If the leading digit is still 0 after omission, it will continue. For example, the following groups of IPv6 are equivalent:

    2001:0DB8:02de:0000:0000:0000:0000:0e13
    2001:DB8:2de:0000:0000:0000:0000:e13
    2001:DB8:2de:000:000:000:000:e13
    2001:DB8:2de:00:00:00:00:e13
    2001:DB8:2de:0:0:0:0:e13
    
  2. A double colon '::' can be used to represent one or more consecutive sets of 0, but only once:

    2001:DB8:2de:0:0:0:0:e13
    2001:DB8:2de::e13
    2001:0DB8:0000:0000:0000:0000:1428:57ab
    2001:0DB8:0000:0000:0000::1428:57ab
    2001:0DB8:0:0:0:0:1428:57ab
    2001:0DB8:0::0:1428:57ab
    2001:0DB8::1428:57ab
    

So the problem comes. If the abbreviation in ipv6, such as 2001:0DB8::1428:57ab, is added to port 2001:0DB8::1428:57ab:443, it is still the legal expression of ipv6, because two colons can represent the original four groups of 0, or three groups of 0, and the last 443 port is regarded as a part of ipv6.

0x05 summary

1. If the URL containing ipv6 does not conform to the RFC standard, the ": port" method will make the abbreviation expression of ipv6 ambiguous.

2. The solution is either to modify ipv6 expression conforming to RFC specification, or not to use ipv6 abbreviation, and then extract host operation with the code in the first section one sentence solution

0x06 reference

https://zh.wikipedia.org/wiki/IPv6

Tags: Programming Python socket github

Posted on Mon, 23 Mar 2020 06:52:43 -0400 by VLE79E