python learning notes hands on how to use native urllib to send network requests

example teaching

urllib.urlopen(url[,data[,proxies]]) : https://docs.python.org/2/library/urllib.html

The default library of network requests in python is the urllib series, including urllib urllib2 and urllib3, which cooperate with each other in most cases

Of course, there are many excellent third-party libraries to send network requests, and the most well-known one should include the requests library. However, this article intends to start from the most basic urllib to talk about how to send network requests. Please follow Snow dream technology station Let's practice together!

"Snow dream technology station" reminds you: due to the timeliness of the article, it is likely that some links are invalid when the readers are reading it. Please practice and verify it in person according to the relevant instructions of the article. It's better to believe that a book is not as good as a book. Don't copy and paste the code directly. Do it yourself!

Article directory

Environment building

The python environment used in this paper is based on the virtual environment implemented by virtualenv, just for the convenience of isolating different environments and better simulating the real user environment

In the actual development, you can follow this article step by step to build the development environment, or you can directly use the system default environment

Environment demonstration

Information about the demo environment is as follows:

(.env)$ python --version
Python 2.7.16
(.env) $ pip --version
pip 19.3.1 from ~/python/src/url/urllib/.env/lib/python2.7/site-packages/pip (python 2.7)

The following code runs normally in this environment, but does not guarantee that other environments are consistent with the demonstration results, so everything is subject to the actual operation results

Environment building

If you don't need a virtual environment, you can ignore the content of environment installation and use the default environment directly. Just make sure that the python version used in the test is python2 instead of python3!

  • Step 1. Install virtual environment virtualenv
sudo pip install virtualenv

It is convenient to install virtual environment to isolate different python environments, or use the system default environment, so this step is optional. Similarly, the following steps are optional

  • Step 2. Prepare the virtual environment directory.env
virtualenv .env

The purpose of setting the virtual environment directory as a hidden directory is to prevent misoperation. Of course, it can also be displayed as a normal directory

  • Step 3. Activate virtual environment.env
source .env/bin/activate

Once you have prepared the virtual environment directory, you need to activate the virtual environment. This step can be repeated without error!

  • Step 4. View the python and pip versions currently in use
(.env) $ which python
~/python/src/url/urllib/.env/bin/python
(.env) snowdreams1006s-MacBook-Pro:urllib snowdreams1006$ which pip
~/python/src/url/urllib/.env/bin/pip

After the virtual environment is activated, the related python dependencies will be downloaded automatically. Therefore, the location of python and pip files is the current directory. env instead of the system default environment. If the virtual environment is not enabled, the system directory will be displayed

Network request urlib Library

If the reader finds that the network cannot request normally during the test run, he can http://httpbin.snowdreams1006.cn/ replace with http://httpbin.org/ Or build local test environment by yourself

Here are two installation methods to set up the local test environment, which can also be accessed http://httpbin.snowdreams1006.cn/ perhaps http://httpbin.org/ Etc. online environment

  • docker installation mode

docker run -p 8000:80 kennethreitz/httpbin

For the first run, the image will be downloaded to the local and then the container will be started. For the non first run, the container will be started directly. Access address: http://127.0.0.1:8000/

  • python installation mode

pip install gunicorn httpbin && gunicorn httpbin:app

The default listening port is 8000. If the port conflict indicates that it has been occupied, run gunicorn httpbin: app - B: 9898 to specify the port

How to send the simplest network request

urllib2.urlopen(url): sends the simplest network request and directly returns the response text data

The new python file is called urllib · demo.py. The core code includes importing urllib 2 package first, then using urllib 2. Urlopen() to send the simplest GET request, and finally using response.read() to read the response content at one time

The code content is as follows:

# -*- coding: utf-8 -*-
import urllib
import urllib2

def use_simple_urllib2():
    '''
    //Get requester information
    '''
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
    print response.read()

if __name__ == '__main__':
    print '>>>Use simple urllib2:'
    use_simple_urllib2()

If the file name is urllib? Demo.py, run Python urllib? Demo.py from the terminal command line to view the output

How to know which properties and methods are available

print type(response): get the object type. With the basic type, you can roughly guess which methods and attributes are available for external calls

print dir(response): get enumeration values of object methods and properties, no document guessing methods and properties

The simplest network request can be sent through urllib2.urlopen(url). No matter GET request or POST request, it is undoubtedly very important to GET the response body after request, but there are also other methods and properties that can not be ignored in actual development

Therefore, in addition to mastering the response.read() to read all the contents of the response body at one time, you need to know what attributes and methods the response has

Through type(response) to get the object type and then with dir(response) to get the attribute enumeration value, we can roughly guess which attributes and methods the object can call without documents

# -*- coding: utf-8 -*-
import urllib2

def use_simple_urllib2():
    '''
    //Get requester information
    '''
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
    print type(response)
    print dir(response)

if __name__ == '__main__':
    print '>>>Use simple urllib2:'
    use_simple_urllib2()

The following is the output content of print type(response) and print dir(response). Next, we will select common attributes and methods to explain slowly

# print type(response)
<type 'instance'>

# print dir(response)
['__doc__', '__init__', '__iter__', '__module__', '__repr__', 'close', 'code', 'fileno', 'fp', 'getcode', 'geturl', 'headers', 'info', 'msg', 'next', 'read', 'readline', 'readlines', 'url']

  • Status code (property) of the response object

response.code: get the status code of the response object. Normally, 200 means the request is successful, and 500 is a typical system error

It is not difficult to find out whether response.code is used to get response status code by using dir(response) to get attribute enumeration value. The specific calling method is response.code or response.code(), which can be roughly inferred by running print type(response.code)

# -*- coding: utf-8 -*-
import urllib2

def use_simple_urllib2():
    '''
    //Get requester information
    '''
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
    print type(response.read)
    print type(response.code)

if __name__ == '__main__':
    print '>>>Use simple urllib2:'
    use_simple_urllib2()

Here, you can verify whether the output result is an attribute or a method by combining the print type(response.read). You can see that response.read is a method type, while response.code is the basic type, so response.code is the attribute call method

The output of type(response.code) is not yes, so the way to get the status code is attribute call

The detailed code is as follows:

# -*- coding: utf-8 -*-
import urllib2

def use_simple_urllib2():
    '''
    //Get requester information
    '''
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
    print response.code

if __name__ == '__main__':
    print '>>>Use simple urllib2:'
    use_simple_urllib2()

  • Response object's status code (method)

response.getcode(): get the status code of the response object. Normally, 200 means the request is successful, and 500 is a typical system error

Similarly, from the print dir(response), we know that the getcode field can be called, but we don't know whether it's an attribute call or a method call. Use 'print type(response.getcode) again to get' so it's determined as a method call form

The detail code is as follows:

# -*- coding: utf-8 -*-
import urllib2

def use_simple_urllib2():
    '''
    //Get requester information
    '''
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
    print response.getcode()

if __name__ == '__main__':
    print '>>>Use simple urllib2:'
    use_simple_urllib2()

  • Status code information (properties) of the response object

response.msg: get the status description information of the response object, for example, the status code 200 is OK, and the status code 500 is INTERNAL SERVER ERROR

# -*- coding: utf-8 -*-
import urllib2

def use_simple_urllib2():
    '''
    //Get response status code
    '''
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/status/200')
    print response.code
    print response.msg

    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/status/500')
    print response.code
    print response.msg

if __name__ == '__main__':
    print '>>>Use simple urllib2:'
    use_simple_urllib2()

The normal request status is 200 OK, and the exception of the request is likely to be 500 INTERNAL SERVER ERROR. Once an exception occurs, an error will be reported and the program will stop running

  • Access links (properties) for response objects

response.url: get the request link

# -*- coding: utf-8 -*-
import urllib2

def use_simple_urllib2():
    '''
    //Get requester information
    '''
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
    print response.url

if __name__ == '__main__':
    print '>>>Use simple urllib2:'
    use_simple_urllib2()

  • Access link (method) of response object

response.geturl(): get request link

# -*- coding: utf-8 -*-
import urllib2

def use_simple_urllib2():
    '''
    //Get requester information
    '''
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
    print response.geturl()

if __name__ == '__main__':
    print '>>>Use simple urllib2:'
    use_simple_urllib2()

  • Access links (properties) for response objects

response.headers.dict: get the request header information and display it in a dictionary

In some cases, a specific request header must be carried when sending a request. Therefore, it is necessary to know what the request header received by the server looks like when the request header is not set by default. Similarly, you can use print type(response.headers) in combination with print dir(response.headers) to explore the attributes and methods that can be called by yourself

# -*- coding: utf-8 -*-
import urllib2

def use_simple_urllib2():
    '''
    //Get requester information
    '''
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
    print response.headers.dict

if __name__ == '__main__':
    print '>>>Use simple urllib2:'
    use_simple_urllib2()

  • Request header information of response object (method)

response.info(): get the request header information and display it line by line

It is similar to the previous response.headers.dict to obtain the request header information, except that response.info() is suitable for visual display rather than program use

# -*- coding: utf-8 -*-
import urllib2

def use_simple_urllib2():
    '''
    //Get requester information
    '''
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
    print response.info()

if __name__ == '__main__':
    print '>>>Use simple urllib2:'
    use_simple_urllib2()

  • Response object's response body (method)

response.read(): read the response body at one time, which is suitable for small data volume of response body. It is convenient to read all the data into memory at one time

# -*- coding: utf-8 -*-
import urllib2

def use_simple_urllib2():
    '''
    //Get response body information
    '''
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
    print response.read()

if __name__ == '__main__':
    print '>>>Use simple urllib2:'
    use_simple_urllib2()

response.read() returns a string, so it is convenient to receive variables for subsequent processing, such as result = response.read():

# -*- coding: utf-8 -*-
import urllib2

def use_simple_urllib2():
    '''
    //Get response body information
    '''
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
    result = response.read()
    print result

if __name__ == '__main__':
    print '>>>Use simple urllib2:'
    use_simple_urllib2()

  • Response object's response body (method)

response.readline(): read the response body line by line. It is suitable for large data bodies. It can be read in cycles until there is no data to read at last

# -*- coding: utf-8 -*-
import urllib2

def use_simple_urllib2():
    '''
    //Get response body information
    '''
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
    line = response.readline()
    while line:
        print line
        line = response.readline()

if __name__ == '__main__':
    print '>>>Use simple urllib2:'
    use_simple_urllib2()

response.readline() can only be read line by line, so manual splicing is required to obtain the completed response body, for example:

# -*- coding: utf-8 -*-
import urllib2

def use_simple_urllib2():
    '''
    //Get response body information
    '''
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
    result = ''
    line = response.readline()
    result = result   str(line)
    while line:
        line = response.readline()
        result = result   str(line)
    print result

if __name__ == '__main__':
    print '>>>Use simple urllib2:'
    use_simple_urllib2()

str(line) is to ensure that the response body string must be of string type. In fact, it should not be. response.readline() itself is already of string type

  • Response object's response body (method)

response.readlines(): traverses the read response body, circularly reads and saves it to the list object, which is suitable for the situation that needs to be processed line by line

# -*- coding: utf-8 -*-
import urllib2

def use_simple_urllib2():
    '''
    //Get response body information
    '''
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
    for line in response.readlines():
        print line

if __name__ == '__main__':
    print '>>>Use simple urllib2:'
    use_simple_urllib2()

Similarly, if you need to obtain the complete response body results for response.readlines(), you can splice them as follows, for example:

# -*- coding: utf-8 -*-
import urllib2

def use_simple_urllib2():
    '''
    //Get response body information
    '''
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
    result = ''
    for line in response.readlines():
        result = result   str(line)
    print result

if __name__ == '__main__':
    print '>>>Use simple urllib2:'
    use_simple_urllib2()

The above multiple lines of code can be further converted into one line of code: result = '. Join ([line for line in response. Readlines()])

How to send a normal GET request

  • Send directly without parameters

urllib2.urlopen(url) ): only one target URL is needed to send the GET request

The simplest request method is GET. Without setting other parameters, you only need to fill in the URL to send the request, such as urllib2.urlopen('http://httpbin.snowdreams1006.cn/get '). The example code is as follows:

# -*- coding: utf-8 -*-
import urllib
import urllib2

def use_simple_urllib2():
    '''
    //Get response header and response body information
    '''
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
    print('>>>Response Headers:')
    print(response.info())
    print('>>>Response Body:')
    print(response.read())

if __name__ == '__main__':
    print '>>>Use simple urllib2<<<'
    use_simple_urllib2()

If the above code file is named urllib_demo.py, run the python urllib_demo.py file on the terminal command line, and the output is as follows:

(.env) $ python urllib_demo.py 
>>>Use simple urllib2<<<
>>>Response Headers:
Server: nginx/1.17.6
Date: Thu, 16 Jan 2020 13:38:27 GMT
Content-Type: application/json
Content-Length: 263
Connection: close
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true

>>>Response Body:
{
  "args": {}, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Connection": "close", 
    "Host": "httpbin.snowdreams1006.cn", 
    "User-Agent": "Python-urllib/2.7"
  }, 
  "origin": "218.205.55.192", 
  "url": "http://httpbin.snowdreams1006.cn/get"
}

The response header Connection: close indicates that the connection is automatically closed, and the response body args is empty dictionary indicates that there is no query parameter

  • Transcoding with parameters

In the actual development process, few GET requests do not need to carry parameters. For GET requests with parameter queries, native urllib also supports it. The simplest way is to splice query parameters to the target URL to GET the URL with query parameters

# -*- coding: utf-8 -*-
import urllib
import urllib2

def use_params_urllib2():
    '''
    //Get response header and response body information
    '''
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get?param1=hello&param2=world')
    print('>>>Response Headers:')
    print(response.info())
    print('>>>Response Body:')
    print(response.read())

if __name__ == '__main__':
    print '>>>Use params urllib2<<<'
    use_params_urllib2()

Similarly, if the above code file is named urllib_demo.py, run the python urllib_demo.py file in the terminal command line, and the output is as follows:

(.env) $ python urllib_demo.py 
>>>Use params urllib2<<<
>>>Response Headers:
Server: nginx/1.17.6
Date: Thu, 16 Jan 2020 13:59:23 GMT
Content-Type: application/json
Content-Length: 338
Connection: close
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true

>>>Response Body:
{
  "args": {
    "param1": "hello", 
    "param2": "world"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Connection": "close", 
    "Host": "httpbin.snowdreams1006.cn", 
    "User-Agent": "Python-urllib/2.7"
  }, 
  "origin": "218.205.55.192", 
  "url": "http://httpbin.snowdreams1006.cn/get?param1=hello&param2=world"
}

The response header Connection: close indicates that the connection is automatically closed, and the response body args is no longer an empty dictionary, but the query parameters just passed indicate that the server has indeed received the query parameters sent, so this method is also feasible

If there are many query parameters, it will be very cumbersome to splice the new URL directly based on the request link URL, and the format of "Param1 = Hello & param2 = World" must be followed, so this cumbersome splicing work will be done by the program!

# -*- coding: utf-8 -*-
import urllib
import urllib2

def use_params_urllib2():
    '''
    //Get response header and response body information
    '''
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get?param1=hello&param2=world&author=snowdreams1006&website=http://blog.snowdreams1006.cn&url=snowdreams1006.github.io/learn-python/url/urllib/teaching.html&wechat=snowdreams1006&email=snowdreams1006@163.com&github=https://github.com/snowdreams1006/')
    print('>>>Response Headers:')
    print(response.info())
    print('>>>Response Body:')
    print(response.read())

if __name__ == '__main__':
    print '>>>Use params urllib2<<<'
    use_params_urllib2()

The above tediousness is not only reflected in the error of too long container when splicing into new URL, but also the problem of dynamic query parameter replacement. Therefore, the automatic splicing query parameter function is really timely!

params = urllib.urlencode({
    'param1': 'hello', 
    'param2': 'world',
    'author':'snowdreams1006',
    'website':'http://blog.snowdreams1006.cn',
    'url':'https://snowdreams1006.github.io/learn-python/url/urllib/teaching.html',
    'wechat':'snowdreams1006',
    'email':'snowdreams1006@163.com',
    'github':'https://github.com/snowdreams1006/'
})
print params

urllib.urlencode() can transcode and splice dictionary type query parameters into & connected query parameters, and then manually splice them to the request URL?params to get the URL with parameters

# -*- coding: utf-8 -*-
import urllib
import urllib2

def use_params_urllib2():
    params = urllib.urlencode({
        'param1': 'hello', 
        'param2': 'world',
        'author':'snowdreams1006',
        'website':'http://blog.snowdreams1006.cn',
        'url':'https://snowdreams1006.github.io/learn-python/url/urllib/teaching.html',
        'wechat':'snowdreams1006',
        'email':'snowdreams1006@163.com',
        'github':'https://github.com/snowdreams1006/'
    })
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get?%s' % params)
    print('>>>Response Headers:')
    print(response.info())
    print('>>>Response Body:')
    print(response.read())

if __name__ == '__main__':
    print '>>>Use params urllib2<<<'
    use_params_urllib2()

If the above code file is named urllib_demo.py, run the python urllib_demo.py file on the terminal command line, and the output is as follows:

$ python urllib_demo.py 
>>>Use params urllib2<<<
>>>Response Headers:
Server: nginx/1.17.6
Date: Thu, 16 Jan 2020 14:27:21 GMT
Content-Type: application/json
Content-Length: 892
Connection: close
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true

>>>Response Body:
{
  "args": {
    "author": "snowdreams1006", 
    "email": "snowdreams1006@163.com", 
    "github": "https://github.com/snowdreams1006/", 
    "param1": "hello", 
    "param2": "world", 
    "url": "https://snowdreams1006.github.io/learn-python/url/urllib/teaching.html",
    "website": "http://blog.snowdreams1006.cn", 
    "wechat": "snowdreams1006"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Connection": "close", 
    "Host": "httpbin.snowdreams1006.cn", 
    "User-Agent": "Python-urllib/2.7"
  }, 
  "origin": "218.205.55.192", 
  "url": "http://httpbin.snowdreams1006.cn/get?website=http://blog.snowdreams1006.cn&github=https://github.com/snowdreams1006/&wechat=snowdreams1006&param2=world&param1=hello&author=snowdreams1006&url=https://snowdreams1006.github.io/learn-python/url/urllib/teaching.html&email=snowdreams1006@163.com"
}

It can be seen that whether the query parameters are directly manually spliced or semi manually spliced using urllib.urlencode(query), they are essentially the same, and still use urllib2.urlopen(url) to send GET requests

How to send a normal POST request

If the request link URL only supports POST request, then the GET request realized by the above splicing address no longer meets the requirements. Interestingly, it only needs one step to convert the GET request to POST request

If it is a GET request, the request is sent as follows: urllib2.urlopen ('http://httpbin.snowdreams1006.cn/post?% s'% params);

If it is a POST request, the request is sent as follows: urllib2.urlopen('http://httpbin.snowdreams1006.cn/post',params);

def post_params_urllib2():
    '''
    //Get response header and response body information
    '''
    params = urllib.urlencode({
        'param1': 'hello', 
        'param2': 'world',
        'author':'snowdreams1006',
        'website':'http://blog.snowdreams1006.cn',
        'url':'https://snowdreams1006.github.io/learn-python/url/urllib/teaching.html',
        'wechat':'snowdreams1006',
        'email':'snowdreams1006@163.com',
        'github':'https://github.com/snowdreams1006/'
    })
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/post',params)
    print('>>>Response Headers:')
    print(response.info())
    print('>>>Response Body:')
    print(response.read())

if __name__ == '__main__':
    print '>>>Post params urllib2<<<'
    post_params_urllib2()

Because GET request and POST request are so similar, you need to pay attention to how link URLs in urllib2.urlopen(url) are spliced when sending requests?

However, a more intuitive method is to send a request for direct verification. An example is as follows:

(.env) $ python urllib_demo.py 
>>>Post params urllib2<<<
>>>Response Headers:
Server: nginx/1.17.6
Date: Thu, 16 Jan 2020 14:45:43 GMT
Content-Type: application/json
Content-Length: 758
Connection: close
Access-Control-Allow-Origin: *
Access-Control-Allow-Credentials: true

>>>Response Body:
{
  "args": {}, 
  "data": "", 
  "files": {}, 
  "form": {
    "author": "snowdreams1006", 
    "email": "snowdreams1006@163.com", 
    "github": "https://github.com/snowdreams1006/", 
    "param1": "hello", 
    "param2": "world", 
    "url": "https://snowdreams1006.github.io/learn-python/url/urllib/teaching.html",
    "website": "http://blog.snowdreams1006.cn", 
    "wechat": "snowdreams1006"
  }, 
  "headers": {
    "Accept-Encoding": "identity", 
    "Connection": "close", 
    "Content-Length": "285", 
    "Content-Type": "application/x-www-form-urlencoded", 
    "Host": "httpbin.snowdreams1006.cn", 
    "User-Agent": "Python-urllib/2.7"
  }, 
  "json": null, 
  "origin": "218.205.55.192", 
  "url": "http://httpbin.snowdreams1006.cn/post"
}

It is worth noting that the parameters submitted in the above POST request are stored in the form attribute instead of the args attribute in the GET request

How to set up agent for network request

Environment building

If http://proxyip.snowdreams1006.cn/ Can't access, can access https://github.com/jhao104/proxy_pool The project builds its own agent pool

{
  "delete?proxy=127.0.0.1:8080": "delete an unable proxy", 
  "get": "get an useful proxy", 
  "get_all": "get all proxy from proxy pool", 
  "get_status": "proxy number"
}

Do not press single machine, malicious visit will close the black house, recommend you to build your own local environment, thank you for your support

This agent pool is based on jhao104/proxy_pool The project provides two installation methods: docker installation and source installation

docker mode installation

docker run --env db_type=REDIS --env db_host=127.0.0.1 --env db_port=6379 --env db_password='' -p 5010:5010 jhao104/proxy_pool

Of course, you can download the image in advance: docker pull jhao104 / proxy pool, and then run the above command to start the container

Source code installation

  • Step 1: download the source code
git clone https://github.com/jhao104/proxy_pool.git

You can also download the installation package directly: https://github.com/jhao104/proxy_pool/releases

  • Step 2: install dependencies
pip install -r requirements.txt

When installing project dependencies, you need to switch to the root directory of the project, such as CD proxy pool, which will be automatically downloaded from the default installation source. You can also use pip install -i https://pypi.tuna.tsinghua.edu.cn/simple -r requirements.txt to speed up the installation

  • Step 3: configure Config/setting.py
# Config/setting.py is the project configuration file

# Configure DB     
DATABASES = {
    "default": {
        "TYPE": "REDIS",        # SSDB or REDIS databases are currently supported
        "HOST": "127.0.0.1",   # db host
        "PORT": 6379,          # db port, for example, SSDB usually uses 8888, REDIS usually uses 6379 by default
        "NAME": "proxy",       # Default configuration
        "PASSWORD": ""         # db password
    }
}

# Configure API services
SERVER_API = {
    "HOST": "0.0.0.0",  # Monitor ip, 0.0.0.0 monitor all IP
    "PORT": 5010        # Monitor port
}
       
# After the above configuration is started, the agent pool access address is http://127.0.0.1:5010

  • Step 5: start the project
# If your dependencies have been installed and are ready to run, you can start them in the cli directory through ProxyPool.py
# The program is divided into schedule scheduler and web server API service

# Start the scheduler first
python proxyPool.py schedule

# Then start the webApi service
python proxyPool.py webserver

This command requires that the current environment is in the CLI directory. If it is another directory, please adjust the path of proxyPool.py (cd cli can switch to the CLI directory)

If all the above steps are normal, the Internet free agent ip will be automatically grabbed after the project kick-off meeting, which can be accessed through http://127.0.0.1:5010 See.

Agent request

It is recommended to use the browser to directly access http://proxyip.snowdreams1006.cn/get/ Check whether the random proxy ip can be obtained, and then use python program to obtain it, so as to ensure that the code runs correctly and facilitate subsequent development and testing

{
    "proxy": "183.91.33.41:8086", 
    "fail_count": 0, 
    "region": "", 
    "type": "", 
    "source": "freeProxy09", 
    "check_count": 59, 
    "last_status": 1, 
    "last_time": "2020-01-18 13:14:32"
}

The above is a random ip example of request / get /. Full request address: http://proxyip.snowdreams1006.cn/get/

Get random proxy ip

# -*- coding: utf-8 -*-
import urllib2
import json

def get_proxy():
    '''
    //Get random agent
    '''
    response = urllib2.urlopen('http://proxyip.snowdreams1006.cn/get/')
    result = response.read()
    return json.loads(result)

if __name__ == '__main__':
    print '>>>Get proxy urllib<<<'
    get_proxy_urllib()

If you have a browser environment, you can access it directly http://proxyip.snowdreams1006.cn/get/ Verify whether the random proxy ip can be obtained, or run curl http://proxyip.snowdreams1006.cn/get/command at the terminal command line to view the results

Set proxy ip access

urllib.FancyURLopener(proxy): set proxy ip information to realize indirect access

Through urllib.FancyURLopener(proxy), you can set up a proxy to hide the real information of the client from the server. However, whether the server can distinguish between the proxy request and the common request is related to the proxy ip

If it is a high hiding agent, it is the most ideal situation, which can achieve the role of real agent

On the contrary, if it is a transparent agent, it is the most useless agent. The server not only knows that you are using the agent, but also knows your real ip. There is an illusion of hiding your ears

How to verify whether the set proxy ip can be recognized by the server and can be accessed http://httpbin.snowdreams1006.cn/ip Obtain the client ip read by the server

$ curl http://httpbin.snowdreams1006.cn/ip
{
  "origin": "115.217.104.191"
}

If the terminal command line has no curl command, you can install it by yourself or directly open the browser to access it http://httpbin.snowdreams1006.cn/ip

If the request ip read by the server is the same as the set proxy ip, Congratulations, setting the proxy is successful and the proxy is highly hidden. Otherwise, it's just stealing

# -*- coding: utf-8 -*-
import urllib
import urllib2
import json

def get_proxy():
    '''
    //Get random agent
    '''
    response = urllib2.urlopen('http://proxyip.snowdreams1006.cn/get/')
    result = response.read()
    return json.loads(result)

def get_proxy_urllib():
    '''
    //Send request through agent
    '''
    # Random proxy ip
    ip = get_proxy().get('proxy')
    print('>>>Get Proxy:')
    print(ip)
    proxy = {
        'http': 'http://{}'.format(ip),
        'https': 'https://{}'.format(ip)
    }
    opener = urllib.FancyURLopener(proxy)
    response = opener.open()
    print('>>>Response Headers:')
    print(response.info())
    print('>>>Response Body:')
    print(response.read())

if __name__ == '__main__':
    print '>>>Get proxy urllib<<<'
    get_proxy_urllib()

The above example only shows how to set proxy ip to send requests, and does not verify whether the proxy ip is set successfully, that is, whether the server reads the request ip and the proxy ip just set are the same, and does not consider the proxy ip is not available or the connection timeout and other exceptions

Here is a simple example to determine whether the proxy ip is set successfully:

{
    "proxy": "121.225.199.78:3128", 
    "fail_count": 0, 
    "region": "", 
    "type": "", "source": 
    "freeProxy09", 
    "check_count": 15, 
    "last_status": 1, 
    "last_time": "2020-01-17 12:03:29"
}

The general format of random proxy ip is obtained, and the general value of random ip is 121.225.199.78:3128

The general format for randomly obtaining proxy ip is to have port number and access http://httpbin.snowdreams1006.cn/ip The source ip does not include the port number, so the simplest way is to remove the port number of random ip, and then compare it with the access result

'121.225.199.78:3128'.split(':')[0]

First, it is divided into two parts, and then only the first part is taken, that is, the ip address without port number is obtained: 121.225.199.78

Next, because the response body data obtained by response.read() is of string type, it is inconvenient to extract the value corresponding to the origin, and the response body data is obviously in json format, so it is convenient to use json.loads(result) to convert it into python dictionary type

result = response.read()
result = json.loads(result)
proxyip = result.get('origin')

For the dictionary type, you can not only get result.get('origin ') but also result['origin']. However, when the key name does not exist, the performance of the two is inconsistent. It is recommended to use the method value

Now the simplest example to verify whether the proxy ip is set successfully is as follows:

# -*- coding: utf-8 -*-
import urllib
import urllib2
import json

def get_proxy():
    '''
    //Get random agent
    '''
    response = urllib2.urlopen('http://proxyip.snowdreams1006.cn/get/')
    result = response.read()
    return json.loads(result)

def get_proxy_urllib():
    '''
    //Send request through agent
    '''
    # Random proxy ip
    ip = get_proxy().get('proxy')
    print('>>>Get Proxy:')
    print(ip)
    proxy = {
        'http': 'http://{}'.format(ip),
        'https': 'https://{}'.format(ip)
    }
    opener = urllib.FancyURLopener(proxy)
    response = opener.open()
    result = response.read()
    result = json.loads(result)
    response_ip = result.get('origin')
    proxy_ip = ip.split(':')[0]
    if proxy_ip == response_ip:
        print 'Proxy Success'
    else:
        print 'Proxy Fail'

if __name__ == '__main__':
    print '>>>Get proxy urllib<<<'
    get_proxy_urllib()

If the proxy ip obtained at random is normal, no exception will be thrown, either the setting succeeds or the setting fails

(.env) $ python urllib_demo.py 
>>>Get proxy urllib<<<
>>>Get Proxy:
52.80.58.248:3128
Proxy Fail

(.env) $ python urllib_demo.py 
>>>Get proxy urllib<<<
>>>Get Proxy:
117.88.176.152:3000
Proxy Success

The quality of free agent IP is general, so don't have too high psychological expectation. In the actual development process, pay agent ip should be chosen

Clear agent ip direct connection

Urllib. Fancyurlopen ({}): clearing agent ip information to achieve direct access

When setting proxy ip, you need to pass a proxy dictionary to urlib.fancyurlopener (proxy). When clearing proxy information, you only need to set the original proxy dictionary as an empty object

There is no difference between the main code and the set proxy ip, which will not be described in detail. Please refer to the following code:

# -*- coding: utf-8 -*-
import urllib
import urllib2
import json

def clear_proxy_urllib():
    '''
    //Send request after clearing agent
    '''
    # Random proxy ip
    ip = get_proxy().get('proxy')
    print('>>>Get Proxy:')
    print(ip)
    proxy = {
        'http': 'http://{}'.format(ip),
        'https': 'https://{}'.format(ip)
    }
    opener = urllib.FancyURLopener(proxy)
    response = opener.open("http://httpbin.snowdreams1006.cn/ip")
    print('>>>Response Headers:')
    print(response.info())
    print('>>>Response Body:')
    result = response.read()
    print(result)
    result = json.loads(result)
    response_ip = result.get('origin')
    proxy_ip = ip.split(':')[0]
    if proxy_ip == response_ip:
        print 'Set proxy success'
    else:
        print 'Set proxy fail'

    opener = urllib.FancyURLopener({})
    response = opener.open("http://httpbin.snowdreams1006.cn/ip")
    print('>>>Response Headers:')
    print(response.info())
    print('>>>Response Body:')
    result = response.read()
    print(result)
    result = json.loads(result)
    response_ip = result.get('origin')
    proxy_ip = ip.split(':')[0]
    if proxy_ip == response_ip:
        print 'Clear proxy fail'
    else:
        print 'Clear proxy success'

if __name__ == '__main__':
    print '>>>Get proxy urllib<<<'
    get_proxy_urllib()

In addition to using urllib. Fancyurlopen() to set or clear the proxy ip, you can also use urllib.urlopen() to achieve similar requirements

# Use http://www.someproxy.com:3128 for HTTP proxying
proxies = {'http': 'http://www.someproxy.com:3128'}
filehandle = urllib.urlopen(some_url, proxies=proxies)

# Don't use any proxies
filehandle = urllib.urlopen(some_url, proxies={})

# Use proxies from environment - both versions are equivalent
filehandle = urllib.urlopen(some_url, proxies=None)
filehandle = urllib.urlopen(some_url)

An example of setting environment variables is as follows:

% http_proxy="http://www.someproxy.com:3128"
% export http_proxy
% python
...

Learning summary

This paper mainly introduces how to send network request by native urlib in python and the process of building some basic environment, including a large number of ready-made code that can be operated directly, documents and source code have been open-source, and interested partners can make their own decisions Browse through.

Now briefly review the important knowledge points involved in this article for later learners to quickly browse and query

Virtual environment

After the virtual environment is installed and activated successfully, the version information of python and pip is as follows:

(.env)$ python --version
Python 2.7.16
(.env) $ pip --version
pip 19.3.1 from ~/python/src/url/urllib/.env/lib/python2.7/site-packages/pip (python 2.7)

To build the virtual environment by yourself, please refer to the following steps to start the virtual environment:

  • Step 1. Install virtual environment virtualenv
sudo pip install virtualenv

It is convenient to install virtual environment to isolate different python environments, or use the system default environment, so this step is optional. Similarly, the following steps are optional

  • Step 2. Prepare the virtual environment directory.env
virtualenv .env

The purpose of setting the virtual environment directory as a hidden directory is to prevent misoperation. Of course, it can also be displayed as a normal directory

  • Step 3. Activate virtual environment.env
source .env/bin/activate

After the virtual environment is activated, you can run pip --version to view the current version information, so as to verify whether the virtual environment is successfully opened

Server background httpbin

Default local access address: http://127.0.0.1:8000/ , online access address: http://httpbin.snowdreams1006.cn/ perhaps http://httpbin.org/

If httpbin is successfully installed by docker, the interface address will be previewed as follows:

If you use python to start the httpbin library, the effect is different from that of docker after successful operation, as follows:

If you need to build your own local service, please choose the installation method according to your own needs. Here are two ways to start httpbin service

  • docker installation mode
docker run -p 8000:80 kennethreitz/httpbin

For the first run, the image will be downloaded to the local and then the container will be started. For the non first run, the container will be started directly. Access address: http://127.0.0.1:8000/

  • python installation mode
pip install gunicorn httpbin && gunicorn httpbin:app

The default listening port is 8000. If the port conflict indicates that it has been occupied, run gunicorn httpbin: app - B: 9898 to specify the port

Free ip proxy pool proxyip

Default local access address: http://127.0.0.1:5010/ , online access address: http://proxyip.snowdreams1006.cn/ perhaps http://118.24.52.95/

{
  "delete?proxy=127.0.0.1:8080": "delete an unable proxy", 
  "get": "get an useful proxy", 
  "get_all": "get all proxy from proxy pool", 
  "get_status": "proxy number"
}

If you need to build your own local service, please choose the installation method according to your own needs. Here are two ways to start proxyip service

  • docker installation mode
docker run --env db_type=REDIS --env db_host=127.0.0.1 --env db_port=6379 --env db_password='' -p 5010:5010 jhao104/proxy_pool

Of course, you can download the image in advance: docker pull jhao104 / proxy pool, and then run the above command to start the container

  • Source code installation mode
    -Step 1: download the source code
    git clone https://github.com/jhao104/proxy_pool.git

    >You can also download the installation package directly: [https://github.com/jhao104/proxy-u pool / releases] (https://github.com/jhao104/proxy-u pool / releases) 
    -Step 2: install dependencies
    pip install -r requirements.txt

    >Note: when installing project dependencies, you need to switch to the root directory of the project (` CD proxy ﹐ pool ') in advance. If you feel that the download speed is too slow, you can use Tsinghua University image' pip install -i https://pypi.tuna.tsinghua.edu.cn/simple -r requirements.txt 'to speed up the download and installation process
    -Step 3: configure ` Config/setting.py`
    # Config/setting.py is the project configuration file

    # Configure DB     
    DATABASES = {
        "default": {
            "TYPE": "REDIS",        # SSDB or REDIS databases are currently supported
            "HOST": "127.0.0.1",   # db host
            "PORT": 6379,          # db port, for example, SSDB usually uses 8888, REDIS usually uses 6379 by default
            "NAME": "proxy",       # Default configuration
            "PASSWORD": ""         # db password
        }
    }

    # Configure API services
    SERVER_API = {
        "HOST": "0.0.0.0",  # Monitor ip, 0.0.0.0 monitor all IP
        "PORT": 5010        # Monitor port
    }
           
    # After the above configuration is started, the agent pool access address is http://127.0.0.1:5010

    >For more details about configuration, please refer to the official introduction of the project directly. The above configuration information is basically enough
    -Step 5: start the project
    # If your dependencies have been installed and are ready to run, you can start them in the cli directory through ProxyPool.py
    # The program is divided into schedule scheduler and web server API service

    # Start the scheduler first
    python proxyPool.py schedule

    # Then start the webApi service
    python proxyPool.py webserver

    >This command requires the current environment to be in the 'cli' directory. If it is another directory, please adjust the path of 'proxyPool.py' ('cd cli')

Native network request urlib

urllib.urlopen(url[,data[,proxies]]) : https://docs.python.org/2/library/urllib.html

  • GET request

If the query parameters are simple, the request URL can be built directly, and the query parameter dictionary can be serialized with urllib.urlencode(dict)

When the query parameters are not too complex, especially when the query parameters are not needed, you can send the request directly to urllib2.urlopen(url), as follows:

# -*- coding: utf-8 -*-
import urllib
import urllib2

def use_simple_urllib2():
    '''
    //Get response header and response body information
    '''
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')
    print('>>>Response Headers:')
    print(response.info())
    print('>>>Response Body:')
    print(response.read())

if __name__ == '__main__':
    print '>>>Use simple urllib2<<<'
    use_simple_urllib2()

When there are many query parameters or dynamic splicing is needed, it is recommended to use urllib.urlencode(dict) to serialize the query parameters, and then splice them to the request URL to finally form the completed request URL

# -*- coding: utf-8 -*-
import urllib
import urllib2

def use_params_urllib2():
    params = urllib.urlencode({
        'param1': 'hello', 
        'param2': 'world',
        'author':'snowdreams1006',
        'website':'http://blog.snowdreams1006.cn',
        'url':'https://snowdreams1006.github.io/learn-python/url/urllib/teaching.html',
        'wechat':'snowdreams1006',
        'email':'snowdreams1006@163.com',
        'github':'https://github.com/snowdreams1006/'
    })
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/get?%s' % params)
    print('>>>Response Headers:')
    print(response.info())
    print('>>>Response Body:')
    print(response.read())

if __name__ == '__main__':
    print '>>>Use params urllib2<<<'
    use_params_urllib2()

urllib2.urlopen('http://httpbin.snowdreams1006.cn/get')

  • POST request

Compared with the default GET request mode, only query parameters need to be passed to parameter data as optional parameters instead of being spliced to the request link URL. For example, the request mode of urllib.urlopen(url,data) is POST request

def post_params_urllib2():
    '''
    //Get response header and response body information
    '''
    params = urllib.urlencode({
        'param1': 'hello', 
        'param2': 'world',
        'author':'snowdreams1006',
        'website':'http://blog.snowdreams1006.cn',
        'url':'https://snowdreams1006.github.io/learn-python/url/urllib/teaching.html',
        'wechat':'snowdreams1006',
        'email':'snowdreams1006@163.com',
        'github':'https://github.com/snowdreams1006/'
    })
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/post',params)
    print('>>>Response Headers:')
    print(response.info())
    print('>>>Response Body:')
    print(response.read())

if __name__ == '__main__':
    print '>>>Post params urllib2<<<'
    post_params_urllib2()

  • Setting agent

urllib.FancyURLopener(proxy) can send proxy requests when the proxy object is valid, and clear proxy settings if the proxy object is an empty dictionary

# -*- coding: utf-8 -*-
import urllib
import urllib2
import json

def get_proxy():
    '''
    //Get random agent
    '''
    response = urllib2.urlopen('http://proxyip.snowdreams1006.cn/get/')
    result = response.read()
    return json.loads(result)

def get_proxy_urllib():
    '''
    //Send request through agent
    '''
    # Random proxy ip
    ip = get_proxy().get('proxy')
    print('>>>Get Proxy:')
    print(ip)
    proxy = {
        'http': 'http://{}'.format(ip),
        'https': 'https://{}'.format(ip)
    }
    opener = urllib.FancyURLopener(proxy)
    response = opener.open('http://httpbin.snowdreams1006.cn/ip')
    result = response.read()
    result = json.loads(result)
    response_ip = result.get('origin')
    proxy_ip = ip.split(':')[0]
    if proxy_ip == response_ip:
        print 'Proxy Success'
    else:
        print 'Proxy Fail'

if __name__ == '__main__':
    print '>>>Get proxy urllib<<<'
    get_proxy_urllib()

In addition to using urllib.fancyurlopen (proxy) to set up proxy requests, you can also use urllib2.urlopen(url,data,proxies) to send proxy requests for GET or POST requests

# -*- coding: utf-8 -*-
import urllib
import urllib2
import json

def get_proxy():
    '''
    //Get random agent
    '''
    response = urllib2.urlopen('http://proxyip.snowdreams1006.cn/get/')
    result = response.read()
    return json.loads(result)

def post_proxy_urllib():
    '''
    //Get response header and response body information through proxy
    '''
    data = urllib.urlencode({
        'param1': 'hello', 
        'param2': 'world',
        'author':'snowdreams1006',
        'website':'http://blog.snowdreams1006.cn',
        'url':'https://snowdreams1006.github.io/learn-python/url/urllib/teaching.html',
        'wechat':'snowdreams1006',
        'email':'snowdreams1006@163.com',
        'github':'https://github.com/snowdreams1006/'
    })
    ip = get_proxy().get('proxy')
    print('>>>Get Proxy:')
    print(ip)
    proxies = {
        'http': 'http://{}'.format(ip),
        'https': 'https://{}'.format(ip)
    }
    response = urllib2.urlopen('http://httpbin.snowdreams1006.cn/post',data=data,proxies=proxies)
    result = response.read()
    result = json.loads(result)
    response_ip = result.get('origin')
    proxy_ip = ip.split(':')[0]
    if proxy_ip == response_ip:
        print 'Proxy Success'
    else:
        print 'Proxy Fail'

if __name__ == '__main__':
    print '>>>Get proxy urllib<<<'
    post_proxy_urllib()

The related demonstrations of urllib.urlopen(url[,data[,proxies]]) of python2 are basically covered. It is recommended that the readers practice it

Notice in the next section:

Visit https://api.github.com/ Request the interface of interest and test the open data

{
  "current_user_url": "https://api.github.com/user",
  "current_user_authorizations_html_url": "https://github.com/settings/connections/applications{/client_id}",
  "authorizations_url": "https://api.github.com/authorizations",
  "code_search_url": "https://api.github.com/search/code?q={query}{&page,per_page,sort,order}",
  "commit_search_url": "https://api.github.com/search/commits?q={query}{&page,per_page,sort,order}",
  "emails_url": "https://api.github.com/user/emails",
  "emojis_url": "https://api.github.com/emojis",
  "events_url": "https://api.github.com/events",
  "feeds_url": "https://api.github.com/feeds",
  "followers_url": "https://api.github.com/user/followers",
  "following_url": "https://api.github.com/user/following{/target}",
  "gists_url": "https://api.github.com/gists{/gist_id}",
  "hub_url": "https://api.github.com/hub",
  "issue_search_url": "https://api.github.com/search/issues?q={query}{&page,per_page,sort,order}",
  "issues_url": "https://api.github.com/issues",
  "keys_url": "https://api.github.com/user/keys",
  "label_search_url": "https://api.github.com/search/labels?q={query}&repository_id={repository_id}{&page,per_page}",
  "notifications_url": "https://api.github.com/notifications",
  "organization_url": "https://api.github.com/orgs/{org}",
  "organization_repositories_url": "https://api.github.com/orgs/{org}/repos{?type,page,per_page,sort}",
  "organization_teams_url": "https://api.github.com/orgs/{org}/teams",
  "public_gists_url": "https://api.github.com/gists/public",
  "rate_limit_url": "https://api.github.com/rate_limit",
  "repository_url": "https://api.github.com/repos/{owner}/{repo}",
  "repository_search_url": "https://api.github.com/search/repositories?q={query}{&page,per_page,sort,order}",
  "current_user_repositories_url": "https://api.github.com/user/repos{?type,page,per_page,sort}",
  "starred_url": "https://api.github.com/user/starred{/owner}{/repo}",
  "starred_gists_url": "https://api.github.com/gists/starred",
  "user_url": "https://api.github.com/users/{user}",
  "user_organizations_url": "https://api.github.com/user/orgs",
  "user_repositories_url": "https://api.github.com/users/{user}/repos{?type,page,per_page,sort}",
  "user_search_url": "https://api.github.com/search/users?q={query}{&page,per_page,sort,order}"
}

Reference document

If you feel that this article is helpful to you, please click on a Zun Zai, or pay attention to the public's "snow dream technology post" regularly update the quality articles!

113 original articles published, praised 116, 10000 visitors+
Private letter follow

Tags: github Python JSON pip

Posted on Sat, 18 Jan 2020 01:30:27 -0500 by phprocker