Ali Sherief

Posted on Dec 30, 2019

Python HTTP at Lightspeed ⚡ Part 2: urllib3 and requests

#python #http #tutorial #codenewbie

In my previous post I covered how to use the basic http module. Now lets go up a higher level and check out how to use urllib3. Then we will reach even higher horizons learning about requests. But first, a quick disambiguation of urllib and urllib3.

The backstory

Once upon a time, back when people were rocking Python 2, you had these libraries called httplib and urllib2. Then Python 3 happened.

In Python 3, httplib was refactored into http.client which you learned about in Part 1, and urllib2 was split across multiple submoubles in a new module called urllib. urllib2 and urllib contained a high-level HTTP interface that didn't require you to mess around with the details of http.client (formerly httplib). Except that this new urllib was missing a long list of critical features such as:

Thread safety
Connection pooling
Client-side SSL/TLS verification
File uploads with multipart encoding
Helpers for retrying requests and dealing with HTTP redirects
Support for gzip and deflate encoding
Proxy support for HTTP and SOCKS

To address these issues, urllib3 was created by the community. It is not a core Python module (and probably never will be) but it doesn't need to maintain compatibility with urllib.

urllib won't be covered here because urllib3 can do nearly everything it does and has some extra features, and the vast majority of programmers use urllib3 and requests.

So now that you know the difference between urllib and urllib3, here is a urllib example (the only one here) that uses the http.cookiejar.CookieJar class from Part 1:

>>> import urllib.request
>>> import http.cookiejar
>>> policy = http.cookiejar.DefaultCookiePolicy(
...     blocked_domains=["ads.net", ".ads.net"])
>>> cj = http.cookiejar.CookieJar(policy)
>>> opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
>>> r = opener.open("http://example.com")
>>> str(type(r))
"<class 'http.client.HTTPResponse'>"

Installation

Neither urllib3 nor requests are included in a default Python installation (if your Python was packaged by a distribution then they might be there). So they must be installed with pip. pip3 install 'urllib3[secure, socks]' 'requests[socks]' should install them for you. The secure part installs certificate-related packages that urllib3 needs and socks installs SOCKS protocol related packages.

urllib3

Obviously you need to import it first with import urllib3, and for those of you who read Part 1, here is where things get interesting. Instead of creating a connection directly, you create a PoolManager object. This handles connection pooling and thread safety for you. There is also a ProxyManager object for routing requests through an HTTP/HTTPS proxy, as well as a SOCKSProxyManager for SOCKS4 and SOCKS5 proxies. This is what it looks like:

>>> import urllib3
>>> from urllib3.contrib.socks import SOCKSProxyManager
>>> proxy = urllib3.ProxyManager('http://localhost:3128/')
>>> proxy.request('GET', 'http://google.com/')
>>> proxy = SOCKSProxyManager('socks5://localhost:8889/')

Bear in mind that HTTPS proxies cannot connect to HTTP websites.

urllib3 also has a logger which will log a lot of messages. You can tweak the verbosity by importing the logger module and calling logging.getLogger("urllib3").setLevel(your_level).

Like an HTTPConnection in the http module, urllib3 has a request() method. It's invoked like poolmanager.request('GET', 'http://httpbin.org/robots.txt'). Similar to http, this method also returns a class named HTTPResponse. But don't be fooled! This is not an http.client.HTTPResponse. This is a urllib3.response.HTTPResponse. The urllib3 version has some methods that are not defined in http, and these will prove to be both very useful and convenient.

As explained this request() method returns an HTTPResponse object. It has a data member which represents the response content in a JSON string (encoded as UTF-8 bytes). To inspect it, you can use:

import json
print(json.loads(response.data.decode('utf-8'))

Creating a query parameter

A query parameter looks like http://httpbin.org/get?arg=value. The easiest way to construct something like this is to have a string containing everything up to and including the question mark, and then pass the argument/value pairs as a dictionary to urllib.parse.urlencode() (yes, urllib) and concatenate that to your original string.

Here's a roundup. Every parameter in this table that can be specified has to be a dictionary. There will be multiple JSON keys in the response containing some of these:

Parameter in `request()`	JSON key in response
N/A	"origin"
`headers`	"headers"
`fields` (HEAD/GET/DELETE)	"args"
encoded `url` parameter (POST/PUT)	"args"
`fields` (POST/PUT)	"form"
encoded `body` with Content-Type application/json in `headers`	"json"
`'filefield': (file_name, file_data, mime_type)` in `fields` parameter	"files"
binary data in `body` with any Content-Type in `headers` parameter	"data"

HTTPS in urllib3

There is some extra boilerplate code to add to use certificates and therefore HTTPS in a PoolManager, but has the advantage of throwing an error if the connection cannot be secured for some reason:

>>> import certifi
>>> import urllib3
>>> pool = urllib3.PoolManager(
...     cert_reqs='CERT_REQUIRED',
...     ca_certs=certifi.where())
>>> pool.request('GET', 'https://google.com')
(No exception)
>>> pool.request('GET', 'https://expired.badssl.com')
(Throws urllib3.exceptions.SSLError)

Some additional goodies

Similar to http, urllib3 connections support timeouts for requests. For even more control, you can make a Timeout object to specify separate connect and read timeouts (all exceptions are sourced under urllib3.exceptions):

>>> pool.request(
...     'GET', 'http://httpbin.org/delay/3', timeout=2.5)
MaxRetryError caused by ReadTimeoutError
>>> pool.request(
...     'GET',
...     'http://httpbin.org/delay/3',
...     timeout=urllib3.Timeout(connect=1.0))
<urllib3.response.HTTPResponse>
>>> pool.request(
...     'GET',
...     'http://httpbin.org/delay/3',
...     timeout=urllib3.Timeout(connect=1.0, read=2.0))
MaxRetryError caused by ReadTimeoutError

Something that http doesn't have is retrying requests. urllib3 has this by virtue of being a high-level library. Its documentation couldn't explain it better:

urllib3 can automatically retry idempotent requests. This same mechanism also handles redirects. You can control the retries using the retries parameter to request(). By default, urllib3 will retry requests 3 times and follow up to 3 redirects.

To change the number of retries just specify an integer:

>>> pool.requests('GET', 'http://httpbin.org/ip', retries=10)

To disable all retry and redirect logic specify retries=False:

>>> pool.request(
...     'GET', 'http://nxdomain.example.com', retries=False)
NewConnectionError
>>> r = pool.request(
...     'GET', 'http://httpbin.org/redirect/1', retries=False)
>>> r.status
302

To disable redirects but keep the retrying logic, specify redirect=False:

>>> r = pool.request(
...     'GET', 'http://httpbin.org/redirect/1', redirect=False)
>>> r.status
302

Similar to Timeout, there is also a Retry object for setting the maximum retries and redirects separately. It's made like this: retries=urllib3.Retry(3, redirect=2). The request will throw MaxRetryError if too many requests are made.

Instead of passing a Retry object for each request, you can also specify the Retry object in the PoolManager constructor to make it apply to all requests. The same applies to Timeout.

requests

requests uses urllib3 under the hood and makes it even simpler to make requests and retrieve data. For one thing, keep-alive is 100% automatic, compared to urllib3 where it's not. It also has event hooks which call a callback function when an event is triggered, like receiving a response (but that's an advanced feature and it won't be covered here).

In requests, each request type has it's own function. So instead of creating a connection or a pool, you directly GET (for example) a URL. A lot of the keyword parameters used in urllib3 (shown in the above table) can also be used for requests identically. All exceptions are sourced under requests.exceptions.

import requests
r = requests.get('https://httpbin.org/get')
r = requests.post('https://httpbin.org/post', data={'key':'value'})
r = requests.put('https://httpbin.org/put', data={'key':'value'})
r = requests.delete('https://httpbin.org/delete')
r = requests.head('https://httpbin.org/get')
r = requests.options('https://httpbin.org/get')
# You can disable redirects if you want
r = requests.options('https://httpbin.org/get', allow_redirects=False)
# Or set a timeout for the number of seconds a server has to start responding
r = requests.options('https://httpbin.org/get', timeout=0.001)
# Set the connect and read timeouts at the same time
r = requests.options('https://httpbin.org/get', timeout=(3.05, 27))
# To pass query parameters (`None` keys won't be added to the request):
r = requests.get('https://httpbin.org/get',
    params={'key1': 'value1', 'key2': 'value2'})
# If a key has a list value a key/value pair is added for each value in the list:
r = requests.get('https://httpbin.org/get',
    params={'key1': 'value1', 'key2': ['value2', 'value3']})
# Headers can also be added:
r = requests.get('https://httpbin.org/get',
    headers={'user-agent': 'my-app/0.0.1'})
# And, only in requests (not urllib3), there is a cookies keyword argument.
r = requests.get('https://httpbin.org/get',
    cookies=dict(cookies_are='working'))

The value returned from these calls is yet another type of response object. This time, it's a requests.Response (at least it wasn't another HTTPResponse 🙂). This object has a wealth of information, such as the time the request took, the JSON of the response, whether the page was redirected and even its own CookieJar type. Here is a running list of the most useful members:

r.status_code and r.reason: Numeric status code and human readable reason.
url: The canonical URL used in the request.
text: The text retrieved from the request.
content: The bytes version of text.
json(): Attempts to return the JSON of text. Raises ValueError if this isn't possible.
encoding: If you know the correct encoding for text, set it here so text can be read properly.
apparent_encoding: The encoding that requests guessed it was.
raise_for_status(): Raises requests.exceptions.HTTPError if the request encountered one.
ok: True if status_code is less than 400, False otherwise.
is_redirect and is_permanent_redirect: Whether the status code was a redirect or if it was a permanent redirect, respectively.
headers: Headers in the response.
cookies: Cookies in the response.
history: All the Response objects from URLs that redirected to get to the present URL, sorted from oldest to newest.

This is how you would save the response output to a file:

with open(filename, 'wb') as fd:
    for chunk in r.iter_content(chunk_size=128):
        fd.write(chunk)

And this is how you stream uploads without reading the whole file:

with open('massive-body', 'rb') as f:
    requests.post('http://some.url/streamed', data=f)

In the event of a network error, requests will raise ConnectionError. If the request timeout expired, it raises Timeout. And if too many redirects were made, it raises TooManyRedirects.

Proxies

HTTP, HTTPS and SOCKS proxies are supported. requests is also sensitive to the HTTP_PROXY and HTTPS_PROXY environment variables and if these are set, requests will use these values as the proxies automatically. Within Python, you can set the proxies to use in the parameter:

# Instead of socks5 you could use http and https.
proxies = {
    'http': 'socks5://user:pass@host:port',
    'https': 'socks5://user:pass@host:port'
}
requests.get('http://example.org', proxies=proxies)

Session objects

A Session can persist cookies and some parameters across requests and reuses the underlying HTTP connection for the requests. It uses a urllib3 PoolManager, which will significantly increase performance of HTTP requests to the same host. It also has all the methods of the main requests API (all the requests methods you saw above). They can also be used as context managers:

with requests.Session() as s:
    s.get('https://httpbin.org/cookies/set/sessioncookie/123456789')

And we're done

This concludes the Python HTTP series. Are there errors here? Let me know so I can fix them.

Top comments (3)

Juan Carlos • Dec 30 '19

VERY detailed overview, is awesome!, congrats.
I invite you to try Faster Than Requests :)

Ali Sherief • Jan 2 '20

Thanks! Took me a long time to find this comment cause I was busy and I wasn't monitoring my notifications.

Eric Chang • Jan 25 '22

Thanks! I found this guide very comprehensive.