In my previous post I covered how to use the basic http
module. Now lets go up a higher level and check out how to use urllib3. Then we will reach even higher horizons learning about requests. But first, a quick disambiguation of urllib and urllib3.
The backstory
Once upon a time, back when people were rocking Python 2, you had these libraries called httplib and urllib2. Then Python 3 happened.
In Python 3, httplib was refactored into http.client which you learned about in Part 1, and urllib2 was split across multiple submoubles in a new module called urllib. urllib2 and urllib contained a high-level HTTP interface that didn't require you to mess around with the details of http.client (formerly httplib). Except that this new urllib was missing a long list of critical features such as:
- Thread safety
- Connection pooling
- Client-side SSL/TLS verification
- File uploads with multipart encoding
- Helpers for retrying requests and dealing with HTTP redirects
- Support for gzip and deflate encoding
- Proxy support for HTTP and SOCKS
To address these issues, urllib3 was created by the community. It is not a core Python module (and probably never will be) but it doesn't need to maintain compatibility with urllib.
urllib won't be covered here because urllib3 can do nearly everything it does and has some extra features, and the vast majority of programmers use urllib3 and requests.
So now that you know the difference between urllib and urllib3, here is a urllib example (the only one here) that uses the http.cookiejar.CookieJar
class from Part 1:
>>> import urllib.request
>>> import http.cookiejar
>>> policy = http.cookiejar.DefaultCookiePolicy(
... blocked_domains=["ads.net", ".ads.net"])
>>> cj = http.cookiejar.CookieJar(policy)
>>> opener = urllib.request.build_opener(urllib.request.HTTPCookieProcessor(cj))
>>> r = opener.open("http://example.com")
>>> str(type(r))
"<class 'http.client.HTTPResponse'>"
Installation
Neither urllib3 nor requests are included in a default Python installation (if your Python was packaged by a distribution then they might be there). So they must be installed with pip. pip3 install 'urllib3[secure, socks]' 'requests[socks]'
should install them for you. The secure
part installs certificate-related packages that urllib3 needs and socks
installs SOCKS protocol related packages.
urllib3
Obviously you need to import it first with import urllib3
, and for those of you who read Part 1, here is where things get interesting. Instead of creating a connection directly, you create a PoolManager
object. This handles connection pooling and thread safety for you. There is also a ProxyManager
object for routing requests through an HTTP/HTTPS proxy, as well as a SOCKSProxyManager
for SOCKS4 and SOCKS5 proxies. This is what it looks like:
>>> import urllib3
>>> from urllib3.contrib.socks import SOCKSProxyManager
>>> proxy = urllib3.ProxyManager('http://localhost:3128/')
>>> proxy.request('GET', 'http://google.com/')
>>> proxy = SOCKSProxyManager('socks5://localhost:8889/')
Bear in mind that HTTPS proxies cannot connect to HTTP websites.
urllib3 also has a logger which will log a lot of messages. You can tweak the verbosity by importing the logger module and calling logging.getLogger("urllib3").setLevel(your_level)
.
Like an HTTPConnection
in the http
module, urllib3 has a request()
method. It's invoked like poolmanager.request('GET', 'http://httpbin.org/robots.txt')
. Similar to http
, this method also returns a class named HTTPResponse
. But don't be fooled! This is not an http.client.HTTPResponse
. This is a urllib3.response.HTTPResponse
. The urllib3 version has some methods that are not defined in http
, and these will prove to be both very useful and convenient.
As explained this request()
method returns an HTTPResponse
object. It has a data
member which represents the response content in a JSON string (encoded as UTF-8 bytes). To inspect it, you can use:
import json
print(json.loads(response.data.decode('utf-8'))
Creating a query parameter
A query parameter looks like http://httpbin.org/get?arg=value
. The easiest way to construct something like this is to have a string containing everything up to and including the question mark, and then pass the argument/value pairs as a dictionary to urllib.parse.urlencode()
(yes, urllib
) and concatenate that to your original string.
Here's a roundup. Every parameter in this table that can be specified has to be a dictionary. There will be multiple JSON keys in the response containing some of these:
Parameter in request()
|
JSON key in response |
---|---|
N/A | "origin" |
headers |
"headers" |
fields (HEAD/GET/DELETE) |
"args" |
encoded url parameter (POST/PUT) |
"args" |
fields (POST/PUT) |
"form" |
encoded body with Content-Type application/json in headers
|
"json" |
'filefield': (file_name, file_data, mime_type) in fields parameter |
"files" |
binary data in body with any Content-Type in headers parameter |
"data" |
HTTPS in urllib3
There is some extra boilerplate code to add to use certificates and therefore HTTPS in a PoolManager
, but has the advantage of throwing an error if the connection cannot be secured for some reason:
>>> import certifi
>>> import urllib3
>>> pool = urllib3.PoolManager(
... cert_reqs='CERT_REQUIRED',
... ca_certs=certifi.where())
>>> pool.request('GET', 'https://google.com')
(No exception)
>>> pool.request('GET', 'https://expired.badssl.com')
(Throws urllib3.exceptions.SSLError)
Some additional goodies
Similar to http
, urllib3
connections support timeouts for requests. For even more control, you can make a Timeout
object to specify separate connect and read timeouts (all exceptions are sourced under urllib3.exceptions
):
>>> pool.request(
... 'GET', 'http://httpbin.org/delay/3', timeout=2.5)
MaxRetryError caused by ReadTimeoutError
>>> pool.request(
... 'GET',
... 'http://httpbin.org/delay/3',
... timeout=urllib3.Timeout(connect=1.0))
<urllib3.response.HTTPResponse>
>>> pool.request(
... 'GET',
... 'http://httpbin.org/delay/3',
... timeout=urllib3.Timeout(connect=1.0, read=2.0))
MaxRetryError caused by ReadTimeoutError
Something that http
doesn't have is retrying requests. urllib3 has this by virtue of being a high-level library. Its documentation couldn't explain it better:
urllib3 can automatically retry idempotent requests. This same mechanism also handles redirects. You can control the retries using the retries parameter to request(). By default, urllib3 will retry requests 3 times and follow up to 3 redirects.
To change the number of retries just specify an integer:
>>> pool.requests('GET', 'http://httpbin.org/ip', retries=10)
To disable all retry and redirect logic specify retries=False:
>>> pool.request(
... 'GET', 'http://nxdomain.example.com', retries=False)
NewConnectionError
>>> r = pool.request(
... 'GET', 'http://httpbin.org/redirect/1', retries=False)
>>> r.status
302
To disable redirects but keep the retrying logic, specify redirect=False:
>>> r = pool.request(
... 'GET', 'http://httpbin.org/redirect/1', redirect=False)
>>> r.status
302
Similar to Timeout
, there is also a Retry
object for setting the maximum retries and redirects separately. It's made like this: retries=urllib3.Retry(3, redirect=2)
. The request will throw MaxRetryError
if too many requests are made.
Instead of passing a Retry
object for each request, you can also specify the Retry
object in the PoolManager
constructor to make it apply to all requests. The same applies to Timeout
.
requests
requests uses urllib3 under the hood and makes it even simpler to make requests and retrieve data. For one thing, keep-alive is 100% automatic, compared to urllib3 where it's not. It also has event hooks which call a callback function when an event is triggered, like receiving a response (but that's an advanced feature and it won't be covered here).
In requests, each request type has it's own function. So instead of creating a connection or a pool, you directly GET (for example) a URL. A lot of the keyword parameters used in urllib3 (shown in the above table) can also be used for requests identically. All exceptions are sourced under requests.exceptions
.
import requests
r = requests.get('https://httpbin.org/get')
r = requests.post('https://httpbin.org/post', data={'key':'value'})
r = requests.put('https://httpbin.org/put', data={'key':'value'})
r = requests.delete('https://httpbin.org/delete')
r = requests.head('https://httpbin.org/get')
r = requests.options('https://httpbin.org/get')
# You can disable redirects if you want
r = requests.options('https://httpbin.org/get', allow_redirects=False)
# Or set a timeout for the number of seconds a server has to start responding
r = requests.options('https://httpbin.org/get', timeout=0.001)
# Set the connect and read timeouts at the same time
r = requests.options('https://httpbin.org/get', timeout=(3.05, 27))
# To pass query parameters (`None` keys won't be added to the request):
r = requests.get('https://httpbin.org/get',
params={'key1': 'value1', 'key2': 'value2'})
# If a key has a list value a key/value pair is added for each value in the list:
r = requests.get('https://httpbin.org/get',
params={'key1': 'value1', 'key2': ['value2', 'value3']})
# Headers can also be added:
r = requests.get('https://httpbin.org/get',
headers={'user-agent': 'my-app/0.0.1'})
# And, only in requests (not urllib3), there is a cookies keyword argument.
r = requests.get('https://httpbin.org/get',
cookies=dict(cookies_are='working'))
The value returned from these calls is yet another type of response object. This time, it's a requests.Response
(at least it wasn't another HTTPResponse
🙂). This object has a wealth of information, such as the time the request took, the JSON of the response, whether the page was redirected and even its own CookieJar
type. Here is a running list of the most useful members:
-
r.status_code
andr.reason
: Numeric status code and human readable reason. -
url
: The canonical URL used in the request. -
text
: The text retrieved from the request. -
content
: The bytes version oftext
. -
json()
: Attempts to return the JSON oftext
. RaisesValueError
if this isn't possible. -
encoding
: If you know the correct encoding fortext
, set it here sotext
can be read properly. -
apparent_encoding
: The encoding that requests guessed it was. -
raise_for_status()
: Raisesrequests.exceptions.HTTPError
if the request encountered one. -
ok
: True ifstatus_code
is less than 400, False otherwise. -
is_redirect
andis_permanent_redirect
: Whether the status code was a redirect or if it was a permanent redirect, respectively. -
headers
: Headers in the response. -
cookies
: Cookies in the response. -
history
: All the Response objects from URLs that redirected to get to the present URL, sorted from oldest to newest.
This is how you would save the response output to a file:
with open(filename, 'wb') as fd:
for chunk in r.iter_content(chunk_size=128):
fd.write(chunk)
And this is how you stream uploads without reading the whole file:
with open('massive-body', 'rb') as f:
requests.post('http://some.url/streamed', data=f)
In the event of a network error, requests will raise ConnectionError
. If the request timeout expired, it raises Timeout
. And if too many redirects were made, it raises TooManyRedirects
.
Proxies
HTTP, HTTPS and SOCKS proxies are supported. requests is also sensitive to the HTTP_PROXY
and HTTPS_PROXY
environment variables and if these are set, requests will use these values as the proxies automatically. Within Python, you can set the proxies to use in the parameter:
# Instead of socks5 you could use http and https.
proxies = {
'http': 'socks5://user:pass@host:port',
'https': 'socks5://user:pass@host:port'
}
requests.get('http://example.org', proxies=proxies)
Session objects
A Session
can persist cookies and some parameters across requests and reuses the underlying HTTP connection for the requests. It uses a urllib3 PoolManager
, which will significantly increase performance of HTTP requests to the same host. It also has all the methods of the main requests API (all the requests methods you saw above). They can also be used as context managers:
with requests.Session() as s:
s.get('https://httpbin.org/cookies/set/sessioncookie/123456789')
And we're done
This concludes the Python HTTP series. Are there errors here? Let me know so I can fix them.
Top comments (3)
VERY detailed overview, is awesome!, congrats.
I invite you to try Faster Than Requests :)
Thanks! Took me a long time to find this comment cause I was busy and I wasn't monitoring my notifications.
Thanks! I found this guide very comprehensive.