Ali Sherief

Posted on Dec 27, 2019

Python HTTP at Lightspeed ⚡ Part 1

#python #http #tutorial #codenewbie

Before I start: No, this isn't a Django or Flask tutorial, this is about using the http module itself. Also this isn't a general purpose guide to the HTTP protocol, it's only a tutorial of HTTP use inside Python. I will also try to avoid some exotic use-cases because this post is focused on real-world uses of these modules, though I must admit, most people use urllib3 or requests, not http. Those two will be covered in Part 2.

Really quickly now, there are four different submodules in http:

http.client - This has the HTTP client classes
http.server - not very useful submodule which has a few HTTP server classes
http.cookies - allows your program to support cookies (a cookie is a piece of state in your HTTP session)
http.cookiejar - Automatically sets cookies, and used to allow or block websites to set cookies in a session

http.client

This has all the methods you need to act like an HTTP client. It's cruder than using the urllib3 module or the requests PyPI package but it gets the job done. Here are some of the nifty things you can do with it:

You can make an HTTPConenction:

conn = http.client.HTTPSConnection("www.python.org")

You can also make an HTTPSConnection:

conn = http.client.HTTPSConnection("www.python.org")

For both of these connections a useful keyword argument you can specify is timeout with will cancel the connection if it waits longer than the number of seconds you specify. Which means instead of the above you could have also typed HTTPConnection("www.python.org", timeout=10).

Then you request a file from the server (this might take a while, if fast request() calls are important call connect() first once):

conn.connect() # optional, will wait here instead of in request()
conn.request("GET", "/")
# You can also specify headers here
conn.request("GET", "/", headers={'Server': 'nginx'})

You can only send one request at a time. If you try to send another without reading the response first, it raises CannotSendRequest.

After that's done, you get the response of the request:

>>> response = conn.getresponse()
>>> print(response)
<http.client.HTTPResponse object at 0x7f8f3e1fbee0>
>>> print(response.status, response.reason)
200 OK
>>> data1 = response.read()
>>> print(data1)
# Entire content of index.html is printed here
>>> response.close()
>>> # When you're done with the connection don't forget to close it
>>> conn.close()

You need read the whole response before you can send a new request to the server. If you don't, it will start interpreting the file as an HTTP result header which will almost certainly cause it to throw an BadStatusLine exception. Closing the response object is not enough (we'll get to response objects in a minute).

It's also possible for getresponse() to fail entirely if the remote end of the connection disconnected. In that case, it raises RemoteDisconnected. Also if you already read the responses to all requests and you didn't send another request, then when you call getresponse() it will raise ResponseNotReady.

Some HTTP status codes

They're in http.client, and since Python 3.5 http.HTTPStatus. It contains all the status codes that are possible in HTTP. A full list of codes are here but I will list the most common ones:

HTTP Code	Name in Python
200	http.client.OK
206	http.client.PARTIAL_CONTENT
301	http.client.MOVED_PERMANENTLY
403	http.client.FORBIDDEN
404	http.client.NOT_FOUND
408	http.client.REQUEST_TIMEOUT
500	http.client.INTERNAL_SERVER_ERROR
502	http.client.BAD_GATEWAY
503	http.client.SERVICE_UNAVAILABLE

One at a time

Instead of making one request() call or specifying all the headers at once, maybe you want to create each part of the HTTP request one at a time. In that case, you call the following functions in order:

First you call putrequest(method, url, skip_host=False, skip_accept_encoding=False), method is "GET" or "HEAD" or any other request type, url is the part of the URL after the domain name, so "/", "/index.html", "/favicon.ico" or any other file on the server, skip_host means don't send the Host header, and skip_accept_encoding means don't send the Accept-Encoding header.
Then you call putheader(header_name, value) to add a header called header_name with a value value to the request.
When you're done passing headers, call endheaders(message_body=None) with an message body if you need to pass one.
Finally you call send(bytes_object) to send the request. You would write things in bytes_object if you were making a PUT request for example. Most of the times you can set it to b''.

Proxy servers

OK this is fine and all, but what happens if you need to use a proxy to connect to the target host? HTTPConnection (and HTTPSConnection) has a method for that too, called set_tunnel(). To use it, you first have to open an HTTPConnection to the proxy server host and port. Then you call set_tunnel('www.targetwebsite.com'). Here's its use in action:

>>> conn = http.client.HTTPSConnection("localhost", 8080)
>>> conn.set_tunnel("www.python.org")
>>> conn.request("HEAD","/index.html")

HTTPResponse objects

What a convenient name to give it, because its purpose is exactly what it's called. You saw some things you can do with HTTPResponse objects above, now we will go in-depth to see some other uses for it.

Essentially, an HTTPResponse object is a file, and has all the methods a file object has, but it's read-only, you can't write to it or seek through it. That's right, seeking is not possible for an HTTPResponse object and you must read the response sequentially.

It has some other handy members and methods too:

getheaders() returns all the headers in the response. Use getheader(name, default=None) to get just one and return default if there is no header called name.
version returns the HTTP protocol version in use, which is either 10 for HTTP/1.0 or 11 for HTTP/1.1.
status is the HTTP status code.
reason is the human-readable version of the status code.

http.server

I have opted to skip this topic because there are only two classes, BaseHTTPRequestHandler (which is basically an abstract class; it can't do anything by itself) and SimpleHTTPRequestHandler, an HTTP server that serves from your file system. They are also insecure because they only do basic security checks. To see what SimpleHTTPRequestHandler would do just for fun, check out the appendix.

http.cookies

This submodule has a class SimpleCookie derived from BaseCookie and a class called Morsel which acts like a key/value pair. Unlike SimpleHTTPRequestHandler, SimpleCookie is actually useful (it's basically equivalent to PHP's cookie functions).

The SimpleCookie and Morsel classes work together to provide a type that is almost identical to a dictionary (it has all the methods of a dictionary), which means you create a cookie and then create keys inside the cookie and associate them with values. Then you can convert the cookie into a string to use as an HTTP header.

>>> import http.cookies
>>> C = http.cookies.SimpleCookie()
>>> C["fig"] = "newton"
>>> C["sugar"] = "wafer"
>>> C["rocky"] = "road"
>>> C["rocky"]["path"] = "/cookie"
>>> print(C)
Set-Cookie: fig=newton
Set-Cookie: rocky=road; Path=/cookie
Set-Cookie: sugar=wafer
>>> str(type(C["rocky"]))
"<class 'http.cookies.Morsel'>"

http.cookiejar

Many websites rely on cookies being set to function properly. For instance, when you go to DEV, it makes sure that a cookie with your login information is set so you appear logged in on the site.

The CookieJar class extracts cookies from HTTP requests, and returns them in HTTP responses. CookieJar supports cookies that expire. It's constructor is CookieJar(policy=None) and as you can see, it also supports so-called policies (a CookiePolicy class) which allow or block cookies from being set. To load cookies from a file, like how your browser does it, use FileCookieJar(filename, delayload=None, policy=None), but the cookies aren't actually loaded until you call its load() or revert() method.

The request and response objects in the parameters belong to the urllib3 module, not the http module.

Some helpful CookieJar and FileCookieJar methods, mostly copied from the python documentation:

make_cookies(response, request) returns a sequence of Cookie objects extracted from response object. This is not an http.cookies.SimpleCookie, cookiejar uses its own Cookie class.
set_policy(policy) changes the policy that's used.
set_cookie_if_ok(cookie, request) sets the cookie if the policy allows it to.
set_cookie(cookie) sets the cookie whether the policy allows it or not.
clear([domain[, path[, name]]]) removes cookies. clear() removes all cookies, clear('domain.name') removes all cookies belonging to domain.name, clear('domain.name', '/path') removes all cookies belonging to 'domain.name/path', and to remove a specific cookie, set all three arguments. If no cookies were found this raises KeyError.
clear_session_cookies() removes all the cookies that either don't expire or have a discard attribute.

Now the following methods are only implemented by FileCookieJar:

save(filename=None, ignore_discard=False, ignore_expires=False) saves all the cookies to a file, overwriting the file in the process. If ignore_discard is True, it saves session cookies as well. If ignore_expires is True, it saves expired cookies too.
load(filename=None, ignore_discard=False, ignore_expires=False) loads the cookies from a file. Cookies in the FileCookieJar object are preserved unless there is a cookie with the same name inside the file.
revert(filename=None, ignore_discard=False, ignore_expires=False) Like load() but all cookies in the FileCookieJar are removed first.

A CookiePolicy class contains a couple of methods to control which cookies can be returned to the server so the server effectively cannot see that the cookie is set. Setting this up is a bit involved so if all you want to do is block certain websites from setting cookies, you should use the simpler DefaultCookiePolicy instead.

policy = http.cookiejar.DefaultCookiePolicy(blocked_domains=["ads.net", ".ads.net"])
policy.set_blocked_domains(["doubleclick.net", ".doubleclick.net"])

Examples of cookiejar will be given in Part 2.

Appendix: http.server.SimpleHTTPRequestHandler

I decided to cover this here since it doesn't have any practical uses. This "server" object, which is actually a request handler, is passed as an argument to socketserver.TCPServer() along with a host/port tuple. So to make an HTTP server that listens to port 8000 on localhost, one would call TCPServer(("", 8000), SimpleHTTPRequestHandler).

Other members in this class are server_version which has a value such as 'SimpleHTTP/0.6', and extension_map, which is an enormous dictionary mapping file extensions to MIME types (a MIME type looks something like application/javascript).

Here is an example of using SimpleHTTPRequest and the curl command to make an equivalent to the ls command. It can also read files.

>>> import http.server
>>> import socketserver
>>> 
>>> PORT = 8000
>>> 
>>> Handler = http.server.SimpleHTTPRequestHandler
>>> 
>>> with socketserver.TCPServer(("", PORT), Handler) as httpd:
...     print("serving at port", PORT)
...     httpd.serve_forever()
...
serving at port 8000
127.0.0.1 - - [26/Dec/2019 21:45:03] "GET / HTTP/1.1" 200 -
...

$ curl 127.0.0.1:8000/bin/
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.01//EN" "http://www.w3.org/TR/html4/strict.dtd">
<html>
<head>
<meta http-equiv="Content-Type" content="text/html; charset=utf-8">
<title>Directory listing for /bin/</title>
</head>
<body>
<h1>Directory listing for /bin/</h1>
<hr>
<ul>
<li><a href="2to3">2to3@</a></li>
<li><a href="2to3-3.8">2to3-3.8</a></li>
<li><a href="bundle">bundle</a></li>
<li><a href="bundler">bundler</a></li>
<li><a href="cargo">cargo</a></li>
...

The HTML tags at the beginning allow us to open this page in a browser too. Try it out.

You can even query it with http.client (HTTPS is not supported for the local filesystem):

>>> conn = http.client.HTTPConnection("127.0.0.1", 8000)
>>> conn.request("GET", "/bin/")
>>> response = conn.getresponse()
>>> print(response.status, response.reason)
200 OK

And we're done (for now)

Let me know in the comments if you find something incorrect in here.

DEV Community