While I was watching Full-Stack Interview Preparation Udacity course, this question was asked: what happens when you visit google.com?
As the interviewer wanted as much detail as possible, I decided to make my own research.
So, what are the actual steps from entering an URL in the address bar and the website is displayed on the screen?
These are the steps:
1) Get an IP address from the URL.
2) Use HTTP/S protocol and the IP address to make a request to the server.
3) Get a response from the server.
4) Depending on the HTTP version:
5) The browser renders the website on screen.
Get an IP address from the URL
DNS (Domain Name Server) is a directory of names (domain names) that match with numbers (IP addresses).
DNS resolution is the process of translating IP addresses to domain names.
A DNS record is a mapping between a domain name and an IP address.
The ISP (Internet Service Provider) is the local Internet company provider (AT&T, Virgin Media, Vodafone, etc.) Usually, the ISP-supplied router will point to a DNS server that is hosted by that ISP.
The steps to retrieve an IP address are as follow:
(1) The user types www.google.com.
(2) DNS resolution process kicks in. The browser will check if there is a local cached DNS record of www.google.com IP address in its local cache.
(3) If there is no record in the browser's local cache, the OS's cache is next to be checked. If there is no record in the OS's cache, the DNS server configured on the user's system is checked, which may be the DNS server provided by the user's ISP or a public DNS server. If there is a cached DNS record found, the DNS resolving process is terminated. Skip to "Use HTTP/S protocol and the IP address to make a request to the server" step.
(4) If the DNS record is not found locally, a full DNS resolution is started: DNS root servers are checked first. DNS root servers store the IP addresses of top-level domain DNS servers. Top-level domain (TLD) servers are divided by type, e.g. .com, .edu, .org, etc. The root server returns the IP address of the relevant top-level domain server. In the case of google.com, the root server returns the IP address of the .com top-level domain server. Check this link for an interactive map of root server's distribution.
(5) The top-level domain server returns the IP address of the second-level domain server. In the case of google.com, the top-level domain server returns the IP address of google.com domain server.
(6) Finally, the second-level domain server contains the DNS record we are looking for. This server sends back the IP address.
(7) ISP receives the IP address and forwards it to the user's computer. Note that if any DNS server (ISP or public) has to contact another DNS server, it caches the lookup results for a limited time (Time To Live TTL) so it can quickly resolve subsequent requests to the same domain name. That means, next time you try to access www.google.com, chances are the DNS server for your ISP will find the DNS record for www.google.com in its cache.
Use HTTP/S protocol and the IP address to make a request to the server
HTTP(S) protocol is a set of rules for communication between a server and a client. Servers are continuously running programs that respond to client request messages. When accessing google.com, the browser is the client. Web servers wait for HTTP requests, process them when they arrive, and reply to the web browser with an HTTP Response message.
HTTP(S) is a stateless protocol: servers do not store any information about clients by default. This protocol sits at the top of the protocols hierarchy.
Courtesy of MDN Web Docs
A note about Ethernet: some don't consider this a protocol, Ethernet is just the pipe. On short, IP is a protocol that allows us to talk to other machines. TCP is a protocol that allows having multiple, independent streams of data between these machines. These streams are distinguished by port numbers (e.g. port 80 for HTTP, port 443 for HTTPS). TCP includes mechanisms to solve many of the problems that arise from packet-based messaging, such as lost packets, out of order packets, duplicate packets, and corrupted packets. IP is the part that obtains the address to which data is sent. TCP is responsible for data delivery once that IP address has been found.
Before any HTTP request can be made, a TCP three-way handshake must be established. TCP three-way handshake goes like this: the client makes a request "I want something", the server replies, "I heard you want something?", and finally the client responds, "That's correct". For more details see this link.
An HTTP request message consists of a request line and request headers. Request line includes request method (GET, POST, HEAD, PUT, DELETE, OPTIONS etc), request URL, and HTTP version (latest version is HTTP/2). After the request line, request headers come next, as shown below:
CONNECT www.google.com:443 HTTP/1.1 Host: www.google.com:443 Connection: keep-alive User-Agent: Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/87.0.4280.141 Safari/537.36 Connection: keep-alive
HTTP + TLS = HTTPS
HTTPS is just HTTP, but encrypted, which makes it more secure to use. TLS (Transport Layer Security) is an algorithm for encryption, which is based on SSL (Secure Sockets Layer). SSL is now outdated. TLS is not specific to HTTP protocol; any network protocol can use it. For example, FTP + TLS = FTPS.
TLS offers secure connection establishment through authentication, avoiding man-in-the-middle attacks. A man-in-the-middle attack is when the attacker gets between you and the server you are trying to connect to. Neither you nor the server knows there is another party sitting in the middle. The attacker will decrypt your data, read all your data, re-encrypt it and forward it to the server. By using TLS, the server must identify itself in some way so you can be sure you are talking to the right server. This process is called the "TLS handshake".
Once the TCP connection is established through the TCP 3-way handshake, the TLS handshake can occur. TLS handshake starts with the client making a request, "I want something". This time, in TLS handshake, the server replies "Here is my certificate file. Check it out." The server sends back a certificate issued by a certificate authority. The certificate contains the public key of the server, the domain the certificate is for, and the signature of the certificate authority. Then, the client checks if the domain is correct, checks if the authority signature is valid (all browsers have a collection of certificate authorities including their public keys saved locally), making it trivial to check if the digital signature is valid. After these checks, the client replies: "Okay, your certificate looks fine, but I need to verify your private key. I will generate a random key and encrypt this random key with your public key. Decrypt this key using your private key and we will use this master key to encrypt and decrypt further messages." After the server validates its identity by decrypting the random key, the padlock in the browser toolbar appears. By contrast, for HTTP connections, the browsers show the not secure warning.
TLS also offers encryption after the TLS handshake, making it nearly impossible for a malicious eavesdropper to be able to read your data stream.
It is recommended that all assets are served over HTTPS. Most modern browsers APIs are being enabled for HTTPS only.
Get a response from the server
The HTTP response has a status line (e.g., 200 OK), header lines (e.g. Date, Server, Last-Modified, Content-Length, Content-Type) and an entity, the object that was request from the server.
HTTP response status codes indicate whether a specific HTTP request has been successfully completed. Responses are grouped into five classes: informational responses (100–199), successful responses (200–299), redirects (300–399), client errors (400–499),
server errors (500–599).
200 OK HTTP/1.1 200 OK Content-Type: application/json; charset=utf-8 Access-Control-Allow-Origin: https://www.google.com Access-Control-Allow-Credentials: true Cache-Control: no-cache, no-store, max-age=0, must-revalidate Pragma: no-cache Expires: Mon, 01 Jan 1990 00:00:00 GMT Date: Fri, 29 Jan 2021 11:52:19 GMT Content-Disposition: attachment; filename="response.bin"; filename*=UTF-8''response.bin X-Content-Type-Options: nosniff Cross-Origin-Resource-Policy: cross-origin Content-Encoding: gzip Server: ESF X-XSS-Protection: 0 X-Frame-Options: SAMEORIGIN
Before we continue with the next step, let's dive in some web performance aspects.
Web performance is a metric that indicates how quickly site content loads and renders in a web browser, and how well it responds to user interaction. A good goal for web performance is for the user to not notice performance.
By what was covered so far, it is easy to see how by adding TCP and TLS handshakes, we get a massive overhead. There at least five network round trips (at least two from TCP, at least three from TLS) before any request can be made. In HTTP/1.x world, there are a few measures commonly used for this issue.
One connection per request
Not feasible, as there is too much overhead from having too many handshakes per request.
One connection per host
This can be achieved with the client "keep-alive" header. That means by using keep-alive header, the server will not close the connection after successfully delivering the response and the browser can reuse the already established connection for additional requests. The overhead from multiple handshakes is completely avoided. However, the drawback is that requests must go in order, in a single connection. That means if multiple requests are fired simultaneously in the browser, we will have to wait for a response to come back for the first request and only then can we send the second request. That adds to a significant delay in request processing.
On a single TCP connection, multiple requests can be sent quickly, one after another, without waiting for their response. However, the server must respect the order in which requests were sent. That means, even though the server has the response for the sixth request, this response cannot be sent until the server has the response for the first request. Of course, this approach has a significant delay too.
Fixed number of parallel connections
A solution for all these drawbacks is to have a fixed number of connections per host (e.g., 8 connection in IE 11, 6 in Firefox, 4-6 in Chrome etc). That means if six requests can be sent simultaneously, the seventh will be queued until a connection becomes free. Combined with the keep-alive header, these connections can be kept open to process multiple requests and avoid TCP and TLS handshakes overhead.
Even with these optimizations, there are a few setbacks:
TCP and TLS handshakes overhead: every time the browser connects to the server to make a request, TCP and optionally, TLS handshakes, must be established. This setup is very time-consuming.
head-of-line blocking is a huge bottleneck in website performance (when a line of packets is held up by the first packet). Head-of-line blocking means requests must be queued to be processed. Fixed number of parallel connections helps, but subsequent requests need to wait for the former requests to complete.
With these issues, it is obvious why optimization is critical. Some optimization techniques for HTTP/1.x are:
minification: a technique that compresses a text resource by removing whitespaces and unnecessary characters without changing validity or functionality.
remove unused code, by tree shaking or Coverage Tab in Chrome DevTools. Tree shaking is a form of dead code elimination. Imagine a mental model of an application and its dependencies as a tree structure where each node is a dependency. In dev builds, tree shaking does not change anything as entire modules are imported. In production builds, module bundlers such as Webpack can be configured to "shake off" non-explicit imports. Hence, production builds are smaller.
cache code to minimize network round trips. Enable browser caching for assets that will not change with different browser sessions. For example, the website's logo can be cached locally, on the browser. Server-side caching can be enabled, too. Service workers can act as caching agents and store content for offline use.
reduce the number of HTTP redirects, as each redirect restarts the fetch process. Cross-origin redirects are worst-case: DNS, TCP, new HTTP request etc.
reduce DNS lookup (unresolved domain names block requests).
speed up DNS resolution. The latest report comparing different DNS providers can be found here.
spriting: images are put together in sprites.
domain sharding: when loading a large number of resources, a performance boost can be obtained by increasing the amount of simultaneously downloaded resources for a particular website by using multiple domains. Remember of the HTTP/1.x limitation of fixed parallel connections per host. By sharding resources across two domains, e.g. "i1.images.random.domain.com", "i2.images.random.domain.com", it is expected the resources to download in about half of time. Consider DNS lookups performance cost.
PRPL is an acronym that describes a pattern used to make web pages load and become interactive, faster.
- Push (or preload) the most important resources by using resource hint "preload". Do note that resource hints are just hints, the browser decides whether to execute them (can ignore or execute hints partially). Modern browsers are quite good at prioritizing resources, so use preload resource hint sparingly and only preload the most critical resources.
- Lazy load other routes and non-critical assets by deferring these resources for a later time.
As you can see, there are a plethora of optimization techniques. But none of them is better than HTTP/2, the latest version of HTTP network protocol which aims to fix most of the optimization issues.
Multiplexing is the simultaneous transmission of several messages along a single channel of communication. So, in HTTP/2, there is only one single connection. What used to be a connection in HTTP/1.x is now called stream and all streams share that single connection. Streams are split into frames (headers in a frame, data in another frame). When one stream is blocked, another stream can take over that connection and make effective use of what would have been idle time. This behaviour is acquired by frame interleaving.
Courtesy of Ilya Grigorik
Additionally, HTTP/2 uses HPACK header compression, which also provides a performance boost. For example, imagine hundreds of requests and each of them with a few KBs for headers, that results in hundreds of KBs. Unnecessary metadata adds up quickly.
By using HTTP/2, some of the optimization patterns become anti-patterns. For example, by using the multiplexing feature of HTTP/2 you can request hundreds of resources in parallel using a single TCP connection, which obsoletes bundling of resources. Bundling is not necessary anymore and it can make things worse, by invalidating the cache of a giant bundle if only a tiny part of the bundle changes. HTTP/2 removes the need for domain sharding because you can request as many resources as you need. Image spriting becomes an anti-pattern too.
By the time of drafting this article, HTTP/2 is used by 50.2% of all websites. HTTP/2 needs to be backwards compatible with HTTP/1.x.
The browser renders the website on screen
When a browser receives the HTML response for a page from the server there is a sequence of required steps before pixels are drawn on the screen. This sequence the browser needs to go through for the initial paint of the page is called Critical Rendering Path. The browser is rendering out the web page for every complete node it finds. This means that we will see content in the order it is discovered.
The browser begins parsing the HTML, converting the received bytes to the Document Object Model (DOM). Some requests are blocking, which means the parsing of the rest of the HTML is halted until the imported asset is handled. The browser continues to parse the HTML making requests and building the DOM, until it gets to the end, at which point it constructs the CSS object model (CSSOM). With the DOM and CSSOM complete, the browser builds the render tree, computing the styles for all the visible content. After the render tree is complete, layout occurs, defining the location and size of all the render tree elements. Once the layout is complete, the page is rendered, or 'painted' on the screen. The last step is compositing, displaying elements on the right layers.
The golden standards to understand Critical Rendering Path are the Udacity course, Website Performance Optimization, Google Critical Rendering Path, MDN Critical Rendering Path. These resources don't mention anything about compositing. I went with the other Udacity course, Browser Rendering Optimization, that mentions compositing.
Once I did more research about compositing, I included this step too in this diagram I created:
Document Object Model (DOM)
DOM construction is incremental. The HTML response turns into bytes, then characters, which are turned into tokens. These tokens are transformed into nodes that turn into the DOM Tree. This process is explained much better here or here. In short, DOM is a tree structure that captures the content and properties of the HTML and all the relationships between the nodes.
CSS Object Model (CSSOM)
The render tree captures both the content and the styles: the DOM and CSSOM trees are combined into the render tree. To construct the render tree, the browser checks every node, starting from the root of the DOM tree, and determine which CSS rules are attached.
The render tree only captures visible content. The head section (generally) does not contain any visible information and is therefore not included in the render tree. If there is a display: none; set on an element, neither it nor any of its descendants, are in the render tree.
Courtesy of Ilya Grigorik
Once the render tree is built, the layout becomes possible. The layout is dependent on the size of the viewport. The layout step determines where and how the elements are positioned on the page, determining the width and height of each element, and where they are in relation to each other.
The viewport meta tag defines the width of the layout viewport, impacting the layout. Without it, the browser uses the default viewport width, which on by-default full-screen browsers is generally 960px. A typical mobile-optimized site contains something like the following:
<meta name="viewport" content="width=device-width>
the width will be the width of the device instead of the default viewport width. The device-width changes when a user rotates their phone between landscape and portrait mode. Layout happens every time a device is rotated, or the browser is otherwise resized.
Once the render tree is created and layout occurs, each node in the render tree to can be converted to actual pixels on the screen. This step is often referred to as "painting". This drawing is typically done onto layers.
Compositing is where the browser puts the individual layers of the page together. This requires layer management to ensure we have the right layers, and in the correct order. Otherwise, there might be layers overlapping each other (one element appearing on top of another). All the other steps happened on the CPU, whereas layers will be uploaded to the GPU. The GPU will be instructed to put the pictures up on the screen. An example of compositing would be a newsletter pop-up in the front form and a grey overlay over the rest of the page.
The time required to perform render tree construction, layout and paint vary based on the size of the document, the applied styles, and the device it is running on. The larger the document, the more work the browser has. The more complicated the styles, the more time is taken for painting also (for example, a solid colour is "cheap" to paint, while a drop shadow is "expensive" to compute and render).
The page is finally visible in the viewport! To see the Critical Rendering Path, you can inspect the Chrome DevTools, Performance tab.
Q: If you have a Flexbox container on the page and you resize the screen to become larger, what steps from Critical Rendering Path are involved?
A: If there is a resize handler that changes the style, or if a media query breakpoint was hit, then the browser would have to recalculate styles. If these cases are not happening, that means element styles are already known, and the styles for the screen resize event are applied through the layout step. If the browser runs layout, it also must paint the elements in their new positions on the page, and then composite them together.
Modern browsers try to refresh the content on screen coordinated with a device's refresh rate. For most devices today, the screen will refresh 60 times a second, or 60Hz. Anything that involves movement or finger on-screen interactions (such as scrolling, transitions, animations, drag and drop, pinching, etc.) the browser should create 60 frames per second to match the refresh rate. Jank is any stuttering, juddering or plain halting that users see when a site or application is not keeping up with the refresh rate. Jank is the result of frames taking too long for a browser to make, and it negatively impacts your users and how they experience your site or app. If you are aiming for smooth animations or transitions and silky scrolling, remember that each of these frames has a budget of about 16.7ms (1 second / 60 = 16.66 ms). As browsers have housekeeping work to do, this budget becomes 10ms. Miss the frame and animations, transitions, scrolling, will appear janky.
Any step of the Critical Rendering Path can introduce jank.
Based on Critical Rendering Path, there a few optimization rules that we notice.
- incremental HTML delivery as parsing HTML is not script/render-blocking (e.g., Google search results page immediately displays the header and then the results.)
minimize the number of critical resources: eliminate them, defer their download, use async tag on scripts that need to be in the header such as for analytics scripts. Remember that deferring means that scripts must wait to execute until after the document has been parsed. When using async, the script runs in the background while the document is being parsed. Defer and async are good optimization techniques for HTTP/2.
CSS is a render-blocking resource (the browser must download and parse CSS files before it can show the page). Get it to the client as soon and as quickly as possible to optimize the time to first render. So, inline critical CSS, the remainder of the CSS can be loaded async. Note that this technique is a good recommendation even with HTTP/2.
separate deferred CSS into non-blocking requests by using media queries. If, for example, we have a stylesheet with a media attribute that targets printers, and we are viewing the page on a screen, that resource will not be considered render-blocking.
not all CSS is created equal. Some CSS properties have more wide-reaching consequences than others. Your CSS should trigger the least amount of work possible, and that is going to mean avoiding paint and layout whenever possible. You can check CSS triggers, an especially useful resource for determining the amount of work your CSS will trigger. E.g.: margin-left will trigger layout, paint, and compositing; color will trigger paint and compositing and transform will trigger just compositing. Transforms and opacity are by far the best properties to change because they can be handled just by the compositor if the element has its own layer. Remember that if you a change a "layout" property, you will get an expensive browser reflow, as the browser will have to check all the other elements, repaint them, and composited.
As a rule of thumb, be wary of micro-optimizations, as all too often these provide theoretical savings but with real costs.
Of course, there are things I missed (TTFB, Page Load Time, etc.) or misinterpreted. Now, could I have made a 10-article series instead of writing a short book? Absolutely! My purpose for this article was to create a resource that is easy to search and easy to update. I hope this article saves you some hours of researching on this topic.
... and many others