DEV Community: HTTP Archive

Querying parsed HTML in BigQuery

Rick Viscomi — Fri, 26 May 2023 16:05:19 +0000

A longstanding problem in the HTTP Archive dataset has been extracting insights from blobs of HTML in BigQuery. For example, take the source code of example.com:

<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">...</style>    
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>
    <p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>

If you wanted to extract the link text in the last paragraph, you could do something relatively straightforward like this:

// 'More information...'
document.querySelector('p:last-child a').textContent;

But in BigQuery, we don't have the luxury of the document object, querySelector, or textContent.

Instead, we've had to resort to unwieldy regular expressions like this:

# 'More information...'
SELECT
  REGEXP_EXTRACT(html, r'<p><a[^>]*>([^<]*)</a></p>') AS link_text
FROM
  body

It looks like it works, but it's brittle.

What if there's text or whitespace between the elements?
What if there are attributes on the paragraph?
What if there's another p>a element pair earlier in the page?
What if the page uses uppercase tag names?

It goes on and on.

Using regular expressions for parsing HTML seems like a good idea at first, but it becomes a nightmare as you need to ramp it up to increasingly unpredictable inputs.

To avoid this headache in HTTP Archive analyses, we've resorted to custom metrics. These are executed on each page at runtime, and it's been really effective. It enables us to analyze both the fully rendered page as well as the static HTML. But one big limitation with custom metrics is that they only work at runtime. So if we want to change the code or analyze an older dataset, we're out of luck.

Cheerio

While looking for a way to implement capo.js in BigQuery to understand how pages in HTTP Archive are ordered, I came across the Cheerio library, which is a jQuery-like interface over an HTML parser.

It works beautifully.

To be able to use Cheerio in BigQuery, I first needed to build a JavaScript binary that I could load into a UDF. The post How To Use NPM Library in Google BigQuery UDF was a big help. I installed the Cheerio library locally and built it into a script with an exposed cheerio global variable using Webpack.

I uploaded the script to HTTP Archive's Google Cloud Storage bucket. Then in BigQuery, I was able to side-load the script into the UDF with OPTIONS:

OPTIONS (library = 'gs://httparchive/lib/cheerio.js')

From there, the UDF was able to reference the cheerio object to parse the HTML input and generate the results. You can see it in action at capo.sql.

Querying HTML in BigQuery

Here's a full demo of the example.com link text solution in action:

DECLARE example_html STRING;
SET example_html = '''
<!doctype html>
<html>
<head>
    <title>Example Domain</title>

    <meta charset="utf-8" />
    <meta http-equiv="Content-type" content="text/html; charset=utf-8" />
    <meta name="viewport" content="width=device-width, initial-scale=1" />
    <style type="text/css">...</style>    
</head>

<body>
<div>
    <h1>Example Domain</h1>
    <p>This domain is for use in illustrative examples in documents. You may use this
    domain in literature without prior coordination or asking for permission.</p>
    <p><a href="https://www.iana.org/domains/example">More information...</a></p>
</div>
</body>
</html>
''';

CREATE TEMP FUNCTION getLinkText(html STRING)
RETURNS STRING LANGUAGE js
OPTIONS (library = 'gs://httparchive/lib/cheerio.js') AS '''
try {
  const $ = cheerio.load(html);
  return $('p:last-child a').text();
} catch (e) {
  return null;
}
''';

SELECT getLinkText(example_html) AS link_text

🔗 Try it on BigQuery

The results show it working as expected:

Limitations

Cheerio is marketed as fast and efficient.

If you try to parse every HTML response body in HTTP Archive, the query will fail.

Fully built, the library is 331 KB. And due to the need for storing the HTML in memory to parse it, it consumes a lot of memory for large blobs.

To minimize the chances of OOM errors and speed up the query, one thing you can do is pare down the HTML to the area of interest using only the most basic regular expressions. Since the capo script is only concerned with the <head> element, I grabbed everything up to the closing </head> tag:

httparchive.fn.CAPO(
  REGEXP_EXTRACT(
    response_body,
    r'(?i)(.*</head>)'
  )
)

If there are no natural "breakpoints" in the document for your use case, you could also consider restricting the input to a certain character length like WHERE LENGTH(response_body) < 1000. The query will work and it'll run more quickly, but the results will be biased towards smaller pages.

Also, some documents may not be able to be parsed at all, resulting in exceptions. I added try/catch blocks to the UDF to intercept any exceptions and return null instead.

That also means that your query needs to be able to handle null values instead. For example, to get the first <head> element from the results, I needed to use SAFE_OFFSET instead of plain old OFFSET to avoid breaking the query on null values: elements[SAFE_OFFSET(0)].

Wrapping up

Cheerio is a really powerful new tool in the HTTP Archive toolbox. It unlocks new types of analysis that used to be prohibitively complex. In the capo.sql use case, I was able to extract insights about pages' <head> elements that would have only been possible with custom metrics on future datasets.

I'm really interested to see what new insights are possible with this approach. Let me know your thoughts in the comments and how you plan to use it.

Introducing the Core Web Vitals Technology Report

Rick Viscomi — Thu, 24 Jun 2021 16:02:56 +0000

The technologies you use to build your website can have an effect on your ability to deliver good user experiences. Good UX is key to performing well with Core Web Vitals (CWV), a topic which is probably top of mind for you, as it is for many other web developers now that these metrics play a role in Google Search ranking. While web developers have had tools like Search Console and PageSpeed Insights to get data on how their sites are performing, the web community has been lacking a tool that has operated at the macro level, giving us something more like WebSpeed Insights. By combining the powers of real-user experiences in the Chrome UX Report (CrUX) dataset with web technology detections in HTTP Archive, we can get a glimpse into how architectural decisions like choices of CMS platform or JavaScript framework play a role in sites' CWV performance. The merger of these datasets is a dashboard called the Core Web Vitals Technology Report.

This dashboard was developed for the web community to have a shared source of truth for the way websites are both built and experienced. For example, the CWV Technology Report can tell you what percentage of websites built with WordPress pass the CWV assessment. While a number like this on its own is interesting, what's more useful is the ability to track this over time and compare it to other CMSs. And that's exactly what the dashboard offers; it's an interactive way to view how websites perform, broken down by nearly 2,000 technologies.

This post is a show-and-tell. First I'd like to walk you through the dashboard and show you how to use it, then I'll tell you more about the data methodology behind it.

Using the dashboard

There are three pages in the dashboard:

Technology drilldown
Technology comparison
Settings

The drilldown page lets you see how desktop and mobile experiences change over time for a single technology. The default metric is the percent of origins having good CWV, and it also supports individual CWV metrics (see the "Optional metrics" section below).

The comparison page lets you compare desktop OR mobile experiences for any number of technologies over time. Similar to the drilldown page, you can select overall CWV compliance or individual CWV metrics. Additionally, this page supports visualizing the number of origins per technology.

The settings page is where you can configure report-level preferences. There are currently two settings: categories and number of origins. Refer to Wappalyzer for the list of possible categories. Use this setting to limit the related technologies in the dropdown list. You can also restrict the technologies to those with a minimum level of adoption, for example those used by at least 100 websites. This can be helpful to reduce noisiness.

By default, the CWV Technology Report is configured to drill down into WordPress performance and compare WordPress, Wix, and Squarespace. This is to demonstrate the kinds of insights that are possible out-of-the-box without having to know how to configure the dashboard yourself. The full URL for the vanilla version of the dashboard is https://datastudio.google.com/reporting/55bc8fad-44c2-4280-aa0b-5f3f0cd3d2be/page/M6ZPC.

Optional metrics

You can also use the "optional metrics" feature of Data Studio to customize the dashboard and select specific CWV stats in the charts/tables as needed. The icon that looks like a chart with a gear icon is the button to select optional metrics. In the timeseries chart, you can toggle between the percent of origins having good CWV overall or specifically those with good LCP, FID, or CLS. On the table views, you can use this feature to add or remove columns, for example to see all CWV metrics separately or to focus on just one.

Data Studio also enables you to share deep links into the dashboard for specific configurations. For example, here's a leaderboard of the top 10 most popular CMSs ordered by CWV performance as of May 2021:

Date	Technology	Origins	Percent good CWV
May 2021	1C-Bitrix	35,385	56.30%
May 2021	TYPO3 CMS	24,060	54.39%
May 2021	Drupal	115,280	45.11%
May 2021	Zendesk	34,713	43.26%
May 2021	Weebly	15,920	33.35%
May 2021	Squarespace	60,316	33.32%
May 2021	Joomla	44,459	32.19%
May 2021	Wix	54,604	31.52%
May 2021	Adobe Experience Manager	15,276	27.65%
May 2021	WordPress	1,731,010	24.53%

Here are some other configurations to help you explore the data:

Feature roadmap

There are two features missing from the dashboard that I would love to add in the near future: segmenting by CrUX rank magnitude and comparing Lighthouse audit compliance. Origin popularity would be a really interesting way to slice the data and the rank magnitude dimension would enable us to see how technology adoption and CWV performance change at the head, torso, and tail of the web. Adding data from Lighthouse would enable us to get some clues into why a particular technology may be better or worse with CWV. For example, if a group of websites tend to have poor LCP performance, it'd be interesting to see what loading performance audits they also tend to fail. Of course there are so many variables at play and we can't determine cause and effect, but these results could give us something to think about for further exploration.

Methodology

The CWV Technology Report is a combination of two data sources: CrUX and HTTP Archive. They are similar datasets in that they measure millions of websites, but they have their own strengths and weaknesses worth exploring.

CrUX is a field tool, meaning that it measures real-user experiences. It's also a public dataset, so you could see how users experience any one of over 8 million websites. This is really cool, to put it loosely, because we (as a community) have visibility into how the web as a whole is being experienced.

The CrUX dataset is powered by Chrome users who enable usage statistics reporting. Their experiences on publicly discoverable websites are aggregated together over 28-day windows and the results are published in queryable monthly data dumps on BigQuery and via the CrUX API, updated daily. CrUX measures users' experiences for each of the CWV metrics: LCP, FID, and CLS. Using this data, we can evaluate whether the website passes the CWV assessment if 75 percent of experiences for each metric are at least as good as thresholds set by the Web Vitals program.

HTTP Archive is a lab tool, meaning that it measures how individual web pages are built. Like CrUX, it's a public dataset, and it's actually based on the same websites in the CrUX corpus, so we have perfect parity when combining the two sources together. HTTP Archive is powered by WebPageTest, which integrates with other lab tools like Lighthouse and Wappalyzer to extract fine-grained data about the page. Lighthouse runs audits against the page to determine how well-optimized it is, for example if it takes advantage of web performance best practices. Wappalyzer is an open-source tool that detects the use of technologies like an entire CMS, a specific JavaScript library, and even what programming languages are probably used on the backend. These detections are what we use in the CWV Technology Report to segment the real-user experience data from CrUX.

Confession time! This isn't the first tool to look at CrUX data through the lens of how websites are built. Perf Track is a report built by Houssein Djirdeh that slices CrUX data by JavaScript frameworks. The annual CMS chapter of the Web Almanac slices CrUX data by (you guessed it) CMSs. What makes the CWV Technology Dashboard different is that it facilitates exploration of the data by making all 1,950 technologies across 71 categories discoverable in a single, browseable UI. You can choose your own adventure by filtering technologies to a single category, like Ecommerce, and comparing platforms head-to-head to see which has more websites passing the CWV assessment.

The CrUX dataset on BigQuery is aggregated at the origin level. An origin is a way to identify an entire website. For example, https://httparchive.org is the origin for the HTTP Archive website and it's different from https://almanac.httparchive.org, which is a separate origin for the Web Almanac website.

HTTP Archive measures individual web pages, not entire websites. And due to capacity limitations, HTTP Archive is limited to testing one page per website. The most natural page to test for a given website is its home page, or the root page of the origin. For example, the home/root page of the HTTP Archive website is https://httparchive.org/ (note the trailing slash). This introduces an important assumption that we make in the CWV Technology Dashboard: an entire website's real-user experiences are attributed to the technologies detected only on its home page. It's entirely possible that many websites we test use different technologies on their interior pages, and some technologies may even be more or less likely to be used on home pages. These biases are worth acknowledging in the methodology for full transparency, but to be honest there's not a lot we at HTTP Archive can do to mitigate them without becoming a full-blown web crawler!

Core Web Vitals

There may be different approaches to measure how well a website or group of websites performs with CWV. The approach used by this dashboard is designed to most closely match the CWV assessment in PageSpeed Insights. CWV metrics and thresholds may change annually, but we'll do our best to keep the dashboard in sync with the state of the art.

Each individual CWV metric has a threshold below which user experiences are considered "good". For example, LCP experiences under 2.5 seconds are good. A website must have at least 75% of its LCP experiences in the "good" category to be considered as having good LCP overall. If all of the CWV metrics are good, the website is said to pass the CWV assessment. Refer to the official CWV documentation for the latest guidance on the set of metrics and thresholds.

FID is an exception worth mentioning. Because it relies on user input to be measured, it doesn't occur on as many page loads as metrics like LCP and CLS. That makes it less likely to have sufficient data for pages that may not have many interactive UI elements or websites with low popularity. So the CWV Technology Dashboard replicates the behavior in PageSpeed Insights and assesses a website's CWV, even in the absence of FID data. In that case, if LCP and CLS are good, the website passes, otherwise it doesn't. In the rare case that a website is missing LCP or CLS data, it's not eligible to be assessed at all.

When evaluating a group of origins, like those in the dashboard that all use the same technology, we quantify them in terms of the percentage of origins that pass the CWV assessment. This is not to be confused with the percentage of users or the percentage of experiences. Origins are aggregated in CrUX in a way that doesn't make it meaningful to combine their distributions together. So instead, we count origins as a unit: those that use jQuery, pass the CWV assessment, have sufficient FID data, have good LCP, etc.

The CrUX dataset includes a form_factor dimension representing the type of device the user was on. We segment all of the data in the dashboard by this dimension and call it the "Client", with values of either desktop or mobile.

Querying the raw data

The dashboard is implemented in Data Studio with a BigQuery connector to power all of the technology and CWV insights. The underlying table on BigQuery is made publicly available at httparchive.core_web_vitals.technologies. Feel free to query this table directly to extract information about specific technology trends, or even to build your own custom dashboards or visualizations.

For reference, this is the query that generated the core_web_vitals.technologies table:

CREATE TEMP FUNCTION IS_GOOD(good FLOAT64, needs_improvement FLOAT64, poor FLOAT64) RETURNS BOOL AS (
  good / (good + needs_improvement + poor) >= 0.75
);

CREATE TEMP FUNCTION IS_NON_ZERO(good FLOAT64, needs_improvement FLOAT64, poor FLOAT64) RETURNS BOOL AS (
  good + needs_improvement + poor > 0
);

WITH unique_categories AS (
SELECT
  ARRAY_AGG(DISTINCT LOWER(category)) AS categories
FROM
  `httparchive.technologies.2021_05_01_mobile`
)

SELECT
  date,
  ARRAY_TO_STRING(ARRAY_AGG(DISTINCT category IGNORE NULLS ORDER BY category), ', ') AS categories,
  app,
  client,
  COUNT(DISTINCT url) AS origins,
  COUNT(DISTINCT IF(good_fid, url, NULL)) AS origins_with_good_fid,
  COUNT(DISTINCT IF(good_cls, url, NULL)) AS origins_with_good_cls,
  COUNT(DISTINCT IF(good_lcp, url, NULL)) AS origins_with_good_lcp,
  COUNT(DISTINCT IF(any_fid, url, NULL)) AS origins_with_any_fid,
  COUNT(DISTINCT IF(any_cls, url, NULL)) AS origins_with_any_cls,
  COUNT(DISTINCT IF(any_lcp, url, NULL)) AS origins_with_any_lcp,
  COUNT(DISTINCT IF(good_cwv, url, NULL)) AS origins_with_good_cwv,
  COUNT(DISTINCT IF(any_lcp AND any_cls, url, NULL)) AS origins_eligible_for_cwv,
  SAFE_DIVIDE(COUNTIF(good_cwv), COUNTIF(any_lcp AND any_cls)) AS pct_eligible_origins_with_good_cwv
FROM (
  SELECT
    date,
    CONCAT(origin, '/') AS url,
    IF(device = 'desktop', 'desktop', 'mobile') AS client,
    IS_NON_ZERO(fast_fid, avg_fid, slow_fid) AS any_fid,
    IS_GOOD(fast_fid, avg_fid, slow_fid) AS good_fid,
    IS_NON_ZERO(small_cls, medium_cls, large_cls) AS any_cls,
    IS_GOOD(small_cls, medium_cls, large_cls) AS good_cls,
    IS_NON_ZERO(fast_lcp, avg_lcp, slow_lcp) AS any_lcp,
    IS_GOOD(fast_lcp, avg_lcp, slow_lcp) AS good_lcp,
    (IS_GOOD(fast_fid, avg_fid, slow_fid) OR fast_fid IS NULL) AND
    IS_GOOD(small_cls, medium_cls, large_cls) AND
    IS_GOOD(fast_lcp, avg_lcp, slow_lcp) AS good_cwv
  FROM
    `chrome-ux-report.materialized.device_summary`
  WHERE
    date >= '2020-01-01'
) JOIN (
  SELECT DISTINCT
    CAST(REGEXP_REPLACE(_TABLE_SUFFIX, r'(\d)_(\d{2})_(\d{2}).*', r'202\1-\2-\3') AS DATE) AS date,
    IF(category != '' AND LOWER(category) IN UNNEST((SELECT categories FROM unique_categories)), category, NULL) AS category,
    app,
    IF(ENDS_WITH(_TABLE_SUFFIX, 'desktop'), 'desktop', 'mobile') AS client,
    url
  FROM
    `httparchive.technologies.202*`
  WHERE
    app IS NOT NULL AND
    app != ''
) USING (date, url, client)
GROUP BY
  date,
  app,
  client

The most idealistic goal for this dashboard is to empower influencers in the web community to make improvements to swaths of websites at scale. Web transparency projects like this one are meant to inform and inspire, whether that's instilling a sense of competitiveness with other related technologies to climb the leaderboard or giving them actionable data to make meaningful improvements to technologies under their control. Please leave a comment if you have any suggestions to help make the CWV Technology Report better!

What can the HTTP Archive tell us about Largest Contentful Paint?

Paul Calvano — Tue, 08 Jun 2021 03:59:31 +0000

Largest Contentful Paint (LCP) is an important metric that measures when the largest element in the browser’s viewport becomes visible. This could be an image, a background image, a poster image for a video, or even a block of text. The metric is measured with the Largest Contentful Paint API, which is supported in Chromium browsers. Optimizing for this metric is critical to end user experience, since it affects their ability to visualize your content.

Google has promoted this metric as one of the three "Core Web Vitals" that affect user experience on the web. It is also slated to become a search ranking signal over the next few weeks, which has created a lot of awareness about it. The suggested target for a good Largest Contentful Paint is less than 2.5 seconds for at least 75% of page loads.

Source: https://web.dev/lcp/

Some of the recent posts on WPOStats feature interesting case studies about this metric. For example,

Google's research found that when Core Web Vitals are met, users are 24% less likely to abandon a page before it finishes loading.
Vodafone improved LCP by 31% and saw an 8% increase in sales.
NDTV improved their LCP by 55% and saw a 50% reduction in bounce rate.
Tokopedia improved their LCP by 55% and saw a 23% increase in session duration.

Identifying the Largest Contentful Paint Element

The name of this metric implies that size is used as a proxy for importance. Because of this, you may be wondering specifically which image or text triggered it as well as the percentage of the viewport it consumed. There are a few ways to examine this:

One way to visualize the Largest Contentful Paint is to look at a WebPageTest filmstrip. You’ll be able to see when visual changes occurred (yellow outline) as well as when the Largest Contentful Paint event occurred (red outline).

In Chrome DevTools, you can also click on the LCP indicator in the “Performance” tab to examine the Largest Contentful Paint element in your browser. Using this method you can see and inspect the exact element (image, text, etc) that triggered it.

Lighthouse also has an audit that identifies the Largest Contentful Paint element. If you examine the screenshot below you’ll notice that there is a yellow box around the largest element, as well as an HTML snippet.

How Large is the Largest Contentful Paint?

The HTTP Archive runs Lighthouse audits for approximately 7.2 million websites every month. In the May 2021 dataset, Lighthouse was able to identify an LCP element in 97.35% of the tests. Since we have the ability to query all of these Lighthouse test results, we can analyze the result of the LCP audits and get more insight into what drives this metric across the web.

Using the same boundaries that Lighthouse uses to draw the rectangle around the LCP element, it’s possible to calculate the area of it. In the above example, the product of the LCP image’s height (191) and width (340) was 64,940 pixels. Since the Lighthouse test was run with an emulated Moto G4 user agent with a screen size of 640x360, we can also calculate that this particular LCP image took up 28% of the viewport.

The graph below shows the cumulative distribution of the LCP element as a percentage of screen size. The median LCP element takes up 31% of the screen size! At the 75th percentile the LCP element is nearly twice as large, taking up 59% of the screen size. Additionally 10.6% of sites actually had an LCP element that exceeded the viewport (which is why the y axis doesn’t reach 100%).

The graph below illustrates the same data in a histogram. From this we can see that 4.03% of sites (285,751) had a LCP element that took up 0 pixels. Upon further inspection, the 0 pixel elements appear to have been used in carousels, so by the time the audit completed the LCP element slid out of the viewport.

Node Paths of LCP Elements

Another interesting aspect of the Largest Contentful Paint audit is the nodePath of the element, which shows you where in the DOM this element was. In the example we looked at earlier, the nodePath was:

1,HTML,1,BODY,8,DIV,2,SECTION,1,DIV,0,DIV,0,DIV,0,UL,0,LI,0,ARTICLE,1,DIV,0,DIV,0,A,0,IMG

If we look at the last element in the node path, we can get some insight into the type of element that triggered the Largest Contentful Paint. The most common node that triggered the Largest Contentful Paint was <IMG>, which accounted for 42% of all sites. Next was <DIV> at 27% (which could include text or images). The <H1> through <H5> header elements accounted for 7.18% of all Largest Contentful Paints.

LCP Node (last element in path)	Number of Sites	Percent of Sites
IMG	3067354	42.12%
DIV	1981416	27.21%
P	766977	10.53%
H1	291091	4.00%
	192498	2.64%
SECTION	182267	2.50%
H2	144534	1.98%
A	107501	1.48%
SPAN	85245	1.17%
HEADER	67762	0.93%
LI	64212	0.88%
H3	60679	0.83%
RS-SBG	51623	0.71%
TD	48470	0.67%
H4	19039	0.26%
VIDEO	15649	0.21%
ARTICLE	12860	0.18%
FIGURE	9208	0.13%
BODY	8859	0.12%
image	8077	0.11%
CENTER	7960	0.11%

The <VIDEO> element only accounted for 0.21% of sites. According to the Web Almanac, the <video> element was used on 0.49% of mobile websites - so from this we can estimate that half of sites loading videos are triggering LCP with video poster images.

Image Weight for the LCP

One of the Lighthouse audits looks for opportunities to preload the Largest Contentful Paint element, and estimates the potential savings in performance. This audit also identifies the URL for the LCP element - which can give us some insights into what type of images are being loaded as a LCP element. In the HTTP Archive data, only 67% of the Lighthouse tests were able to identify a URL for an LCP element. Based on this, we can infer that text nodes are used for the LCP on approximately 33% of sites.

The graph below shows the distribution of sizes for the image element that was associated with the Largest Contentful Paint. The median LCP element size was 80KB. At the 90th percentile, the LCP element size was 512KB. If you have a large LCP image then you should consider optimizing it before you attempt to follow the Lighthouse preload recommendation.

Additionally, 70% of the LCP element images were JPEG and 25% were PNG. Only 3% of sites served a webp as their LCP element.

format	sites	% of Sites
jpg	3161991	69.37%
png	1122585	24.63%
webp	141441	3.10%
gif	84829	1.86%
svg	34123	0.75%
Other	13272	0.29%

When we look at the LCP element as a percentage of page weight, we can see that the median LCP element is 4.17% of the total page weight. At the higher percentiles, the LCP elements are larger and also a larger percentage of page weight.

percentile	ImageRequests	ImageKB	TotalKB	LCP as a % of Page Weight
p25	15	422	1,138	3.01%
p50	26	1,142	2,185	4.17%
p75	45	2,692	4,108	5.58%
p95	103	8,008	10,036	8.42%

Since images account for 52% of the median page weight (for the sites that have a LCP image element), we can infer that at the median 8% of page weight is used to render content to 31% of the screen.

How does this change based on Site Popularity?

The HTTP Archive now contains rank groupings, obtained from the Chrome User Experience Report. This can enable us to segment this analysis based on the popularity of sites. The rank grouping indicator buckets sites into the top 1K, 10K, 100K, 1 million and 10 million.

When we look at the Largest Contentful Paint image size based on popularity, it’s interesting to note that the most popular sites tend to be serving smaller images for the LCP element. While there may be numerous reasons for this, I suspect that the more popular sites are investing in image optimization solutions.

Page weight follows the same pattern, with the least popular websites having some of the largest page weights. If we look at the LCP element based on the percentage of page weight, you can see that within the top 100K sites the ratios are very close. In the less popular sites, the LCP element tends to be a much greater percentage of page weight.

Largest Contentful Paint Image as a Percentage of Page Weight
rank	p25	p50	p75	p95
Top 1k	1.61%	2.12%	2.85%	5.67%
Top 10k	1.76%	2.27%	3.00%	4.96%
Top 100k	2.07%	2.87%	3.77%	5.78%
Top 1 million	2.53%	3.49%	4.60%	6.95%
Top 10 million	3.11%	4.30%	5.75%	8.65%

We can also make some interesting observations about how popular sites are optimizing their LCP assets. Looking at the various image formats, JPG images are the most common LCP element. Some other formats such as PNG, WebP, GIF and SVG are used more frequently in the more popular sites.

Conclusion

Largest Contentful Paint is an important metric that helps illustrate when a page’s most significant content is rendered to the screen. In reviewing the HTTP Archive data, we can see that this area represents between 30% and 60% of a mobile viewport for a majority of sites.

There are a shocking number of sites that have a LCP element that consumes a large percentage of the viewport and are delivered as large unoptimized images. Site owners should evaluate both what is triggering the Largest Contentful Paint as well as how it is loaded. Optimizing for the Largest Contentful Paint will ensure that the browser has the opportunity to load and render this content as quickly as possible.

If you are interested in seeing some of the SQL queries and raw data used in this analysis, I’ve created a post with all the details in the HTTP Archive discussion forums. You can also see all the data used for these graphs in this Google Sheet.

Originally posted at https://paulcalvano.com/2021-06-07-lcp-httparchive/

Correlation between Core Web Vitals and web characteristics

Sixing Chen — Tue, 12 Jan 2021 22:46:17 +0000

Introduction

Core Web Vitals (CWV) are the metrics that Google considers to be the most important indicators of the quality of experience on the web. The process to identify and optimize CWV issues has typically been a reactive one. The decisions site owners make about which technologies to use or which metrics to look at are usually decided by trial and error, rather than empirical research. A site may be built or rebuilt using a new technology, only to discover that it creates UX issues in production.

In this analysis, we analyze the correlation between CWV and many different types of web characteristics simultaneously, rather than a single type of web characteristic in a vacuum, since web development choices are not in a vacuum but in the context of many parts of a website. We hope that these results will provide additional reference points to teams as they consider assessing various web development choices and invite the community to help further the understanding of the interplay between CWV and web characteristics.

Notable negative associations with largest contentful paint:
- TTFB, bytes of JavaScript, CSS, and images
- JavaScript frameworks - AngularJS, GSAP, MooTools, and RequireJS
- JavaScript libraries - Hammerjs, Lodash, momentjs, YUI, Zepto, jQueryUI, and prettyPhoto
- CMS - Joomla and Squarespace
- UI frameworks - animatecss
- Web frameworks - MicrosoftASPNet
- Widgets - FlexSlider and OWLCarousel
Notable negative associations with cumulative layout shift:
- Bytes of images
- JavaScript frameworks - AngularJS, Handlebars, and Vuejs
- JavaScript libraries - FancyBox, Hammerjs, Modernizr, and Slick
- Widgets - Flexslider and OWLCarousel

Methodology

Data source

This analysis is based on data from HTTP Archive. The HTTP Archive dataset is generated in a lab environment and contains detailed information on many characteristics of a website as well as performance data. Due to being lab-generated on a single set of hardware, HTTP Archive data will not be completely reflective of real usage and only allows for us to analyze LCP (largest contentful paint) and CLS (cumulative layout shift) as we do not have any user input for FID (first input delay). However, an advantage of being lab-generated is that all data is gathered on a single set of hardware with no bias in the types of websites that are loaded; thus, shields us from confounding due to user/device characteristics that we do not measure. Although we are not shielded from all confounding between website characteristics and web performance, this choice leaves us with far less confounding than a user-generated dataset where we often have no information on the user, and have only limited device information.

Web characteristics

We conferred with domain experts and established a list of web characteristics of interest:

TTFB, font requests, and bytes of content of various types
Counts of various types of third party requests
Web technologies (coded as binary to represent whether technology is used)
- JavaScript frameworks
- JavaScript libraries
- CMS
- UI frameworks
- Web frameworks
- Widgets

These characteristics represent the ways that web pages are built and experienced. Pages may be built using various technologies like content management systems (CMSs), JavaScript libraries and frameworks, etc. According to the Web Almanac, 40% of websites are built with a CMS. That makes it a useful category to inspect qualitatively to see if there are meaningful correlations between CMS and CWV. On the other hand, we use quantitative metrics that represent how users experience the page, including performance and page weight data.

The list of technologies we include in the analysis is only a subset of all technologies employed by sites in HTTP Archive. We have restricted the analysis to websites that employ only technologies that are used by at least 50,000 websites (this amounts to about 1% of sites in HTTP Archive). This removes underused technologies that we may not have sufficient data to evaluate. The presence of certain technologies also overlap highly, with CMS Wix and JavaScript library Zepto almost overlapping completely. Such high overlap creates modeling issues, and we have chosen to remove Wix from this analysis.

Analysis

With LCP and CLS as the outcomes and the web characteristic as the predictors, we attempt to model the relationship between the outcomes and the predictors through random forest. Random forest is a learning algorithm for both regression and classification based on a set of decision trees trained from a randomly chosen set of predictors and a bootstrap sample of the dataset.

To assess the correlation between the outcome and each predictor as well as their individual effects on the outcome, we derived a measure of correlation (% of higher >= split mean, %HSM) and a measure of effect size (mean split difference, MSD). Both measures are based on the types of splits the trained decision trees make based on the predictors. See appendix for more details.

%HSM is bounded between 0 and 1, with values close to 0 indicating negative correlation and values close to 1 indicating positive correlation, while values close 0.5 indicates little correlation. MSD’s magnitude is not bounded, and a large positive value indicates that the predictor appears to contribute positively to the mean of the outcome. Note positive here does not necessarily mean it is good, but merely in the numerical sense.

Results

Here, we present results on association and make note of specific characteristics that appear especially impactful on CWV.

When interpreting these results on association, an important thing to note is that positive and negative impact of a particular web characteristics should only be interpreted relative to that of other web characteristics and in the context of websites that employ an array of web technologies, various types of contents, and different third party requests. For instance, if a given web technology shows a strong positive impact, it should be interpreted as this technology appears to be good for performance relative to other technologies, instead of if we add this technology to a website, its web performance will improve.

LCP

LCP is modeled as the log of its numerical value, so higher values are worse.

Predictors with %HSM values close to 1 means higher values of a numerical/count characteristic or presence of a technology are strongly associated with higher values of LCP, and vice versa for predictors with %HSM close to 0 (high %HSM is worse).

Likewise, predictors with a relatively large and positive MSD means higher values of a numerical/count characteristic or presence of a technology shows a strong negative impact on LCP, and vice versa for predictors with relatively large and negative MSD (large positive MSD is worse).

Higher values of TTFB, bytes of JavaScript, CSS and images show the strongest positive correlation with LCP and most negative impact, though TTFB is not always actionable.

In general, third party requests do not show strong correlation or impact on LCP in the context of the other predictors we consider. This result could be due to most websites in HTTP Archive having a fair number of third party requests, so its effect could not be well ascertained.

The presence of most JavaScript frameworks show strong positive correlation with LCP and negative impact, except AMP. AngularJS, GSAP, MooTools, and RequireJS stand out the most.

Just as JavaScript frameworks, the presence of most JavaScript libraries also show strong positive correlation with LCP and negative impact. Hammerjs, Lodash, momentjs, YUI, and Zepto stand out in terms of correlation and effect size, while jQueryUI and prettyPhoto stand out in terms of correlation.

Among CMS, Joomla and Squarespace show strong positive correlation with LCP and negative impact. On the other hand, WordPress shows low correlation and impact.

Animatecss stands out among UI frameworks, MicrosoftASPNet stands out among web frameworks.

Among widgets, FlexSlider and OWLCarousel both show strong positive correlation with LCP, and Flexslider also shows a strong negative effect size.

CLS

CLS is modeled as a binary indicator of whether a given threshold is met. 1 indicates a website has CLS < 0.1, and 0 otherwise, so 1s are better than 0s.

Predictors with %HSM values close to 1 means higher values of a numerical/count characteristic or presence of a technology are strongly associated with meeting the CLS threshold, and vice versa for predictors with %HSM close to 0 (low %HSM is worse).

Likewise, predictors with a relatively large and positive MSD means higher values of a numerical/count characteristic or presence of a technology shows a strong positive impact on meeting the CLS threshold, and vice versa for predictors with relatively large and negative MSD (large negative MSD is worse).

Most of these characteristics show only weak correlation with LCP and low impact, except bytes of images that show a negative correlation with CLS compliance and negative impact.

Just as LCP, third party requests seem to have low correlation and impact on CLS compliance.

The presence of several JavaScript frameworks show strong negative correlation with CLS compliance and negative impact, while AMP, GSAP, and React show low correlation and impact. AngularJS, Handlebars, and Vuejs appear to have the most negative impact.

JavaScript libraries appear less bad for CLS compliance than frameworks, though most still show a negative impact. FancyBox, Hammerjs, Modernizr, and Slick are the most notable.

None of the CMSs have a notable negative impact, with WordPress showing a fairly positive correlation.

UI frameworks all show low impact. Among web frameworks, RubyonRails shows a fairly positive correlation with CLS compliance.

Among widgets, Flexslider and OWLCarousel both show a fairly negative impact on CLS compliance.

Conclusion

This analysis is a first step in an effort to more comprehensively understand the impact of web characteristics on CWV. While the results point out strongly associated characteristics, it would be impactful for the web community to further delve into the associations identified to ascertain which ones are truly causal and which ones are merely associative to inform web developers. In the meantime, the web characteristics with strong negative correlations/effects should be seen as a signal of things that require more attention and/or planning. Finally, it would be of interest to refresh these analyses in the future to see if the associations identified here still hold.

Appendix

Random forest trains decision trees by making binary splits of the data. Each split is based on a particular predictor and are of the form X <= c and X > c, for a predictor X based on some purity criterion. Then, all data points with value for X <= c will be put in the corresponding branch, and likewise for data points with X > c. The data points can then be further split in each subsequent branch based on other predictors in the same way. The measures of correlation and effect size we use exploit these splits.

Specifically, for a given predictor, we look for splits that are based on the predictor. For each such split, we compute the outcome mean of the data points that are in the <= and > branches. %HSM (% of higher >= split mean) checks the proportion of times that the outcome mean in the >= branch is higher than that in the < branch. This checks how frequently larger outcome means are associated with higher predictor value. MSD (mean split difference) is the outcome mean of the <= branch subtracted from that of the > branch, averaged across all relevant splits of the predictor. This checks the difference in the outcome mean between data points with higher values of the predictor and those with lower values.

Introducing the second annual Web Almanac!

Rick Viscomi — Tue, 22 Dec 2020 15:23:49 +0000

The 2020 Web Almanac is a free, open-source, community-made ebook whose mission is to annually track the state of the web. This report is for everyone who's ever wondered how big a typical web page is nowadays, or what the most common CSS breakpoints are, or the most popular CMS. Those questions and many more are answered in this comprehensive, data-driven research project sourced from over 7 million websites.

Experts in over 20 web disciplines researched and wrote chapters covering the components web pages are made of, how users experience them, how developers publish them, and how they're delivered. The 2020 report is out and I'm excited to share it with you all! (Read the 2019 announcement on dev.to.)

This year over 100 contributors have collaborated on the report: analyzing data, authoring chapters, peer-reviewing, editing, translating, and maintaining the website. This project was only made possible through the hard work by all of the contributors collectively. The result is a free 500+ page ebook that you can read on our website, as a PDF (22 MB), or on Google Play or Google Books. Translations are in progress in nine languages, and the 2019 edition is completely translated into Japanese with its own ebook!

This is a community effort and we're a friendly, inclusive group, so if you'd like to be a part of the 2019-2020 translation effort or help write the 2021 edition, we'd love for you to join us! Please fill out this interest form to let us know how you'd like to contribute, and we'll reach out to you when it's time to start planning in mid-2021.

Give it a read and let us know what you find most interesting or surprising! Here's what's inside:

Part I. Page Content

Part II. User Experience

Part III. Content Publishing

Part IV. Content Distribution

Appendices

(Chapters 5 and 16, Media and Ecommerce, are coming soon!)

Growth of the Web in 2020

Paul Calvano — Tue, 29 Sep 2020 19:28:48 +0000

For the past 10 years, the HTTP Archive has tracked the evolution of the web by archiving the technical details of desktop and mobile homepages. During its early years, the Alexa top million dataset (which was publicly available until 2017) was used to source the list of URLs included in the archive and the number of sites tracked increased from 16K to almost 500K as testing capacity increased. To keep the archive current and include new sites, towards the end of 2018 we started using the Chrome User Experience Report as a source of the URLs to track.

Throughout 2019 the size of the HTTP Archive dataset was mostly constant. However, the sample size has grown quite a bit in 2020 as you can see in the graph below! Additionally, if we combine both desktop and mobile URLs, there was a recent peak of 7.5 million sites!

How are sites Included in the Chrome User Experience Report

The Chrome User Experience Report (CrUX) is sourced from performance data collected from real Chrome users that have opted into syncing their browser history and sharing anonymized usage statistics reporting.It’s essentially real user measurement (RUM) data for Chrome users.

You can read more about CrUX on Google’s Developer website, as well as this informative blog post from Rick Viscomi. I’ve also written about it previously here.

While Google doesn’t publish a definitive list on what it takes to be included in the Chrome User Experience Report dataset, they have indicated that:

Origins are automatically curated based on real-user Chrome usage
Websites must meet a traffic threshold to be included.
Websites must be publicly accessible

Essentially, a website’s inclusion in the Chrome User Experience Report indicates that they’ve reached a certain threshold of activity. According to the CrUX changelog, there have been no changes in the methodology. So it can be inferred that analyzing the number of websites included in this dataset should provide some interesting insights into the month on month growth of the number of websites being visited by real users.

Note: The Chrome User Experience Report does not contain traffic details, and as such this analysis should not be interpreted as growth of traffic on the internet. This analysis is specifically about the growth in the number of websites that people are visiting.

Global Growth of Origins Accessed in 2020

The graph below illustrates the total number of websites that the Chrome User Experience reported across all form factors during the previous 12 months. There are a few interesting observations we can make:

In both 2019 and 2020 there were increases in the number of websites at the start of the year.
There was a linear increase in the number of websites through the first half of 2020.
The drop in March and April 2020 is interesting, since that coincides with the start of the global COVID-19 pandemic.

There is a similar pattern with the total number of registered domains. This indicates that much of the growth is new domains and not necessarily subdomains of existing domains.

When we look at the month over month rate of change, you can see that the max change in 2019 was +/- 5%. The number of sites tends to fluctuate month to month. For example, between August and December 2019 there was an 8.6% decrease in sites. However at the start of 2020 there was a 7.5% increase.

When comparing the number origins between December 2019 and August 2020, the total number of origins increased by 28.9% this year alone! That’s huge!

Mobile vs Desktop Growth

Looking at this by device type, we can see that there are consistently more mobile websites compared to desktop websites. And over the past year the fluctuations between them have been fairly consistent. The one exception is between May 2020 and June 2020, where desktop increased by 0.7% and mobile increased by 6.2%.

Overall, there are 22.9% more mobile websites in the CrUX dataset compared to desktop. We know from sources like statcounter that mobile usage has grown significantly over the years, and consistently surpasses desktop. But why are mobile users navigating to so many more websites compared to desktop users?

Is there something about the mobile experience (such as social media links, email marketing, etc) that increases the change a user may navigate to an unfamiliar website?

Or could it be growth in regions where mobile is more dominant?

How Has this Varied by Region?

At the start of 2020, most regions of the world saw an increase in the domain of sites. The exception to this was western Asia. The regions that had the most substantial increase at the start of the year were Northern Europe, North America and South America.

Between May and June there was another large uptick in the number of sites. This appeared to be mostly South-East Asian and Western European countries

The tables below detail the number of sites included in the CrUX dataset during December 2019 as well as January, May and June 2020. This first table contains the top 10 regions, most of which saw an increase of 15% to 25% during the previous six months!

	Number of Sites Included in CrUX dataset
Sub Region	Dec 2019	Jan 2020	May 2020	June 2020	August 2020
Northern America	1,257,159	1,406,284	1,676,120	1,681,454	1,730,260
Western Europe	713,202	768,644	874,164	908,560	943,891
Eastern Europe	640,145	694,632	821,024	913,037	926,202
Eastern Asia	720,926	740,767	871,854	882,322	901,008
South America	486,894	540,685	668,604	726,410	784,854
Southern Europe	506,054	541,526	661,416	710,543	724,323
Northern Europe	453,591	516,527	601,459	638,744	661,790
South-Eastern Asia	473,962	485,143	524,249	584,214	629,815
Southern Asia	403,325	419,118	441,328	462,282	500,600
Western Asia	274,339	273,610	327,425	351,186	362,340

	Percent Change of Sites Included in CrUX dataset
Sub Region	Dec 2019 - Jan 2020	May - Jun 2020	Jun - Aug 2020	Dec - Aug 2020
Northern America	10.60%	0.32%	2.82%	27.34%
Western Europe	7.21%	3.79%	3.74%	24.44%
Eastern Europe	7.84%	10.08%	1.42%	30.88%
Eastern Asia	2.68%	1.19%	2.07%	19.99%
South America	9.95%	7.96%	7.45%	37.96%
Southern Europe	6.55%	6.91%	1.90%	30.13%
Northern Europe	12.18%	5.84%	3.48%	31.46%
South-Eastern Asia	2.30%	10.26%	7.24%	24.75%
Southern Asia	3.77%	4.53%	7.65%	19.43%
Western Asia	-0.27%	6.77%	3.08%	24.29%

Looking at the next 10 in the list, we can see significant growth in Central America, Australia as well as West, Southern and South Africa. Overall the regions with the most growth during the 7 month period was Australia and New Zealand, South America, and Central America.

	Number of Sites Included in CrUX dataset
Sub Region	Dec 2019	Jan 2020	May 2020	June 2020	August 2020
Central America	155,057	179,295	236,255	242,132	257,043
Australia and New Zealand	124,763	141,523	194,212	196,841	214,757
Northern Africa	68,754	69,497	83,312	88,672	88,606
Southern Africa	50,618	59,139	64,978	66,392	70,218
Central Asia	45,932	49,192	57,098	57,508	62,112
Western Africa	44,692	49,868	47,834	51,257	50,853
Caribbean	33,840	37,445	44,090	45,910	45,395
Eastern Africa	31,010	34,822	36,073	37,388	38,609
Middle Africa	8,873	9,149	9,121	10,057	10,032
Melanesia	2,733	2,991	2,580	2,779	2,818

	Percent Change of Sites Included in CrUX dataset
Sub Region	Dec 2019 - Jan 2020	May - Jun 2020	Jun - Aug 2020	Dec - Aug 2020
Central America	13.52%	2.43%	5.80%	39.68%
Australia and New Zealand	11.84%	1.34%	8.34%	41.91%
Northern Africa	1.07%	6.04%	-0.07%	22.40%
Southern Africa	14.41%	2.13%	5.45%	27.91%
Central Asia	6.63%	0.71%	7.41%	26.05%
Western Africa	10.38%	6.68%	-0.79%	12.12%
Caribbean	9.63%	3.96%	-1.13%	25.45%
Eastern Africa	10.95%	3.52%	3.16%	19.68%
Middle Africa	3.02%	9.31%	-0.25%	11.55%
Melanesia	8.63%	7.16%	1.38%	3.02%

Many of the regions that had an increase in sites visited (based on CrUX data), also have a high percentage of mobile visitors compared to the global population (based on statcounter). So while it’s difficult to say for certain, it’s entirely possible that location is a large factor in the gap between Desktop and Mobile.

Analyzing by Top Level Domain

The .com top level domain accounts for 43.7% of all websites tracked in the Chrome User Experience report. The next largest top level domain is .org, which consists of 3.7% of all sites. Overall there were 4111 TLDs in the dataset, and the top 20 of them represented 75% of all websites.

Most of these top level domains experienced a > 20% growth in active websites since December 2019, with the exception of .info and .net. The domains with the largest percentage growth were co.uk, com.au and de.

If we look at the month to month growth trends for these TLDs, we can make a few interesting observations:

There was a significant drop across all TLDs in March 2020.
The largest percentage drop was for .it domains in March 2020, although that rebounded with increases in April, May and June.
In February 2020, there was a 23.9% increase in .edu domains receiving traffic.
In May 2020, more than a dozen popular TLDs saw a double-digit increase in the number of sites.
In August 2020 there was a 10.4% increase in edu domains

Conclusion

The web is constantly growing and evolving, and clearly it’s rate of growth can vary quite a bit. During this analysis we explored a public dataset that Google provides to show how the web has grown during 2020, and which regions are growing the most. While this doesn’t speak to the traffic levels experienced in these locations, the number of websites can be used as a proxy for understanding usage of the web. As this analysis shows, 2020 has been a year of substantial global growth for the web.

Originally posted at paulcalvano.com

An Analysis of Cookie Sizes on the Web

Paul Calvano — Mon, 13 Jul 2020 14:28:47 +0000

Cookies are used on a lot of websites - 83.9% of the 5.7 million home pages tracked in the HTTP Archive to be specific. They are essentially a name/value pair set by a server and stored in a client’s browser. Sites can store these cookies by using the Set-Cookie HTTP response header, or via JavaScript (document.cookie). On subsequent requests, these cookies are sent to the server in a Cookie HTTP request header.

In this article we’ll be looking at the size of cookies across the web, and discuss some of the web performance implications of them.

First vs Third Party cookies?

When a cookie is set by the domain you see in your address bar, it is considered a first party cookie. These can be used for session management, authentication, personalization, etc. When a cookie is set by a different domain, then it’s considered a third party cookie.

Based on an analysis of over 109 million cookies, third parties account for 79% of all cookies.

Note: I wrote another blog post exploring the use of the SameSite attribute in cookie files, and how third party cookies are affected. You can read it here.

Cookie Sizes

An example of a cookie set by a server would be:

Set-Cookie: loggedIn=true; Domain=.example.com; Path=/; Max-Age=14400; HttpOnly

Directives such as Domain, Path, max-age, and HttpOnly affect how the cookie is stored, and which hostname a browser should share them with. In this example, loggedIn=true is the name/value portion of the cookie, and that is what we’ll be exploring in this post.

The median length of all cookies in the HTTP Archive is 36 bytes as of June 2020. That statistic is consistent across both first and third party cookies. The minimum is just a single byte, usually set by empty Set-Cookie headers (which is likely an error).

Cookie Size	First Party	Third Party
Min	1	1
Median	36	37
95th Percentile	181	135
99th Percentile	287	248
Max	29,735	8,500

The largest cookie size was 29,735 bytes, which is quite large! This is so large in fact that it is rejected by all modern browsers. I was curious to see what the limits are, and decided to dig into the source. Both Chrome and Firefox will reject cookies greater than 4KB. This is likely due to the implementation limits defined in RFC 6265.

So who is setting these large cookies?

The largest first party cookie was set by https://www.ridewill.it/, and it is named menu. It’s value was a long urlencoded string that contained multiple <div> layers and links. All in all, there were 240 links in a single cookie!
Many large first party cookies included large session cookies.
The largest third party cookie was set by web.taggbox.com, and consisted of a large JSON array named “liveWall”.
Most of the largest third party cookies were set from web.taggbox.com as well as a small number of advertising third parties.

If we look at the entire distribution of cookie sizes, it gets even more interesting. 88% of the cookies being set are less than 100 bytes. The 99th percentile is 372 bytes. So really large individual cookies are not common.

Cookies Sent to First Party Domain

What about cookies that clients send back to servers? A client can send multiple cookies in a single Cookie request header. Since the HTTP Archive only collects information on homepages, there is a limit to the insight we can collect here. If we look at just the request headers for favicon.ico, we can get an idea of how large the Cookie request header might be for a subsequent request. However this does not include any additional cookies set later in a session (ie, such as after logging in)

The median size of cookies sent on the favicon request was 161 bytes and the 95th percentile was 681. The largest was 7,795 bytes, and you can see the distribution below.

It’s important to note that the cookies set by the time favicon was requested may under represent the size of the cookies users would send later in a browsing session. For example, when logging into an application, a few additional cookies might be set. Some third parties that use a first party subdomain (ie, if www.example.com loaded a resource from metrics.example.com) also set a first party cookie.

Performance Implications of Large Cookies

When a browser sends an HTTP request, the HTTP request headers are usually 400 - 500 bytes. In the example below the request headers total 407 bytes.

GET / HTTP/1.1
Host: www.example.com
Connection: keep-alive
Upgrade-Insecure-Requests: 1
User-Agent: Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_4) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/83.0.4103.116 Safari/537.36
Accept: text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,image/apng,*/*;q=0.8,application/signed-exchange;v=b3;q=0.9
Accept-Encoding: gzip, deflate, br
Accept-Language: en-US,en;q=0.9

Adding a cookie to that will increase the size of the request header. If we add more than 1KB of cookies to that request, then we exceed 1500 bytes, which is the standard maximum transmission unit (MTU) used by TCP. This means that the HTTP request would span multiple TCP packets, which may result in multiple round trips and increase the risk of retransmission. This can potentially increase the TTFB of the response since it would take longer to make the request.

With HTTP/2, we have HPACK compression - which was designed to help reduce the size of HTTP request headers by utilizing a dynamic index table. Receiving endpoints advertise the maximum size of this table in bytes (default is 4096), the sender can insert headers up to this limit in one request or response and subsequently reference it in another. In theory HPACK seems like it could help reduce the overhead of these large cookies. However, it’s not as easy as it sounds. Consider the following:

If a request is made on a new HTTP/2 connection (ie, for a return visitor, or a navigation that happened after the previous connection times out), then the entire cookie string would be sent. This would impact TTFB for that first request (usually the HTML).
Some servers intentionally exclude cookies from HPACK compression, since cookies are a very valuable target for an attacker.

Because of the practical constraints on compressing cookies, and the impact of cookie size on the first flight of requests and responses, it is beneficial to use smaller cookies. Based on the observations here, 900 bytes seems like a good budget for a total cookie size, which leaves room for other headers such as user-agent (which can benefit from HPACK compression).

Conclusion

Cookies are everywhere, and they are set by both first and third party requests. Most browsers will limit the size of cookies stored to 4KB - which is still quite large.

At a minimum, setting a cookie larger than 4KB will simply not work. However, 4KB is still too large and can negatively impact your TTFB by increasing the number of round trips on the HTTP response.

Even moreso, it’s important to keep track of how many cookies are being sent, as bloated cookies can also impact your TTFB by increasing the time it takes to make an HTTP request.

Many thanks to Lucas Pardue and Matt Ringel for reviewing this.

SameSite Cookies - Are you Ready?

Paul Calvano — Tue, 07 Jul 2020 20:43:39 +0000

Last year Google announced updates to Chrome that provide a way for developers to control how cross site cookies should work on their sites. This is a good change - as it ultimately improves end user security and privacy by limiting which third parties can read cookies that were set while visiting a different site. It also defeats cross site request forgery attacks. The implementation is fairly simple, and only requires developers to add the SameSite attribute to their cookies.

The SameSite attribute is supported by all modern browsers, and most have historically defaulted to a permissive use of cookies if the attribute isn’t present.

Google changed the default behavior of SameSite attribute to secure cookies by default when Chrome 80 was released in February 2020. However it was rolled back in April 2020 to ensure stability during the initial stage of the COVID-19 response. Now they are planning to resume SameSite cookie enforcement with Chrome 84, which will be released on July 14th.

Despite almost a year of notice and warnings in the browser console, this seemed to catch many by surprise in February. How ready are third parties for this now?

What is a Cross Site Third Party Cookie?

If a request on www.example.com sets the following cookie from its domain, then the browser will store this cookie and send it back on subsequent requests to the same domain. This is an example of a first party cookie. Essentially a cookie whose domain matches the domain that appears in the address bar...

Set-Cookie: session=abc; path=/; Secure; HttpOnly;

Now let’s assume that www.example.com includes a third party analytics provider, metrics.analyticsexample.com. When the third party request is made, that third party can also set a cookie in the end users browser. And that third party will be able to read the cookie. This is an example of a third party cookie.

If that same user then navigated to www.example2.com, which uses the same third party analytics provider, then their third party cookies would be readable by them across both sites. The third party is then able to track the user across multiple websites.

SameSite Cookies

The SameSite cookie attribute was introduced in a 2016 IETF draft, but had not been widely adopted initially. This attribute provided developers with the ability to control when a browser would send a cookie to a third party. Using them is simply a matter of adding the SameSite attribute to a cookie declaration, with one of the three supported values: “None”, “Lax”, and “Strict”.

This provides the following controls:

SameSite=None
- The browser will send cookies with both cross-site requests and same-site requests.
SameSite=Lax
- Same-site cookies are withheld on cross-site sub-requests, such as calls to load images or frames, but will be sent when a user navigates to the URL from an external site; for example, by following a link.
SameSite=Strict
- The browser will only send cookies for same-site requests (requests originating from the site that set the cookie). If the request originated from a different URL than the URL of the current location, none of the cookies tagged with the Strict attribute will be included.

An example of how this is configured is:

Set-Cookie: key=value; SameSite=Strict

If the SameSite attribute is not included, then most browsers have historically defaulted to the most permissive behavior: SameSite=None.

Google Chrome’s Update

Google has been planning to update the behavior of SameSite within the Chrome browser to default to the more secure SameSite=Lax. Additionally, if a SameSite=None attribute is present, then they would require that the cookie have the “Secure” attribute. There was some concern that this change would cause breakage for some third parties, so a warning message was included in Chrome since version 77 (September 2019).

A cookie associated with a cross-site resource at was set without the SameSite attribute. A future release of Chrome will only deliver cookies with cross-site requests if they are set with SameSite=None and Secure. You can review cookies in developer tools under Application>Storage>Cookies and see more details at https://www.chromestatus.com/feature/5088147346030592 and https://www.chromestatus.com/feature/5633521622188032.

SameSite Usage Across the Web

The HTTP Archive stores a tremendous amount of detail for every HTTP request and response for approximately 5.8 million homepages. In the June 2020 data, there were approximately 108 million third party cookies set across 3.79 million homepages. Of these cookies 35,721,768 (32.9%) included the SameSite attribute. Comparatively in August 2019, 21.4% of cookies had the SameSite attribute.

Note: Due to a collection issue described here, ~18.6% of third party cookies were unreadable in the June 2020 HTTP Archive data. The remainder of this analysis is on the cookies we could read.

The “Secure” flag of a cookie ensures that the browser only sends the cookie over HTTPS. Chrome made this a requirement to use SameSite=None. Out of the 35 million cookies, nearly 75% of them use the Secure flag.

When we look at this graphically, there are a few interesting observations we can make:

SameSite=Lax, which will be the new default, is in use by only 10.82% of secure cookies, but 97% of insecure cookies.
SameSite=None is present on 89.10% of Secure cookies.
2.65% (238,810) insecure cookies are set with SameSite=None, but not including the Secure flag. These will default to SameSite=Lax.
Only 0.06% (16,000) of secure cookies are using SameSite=Strict!
There are less than 1000 SameSite attributes set to an erroneous value (ie, not Lax, Strict or None).

Who is using SameSite=None incorrectly?

There were 238,810 third party cookies set with SameSite=None, but missing the Secure flag. These will default to SameSite=Lax in Chrome 84 unless the Secure flag is added. Overall, there were 1749 third party domains that made this error. The top 5 account for 48% of the erroneous SameSite cookies. This includes Spotxchange, ETargetNet, SmartAdServer, BazaarVoice and EntityTag.

Even more concerning than the number of cookies, is the number of websites that are affected. For example spotxchange.com is setting SameSite=None with insecure cookies on 26,174 websites. EntityTag.co.uk is doing the same for 14,358 websites.

Who is using SameSite Strict?

I thought it was odd that SameSite=Strict was used so infrequently. The table below shows some of the third parties that are using it. The top 10 account for 67% of all SameSite=Strict usage.

What Erroneous SameSite Attribute Values are in Use?

I was surprised to see that there was such a low percentage of erroneous usage of the SameSite attribute. The top 10 errors listed below account for 80% of the erroneous uses. Most were SameSite=Secure, SameSite with no value, and SameSite: Lax.

Many of these errors were from a small number of third parties. For example:

SameSite=secure was set by wildbeardbg.com:
- dg_perm_sessid=95951591788161; secure; SameSite=Secure
SameSite: Lax was set by nytimes.com.
- nyt-purr=cfshcfhssc; Expires=Fri, 18 Jun 2021 01:02:51 GMT; Path=/; Domain=.nytimes.com;SameSite: Lax;Secure
SameSite=; was set by cloudengage.com
- CEUID=E80y%2BgcVMl0o6Skdol5X9FRCrdrbqBNssQeOYrNUnQxC42U4gpGNV6VX; expires=Wed, 01-Jul-2020 16:16:13 GMT; Max-Age=2592000; path=/; samesite=; domain=.cloudengage.com; secure; HttpOnly

SameSite Usage Across Popular Third Parties

So far we’ve looked at how SameSite is being used across third parties. But which third parties are not setting SameSite attributes? By not setting it, they will default to SameSite=Lax. Whether or not this is intentional or will cause breakage will depend on each third party's usage of the cookies. But not setting an explicit SameSite attribute could be indicative of whether a third party has prepared for this.

The graph below shows SameSite usage across some of the most popular third parties. Very few sites third parties are setting SameSite=Lax, which is about to become the new default. Some of these third parties are used by hundreds of thousands of sites.

Conclusion

SameSite cookies are a huge win for privacy and security, but there is a risk that Chrome’s new default settings will cause problems. For many cases, this will likely render some cross site tracking techniques ineffective with little change to end user experience. However, with any changes like this there are risks of breakage and serious site issues. With SameSite adoption at less than 33% of third party cookies, this raises the question of how prepared third parties are for this.

It’s important to note that the absence of a SameSite attribute does not necessarily mean that there will be breakage. However depending on how the cookie is used, it has the potential of becoming problematic. It may be worth checking the JS console in Chrome DevTools to see if you see the SameSite warning for any of your third parties. You can also set a flag in Chrome to test how this will affect your site ahead of the Chrome 84 release. The Chrome team has published a useful guide for debugging SameSite cookies.

The 2019 Web Almanac is now available as a free ebook!

Rick Viscomi — Sat, 23 May 2020 22:02:41 +0000

Last year 29 subject matter experts from the web community came together in a massive effort to document the state of the web, called the Web Almanac. They wrote about 20 topics in the areas of page content, user experience, content publishing, and distribution. Chapters include JavaScript, CSS, Performance, SEO, Ecommerce, CMS, CDN, HTTP/2, and many more.

Now you can download an ebook of the entire 2019 edition (for free). It's 421 pages and 18 MB of solid research and analysis from trusted web experts packaged up for ergonomic e-reading. We've also translated the entire contents into Japanese. Or if you'd prefer to browse the content on the web, you can always visit almanac.httparchive.org.

The mission of the Web Almanac project is to document and raise awareness of the state of the web using the huge amounts of data from millions of websites in the HTTP Archive. This project is invaluable to understanding how the web is trending and the areas in which we need to be doing better. I hope you check it out!

Certificate Validity Dates

Paul Calvano — Thu, 20 Feb 2020 20:55:52 +0000

Back in 2017 the maximum validity lifetime for an HTTPS certificate was set to 825 days, a decision that was widely supported by both browsers and certificate authorities. However, since then there have been multiple unsuccessful attempts at reducing the maximum lifetime to one year. Scott Helme has written about this previously, and his blog post noted that browser vendors unanimously supported this while some certificate authorities objected to it.

Information is still trickling in, but it seems that Safari is planning to enforce a max validity lifetime of 398 days effective September 1st, 2020.

Dean Coclin

@chosensecurity

Today's big news: One year max public TLS certs are coming, starting 1 Sept 2020, if you want to be trusted in Safari.

22:10 PM - 19 Feb 2020

103 149

Let’s take a look at the state of certificate validity ranges today, so we can track how this evolves over the next few months. The data for this comes from the HTTP Archive, which is an open source project that tracks how the web is built.

Certificate Validity Dates in the Wild

The HTTP Archive requests table contains certificate details for every HTTPS request that was served to the 5.3 million websites tracked. The details are in the $._securityDetails and the data contains information on 4,397,690 unique hosts. Out of all of these, 136,007 hostnames served a validity date greater than 825 days! That’s 3.09% of all requests!

Certificate validity dates are widely distributed, but there seems to be a few popular ranges. THe most common validity date is 90 days, likely to the popularity of LetsEncrypt Overall 55% of certificates have a validity date of less than 364 days, which I’ve highlighted in green below. An additional 20% of certificates have a validity between 365 and 398 days, which will meet Safari’s requirements. The remaining 25% have a validity range of more than 398 days.

Looking at this by certificate authority is also quite interesting. The graph below shows the top 15 certificate authorities. LetsEncrypt accounts for 38.4% of all certificates -

When we look at certificate validity date ranges for these certificate authorities, we can see that there’s a mix. Certificates issued by LetsEncrypt, Cloudflare, cPanel and Amazon meet the 398 day requirement already. However, Sectigo, GoDaddy, DigiCert, Comodo and RapidSSL have a very large percentage of certificates that exceed 398 days.

DigiCert released a public statement yesterday, which confirms that existing certificates with a validity range >398 days will continue to be trusted by Safari, but that certificates issued after August 30th won’t be able to exceed 398 days.

For your website to be trusted by Safari, you will no longer be able to issue publicly trusted TLS certificates with validities longer than 398 days after Aug. 30, 2020. Any certificates issued before Sept. 1, 2020 will still be valid, regardless of the validity period (up to 825 days). Certificates that are not publicly trusted can still be recognized, up to a maximum validity of 825 days.

I’m interested in seeing how this will evolve in the coming months, especially once there are more formal announcements about this. If you would like to see the queries used for this analysis, I've detailed them in this HTTP Archive discussion forum post.

Many thanks to Scott Helme for reviewing this.

I created the Web Almanac. Ask me anything about the state of the web!

Rick Viscomi — Sun, 24 Nov 2019 02:17:24 +0000

Hi everyone! I'm doing my first AMA to get people thinking about the state of the web.

My day job for the past ~2 years has been web developer relations at Google. Prior to that I worked on web performance for ~4 years at YouTube. In my current role I'm a steward of web transparency datasets like the HTTP Archive and Chrome UX Report projects. Web transparency is all about cultivating a public body of knowledge about how the web is built and experienced. I also host a video series called the State of the Web where I interview members of the community about web trends and technologies. My job takes me all around the world to meet with developers at conferences, share transparency data, and hear their stories about building on the web.

The big project I've been working on this year is the Web Almanac, the first annual edition of HTTP Archive's report on the state of the web, which launched at Chrome Dev Summit last week. I led the project and coordinated with 80+ community contributors to build everything from scratch (planning/writing content, researching stats, developing the website, etc). The end result is a massive resource that sheds a light on how the web is doing at the scale of millions of websites.

Here are some of the interesting insights from each chapter:

If you find any of this interesting, I'd love to hear your questions about web transparency, the state of the web, the Almanac project, or anything. AMA!

The Web Almanac 2019 is live!

Rick Viscomi — Mon, 11 Nov 2019 05:41:29 +0000

I finally get to share with everyone what I've been working on for so long! It's called the Web Almanac and it's a free, open source, community-made "state of the web" report.

What's the Web Almanac?

Here is an excerpt of the foreword I wrote:

The open web is an amazingly complex, evolving network of technologies. Entire industries and careers are built on the web and depend on its vibrant ecosystem to succeed. As critical as the web is, understanding how it's doing has been surprisingly elusive. Since 2010, the mission of the HTTP Archive project has been to track how the web is built, and it's been doing an amazing job of it. However, there has been one gap that has been especially challenging to close: bringing meaning to the data that the HTTP Archive project has been collecting and enabling the community to easily understand how the web is performing. That's where the Web Almanac comes in.

The mission of the Web Almanac is to take the treasure trove of insights that would otherwise be accessible only to intrepid data miners, and package it up in a way that's easy to understand. This is made possible with the help of industry experts who can make sense of the data and tell us what it means. Each of the 20 chapters in the Web Almanac focuses on a specific aspect of the web, and each one has been authored and peer reviewed by experts in their field. The strength of the Web Almanac flows directly from the expertise of the people who write it.

Many of the findings in the Web Almanac are worthy of celebration, but it's also an important reminder of the work still required to deliver high-quality user experiences. The data-driven analyses in each chapter are a form of accountability we all share for developing a better web. It's not about shaming those that are getting it wrong, but about shining a guiding light on the path of best practices so there is a clear, right way to do things. With the continued help of the web community, we hope to make this an annual tradition, so each year we can track our progress and make course corrections as needed.

There is so much to learn in this report, so start exploring and share your takeaways with the community so we can collectively advance our understanding of the state of the web.

What's inside

There are 20 chapters organized into four main parts exploring different aspects of the web.

Part I. Page Content

Chapter 1. JavaScript
Chapter 2. CSS
Chapter 3. Markup
Chapter 4. Media
Chapter 5. Third Parties
Chapter 6. Fonts

Part II. User Experience

Chapter 7. Performance
Chapter 8. Security
Chapter 9. Accessibility
Chapter 10. SEO
Chapter 11. PWA
Chapter 12. Mobile Web

Part III. Content Publishing

Chapter 13. Ecommerce
Chapter 14. CMS

Part IV. Content Distribution

Chapter 15. Compression
Chapter 16. Caching
Chapter 17. CDN
Chapter 18. Page Weight
Chapter 19. Resource Hints
Chapter 20. HTTP/2

Who made it

The Web Almanac is a community effort. 85 people have contributed to the project to write, peer review, edit, and translate content, analyze and visualize the data, and build and design the website. This was an enormous effort that couldn't be done without everyone's help. You can check out each and every one of them on the Contributors page to see where they helped.

I assumed the role of ring leader to help guide everyone to today's launch :) (And I must say I think it went really well, considering the number of people to herd!)

How was it made

The Methodology page goes into all the detail of how the process worked and where the results come from. There are so many amazing pieces of technology that went into this report, spanning data from over 5 million websites and consuming terabytes of queryable storage.

What's next

I don't want us to stop here. The web is constantly changing with new technologies and evolving adoption. I would love to see this renewed each year with fresh perspectives from different members of the community offering their take on the state of the web.

If you're interested in joining us, please fill out this form and subscribe to our GitHub repository.

In order to make the web a better place, we need to observe, quantify, and analyze it over time to make sure we're heading in the right direction. Join us for next year's Web Almanac 2020 edition!