Antonina Listopadova

Posted on May 18, 2021 • Edited on May 19, 2021

Content Indexing API: pages available offline

#javascript #frontend #google

Hello, my name is Antonina, I work as a front-end developer in Rambler&Co, in the Lenta.ru team.

Content Indexing API is a new tool from Google that shows which pages are available offline. I'll tell you how Content Indexing API works, when to use it, and how our team implemented it.

About the project

Lenta.ru is a Russian online news publication. The daily number of unique visitors is ~7 mln, while the peaks were 12 mln. 20% use the desktop version, and 80% — the mobile version, which also has an offline version. Further, we will talk about it.

Offline version of Lenta.ru

As a brief digression, I'll tell you how our offline version works.

Why does Lenta.ru even need it? To provide content regardless of whether the user has internet or not. The main focus is on information, so in the offline version, we only have the main title page, the content pages themselves, and a page with a game of tic-tac-toe in case the content is not preserved.

For the offline version, the following is saved:

Required assets and code: mark-up, styles, JavaScript, and font;
Data: titles, texts, publication time, and some other data required to display the content.

Assets and code are cached using CacheStorage, and data is stored in IndexedDB. The first time m.lenta.ru is accessed, the following happens:
● materials uploading,
● Service Worker registration (or updating),
● files caching with CacheStorage,
● saving data to IndexedDB.

Then Service Worker waits for requests (fetch events), and if there is no connection, the user is given an index.html file with SPA.

As a result, if the user goes to the page of the material that has been saved, they will be able to read it. If this particular material is not saved, or if the user goes to any other (non-content) page, they will be taken to the offline version home page.

If nothing is saved, a tic-tac-toe page is displayed. In either case, the user will see that they are offline, and when the connection reappears, they will receive a notification prompting them to come back online.

Problem to be solved by Content Indexing API

I'll start a little bit from afar. The illustration shows a user with an unstable Internet and different sites. Some of them have an offline version. How can a user find out what content is available to them?

Most likely, they will try to open some site. In addition, it is unlikely that they will check other sites if they see that there is no Internet:

The question arises: is it possible to view not every site separately, but to see all the available pages in one place at once?

It turns out that it is possible, and Content Indexing API solves this task. It creates a single entry point in the interface, thanks to which, the user can see a list of all the pages that are available offline:

Content Indexing API

Content Indexing API is one of the APIs being developed as part of the New Features implementation project. The project goal is to allow web applications to do everything that native applications can do on mobile devices and on desktops.

It solves the problem of detecting pages that are available offline. After all, if a person does not know that they have something saved and available without the Internet, they are unlikely to use it.

Essentially what Content Indexing API allows the user to see in the browser interface a list of all the pages available without a network. From all web applications that use this API.

How it works (for users)

Here is an important point and an important clause: because the functionality is new, users do not yet have experience interacting with it and do not have the necessary behaviour pattern. Therefore, searching for a list of pages available in the browser may seem like a quest.

Let's look at the user's path to the place where the offline pages are displayed:

Go to the menu in the browser.
Select the item from the Downloads menu.
Select the "Explore offline" tab to the right.

We're here. This tab displays all pages that are available offline and indexed using Content Indexing API. If there is an Internet connection, the user will be redirected to the page of the regular version of the site. If there is no Internet, go to the offline version of the page.

Potentially, this functionality could have a greater effect, but its location is too unclear. I think the growth zone is to make the path to the tab easier for users.

How it works (for developers)
What does it even take to start using this API?

The web application must have an offline version.
The offline version must have content pages.

The pages are saved and displayed using the offline version of the web application. Content Indexing API is an extension to it that allows you to display in the browser interface a list of pages available without the Internet, their addresses and previews.

The indexing algorithm looks like this:

Now let's look at the code. This is a snippet of Content Indexing API support check in the browser:

navigator.serviceWorker.ready
  .then((registration) => {
    if (!registration.index) {
      return;
    }

    // (1)
  }

Now let us look at the indexing code (instead of the line with comment (1) from the example above):

registration.index.add({
    url: page.url, // required
  id: page.id, // required
  title: page.title, // required
  description: page.description, //required
  icons: [{ // required
    src: page.image_url,
    sizes: 64x64,
    type: 'image/png',
  }],
  // Варианты: 'homepage', 'article', 'video', 'audio', ''
  category: 'article', // optional
});

We index it using the add method of this API. When indexing, url, id, title, description, icons and category should be specified. All parameters except category are required. The default value of category is an empty string, but you can specify one of the following values: 'homepage', 'article', 'video', 'audio'. Some of these parameters are used to generate previews of indexed pages, and we will focus on the id a little later.

We are responsible not only for recording, but also for de-indexing outdated pages. There are two options: make the interface so that the user can delete all indexed pages, or we ourselves should periodically do this.

In our case, we remove pages from indexation when the data is updated for the offline version. In addition, the user can always remove content from the offline content tab itself, but to remove everything, they would have to manually delete each page. It is therefore worth making an interface to delete everything or auto-delete it.

Lenta.ru is a news publication and news updates are short-lived, so the offline version is updated every half an hour. Pages are indexed and de-indexed at the same time.

The algorithm for pages de-indexing is as follows:

And this is the code that is needed for pages de-indexing:

registration.index.getAll() // (1)
    .then((entries) => {
    entries.forEach((entry) => {
      registration.index.delete(entry.id); // (2)
    });
  });

In line with comment (1), use this API getAll method to retrieve an iterable object with data from all saved pages. And in line with comment (2), already knowing pages id, we delete them using this API delete method and pass it the same id that we specified during indexing (now it is clear why it is needed).

This allows removing information about pages only from indexing, that is, it will no longer be in the "Offline Content" tab. But the data of the saved pages should be deleted separately by the offline version.

3 code snippets above are not a simplified demo, but really all the necessary code to work with Content Indexing API. If there is an offline version, then it is easy to add it to the project.

How we implemented Content Indexing API on Lenta.ru

Our goal is to deliver content regardless of whether the user has an internet connection. It is executed because the offline version in the project has existed for more than 3 years. But how does the user know that Lenta.ru is able to operate without a network?

Until now, there was only one option: the user would go to any Lenta.ru page, when there is no connection, and will get into the offline version. But with this API, the second option appears: now the user can find out that some pages are available to them from the "Offline Content" tab in the browser interface.

Now I'll tell you what we're indexing. For the offline version, material from three news lists, totalling around 100 items, is saved. Among them, there is a small list — the top 10, the news from this list is displayed at the top of the main page.

Since the API is new, we decided not to index all ~100 materials at once, but to start with the top 10. Why did we decide to limit ourselves to a small fragment first:

It's faster to release it.
There were concerns that Content Indexing API might save data from materials and duplicate it when saving. This fear was not justified.
We didn't know how long it would take, or the benefits it would bring.
It was unclear what the impact would be (better to have a good impact on a small fragment than a bad impact on a large one).
Quite a large number of visitors per day, so we try to release it carefully.

We are currently indexing about 10 pages at a time. We do not plan to index any more in the near future.

A word about the metric

In the offline version, we count how many users go to pages thanks to Content Indexing API. The solution is quite simple:

 registration.index.add({
  url: `${page.url}?utm_source=offline`,
  id: page.id,
  title: page.title,
  description: page.description,
  icons: [{
    src: page.image_url,
    sizes: 64x64,
    type: 'image/png',
  }],
  category: 'article',
});

When indexing the material with this API, we add the url of the material with a utm tag, by which we understand that the page was accessed from the Content Indexing API tab. It is too early to give the figures. This does not yet generate any significant traffic (relative to the main one).

Support

Content Indexing API is available in the stable version, and it can already be used in production. Recently, there was an article about this API on MDN, which states that the API is available in Edge, Chrome Android and WebView Android v. 84 and Opera Android v. 60.

But on chromestatus only Chrome Android and Android WebView are mentioned, and the other browsers are marked "No signal". I didn't find this interface either in Opera v. 62 on Android, or in Edge v.84. If anyone has any other information, I would be grateful if you would correct me.

It would be interesting to translate this information into figures to roughly understand what percentage of users have Content Indexing API support. Let us look at the example of Lenta.ru statistics.

Lenta.ru has an Android Chrome user base of around 60% of all mobile users. 64% of all Chrome users on Android use v.84 and above. That is, approximately 39% of all mobile users have support for Content Indexing API. These are the figures for the last 3 months.

Possible prospects

There are 4 ideas on how it will develop and how it can be useful:

SEO bonuses for indexed materials. We have an assumption that in the future, materials indexed using Content Indexing API may increase the priority in search results or there will be other bonuses in terms of SEO.
It will be easier to find indexed materials, which means that they will use it more often. More our hope than our guess: the path to where the browser lists the pages available offline will be reduced from three to (ideally) one step.
Content Indexing API can be used to save user bookmarks and for personal recommendations. This is not a guess, it really can be done. This API can be used to save recommended content and those that the user has bookmarked. Unless, of course, the web application actually saves these pages for offline mode.
Over time, more people will start using this functionality. Since the API is new and there was no such option before, the necessary user experience has not yet been formed: most simply do not know that this is possible. It is possible that more people will use this functionality over time. It will be good if Lenta.ru has it by then. The project had roughly the same situation with the offline version itself: first they did, and then it became a requirement to be considered PWA.

Pros and cons

Pros:

Stable version, which can be used in production.
+1 engagement tool and entry point.
Little code (directly for working with Content Indexing API). 
There are prospects.

Cons:

Poor browsers support. 
The user experience has not yet formed. 
An offline version is needed to start using it. 
So far, it does not give great results (traffic).

Recommendations

There are two recommendations for when to use Content Indexing API:

If the web application is a content resource.
The web application has an offline version where you can view content pages.

As a conclusion

Content Indexing API solves the problem of detecting content that is available offline. There is already a stable version that can be used in production. This API only indexes pages, saving and displaying — tasks of the offline version. Once again, the offline version and Content Indexing API are not the same thing, they are not interchangeable.

The main problem is that the user experience has not yet formed, and the location of the tab is not obvious, so you should not expect great results yet.

The effect of such new tools is not always immediately noticeable, but the prospects are interesting. Perhaps, after a while, it will become the same familiar user experience as, for example, AMP and offline.

Link to materials.