moga

Posted on Jul 14, 2021

Decrease read costs of firestore using Firestore Data Bundles

#firebase #firestore

A feature called Firestore Data Bundles was released around December 2020, but it hasn't been talked about much. I've been looking into it because it hasn't been talked about much. I'm sure there are many people who don't know about it.

Here are the blogs and documents I used for reference. I recommend the official blog as it is easy to understand and includes usage scenarios.

Reference

Official blog
- Load Data Faster and Lower Your Costs with Firestore Data Bundles!
Official documentation
- Cloud Firestore data bundles
- Serve bundled Firestore content from a CDN

What is Firestore Data Bundles?

In Firestore, reading occurs for each user (device) even if all users read the same data👇. As shown in this figure, if 4 users have 100 Reads each, you will be charged for 400 Reads in total.

Using Firestore Data Bundles, you can dramatically reduce the number of reads by delivering reads from Firestore in the form of Bundles through a CDN 👇 (the figure is just an image). Moreover, the devices can generally retrieve the data through the interface provided by Firestore.

In the figure 👆, Functions is behind CDN (and so is the official documentation), but you can use Google Cloud Storage, etc. without any problem.

Implementation

The code is written in TypeScript. It is written in TypeScript, so you can rewrite it as you see fit.
https://github.com/mogaming217/firestore-data-bundles-sample

This time, we will replace the CDN + Functions part with Google Cloud Storage(GCS). We will create a Bundle from the local machine, upload it to GCS, and read the file.

Create and upload Bundle

import { firestore, storage } from "firebase-admin"
import * as fs from 'fs'

// We're just initializing the project.
import { initAdminProject } from ". /helper"
initAdminProject()

const BUCKET_NAME = 'YOUR_BUCKET_NAME'
const CREATE_INITIAL_DATA = false

const main = async () => {
  const db = firestore()
  const timestamp = Date.now()

  if (CREATE_INITIAL_DATA) {
    // create 100 data items
    await Promise.all([. .Array(100)].map((_, i) => {
      return db.doc(`bundles/data_${i}`).set({
        body: `${i}`.repeat(1000).slice(0, 1000),
        Timestamp: firestore.Timestamp.fromMillis(timestamp + i * 100)
      }))
    })))
  }

  // Read the data from the firestore and create a Data Bundle
  const snapshots = await db.collection('bundles').orderBy('timestamp', 'asc').get()
  const bundleID = timestamp.toString()
  const buffer = await db.bundle(bundleID).add('bundles-query', snapshots).build()

  // write them out locally for upload
  const bundledFilePath = `. /${timestamp}.txt`.
  fs.writeFileSync(bundledFilePath, buffer)

  // Upload the file to GCS
  const destination = `firestore-data-bundles/bundles.txt`
  await storage().bucket(BUCKET_NAME).upload(bundledFilePath, { destination, public: true, metadata: { cacheControl: `public, max-age=60` } })

  console.log(`uploaded to https://storage.googleapis.com/${BUCKET_NAME}/${destination}`)
  process.exit(0)
}

main()

The db.bundle(bundleID).add('bundles-query', snapshots).build() part creates the Bundle. db.bundle() returns the BundleBuilder and add adds to it. In this case, we will only add one QuerySnapshot, but you can call .add() as many times as you want, and you can also pass a DocumentSnapshot. I didn't see any indication of how much you can pack in, but considering that the client will be downloading it, I'd say a few MB is the upper limit.

Also, the bundleID is once set to timestamp, but the role of this ID seems to be used to determine if the Bundle has already been retrieved by the client or not. This ID is used to determine if the bundle has already been retrieved by the client.

The ID of the bundle. When loaded on clients, client SDKs use this ID and the timestamp associated with the bundle to tell if it has been loaded already. If not specified, a random identifier will be used.

As a reminder, you need to use the Admin SDK to create the bundle. Please do not put any secure data in the file.

Bundle Reading

The following example shows how to use.

import axios from "axios"
import { initClientProject } from ". /helper"

const app = initClientProject()

const BUNDLE_URL = 'UPLOADED_BUNDLE_URL'

const main = async () => {
  const db = app.firestore()
  const now = Date.now()

  // Get Bundle data from GCS and load it
  const response = await axios.get(BUNDLE_URL)
  await db.loadBundle(response.data)

  // retrieve from the loaded Bundle data
  const query = await db.namedQuery('bundles-query')
  const snaps = await query!.get({ source: 'cache' })

  console.log(`${(Date.now() - now) / 1000}s`)
  process.exit(0)
}

main()

Get the Data Bundle from GCS. It will be a simple GET request, so caching with CDN or other means will be effective.

If you load it with loadBundle, the Data Bundle will be expanded as a local cache of the device as well. We dare to use the namedQuery method to load it, but you can also load the loaded Bundle with db.collection('bundles').orderBy('timestamp', 'asc').get({ source: 'cache' }). The reason why we use source: 'cache' is because we wanted to make sure to read from the unpacked cache. You may want to change this to suit your logic.

Advantages

Cost savings

As mentioned above, the number of Firestore reads can be reduced, thus reducing the running cost. The figure below shows a comparison between the case where each device reads directly from Firestore and the case where Firestore Data Bundles are used (costs of storage). 1 document is 1KB and 100 data are read per user. The following is a comparison of what the price will be as the number of users increases.

Firestore is very inexpensive to begin with, so there is not much difference, but it is still more than 50% cheaper to deliver via GCS. The more users you have, the more effective it will be.

The calculation formula is as follows. The calculation is based on the Tokyo region. If I'm wrong, please let me know.

Firestore
- Number of reads / 100,000 x $0.038
- Ignore the free reads.
- If the transfer volume exceeds 10GB, add exceeded transfered data (GB) x $0.14 to the above.
- Reference: https://cloud.google.com/firestore/pricing
Google Cloud Storage
- Amount transferred (GB) × $0.12
- Ignore read and storage charges for Bundle creation.

Improve read speed (there is data that can be read).

The official documentation mentions that the read speed will be faster not only the first time.

While the developer benefits from the cost savings, the biggest beneficiary is the user. Studies have repeatedly shown that speedy Studies have repeatedly shown that speedy pages mean happier users.

Data Bundle is not affected by this issue because it can read from CDN, load into Firestore, and then fetch from local cache. Data Bundle is not affected by this issue. There may be situations where this will work effectively. I tried to verify how much faster this would be, but I couldn't get it to work, probably because the Firestore connections are pooled on the machine to some extent when running repeatedly. Please note that the above is only theoretical. The maximum measured value is shown below.

Getting 100 items of data from Firestore (first time): 1.4s
When retrieving data from GCS by loading the Data Bundle that contains 100 data: 0.7s
- (The above is the case when the data was not hit in the CDN cache, and when it was, it was 0.2s or so.

Disadvantages

It is a bit more complicated to implement than reading directly from Firestore. As mentioned in the official blog, this is a bit of an advanced feature, so it won't be very effective unless you have a large number of users.

Also, if you make a mistake in handling the cache, you may see data that you shouldn't, and many other accidents can happen. Be sure to understand how to use it.

Usage scenarios

This is a brief description of what is described in the official blog.

The following are some examples of how you might use it.

When you have data that all clients need to read when launching an application (master data, etc.)
Top 10 articles that will be read by all clients in news, blog services, etc.
Starter data that will be read by non-logged-in users

On the other hand, you should not use the following

Queries that produce different results for different users (such as data in a sub-collection of the users collection)
Data that contains private information.

Personal impressions, etc.

The response caching by CDN that was done with the so-called general Web API can now be done with Firestore. I thought for a moment. However, the fact that the data is expanded as Firestore's local cache, and the fact that there is no need to convert timestamps, although it is a minor detail, made me think that there are quite a few advantages to using this.

If I were to use it, I would use it for ranking data as well as master data. In that case, I would place the Data Bundle in the GCS at the timing when the ranking is updated, so that the client always sees the same GCS URL, and handle it by adjusting the bundleID and Cache-Control.

However, although it looks very good, I have the impression that this is something that has not yet been talked about because it is something that will not be very effective until the number of users becomes large. When I look at the actual price, I feel that if there are enough users to make it effective, then the profit will be higher, and the change in operating cost will be like a margin of error when Data Bundle is introduced.... It's hard to say. I think it's perfect if you have a lot of users and you want to cut down on operating costs.

By the way, Flutter's SDK support tends to be slow for new features like this, but it seems to be already supported (https://firebase.flutter.dev/docs/firestore/usage/#data-bundles). (It's been a while since the release).

Conclusion

I often talk about Firebase and other technical topics on Twitter, so please follow me if you like!

DEV Community