Pre-computing Statistics for a Time Frame With MongoDB

#csharp #mongodb #dotnet #programming

This article was written by MongoDB Champion Markus Wildgruber

Especially in IoT contexts, you often have the requirements that your users want to view aggregated data over a specific time frame. Imagine we have a system where a number of sensors reports a temperature value in a variable interval. The users of the system want to know the average temperature and the number of readings over the last seven days. For systems with a small amount of data, you might be able to calculate those statistics on the fly for each request, especially if you make use of time series collections or the Bucket Pattern to reduce the number of documents involved in the query.

As your system grows, calculating the statistics over and over again is not sustainable. Furthermore, users often want to view those data in a table and be able to filter and sort by those statistics to identify the sensors that reported the highest average temperature or the lowest number of readings in the time frame. In this case, you need to pre-compute the statistics and store them in a document per sensor, e.g.:

{
  "_id": /* Sensor Id */,
  "name": "Sensor name",
  "avgTemp": 23.45,
  "readingCount": 18,
  // other properties
}

This schema, while simplified, covers our requirements. You can create indexes to support filtering and sorting the documents for optimal performance. The indexes you need depend on your specific scenario, but as we store the average temperature and reading count in dedicated properties on the document, those are good candidates for an index if you often filter or sort by them. For instance, if you display the 10 sensors with the highest average temperature in the time frame on your dashboard, you could create an index with the following fields to cover the query:
avgTemp: -1
name: 1,
_id: 1

From a read perspective, our work is done at this point. In terms of writing, we need to consider two reasons to update the document:

A sensor reports new data.
The day changed and we need to ensure that the data of the document is up-to-date.

A simple approach for updating and its downsides

Updating statistics when new data are reported often is feasible without too much consideration as:

a) The amount of data that is involved is restricted to a single sensor.

b) We also know that the update for this sensor is necessary.

But what about the daily update for the statistics?

A straightforward way to update the documents is to have a job update the documents once per day. If you have been working with aggregation pipelines, you can easily imagine an aggregation pipeline that reads the values in the time frame from a time series collection, groups them by sensor, calculates the statistics, and merges the outcome into the sensors’ collection.

Especially in systems with a high number of sensors or in multi-tenant scenarios, this approach has serious downsides, because it puts a lot of load on the cluster during the update process. If you have sensors that might not report a reading each day, this update also might include sensors that are not outdated. Also, the approach includes sensors that have been updated when new data has been reported even if there is no need to update them.

Evolving to a more sophisticated approach

Our main goal is to reduce the load on the system when carrying the values forward for a new day. A more sophisticated approach achieves this by updating statistics for a sensor when new data is reported. This moves the load from the daily run to the ingestion of new values and thus spreads the load.
Also, the approach excludes documents from the daily update that do not need to be updated.

Fortunately, MongoDB's flexible data schema provides us with the toolset to achieve this by extending the sensor document with:

An array with statistics aggregated by day (more on that later).
The date property outdatedFrom that contains the date when the data needs to be updated in the daily run:

{
  "_id": /* Sensor Id */,
  "name": "Sensor name",
  "avgTemp": 23.45,
  "readingCount": 18,
  "statsByDate": [
    { "date": ISODate("2025-07-19T00:00:00Z"), readings: 7, totalTemps: 164.15 },
    { "date": ISODate("2025-07-21T00:00:00Z"), readings: 3, totalTemps: 70.35 },
    { "date": ISODate("2025-07-23T00:00:00Z"), readings: 5, totalTemps: 117.25 },
    { "date": ISODate("2025-07-24T00:00:00Z"), readings: 3, totalTemps: 70.35 },
  ],
  "outdatedFrom": ISODate("2025-07-26T00:00:00Z")
  // other properties
}

Let's consider our current date is July 25th, 2025. If the sensor reports new data for the day, a new sub-document is added to the statsByDate array, and the statistics are updated, but the outdatedFrom property keeps its value. The document will not be included in the daily run on July 25th, 2025.

On the next day, July 26th, 2025, data for July 19th is not part of the seven-day window for the statistics anymore. In this case, the daily job will remove the sub-document for this day from the array and update outdatedFrom to July 28th, 2025. Also, if new data for the sensor is reported before the daily job runs, the document will be updated and is not included in the daily job anymore.

As you can see, this approach keeps statistics up to date and also limits the documents in the daily run to the ones that really need to be updated.

Let's have a closer look at the two main reasons to update the data.

Ingesting new data for the sensor

As described above, we want to update all statistics for the sensor when new data is received. This builds upon the information that is already present in the document so that we either need to read the document from the database and perform the changes in code, or—better and much more efficient—use an update with aggregation pipeline. As updates on a single document in MongoDB are atomic, the latter also means that MongoDB will handle concurrency issues for us.

The following Javascript code (from a MongoDB playground) performs the update:

const sensorId = /* Sensor Id */;
const today = new Date("2025-07-25T00:00:00Z");
const daysInTimeframe = 7;
const cutOffDate = new Date(today.getTime() - daysInTimeframe * 24 * 60 * 60 * 1000);

db.getCollection("sensors").updateOne({ _id: sensorId }, [
  // Initialize array if not present (omit this if you are sure that the array is always present)
  {
    $set: {
      statsByDate: { $ifNull: ["$statsByDate", []] },
    },
  },
  // Cut off redundant data from array
  {
    $set: {
      statsByDate: {
        $filter: {
          input: "$statsByDate",
          cond: {
            $gt: ["$$this.date", cutOffDate],
          },
        },
      },
    },
  },
  // Add/update data in array
  {
    $set: {
      statsByDate: {
        $cond: {
          if: {
            $in: [today, "$statsByDate.date"],
          },
          then: {
            $map: {
              input: "$statsByDate",
              in: {
                $cond: {
                  if: { $eq: ["$$this.date", today] },
                  then: {
                    date: today,
                    totalTemps: { $add: ["$$this.totalTemps", temp] },
                    readings: { $add: ["$$this.readings", 1] },
                  },
                  else: "$$this",
                },
              },
            },
          },
          else: {
            $concatArrays: [
              "$statsByDate",
              [
                {
                  date: today,
                  totalTemps: temp,
                  readings: 1,
                },
              ],
            ],
          },
        },
      },
    },
  },
  // Update reading count
  {
    $set: {
      readingCount: { $sum: "$statsByDate.readings" },
    },
  },
  // Update avg temp (separate to reuse new value of readingCount)
  {
    $set: {
      avgTemp: {
        $cond: {
          if: { $gt: ["$readingCount", 0] },
          then: {
            $divide: [
              { $sum: "$statsByDate.totalTemps" },
              { $sum: "$statsByDate.readings" },
            ],
          },
          else: null,
        },
      },
    },
  },
  // Update outdatedFrom
  {
    $set: {
      outdatedFrom: {
        $cond: {
          if: { $gt: ["$readingCount", 0] },
          then: {
            $dateAdd: {
              startDate: { $min: "$statsByDate.date" },
              unit: "day",
              amount: daysInTimeframe,
            },
          },
          else: null,
        },
      },
    },
  },
]);

The aggregation pipeline above guards against cases where the array is not present at all and also where there are no values in the time frame so that readingCount is zero and a division by its value would fail. In this case, both avgTemp and outdatedFrom are set to null.

The most complex part is to add the new reading to the statsByDate array if there are no values for the date yet or update the existing sub-document in the array. The outer $cond checks if there is a sub-document for the date. If there is, the existing array is transformed using a $map statement that updates the sub-document for the date and returns the existing sub-document for other dates.

You could simplify this significantly if you forgo the aggregation by day and add each new reading to the array when it arrives. Keep in mind that sub-arrays in MongoDB documents should have a limited amount of items and should not be unbounded. If you get a lot of readings per day or if your time frame is large, it is preferable to aggregate values by date.

Also, if your scenario is based on a different unit for time frames—e.g., hours—adjust the aggregation accordingly so that the number of sub-documents in the array is limited while still allowing for cutting off redundant data that are not part of the time frame anymore.

Updating outdated documents

After being able to ingest new readings, we now take a look at the daily job that carries the values of outdated documents forward. Luckily, the last section already contains many building blocks that we can reuse in the aggregation pipeline for the job:

The initialization of the array if not present as a safety measure
Cutting of sub-documents in the array if they are not in the time frame anymore
The update of readingCount, avgTemp, and outdatedFrom

This leads to the following sample code in JavaScript:

const today = new Date("2025-07-25T00:00:00Z");
const daysInTimeframe = 7;
const cutOffDate = new Date(
  today.getTime() - daysInTimeframe * 24 * 60 * 60 * 1000
);

db.getCollection("sensors").updateMany(
  {
    outdatedFrom: {
      $lte: today,
    },
  },
  [
    // Initialize array if not present
    {
      $set: {
        statsByDate: { $ifNull: ["$statsByDate", []] },
      },
    },
    // Cut off redundant data from array
    {
      $set: {
        statsByDate: {
          $filter: {
            input: "$statsByDate",
            cond: {
              $gt: ["$$this.date", cutOffDate],
            },
          },
        },
      },
    },
    // Update reading count
    {
      $set: {
        readingCount: { $sum: "$statsByDate.readings" },
      },
    },
    // Update avg temp
    {
      $set: {
        avgTemp: {
          $cond: {
            if: { $gt: ["$readingCount", 0] },
            then: {
              $divide: [
                { $sum: "$statsByDate.totalTemps" },
                { $sum: "$statsByDate.readings" },
              ],
            },
            else: null,
          },
        },
      },
    },
    // Update outdatedFrom
    {
      $set: {
        outdatedFrom: {
          $cond: {
            if: { $gt: ["$readingCount", 0] },
            then: {
              $dateAdd: {
                startDate: { $min: "$statsByDate.date" },
                unit: "day",
                amount: daysInTimeframe,
              },
            },
            else: null,
          },
        },
      },
    },
  ]
);

The update identifies the documents that need to be updated by filtering on the outdatedFrom property. It then reuses the stages from the ingestion and applies the relevant changes to the documents.

Additional considerations

This sample provides an outline on how to pre-compute time frame statistics in MongoDB documents. It can be a starting point for tackling various other requirements regarding time frames—e.g., comparing statistics with the previous time frame. In this case, you need to double the number of days that are preserved in the statsByDate array and filter the items when calculating the statistics for the current and previous time frame. Also, outdatedFrom needs to consider both when an item leaves the time frame and when it moves from the current to the previous time frame.

If you are providing statistics for a time frame, chances are that users want to view and compare developments between different time frames. In this case, a key decision is whether to have a single document that keeps data for all time frames for a sensor or if you have a document per sensor and time frame. The main factor here is the size of your documents and how often data for different time frames is accessed. If your documents are not too big and users often switch between time frames, having all time frames in the same document is advisable from a performance point of view. This also allows you to show sensor data in a tabular view that contains columns for different time frames. On the other hand, having all time frames in the same document adds further complexity to the update statements.

This sample demonstrates how MongoDB's flexible data schema has a significant impact on system performance. The documents for the sensors are self-sufficient so that you can update the data in the sensors' collection without the need to access other collections. If you do not need to store the single readings of the sensors in the database, you can discard them (or save them in cold storage) so that your database is not bloated by these fine-grained data just for updating statistics over time.

How do you handle statistics for time frames? Do you have tips that you can share with fellow developers? Let us know in the comments!