Slowly Changing Data
Databases allows for the creation and reading of data. And in most databases, in addition to inserting data and reading data, updates and deletions may be made to data.
Changing data may pose a challenge when dealing with either changes due to errors in the old data, or whether the facts the data represents have changed. In particular, the difficulty arises when the old data remains relevant even after it has been updated: how do we indicate the original value of the data?
In the context of a Data Warehouse, data subject to this problem (and its solution) are called Slowly Changing Dimensions (SCDs), and there are several approaches.
The term "dimension" here refers to the practice in data warehouses to transaction data (or aggregates of transaction data) in "fact" tables and master data (entity data) in "dimension" tables. In these cases, the dimension tables are versioned using the SCD approach, hence the term "dimensions" in Slowly Changing Dimensions.
The solutions to the problems posed by SCDs are known as "SCD type 0", "type 1", and so on. In this article, we will present the usage of several of these types, and show how a temporal database can extend their utility, or reduce their complexity.
SCD Types 0 and 1
Type 0 is simple. If we can safely make the assumption that the fact this data represents cannot possibly change, we can simply say it is Type 0, define our schema normally, and call it a day.
Type 1 is similar. We know the data can change, but we don't really care about preserving the old data. If we believe that prior values will be of no relevance after an update is made, we can define our schema normally, and simply update the data when required.
Type 0 and type 1 are cases where we know we don't need to preserve old data. Either because we believe the data will remain static (Type 0), or for another reason (Type 1).
SCD Type 2
What about a case where prior data remains relevant? One way to solve this is to insert a new (updated) record into the table, while leaving the original record. In a case such as this, we need to add a begin_datetime
field to the schema, as well as an end_datetime
field, so we know when the record is/was valid.
Suppose we store contact information. When a field in the contact data changes, we update the end_datetime
field in the existing record, and create a new record with the updated data (or, if we wish to delete the entity, update end_datetime
, but without inserting a new record).
However, we now face another difficulty, namely, how can we reference a contact from another field? Every record must have a unique primary key, but multiple records may refer to the same contact, only at different times.
In such a case we may simply leave references to the old record, and this might be okay if we indeed intend to reference the record, rather than the contact it represents. A better solution, however, is to use two keys, both a surrogate primary key for records, as well as an additional, non-unique (at the row level, but unique at the entity level) field to provide time-agnostic identification of an entity. For example, a contact may be listed number 52 in all of its records. If we wish to find the most up to date information, we simply look for the valid record for contact 52. It may be wise to add a boolean field valid
, for faster lookups.
A similar method (Type 4) is to add records to a dedicated "history table", rather than insert further records into the same table. In this approach, updating a contact will update the original record, as well as insert the old version of the record into a history table (the same is done when creating a record), for further reference.
SCD Type 3
When we only need to keep a partial history, and of only a single field we can simply add a field to hold the "old" value of the field that needs tracking.
This approach is of course of more limited utility than Type 2.
Drawbacks to SCDs
All of the above approaches have the same drawback. They gracefully handle (or ignore) changes to the facts the records represent, but they all assume there will never be an error in the records. If there will be an error, we cannot fix it other than overwriting the error.
Sometimes, this is a non-issue, but what if we run reports based on this data, and the data is later found faulty and corrected, and the report is later run again? How can we know why the new report produced a different result?
In technical terms, SCDs allow us to effectively store only a single time axis, the valid time of records, but not their transaction time.
A possible solution would be to add a
transaction_datetime
field to the schema. However, it is then may be impossible to ascertain whether which version of the record is updated by the current record. For more detail, see The Case Against Slowly-Changing Dimensions, Part One.
However, even should we assume that this is not a problem, it is nevertheless additional complexity to manage the extra fields in the schema.
Temporal Databases
SirixDB, as a uni-temporal database, stores every revision of data immutably, timestamped with transaction time. A query defaults to the most up-to-date revision, or a timestamp can be specified, to retrieve the data as it was up-to-date in the database at that time.
Let us ask ourselves a question. Do we really need valid time? Or perhaps transaction time suffices for our needs? For example, if a customer's address is changed, perhaps all we care about is when it was changed in our system, not when the customer actually changed addresses. If so, then perhaps transaction time is a good enough proxy for valid time.
If that is the case, then our type 2 SCD has already been implemented for us, and all we need to do is to query our data appropriately.
On the other hand, if it is not the case that we only need transaction time, and we need valid time distinct from transaction time, then we can use SirixDB to handle transaction time, while our schema will handle valid time. This gives the transaction time axis, without overhead on the part of the schema designer.
Indeed, often the need to retrieve old data occurs in analytics databases, where data is not immediately updated, hence transaction time cannot take the place of valid time.
Of course, since the database manages transaction time, we must commit to the database immediately to take advantage of the transaction time. This may be unsuitable for a data warehouse, which may not be immediately updated on data changes.
On the other hand, see The Potential of Temporal Databases for the Application in Data Analytics, where it is suggested that all data be immediately replicated to a temporal database, which will serve as the data warehouse. It should be noted, however, that special handling of late arriving data would be required.
However, this is not an issue when both transaction time is only used to deal with errors. In such a case, the exact time the data was updated in the source database does not matter, only the time the data was updated in the data warehouse.
Also, if we are dealing with Type 0 or type 1, we can safely update mistakes to data, with SirixDB recording when data was updated (transaction time). This has the advantage of the data being auditable.
Conclusion
Depending on the business needs and technical setup of the data warehouse, using temporal tables may simplify handling SCDs, or alternatively (in conjuction with Type 2) extend handle errors in data as well changes to the entities the data represents.
Many RDMBS systems today offer uni-temporal (such as SQL Server and MariaDB) or bi-temporal support (such as Db2). This is typically implemented as history tables (a la SCD type 4, except that the system records transaction time) and datetime fields.
Other, non-traditional databases also support temporality, and in some, it is a key feature, such as SirixDB or Crux.
Top comments (0)