Is reference counting a solution to DB Garbage Collection?

For the past 4 months I've been working on a side project to develop an object database from scratch, and chronicling my work and learnings on my YouTube channel. To date I've managed to implement an on disk B-Tree mechanism with caching to improve read performance and a way to separate different transactions into different files so that history is preserved
Things are starting to work pretty good, but I'm quickly getting to the point where I'm going to have to start thinking about how to compress old transactions once there are no longer relevant references to remove any blocks that are now garbage

How is garbage created?

In the object database I've been working on, garbage is created when a B-Tree node in an older transaction is superceded by a newer version of that node either by adding or deleting an entry to that node, sometimes also incorporating merges and splits of nodes depending on the operation. The nodes of the current transaction are allowed to be overwritten during the transaction and so are not currently able to be superceded during the transaction, however older transactions are treated as immutable. I partly wanted to do this so that it was optional to collect and compress older transactions, but for now without a collection mechanism, history retention is mandatory, and over time the garbage builds up. I'd like to start addressing that factor soon

How should garbage be detected

In most typical cases a B-Tree node will have exactly one parent, however when a node is superceded by another, that node's children will temporarily have two parents. There might be several references from stale parent nodes until those nodes are collected, but I'm anticipating those nodes might stay active while there are read-only operations occurring on previous transactions and where those transactions have the consistent-read transaction mode. When a node is superceded, one option would be to immediately move the parent, but this will add overhead to whichever operation is in progress. One option I'm considering is to stage node supercessions into a buffer that can be scavenged by the garbage collector. The collector would increment the reference count on a set of referred nodes, and then decrement the reference count on the stale node, and when said node reached zero references, it could be added to the collector list. This allows two things to occur: the collection process will be incremental, and transactions won't have to wait for it

How should garbage be collected?

For now the nodes are referenced via a file and offset pointer with a typical small transaction confined to one file. I decided on this layout because in the beginning I was anticipating that I would want to restructure or remove files to reclaim disk space after nodes were cleaned up, but the more I think about it the more I think that the block files may need to be compacted. One way to reuse the space is to add a free list to the data store and allow new blocks to be allocated from the pool of previously allocated and freed blocks. This is a pretty good technique to ensure that space isn't used inefficiently but has the drawback that once data is allocated, it's very difficult for applications outside of the data store to reclaim the space. One mechanism I'm considering for improving this situation is to resize old files when their blocks are collected. Unfortunately this means that those blocks will need to have all of their free space at the end of the block. One way to ensure this might be to map block file entries to an offset using a per file index. This would incur an additional overhead and it's unclear to me whether this will be worth the payoff of reduced size. Probably this is something I should measure, and perhaps I can make that a setting that may be optional too

Thanks for reading, and if you're interested in what I've been up to you can find my work here at https://github.com/PhillipVoyle/objectdb, and my youtube channel here https://www.youtube.com/@phillip-voyle

DEV Community

Is reference counting a solution to DB Garbage Collection?

How is garbage created?

How should garbage be detected

How should garbage be collected?

Top comments (0)