MongoDB Guests for MongoDB

Posted on Oct 14

Building With Patterns: Document Versioning for Financial Services

#webdev #programming #beginners #mongodb

This tutorial was written by Russell Epstein.

When you’re building applications for financial services and other highly regulated industries, storing just the current version of a document is typically not enough. A loan application that starts as a simple document requesting $200,000 might evolve through dozens of revisions with updated customer information, revised terms, compliance annotations, and underwriter adjustments before it’s finally approved. The challenge is preserving every single change along the way.

Unlike most applications where version history is a nice-to-have feature, financial services applications must maintain complete audit trails by design. Regulators expect to reconstruct exactly what happened at any point in time, compliance teams need to track who made which changes and why, and business users need both fast access to current data and the ability to investigate historical decisions.

This is where MongoDB’s Document Versioning Pattern becomes essential. While most applications only need to query the latest version of their data, financial institutions must maintain complete historical records that can withstand regulatory scrutiny, support audit requirements, and enable precise point-in-time reconstruction of business decisions.

The Document Versioning Pattern for financial services

While several versioning approaches exist, they often fall into two categories, each with significant limitations. The single-document approach embeds all historical versions within a single document, typically as an array of previous states (e.g., a “versions” field containing [{v1}, {v2}, {v3}]). This creates documents that grow indefinitely, and can exceed MongoDB’s 16MB document size limit. The single-collection approach stores all versions in one collection with a field indicating which document is current (e.g., “isCurrent”: true), requiring every query to filter for the current version and making it difficult to optimize indexes for both current and historical access patterns.

For financial services applications with their high document volumes, complex audit requirements, and need for both fast operational queries and comprehensive historical compliance data, a more sophisticated implementation might be necessary. The core document versioning pattern uses a two-collection architecture: one collection for current documents optimized for operational queries, and a second collection for complete historical revisions. This separation ensures that operational queries remain fast while maintaining complete audit trails.

An optional third collection for detailed change logs becomes valuable when you have specific audit requirements that demand efficient field-level change queries. If compliance teams regularly need to generate reports like “find all loans where the interest rate changed” or “show all records where email addresses were updated,” this third collection enables these queries to run efficiently by pre-indexing the change information. Without this collection, you would need to compare revision documents using aggregation pipelines or application logic, and create specialized indexes on every field you want to track in the revisions collection. The change log collection essentially trades additional storage for query performance on audit trails.

For the examples in this article, we’ll demonstrate the three-collection approach to show the full pattern, but remember that the third collection can be omitted if your audit requirements don’t demand field-level change tracking.

Pattern assumptions

This pattern makes several assumptions about the data and access patterns:

Documents are updated infrequently relative to read operations: This holds true for many financial services use cases such as loan applications (modified during underwriting but then largely static), account opening documents (updated primarily during onboarding), and regulatory filings (revised periodically but read frequently). However, this assumption may not apply to high-frequency trading systems, real-time risk calculations, or dynamic pricing engines where documents update continuously. For these high-velocity scenarios, the storage overhead of full snapshots may become prohibitive, and you should consider delta-based versioning or time-series collections instead.
Most queries are performed on current document versions.
Historical queries are less frequent but must be fast enough for regulatory requirements.
Complete audit trails are necessary for compliance purposes.
Data retention requirements may span several years.

Sample use case

Consider how a commercial bank might implement this pattern for loan management. Each loan application goes through multiple stages: initial submission, documentation review, underwriting, approval, and ongoing management. At each stage, the loan document may be modified with updated terms, additional documentation requirements, or status changes.

The bank needs to serve multiple constituencies: Loan officers need fast access to current loan status, compliance teams need complete audit trails, and regulators may request point-in-time reconstructions of loan states during examinations.

If we examine the requirements for the Document Versioning Pattern, loan management represents an ideal use case. Banks typically manage thousands of active loans, loan modifications occur regularly but not excessively, and the vast majority of daily queries focus on current loan status rather than historical versions.

Inside our database, each loan might have a current_loan document containing the latest loan information in a current_loans collection, complete historical snapshots in a loan_revisions collection, and detailed change records in a loan_change_log collection. When a loan officer modifies loan terms, the system creates a new revision document capturing the complete previous state, updates the current document with new information, and logs the specific changes made.

Let's examine the document structure for each collection:

Current loans collection

{
  "_id": ObjectId("..."),
  "loanId": "LOAN-2024-001",
  "version": 3,
  "customerId": "CUST-456789",
  "amount": 350000,
  "interestRate": 4.5,
  "term": 360,
  "status": "underwriting",
  "assignedOfficer": "officer_123",
  "lastModified": ISODate("2024-12-15T10:00:00Z"),
  "modifiedBy": "loan_officer_123"
}

Loan revisions collection

{
  "_id": ObjectId("..."),
  "originalId": ObjectId("..."), // Reference to current document
  "loanId": "LOAN-2024-001",
  "version": 2,
  // Original document fields stored directly
  "customerId": "CUST-456789",
  "amount": 325000,
  "interestRate": 4.75,
  "term": 360,
  "status": "documentation",
  "assignedOfficer": "processor_456",
  // Revision metadata
  "timestamp": ISODate("2024-12-10T15:30:00Z"),
  "modifiedBy": "processor_456",
  "reason": "customer_documentation_update"
}

Loan change log collection (optional)

{
  "_id": ObjectId("..."),
  "customerId": "CUST-456789", 
  "updatedFields": [
    {
      "key": "amount",
      "value": 350000
    },
    {
      "key": "interestRate", 
      "value": 4.5
    },
    {
      "key": "assignedOfficer",
      "value": "loan_officer_123"
    }
  ],
  "removedFields": [],
  "truncatedArrays": [],
  "timestamp": ISODate("2024-12-15T10:00:00Z"),
  "modifiedBy": "loan_officer_123",
  "reason": "moved_to_underwriting",
  "businessJustification": "Customer provided updated appraisal increasing property value"
}

Implementation with Change Streams

MongoDB Change Streams provide a powerful way to automatically capture document changes and trigger versioning logic. Rather than modifying your application code, you can run a separate process that watches for changes and maintains the revision history.

While Change Streams offer excellent architectural decoupling and make it easy to add versioning to existing systems, it's important to note that they introduce eventual consistency which means there's a small window where a document might be updated but its revision hasn't been created yet. For financial services applications requiring guaranteed strong consistency with zero data loss, a transactional approach using multi-document ACID transactions is recommended instead. However, for many use cases with proper monitoring and error handling, the eventual consistency of Change Streams is acceptable and offers operational benefits.

import com.mongodb.client.*;
import com.mongodb.client.model.*;
import com.mongodb.client.model.changestream.*;
import org.bson.Document;
import java.util.Arrays;
import java.util.Date;

public class ChangeStreamVersioningService {

    private final MongoDatabase database;

    public void startVersioningChangeStream() {
        MongoCollection<Document> currentLoans = database.getCollection("currentLoans");

        // Configure change stream to get both current and previous document states
        ChangeStreamIterable<Document> changeStream = currentLoans.watch()
            .fullDocument(FullDocument.UPDATE_LOOKUP)
            .fullDocumentBeforeChange(FullDocumentBeforeChange.REQUIRED);

        // Process each change event
        changeStream.forEach(change -> {
            ClientSession session = database.getMongoClient().startSession();

            try {
                session.startTransaction();

                // Create revision from the document's previous state
                Document revision = new Document(change.getFullDocumentBeforeChange());
                revision.remove("_id");
                revision.append("originalId", change.getDocumentKey().get("_id"))
                       .append("timestamp", new Date());

                database.getCollection("loanRevisions")
                       .insertOne(session, revision);

                // Optional: Create change log entry
                Document changeLog = new Document()
                    .append("loanId", change.getFullDocument().get("loanId"))
                    .append("updatedFields", convertUpdatedFields(change.getUpdateDescription()))
                    .append("timestamp", new Date());

                database.getCollection("loanChangeLog")
                       .insertOne(session, changeLog);

                session.commitTransaction();
            } catch (Exception e) {
                session.abortTransaction();
                // Log error and handle appropriately
            } finally {
                session.close();
            }
        });
    }
}

This approach ensures versioning happens automatically for any document changes, regardless of which application or service makes the update. The change stream acts as a centralized versioning service that maintains consistency across your entire system.

Important considerations for production:

Monitoring: Implement health checks and alerting for your Change Stream consumer to detect failures quickly.
Resume tokens: Store resume tokens to ensure the Change Stream can restart from where it left off in case of network failures or planned restarts.
Error handling: Implement retry logic and dead letter queues for changes that fail to process.
Scaling: For high-throughput systems, consider running multiple Change Stream consumers with different filters.
Thread management: Run the change stream consumer in a managed thread pool or as a separate microservice for better resource control.

Query patterns and performance

The multi-collection architecture enables different query patterns optimized for specific use cases:

Queries against the live documents (most frequent):

// Get current loan status - optimized for speed
Document currentLoan = database.getCollection("currentLoans")
    .find(Filters.eq("loanId", "LOAN-2024-001"))
    .first();

// Find all loans for a customer
List<Document> customerLoans = database.getCollection("currentLoans")
    .find(Filters.eq("customerId", "CUST-456789"))
    .into(new ArrayList<>());

Historical queries (regulatory/audit):

import java.time.Instant;
import java.util.Date;

// Get specific historical version
Document historicalVersion = database.getCollection("loanRevisions")
    .find(Filters.and(
        Filters.eq("loanId", "LOAN-2024-001"),
        Filters.eq("version", 2)
    ))
    .first();

// Point-in-time reconstruction
Date pointInTime = Date.from(Instant.parse("2024-12-01T00:00:00Z"));
Document pointInTimeVersion = database.getCollection("loanRevisions")
    .find(Filters.and(
        Filters.eq("loanId", "LOAN-2024-001"),
        Filters.lte("timestamp", pointInTime)
    ))
    .sort(Sorts.descending("timestamp"))
    .first();

Audit trail queries (compliance):

// Complete change history for audit
List<Document> changeHistory = database.getCollection("loanChangeLog")
    .find(Filters.eq("customerId", "CUST-456789"))
    .sort(Sorts.ascending("timestamp"))
    .into(new ArrayList<>());

// Find all loans where interest rate was changed
List<Document> interestRateChanges = database.getCollection("loanChangeLog")
    .find(Filters.eq("updatedFields.key", "interestRate"))
    .into(new ArrayList<>());

// Find all changes made by specific user
List<Document> userChanges = database.getCollection("loanChangeLog")
    .find(Filters.eq("modifiedBy", "loan_officer_123"))
    .sort(Sorts.descending("timestamp"))
    .into(new ArrayList<>());

// Find customers who had their email addresses updated
List<Document> emailUpdates = database.getCollection("loanChangeLog")
    .find(Filters.elemMatch("updatedFields", 
        Filters.eq("key", "email")))
    .into(new ArrayList<>());

Benefits and trade-offs

The Document Versioning Pattern using multiple collections provides several key benefits:

Performance: Current operations remain fast since they only query the current collection.
Compliance: Complete audit trails satisfy regulatory requirements.
Flexibility: Different collections can be optimized for different access patterns.
Scalability: Historical data can be tiered to less expensive storage.

However, this pattern also introduces some trade-offs:

Storage overhead: Multiple collections require more storage than single-collection or embedded approaches.
Write amplification: Each update results in multiple write operations.
Complexity: Applications must understand which collection to query for different use cases.

Implementation approaches: Transactions vs change streams

When implementing this pattern, you have two primary approaches: multi-document ACID transactions or change streams. Transactions provide guaranteed consistency where all collections update atomically, but require embedding versioning logic in your application code. Change streams offer architectural decoupling by running versioning logic in a separate process, but introduce eventual consistency where there's a window where the audit trail may be incomplete if the consumer fails. For financial services applications where regulatory compliance is mandatory, multi-document ACID transactions are strongly recommended to eliminate any risk of incomplete audit trails. MongoDB solutions architect John Page has documented a comprehensive transactional approach with detailed implementation examples in his article "Introduction to Document Versioning" with working code samples in his GitHub repository.

Scaling and cost optimization for production

As your document versioning system grows from prototype to production, two key considerations emerge: managing the lifecycle of historical data and optimizing infrastructure costs for different access patterns.

Data lifecycle management

Financial institutions typically face regulations requiring document retention for specific periods, often seven years for transaction records and two years for operational data. Rather than storing all historical data on expensive, high-performance storage indefinitely, you can implement tiered storage strategies.

MongoDB Atlas Online Archive provides an elegant solution for automatically aging historical data to less expensive storage while keeping it queryable. Archived data remains fully queryable through Atlas Data Federation, though it may take slightly longer to execute. This approach typically reduces storage costs by 60-80% for historical data while maintaining compliance requirements.

Matching infrastructure to access patterns

The document versioning pattern creates naturally asymmetric access patterns: your current documents collection receives the vast majority of queries and needs the fastest possible response times, while your revision collections are accessed much less frequently and can tolerate slightly higher latency. This creates an opportunity for significant cost optimization.

MongoDB Atlas's Independent Shard Scaling feature allows different collections within the same cluster to run on different hardware tiers. For a loan management system processing thousands of daily queries, you might see:

Current loans collection: Receives 95% of queries, needs sub-100ms response times → Deploy on M80 shard
Loan revisions collection: Receives 5% of queries, can tolerate 200-500ms response times → Deploy on M40 shard
Change log collection: Primarily used for compliance reporting → Deploy on M30 shard

With MongoDB 8.0's moveCollection command, you can easily move your versioning collections to appropriate performance tiers:

// Move revision collection to M40 shard
Document moveRevisionsCommand = new Document()
    .append("moveCollection", "loandb.loanRevisions")
    .append("toShard", "shard01");

database.runCommand(moveRevisionsCommand);

// Move change log collection to M30 shard
Document moveChangeLogCommand = new Document()
    .append("moveCollection", "loandb.loanChangeLog")
    .append("toShard", "shard02");

database.runCommand(moveChangeLogCommand);

This architecture can reduce infrastructure costs by 40-60% compared to running all collections on uniform high-performance hardware. The revision collections remain immediately accessible for audit queries (unlike archived data), but at a cost structure that matches their actual usage patterns.

The key insight is recognizing that not all data in your versioning system requires the same performance characteristics. Current operational data needs the fastest possible access, while historical compliance data prioritizes availability and consistency over raw speed. By matching your infrastructure to these different requirements, you can build a system that satisfies both operational needs and compliance requirements without overprovisioning expensive resources.

Conclusion

When financial services organizations need to maintain complete document histories for regulatory compliance, the Document Versioning Pattern provides a robust, scalable solution. The multi-collection architecture balances operational performance with compliance requirements, while offering flexibility in implementation from strongly consistent transactions to event-driven change streams.

This pattern is particularly well-suited for financial applications where document modifications occur regularly over long lifecycles, audit trails are mandatory, and both current and historical data access patterns need optimization. While it does introduce storage overhead and complexity compared to simpler versioning approaches like embedded history arrays, these costs are typically justified by the regulatory benefits and operational advantages in financial services environments with high change volumes.

The pattern can be implemented incrementally on existing systems and scales effectively with data volume growth, making it an excellent choice for financial institutions building or modernizing their document management systems.

Get started

Ready to implement document versioning for your application? Here are your next steps:

Try it out: Start with a free MongoDB Atlas cluster and experiment with the three-collection pattern using sample financial data. MongoDB Atlas includes Change Streams and all the features needed to implement this pattern.

Explore the code: Copy the change streams implementation from this post and adapt it to your document structure, or visit the MongoDB Enterprise Microservice Example for a more transactional approach.

Learn more: Check out the other posts in our Building With Patterns series to see how MongoDB's flexible document model solves complex data challenges across industries.

DEV Community