Kenan Sejmenović

Posted on May 20

Claprec: Database Design - Modeling the Data (5/6)

#database #mssql #entityframeworkcore #dotnet

Series Roadmap

Product Overview (1/6)
Architecture: N-Tier Design, Decoupling, and Microservices (2/6)
Frontend: Angular, NgRx, and OpenAPI Integration (3/6)
Machine Learning in Practice (4/6)
Database Design and Data Modeling (5/6)
Engineering Tradeoffs: Limited Time vs. Ideal Architecture (6/6)

The Blueprint of the Data Model

When building Claprec, I knew that database design couldn't be an afterthought. It wasn't just about storing data; it was about modeling the lifecycle of that data. The requirements were strict: nested comments, recursive categories, complex business and product hierarchies, and critically - enterprise-grade auditability.

I used EF Core's code-first approach, defining the schema from the domain model.

In this post, I'll walk you through how I architected the data layer for Claprec. I prioritized a fully normalized schema designed for performance and legal compliance, moving far beyond simple CRUD operations.

Before diving into the implementation details, it is important to visualize the data architecture as a whole. The system is supported by two distinct databases: the primary transactional database and a separate database for the ViewsProcessing microservice.

First, the main database schema visualizes the core domain. This ER diagram highlights the three layers of the dual-schema strategy: the live tables, the archive schema (<<shadow_deletable>>) for deleted records, and the log schema (<<loggable>>) for audit trails.

Secondly, the database used by ViewsProcessing microservice is kept entirely separate. This schema is stripped down to the essentials for tracking views, ensuring that high-volume analytical writes are isolated and do not impact the performance of the primary transactional database.

With these structures in mind, here is the breakdown of the most significant architectural challenges and solutions.

The Dual-Schema Strategy: Auditability and Data Lifecycle

The most significant architectural hurdle was managing data retention without sacrificing performance. In a production environment, hard deletes are rarely acceptable - especially when legal disputes or behavioral analytics require a full history of user actions.

I solved this by implementing a Dual-Schema Strategy. Instead of cluttering primary tables with IsDeleted flags (which ruins indexing and query performance), I physically moved deleted records to a separate archive schema.

Architecture Breakdown

Primary schema: Lean, hot tables optimized for reads and writes.
Archive schema: Soft-deleted records are physically moved here.
Log schema: A complete audit trail snapshotting every modification.

This separation ensures that the primary database remains lean. When a user deletes a comment, the data isn't just marked; it's architected out of the active set but preserved for compliance.

To achieve this, I built a "write-once, audit-everywhere" mechanism. I didn't want to write manual audit logs in the business logic - that's brittle and prone to human error. Instead, I engineered an automated solution using EF Core Triggers and Reflection.

The entities are decoupled into three distinct layers to support this:

Abstraction: Real data properties (int, string, etc.).
Concrete: The EF Core entity used in queries (handling navigation properties).
Changed/Deleted: Shadow entities that inherit from abstraction but implement tracking interfaces (IChanged, IDeleted).

When an entity is modified, the backend inspects the change, captures the previous state via reflection, and persists it to the corresponding audit table automatically.

// AbstractionReview.cs

[Table("review")]
public abstract class AbstractionReview : AbstractionCommentOrReviewParent
{
    [Column("product_or_business_parent_id")]
    public long ProductOrBusinessParentId { get; set; }

    [Column("business_address_id")]
    public long? BusinessAddressId { get; set; }

    [Column("rating")]
    public byte Rating { get; set; }

    [Column("description")]
    [MaxLength(255)]
    public string? Description { get; set; }

    [NotMapped]
    public override long ViewCount { get => base.ViewCount; set => base.ViewCount = value; }

    [NotMapped]
    public override long UserId { get => base.UserId; set => base.UserId = value; }

    [NotMapped]
    public override DateTime CreatedAt { get => base.CreatedAt; set => base.CreatedAt = value; }
}

// Review.cs

public class Review : AbstractionReview
{
    [ForeignKey(nameof(Id))]
    public CommentOrReviewParent CommentOrReviewParent { get; set; } = null!;

    public BusinessAddress? BusinessAddress { get; set; }
    public ProductOrBusinessParent ProductOrBusinessParent { get; set; } = null!;

    [NotMapped]
    public User? User { get; set; }
}

// DeletedReview.cs

[ShadowDeleteable("review")]
public class DeletedReview : AbstractionReview, IDeleted
{
    [Column("deleted_id")]
    public long DeletedId { get; set; }

    [Column("deleted_at")]
    public DateTime DeletedAt { get; set; }

    [Column("deleted_by")]
    public long? DeletedBy { get; set; }
}

// ChangedReview.cs

[Loggable("review")]
public class ChangedReview : ChangedAbstraction, IChanged
{
    [Column("review_id")]
    public long ReviewId { get; set; }

    [Column("rating")]
    public byte? Rating { get; set; }

    [Column("description")]
    [MaxLength(255)]
    public string? Description { get; set; }
}

Automating Entity Registration: Reflection over Repetition

One of the code-smells in Entity Framework Core is a DbContext with hundreds of DbSet properties and manual fluent API configurations. With Claprec's complex schema, this was cumbersome.

I automated the entity registration process. Instead of manually adding entities, the AppDbContext scans the executing assembly for types matching specific interfaces (IChanged, IDeleted). It identifies the concrete types and automatically registers the necessary triggers.

This didn't just save lines of code; it enforced consistency. If a developer creates a new entity, they simply implement the interface, and the database context handles the wiring, triggers, and schema mapping automatically.

// AppDbContext.cs

// ...

private IEnumerable<Type> GetEntitiesInRepository(string namespaceFullName)
{
    return from t in Assembly.GetExecutingAssembly().GetTypes()
        where t.IsClass &&
              t?.Namespace != null &&
              t.Namespace.Contains(namespaceFullName) &&
              !t.IsAbstract
        select t;
}

private void AddChangeTriggers(DbContextOptionsBuilder optionsBuilder)
{
    var changedTypes = GetEntitiesInRepository(ChangedNamespaceFullName);

    foreach (var changedType in changedTypes)
    {
        CheckIfTypeImplementsIChanged(changedType);

        var concreteType = FindConcreteType(changedType, _loggableEntitiesNamespaceName);

        RegisterTrigger(optionsBuilder, typeof(AfterChangeTrigger<,>), concreteType, changedType);
    }
}

private void AddDeleteTriggers(DbContextOptionsBuilder optionsBuilder)
{
    var deletedTypes = GetEntitiesInRepository(DeletedNamespaceFullName);

    foreach (var deletedType in deletedTypes)
    {
        CheckIfTypeImplementsIDeleted(deletedType);

        var concreteType = FindConcreteType(deletedType, _shadowDeletableEntitiesNamespaceName);

        RegisterTrigger(optionsBuilder, typeof(AfterDeleteTrigger<,>), concreteType, deletedType);
    }
}

// ...

Performance Engineering: Seeding and Garbage Collection

Data lifecycle management is one thing; initializing it is another. I faced a massive performance bottleneck during database seeding.

The dataset includes 374,352 cities and 720,599 zipcodes. When I first implemented seeding using standard EF Core AddRange, the full seed took roughly 40 minutes. This was unacceptable for a development workflow where database resets are frequent.

I knew the primary culprit was Entity Framework Core's change tracker - maintaining state for hundreds of thousands of entities incurs massive overhead. I attempted to optimize this by explicitly detaching entities immediately after insertion to unburden the tracker. While this yielded some performance improvements, they were marginal - nowhere near the speedup required for a rapid development cycle.

The Fix: Raw SQL and Batch Processing

I decided to bypass EF Core's pipeline entirely for these bulk operations and wrote raw SQL statements directly.

// SeedCity.cs

private const string INSERT_STATEMENT_2 = "INSERT INTO dbo.city(city_id, country_id, name, lat, lng) VALUES ({0}, {1}, N'{2}', {3}, {4});";

However, this introduced a new problem: memory spikes.
Constructing hundreds of thousands of SQL strings in memory overloaded the RAM, causing the app to crash. The .NET Garbage Collector (GC) is optimized for steady-state workloads, but it cannot compensate for extreme, one-time allocation spikes.

I implemented batch processing and manual GC triggering:

The SeederManager is configurable: batch size, target environment (Dev/Prod), and collection type.
After processing a batch, it manually triggers the garbage collector to immediately free RAM.

Result: The full database seed time dropped from ~40 minutes to ~30 seconds (~98% reduction).

// Program.cs

// ...

await startup.PopulateDb(false, 10, 25000);

// ...

// Startup.cs

// ...

public async Task PopulateDb(bool loadAllData, int geodataSubsetLength, int maxNumberOfStatementsPerExecution)
{
  // ...
}

// ...

// SeedCity.cs

// ...

public async Task Seed()
{
  // ...

  var sb = new StringBuilder(listCapacity);

  foreach (var geoItemCountry in _geoItemCountries)
  {
    foreach (var geoItemCity in geoItemCountry.Cities)
    {
        if (count > _maxNumberOfStatementsPerExecution)
        {
          // ...

          await _appDbContext.Database.ExecuteSqlRawAsync(sb.ToString());

          sb = null;
          sb = new StringBuilder(listCapacity);

          // ...
        }

        // ...

        sb.Append(string.Format(INSERT_STATEMENT_2, geoItemCity.CityId, geoItemCountry.CountryId, geoItemCity.Name, geoItemCity.Lat, geoItemCity.Lng));

        // ...
    }
  }

  // ...
}

// ...

Inheritance Modeling and EF Core Optimization

Claprec requires complex hierarchies. To prevent duplication of fields between the live entity, its archived version, and its log version, I utilized a strict inheritance strategy.

I separate entities into an Abstraction and a Concrete class.

Abstraction: Defines the actual data columns.
Concrete: Inherits the abstraction and defines navigation properties (Foreign Keys).

The reason for this separation is strictly the DRY (Don't Repeat Yourself) principle. Since lazy loading is disabled by default in the architecture, I didn't need to separate them to control loading behavior. Instead, this structure allows the archive entities (like DeletedComment) to inherit directly from the abstraction. This means the archived entity shares the exact same data fields as the live entity without me having to rewrite or maintain two sets of identical properties.

The logging model introduces an additional constraint. Unlike archived entities, where the full state is preserved, log entities track only a subset of fields. Since not every field change is relevant, each log entity explicitly defines which fields are tracked. This avoids unnecessary data growth and keeps logs focused on meaningful state transitions.

This approach also enabled modeling entities with multiple parent relationships via navigation properties and composite primary keys - something EF Core handles relatively well compared to my experiences with Laravel 6.

Why inheritance over duplication?

Single source of truth: Shared fields are defined once and reused across related entities.
Consistent structure: All related entities stay aligned automatically through the shared base definition.
Easier maintenance: Changes to shared fields are made in one place and automatically affect all related entities.

// CreatedAtTimestamp.cs

public abstract class CreatedAtTimestamp
{
    [Column("created_at")]
    public virtual DateTime CreatedAt { get; set; }
}

// AbstractionCommentOrReviewParent.cs

[Table("comment_or_review_parent")]
public abstract class AbstractionCommentOrReviewParent : CreatedAtTimestamp, ICountViews
{
    [Key]
    public long Id { get; set; }

    [Column("user_id")]
    public virtual long UserId { get; set; }

    [Column("view_count")]
    [DefaultValue(0)]
    public virtual long ViewCount { get; set; }
}

// AbstractionComment.cs

[Table("comment")]
public abstract class AbstractionComment : AbstractionCommentOrReviewParent
{
    [Column("comment_or_review_parent_id")]
    public long CommentOrReviewParentId { get; set; }

    [Column("value")]
    [MaxLength(255)]
    public string Value { get; set; } = null!;

    [NotMapped]
    public override long ViewCount { get => base.ViewCount; set => base.ViewCount = value; }

    [NotMapped]
    public override long UserId { get => base.UserId; set => base.UserId = value; }

    [NotMapped]
    public override DateTime CreatedAt { get => base.CreatedAt; set => base.CreatedAt = value; }
}

Optimization Strategies: Indexes and Identity Seeds

To ensure the database scaled, I applied two specific optimization strategies:

Identity Seeds

By default, SQL Server identity columns start at 1. For signed integer types, this uses only the positive portion of the available range. I configured the identity seed to start at the minimum value of the underlying type to utilize the full representable range.

This doubles the available ID pool before a key overflow could theoretically occur - a small configuration change with a significant long-term benefit.

// AppDbContext.cs

private void SetIdentitySeeds(IEnumerable<IMutableEntityType> entities)
{
    foreach (var type in entities)
    {
        var primaryKeys = type.FindPrimaryKey();

        if (primaryKeys!.Properties.Count != 1)
        {
            continue;
        }

        foreach (var prop in primaryKeys.Properties)
        {
            switch (Type.GetTypeCode(prop.ClrType))
            {
                case TypeCode.SByte:
                    type.FindProperty(prop.Name)!.SetIdentitySeed(sbyte.MinValue);
                    break;
                case TypeCode.Int16:
                    type.FindProperty(prop.Name)!.SetIdentitySeed(short.MinValue);
                    break;
                case TypeCode.Int32:
                    type.FindProperty(prop.Name)!.SetIdentitySeed(int.MinValue);
                    break;
                case TypeCode.Int64:
                    type.FindProperty(prop.Name)!.SetIdentitySeed(long.MinValue);
                    break;
            }
        }
    }
}

Heavy Indexing Strategy

I prioritized the use of integer-based keys over string-based ones to improve comparison efficiency and reduce index size. Integer values require less storage and are processed faster by the CPU, enabling more index entries per page, reducing memory usage and disk access, and improving overall query performance.

I also defined Composite Unique Keys to prevent logical duplicates (e.g., a user blocking the same user twice).

// AppDbContext.cs

private void SetCompositeUniqueKeys(ModelBuilder modelBuilder)
{
    _ = modelBuilder.Entity<Zipcode>()
        .HasIndex(z => new { z.CityId, z.Name }).IsUnique();

    _ = modelBuilder.Entity<Block>()
        .HasIndex(b => new { b.UserId, b.BlockedUserId }).IsUnique();

    _ = modelBuilder.Entity<BusinessAddress>()
        .HasIndex(ba => new { ba.Name, ba.Number, ba.CityId }).IsUnique();

    // ...
}

Conclusion

The resulting architecture for Claprec is designed for auditability, traceability, and scalability - not just functional correctness. By leveraging a dual-schema for soft deletes and field-level change tracking, automating triggers via reflection, and optimizing the seeding process, I built a backend capable of handling enterprise-level scrutiny.

What's Next?

While a meticulously designed database provides the structural integrity of the platform, it doesn't answer the question of when to stop building. In the final post, Claprec: Engineering Tradeoffs - Limited time vs. Perfection (6/6), I'll step back from the code to examine the decision-making process. I will discuss the tension between architectural idealism and the reality of shipping a product - where I chose to invest in over-engineering for scalability, where I accepted pragmatic compromises, and how to balance the pursuit of perfection against the constraints of a deadline.

I'd love to hear your thoughts on this database architecture approach or answer any questions you might have in the comments below.

DEV Community