Seryl Lns

Posted on Feb 19 • Edited on Feb 22

Building a Rails Engine #4 — Modeling Import Data with StoreModel & JSONB

#ruby #rails #opensource #tutorial

Modeling import data with StoreModel & JSONB

Storing structured import records, errors, and reports inside a single JSONB column -- no extra tables, no schema sprawl.

Context

This is part 4 of the series where we build DataPorter, a mountable Rails engine for data import workflows. In part 3, we built the configuration DSL that lets host apps customize the gem through a clean configure block.

Now we shift from how the gem behaves to what it operates on: the data models for parsed records, validation errors, and summary reports. We'll model all three using the StoreModel gem and PostgreSQL JSONB columns.

The problem

A typical import engine ends up with a lot of tables: imports, import rows, import errors, reports. Each needs a migration, foreign keys, indexes, and cleanup logic. For a gem that drops into any Rails app, that's a heavy footprint.

But these records are ephemeral. They exist during the import workflow, get consulted in the results view, and nobody queries them independently. You never ask "give me all errors across all imports." They're always accessed through their parent.

If the data is always read and written as a group, it doesn't need its own table. It needs a structured column.

What we're building

A single DataImport record will carry its entire import payload in JSONB columns:

# Anywhere in the engine
import = DataPorter::DataImport.find(42)

import.report.records_count      # => 150
import.report.errored_count      # => 3
import.report.error_reports.each { |e| puts e.message }

import.records.first.line_number  # => 1
import.records.first.status       # => "complete"
import.records.first.data         # => { "name" => "Alice", "email" => "alice@example.com" }

No joins, no N+1 queries. Records and reports come back as typed Ruby objects with real attributes and methods -- not raw hashes.

Implementation

Step 1 -- The Error model

Every import record can accumulate validation errors. We need a small object to represent each one. StoreModel lets us define it like an ActiveModel attribute model, but serialized into JSON.

# lib/data_porter/store_models/error.rb
module DataPorter
  module StoreModels
    class Error
      include StoreModel::Model

      attribute :message, :string
    end
  end
end

include StoreModel::Model gives us ActiveModel-compatible attributes that serialize to and from JSON. Why not a plain hash? Because error.message is a method call with autocompletion, not error["message"] where you guess at indifferent access. If we need :code or :severity later, we add an attribute and existing data deserializes cleanly -- new fields default to nil.

Step 2 -- The ImportRecord model

Each row from the source file becomes an ImportRecord. This is the workhorse of the import: it holds the parsed data, tracks validation status, and collects errors and warnings.

# lib/data_porter/store_models/import_record.rb
module DataPorter
  module StoreModels
    class ImportRecord
      include StoreModel::Model

      attribute :line_number, :integer
      attribute :status, :string, default: "pending"
      attribute :data, default: -> { {} }
      attribute :errors_list, Error.to_array_type, default: -> { [] }
      attribute :warnings, Error.to_array_type, default: -> { [] }
      attribute :target_id, :integer
      attribute :dry_run_passed, :boolean, default: false
    end
  end
end

The data attribute stores whatever hash the source parser produces -- no explicit type, because each import target defines different columns. The lambda defaults (-> { {} }) are critical; without them, every record shares the same mutable object. Error.to_array_type makes errors_list a typed array: each JSON entry deserializes into an Error instance, not a raw hash.

The model also carries behavior. Status determination runs after validation:

# lib/data_porter/store_models/import_record.rb
def determine_status!
  self.status = if required_error?
                  "missing"
                elsif errors_list.any?
                  "partial"
                else
                  "complete"
                end
end

Three statuses: "missing" (a required field is absent -- the record cannot be imported), "partial" (optional field errors exist -- the record can be imported with warnings), and "complete" (clean row, ready to go). The distinction between required and optional errors is driven by the column definition -- we'll wire that up in part 5 with the Target DSL. The Orchestrator will call determine_status! after validation and use importable? to decide which records to persist.

Step 3 -- The Report model

After parsing and validating, we need a summary. The Report model aggregates counts and collects top-level errors (like "file has no header row" or "unexpected encoding").

# lib/data_porter/store_models/report.rb
module DataPorter
  module StoreModels
    class Report
      include StoreModel::Model

      attribute :records_count, :integer, default: 0
      attribute :complete_count, :integer, default: 0
      attribute :partial_count, :integer, default: 0
      attribute :missing_count, :integer, default: 0
      attribute :duplicate_count, :integer, default: 0
      attribute :imported_count, :integer, default: 0
      attribute :errored_count, :integer, default: 0
      attribute :error_reports, Error.to_array_type, default: -> { [] }
    end
  end
end

Every counter defaults to zero; the Orchestrator increments them during processing. error_reports reuses Error.to_array_type for import-level errors that don't belong to a specific row -- same typed-array pattern as ImportRecord#errors_list, so the UI can render both with the same component.

Step 4 -- TypeValidator: validating before the database

We now have models that can hold import data. But who checks that the data is valid before it gets there? A CSV cell containing "abc" shouldn't silently land in an integer field.

The TypeValidator module handles this at the column level, before the data ever touches ActiveRecord.

# lib/data_porter/type_validator.rb
module DataPorter
  module TypeValidator
    VALIDATORS = {
      string:   ->(_value, _opts) { true },
      integer:  ->(value, _opts) { Integer(value, exception: false) },
      decimal:  ->(value, _opts) { Float(value, exception: false) },
      date:     ->(value, opts) { parse_date(value, opts) },
      email:    ->(value, _opts) { value.match?(/\A[^@\s]+@[^@\s]+\z/) },
      phone:    ->(value, _opts) { value.match?(/\A[+\d][\d\s\-().]{6,}\z/) },
      url:      ->(value, _opts) { valid_url?(value) },
      boolean:  ->(value, _opts) { %w[true false 1 0].include?(value.to_s.downcase) }
    }.freeze
  end
end

Each type maps to a lambda that returns truthy or falsy. The public API is one method: TypeValidator.valid?("42", :integer). Integers use Integer() with exception: false to avoid rescue-driven control flow. The email regex is intentionally loose -- RFC 5322 is beautiful on paper. User CSVs are not. I picked the side of production, not poetry. Dates support an optional :format option for regional formatting like "%d/%m/%Y".

The key design choice: validation happens before data enters the StoreModel. During parsing, the source reads a row, column definitions declare expected types, and the validator checks each value. Errors get added to the ImportRecord via add_error. By the time determine_status! runs, all column-level issues are captured.

This is deliberately separate from database-level validation (uniqueness, foreign keys), which runs later during the actual import. Keeping the two layers apart lets us show users a preview with type errors highlighted before any write attempt -- the foundation for the dry-run feature in part 14.

Decisions & tradeoffs

Decision	We chose	Over	Because
Row storage	JSONB column (array of StoreModel)	Separate `import_rows` table	Records are always accessed through their parent; no independent queries needed. One fewer migration for host apps to manage
Structured JSON	StoreModel gem	Hand-rolled `serialize` / raw hashes	StoreModel gives us ActiveModel attributes, typed arrays, defaults, and validations. Writing our own serializer would duplicate all of that
Validation layer	Column-level TypeValidator + later DB-level	Database-only validation	Enables preview and dry-run without touching the database. Users see type errors immediately, before any write attempt
Error representation	StoreModel class with `:message`	Plain strings in an array	Extensible -- we can add `:code`, `:severity`, `:column` later without changing the array structure or breaking existing serialized data
Status logic	Method on ImportRecord (`determine_status!`)	External service or state machine gem	Status depends only on the record's own errors. No transitions or events needed. A method is the simplest thing that works

When NOT to use this pattern

JSONB is great for workflows, terrible for domains. This design works because import records are ephemeral, always accessed through their parent, and never queried across imports.

Don't reach for JSONB when:

You need to JOIN it -- if other tables reference your data, it belongs in a real table with foreign keys.
You need to ANALYZE it -- "top 10 most frequent errors across all imports" is a SQL query on a relational table, not a JSONB treasure hunt.
It's core business data -- orders, users, payments deserve constraints, indexes, and referential integrity. JSONB gives you none of that.
Size explodes -- StoreModel loads the entire blob into Ruby objects. A million-row import in a single JSONB column is a one-way ticket to OOM city.

StoreModel improves developer experience. It does not change database architecture. Know the difference and you'll sleep better.

Testing it

ImportRecord specs verify status determination:

# spec/data_porter/store_models/import_record_spec.rb
RSpec.describe DataPorter::StoreModels::ImportRecord do
  subject(:record) { described_class.new(line_number: 1, data: { name: "Alice" }) }

  describe "#determine_status!" do
    it "sets missing when required field error exists" do
      record.add_error("Name is required")
      record.determine_status!
      expect(record.status).to eq("missing")
    end

    it "sets partial when non-required error exists" do
      record.add_error("Email: invalid email")
      record.determine_status!
      expect(record.status).to eq("partial")
    end

    it "sets complete when no errors" do
      record.determine_status!
      expect(record.status).to eq("complete")
    end
  end
end

TypeValidator specs cover each type, including edge cases like custom date formats:

# spec/data_porter/type_validator_spec.rb
RSpec.describe DataPorter::TypeValidator do
  it "accepts valid integers" do
    expect(described_class.valid?("42", :integer)).to be true
  end

  it "rejects non-integers" do
    expect(described_class.valid?("abc", :integer)).to be false
  end

  it "accepts dates with custom format" do
    expect(described_class.valid?("15/01/2024", :date, format: "%d/%m/%Y")).to be true
  end
end

No database setup needed for any of these. StoreModel objects instantiate in memory like plain Ruby objects, which makes the specs fast and isolated.

Recap

JSONB columns over extra tables -- records are always accessed through their parent.
StoreModel over raw hashes -- typed attributes, defaults, and methods for free.
ImportRecord holds parsed data, collects errors, and determines its own status.
TypeValidator validates before the database -- enables preview and dry-run without writes.

Next up

We have configuration (part 3) and data models (this part). In part 5, we'll bring them together by designing the Target DSL -- the class-level interface that lets each import type declare its label, model, columns, and CSV mapping in a single file. One file per import type, zero boilerplate. If you've ever wanted class_attribute to do more heavy lifting, that's the one.

*This is part 4 of the series "Building DataPorter - A Data Import Engine for Rails". Previous: Configuration DSL | Next: Designing a Target DSL (coming soon)

GitHub: SerylLns/data_porter | RubyGems: data_porter