Seryl Lns

Posted on Mar 19

Building a Rails Engine #12 -- Beyond CSV: JSON and API Sources

#ruby #rails #api #development

Beyond CSV: JSON and API Sources

CSV is the king of data import -- but in the real world, data also arrives as JSON from a file, or directly from a third-party API. Here is how DataPorter extends its source architecture to absorb these new formats without breaking anything.

Context

This is part 12 of the series where we build DataPorter, a mountable Rails engine for data import workflows. In part 11, we built the install and target generators so adopting the gem requires a single command.

Until now, DataPorter only reads CSV. That covers a lot of cases, but the real world is more varied. If every new format requires rearchitecting the engine, something went wrong. The Sources::Base abstraction we established in part 6 is about to prove its worth.

Interactive Walkthrough →

DataPorter Sources Architecture — Explaina

Interactive explorer showing how one fetch contract powers CSV, JSON, and API sources — pick a source type and see the code.

explaina.app

Why multiple sources?

An import engine that only speaks CSV forces users to convert their data before importing. That is unnecessary friction. Your hotel partner sends a nightly JSON export of guest reservations. Your CRM exposes a REST API with paginated contacts. A front-end form lets operators paste raw JSON into a text field. Each of these is a valid data source, and none of them is a CSV.

By supporting JSON and API natively, we cover all three without asking users to convert anything first. The key point: every source must respect the same contract -- a fetch method that returns an array of hashes with symbol keys. The rest of the pipeline (validation, transformation, persistence) does not change.

The JSON source

The JSON source must handle three ways to receive content: direct injection (for tests or programmatic use), raw JSON stored in the import configuration, and download from an ActiveStorage attachment.

# lib/data_porter/sources/json.rb
module DataPorter
  module Sources
    class Json < Base
      def initialize(data_import, content: nil)
        super(data_import)
        @content = content
      end

      def fetch
        parsed = ::JSON.parse(json_content)
        records = extract_records(parsed)

        Array(records).map do |hash|
          hash.transform_keys { |k| k.parameterize(separator: "_").to_sym }
        end
      end

      private

      def json_content
        @content || config_raw_json || download_file
      end

      def config_raw_json
        config = @data_import.config
        config["raw_json"] if config.is_a?(Hash)
      end

      def download_file
        @data_import.file.download
      end

      def extract_records(parsed)
        root = @target_class._json_root
        return parsed unless root

        parsed.dig(*root.split("."))
      end
    end
  end
end

Three things stand out.

The json_content cascade. The method tries three sources in order: content injected into the constructor, the raw_json key in the import configuration, and finally the ActiveStorage file. This cascade allows great flexibility without explicit configuration -- the right path is chosen automatically based on what is available.

json_root for nested paths. Real-world APIs and JSON files often wrap data in a structure: {"data": {"guests": [...]}}. Rather than forcing the user to flatten their JSON, we give them a DSL method in the Target:

class GuestsTarget < DataPorter::Target
  label "Guests"
  model_name "Guest"
  json_root "data.guests"

  columns do
    column :name, type: :string
  end
end

The extract_records method uses dig by splitting the path on dots. "data.guests" becomes parsed.dig("data", "guests"). Simple, readable, and supports any level of nesting.

Key normalization. As with the CSV source, every key is transformed via parameterize(separator: "_").to_sym. "First Name" becomes :first_name. This guarantees the rest of the pipeline always receives keys in the same format, regardless of the source format.

The JSON source covers file-based and programmatic use cases. But sometimes the data lives behind an HTTP endpoint, and the engine needs to go fetch it.

The API source

The API source fetches data from an HTTP endpoint. It must support static and dynamic endpoints, fixed and lazily-generated headers, and data extraction from a response key.

# lib/data_porter/sources/api.rb
module DataPorter
  module Sources
    class Api < Base
      def fetch
        api = @target_class._api_config
        response = perform_request(api)
        parsed = ::JSON.parse(response.body)
        records = extract_records(parsed, api)

        Array(records).map do |hash|
          hash.transform_keys { |k| k.parameterize(separator: "_").to_sym }
        end
      end

      private

      def perform_request(api)
        url = resolve_endpoint(api)
        headers = resolve_headers(api)
        uri = URI(url)

        Net::HTTP.start(uri.host, uri.port, use_ssl: uri.scheme == "https") do |http|
          request = Net::HTTP::Get.new(uri)
          headers.each { |k, v| request[k] = v }
          http.request(request)
        end
      end

      def resolve_endpoint(api)
        params = @data_import.config.symbolize_keys
        api.endpoint.is_a?(Proc) ? api.endpoint.call(params) : api.endpoint
      end

      def resolve_headers(api)
        api.headers.is_a?(Proc) ? api.headers.call : (api.headers || {})
      end

      def extract_records(parsed, api)
        root = api.response_root
        root ? parsed[root.to_s] : parsed
      end
    end
  end
end

The core logic lives in resolve_endpoint and resolve_headers. Each accepts either a static value or a lambda. This opens two usage modes:

# Static endpoint, fixed headers
api_config do
  endpoint "https://api.example.com/stays"
  headers({ "Authorization" => "Bearer abc123" })
  response_root :stays
end

# Dynamic endpoint, lazily-generated headers
api_config do
  endpoint ->(params) { "https://api.example.com/items?id=#{params[:item_id]}" }
  headers(-> { { "Authorization" => "Bearer #{Token.current}" } })
end

For the endpoint lambda, parameters come from @data_import.config.symbolize_keys. The user passes config: { item_id: "42" } when creating the import, and the lambda receives those parameters to build the URL. For headers, the lambda is called with no argument -- it retrieves the token from wherever it lives (environment variable, model, external service).

response_root works like json_root from the JSON source, but simpler: it extracts a single key from the response hash. response_root :stays on a response {"stays": [...]} returns the array directly. If no response_root is defined, the entire response is used.

The ApiConfig DSL pattern

The API configuration uses a dedicated DSL object rather than plain attr_accessor:

# lib/data_porter/dsl/api_config.rb
module DataPorter
  module DSL
    class ApiConfig
      def endpoint(value = nil)
        return @endpoint if value.nil?

        @endpoint = value
      end

      def headers(value = nil)
        return @headers if value.nil?

        @headers = value
      end

      def response_root(value = nil)
        return @response_root if value.nil?

        @response_root = value
      end
    end
  end
end

Each method plays a dual role: called with an argument, it acts as a setter; called without, it acts as a getter. In the Target, api_config creates an ApiConfig instance and executes the block via instance_eval -- the same DSL object + instance_eval pattern we used for the columns block. A classic Ruby idiom that gives clean syntax while keeping the implementation testable as a plain PORO.

Dispatch via Sources.resolve

Adding new sources does not modify any existing code. The Sources module maintains a simple registry:

# lib/data_porter/sources.rb
module DataPorter
  module Sources
    REGISTRY = {
      api: Api,
      csv: Csv,
      json: Json
    }.freeze

    def self.resolve(type)
      REGISTRY.fetch(type.to_sym) { raise Error, "Unknown source type: #{type}" }
    end
  end
end

The Orchestrator calls Sources.resolve(import.source_type) and receives the right class. It then instantiates the source and calls fetch. Neither the Orchestrator nor the controllers know which source type is being used -- it is the source_type stored in the import that decides. Adding an XML or Parquet source would require: a class inheriting from Base, one entry in the REGISTRY, and nothing else.

The TDD approach

Both sources were built test-first. The JSON source specs cover each path through the cascade -- direct injection, json_root extraction, and raw_json fallback:

it "parses JSON array content" do
  json = '[{"first_name": "Alice", "last_name": "Smith"}]'
  source = described_class.new(import, content: json)

  expect(source.fetch.first[:first_name]).to eq("Alice")
end

it "extracts records from a nested path" do
  json = '{"data": {"guests": [{"name": "Alice"}, {"name": "Bob"}]}}'
  source = described_class.new(import_with_root, content: json)

  expect(source.fetch.size).to eq(2)
end

The API source specs stub Net::HTTP.start and test the same axes: static vs lambda endpoint, header resolution, and response_root extraction. We are not testing that Net::HTTP works -- we are testing that our code correctly composes the URL, headers, and extracts the right data from the response.

Decisions & tradeoffs

Decision	We chose	Over	Because
HTTP client	`Net::HTTP` (stdlib)	Faraday, HTTParty	Zero extra dependency; sufficient for simple GETs
Dynamic endpoint	Lambda receiving `params`	String with interpolation	The lambda allows any logic (conditions, service calls) without string eval
Dynamic headers	Lambda with no argument	Callback with context	Headers often come from a global service (ENV, token store), not the import context
JSON cascade	`content` > `raw_json` > `file`	Mandatory argument	Maximum flexibility; each use case finds its natural path
Key normalization	`parameterize` + `to_sym`	Explicit mapping	Consistent with the CSV source; the downstream pipeline always receives the same format

Recap

The JSON source supports three input modes (injection, config raw_json, file) via a cascade of fallbacks, and uses json_root to navigate nested structures.
The API source dynamically resolves endpoints and headers through a static/lambda dual system, and extracts data via response_root.
The ApiConfig DSL uses a getter/setter pattern without attr_reader, evaluated in an instance_eval block for natural syntax.
Sources.resolve dispatches to the right class via a frozen registry -- adding a source is a two-line operation.
Tests cover every path through every source without touching the network, thanks to content injection and HTTP stubbing.

Next up

JSON and API sources complete the trio of supported formats. But we have not yet talked about the engine's overall testing strategy -- how to test a Rails engine without a full host application, how to organize specs between unit and integration tests, how to mock ActiveStorage and ActionCable. In part 13, we dive into testing a Rails engine with RSpec and the patterns that keep the suite fast and reliable.

This is part 12 of the series "Building DataPorter - A Data Import Engine for Rails". Previous: Generators: Install & Target Scaffolding | Next: Testing a Rails Engine with RSpec

DEV Community