Why switch? → 10 Ways Ruby's CSV.read Can Silently Corrupt or Lose Your Data
In this article we'll explore how easy it is to switch from Ruby CSV to SmarterCSV — often just a single line change.
But we'll also go beyond the basics and look at advanced scenarios where SmarterCSV really shines: parallel processing with Sidekiq, streaming imports directly from S3, production-grade instrumentation, and resumable imports that survive deployments mid-file. These are patterns that Ruby's built-in CSV library can't handle without you building all the plumbing from scratch.
Here's how to make the switch in 5 minutes.
Ruby's built-in CSV library works — but it's slow, and its default output is arrays of arrays, where row data is disassociated from the headers. That means your code has to manually correlate values with column names, introducing risk and boilerplate. The result doesn't lend itself to direct use with ActiveRecord, Sidekiq, or any hash-based workflow — you're always required to do post-processing to get to usable data, re-implementing boilerplate code to clean up data.
SmarterCSV returns Rails-ready hashes with symbol keys, automatic numeric conversion, and whitespace stripping, using sensible defaults to clean up the data — all out of the box. No boilerplate code, and it does it up to 2×–9× faster than CSV.read while returning cleaned-up arrays of hashes.
How Much Faster? 🚀
SmarterCSV is designed for real-world CSV processing — the full pipeline including hash construction, key normalization, and type conversion, not just raw tokenization.
| Comparison | Range |
|---|---|
vs CSV.read (arrays only, no post-processing) |
1.7×–8.6× faster |
vs CSV.table (closest equivalent output) |
7×–129× faster |
Benchmarks: 19 CSV files (20k–80k rows), Ruby 3.4.7, Apple M1, SmarterCSV 1.16.0.
The CSV.table comparison is the fair one — both return symbol-keyed hashes. CSV.read returns raw arrays, so the post-processing work your application still needs to do is not included in that number, understating the real cost difference.
Step 1: Install
# Gemfile
gem 'smarter_csv'
bundle install
Step 2: The One-Line Switch
Most developers use CSV.read with headers: true, which returns an array of CSV::Row objects with string keys. To get usable hashes, you need to call .map(&:to_h) — and you still have string keys, no type conversion, and no whitespace stripping.
Consider this real-world CSV file — messy headers, extra columns without headers, a trailing comma:
$ cat data.csv
First Name , Last Name , Age
Alice , Smith, 30, VIP, Gold ,
Bob, Jones, 25
Before: Ruby CSV
rows = CSV.read('data.csv', headers: true).map(&:to_h)
rows.first
# => {" First Name " => "Alice ", " Last Name " => " Smith", " Age" => " 30", nil => ""}
# ^^^ "VIP" and "Gold" silently lost!
Whitespace-polluted keys, Age as a string, and every extra column competes for the same nil key — the last one wins, the rest are silently discarded.
After: SmarterCSV
rows = SmarterCSV.process('data.csv')
rows.first
# => {first_name: "Alice", last_name: "Smith", age: 30, column_1: "VIP", column_2: "Gold"}
# trailing empty field dropped, no data loss
Clean symbol keys, whitespace stripped, age converted to Integer, extra columns named — no data loss.
That's it. No .map(&:to_h), no header_converters:, no manual post-processing.
Step 3: Know the Differences
SmarterCSV's defaults are designed for real-world use. Here's what changes and what to watch for:
String keys → Symbol keys
CSV.read returns string keys by default. SmarterCSV returns symbol keys, which are more efficient (much lower memory usage and faster), as well as idiomatic for Rails. If you genuinely need string keys, you could still add strings_as_keys: true.
Before: Ruby CSV
rows = CSV.read('data.csv', headers: true).map(&:to_h)
rows.first['name'] # => "Alice"
After: SmarterCSV
# symbol keys (the default)
rows = SmarterCSV.process('data.csv')
rows.first[:name] # => "Alice"
# string keys — if you don't mind the memory impact
rows = SmarterCSV.process('data.csv', strings_as_keys: true)
rows.first['name'] # => "Alice"
Numeric conversion is automatic
CSV.read returns everything as strings - not ideal for consumption. SmarterCSV converts "42" → 42 and "3.14" → 3.14 automatically.
Watch out for columns where leading zeros matter — ZIP codes, phone numbers, account numbers - and exclude them:
Before: Ruby CSV
rows = CSV.read('data.csv', headers: true).map(&:to_h)
rows.first['age'] # => "30" (string)
After: SmarterCSV
# numeric strings converted automatically
rows = SmarterCSV.process('data.csv')
rows.first[:age] # => 30 (Integer)
# Exclude columns where leading zeros matter
rows = SmarterCSV.process('data.csv',
convert_values_to_numeric: { except: [:zip_code, :phone, :account_number] })
Empty values are removed
SmarterCSV drops key/value pairs where the value is blank, and returns cleaned-up data hashes. CSV.read keeps them as nil.
Before: Ruby CSV
rows = CSV.read('sample.csv', headers: true, header_converters: :symbol).map(&:to_h)
rows[1] # => { name: "Bob", age: "25", city: nil }
After: SmarterCSV
rows = SmarterCSV.process('sample.csv')
rows[1] # => { name: "Bob", age: 25 } ← :city dropped, :age converted
# To keep nil values (match CSV.read behaviour):
rows = SmarterCSV.process('sample.csv', remove_empty_values: false)
rows[1] # => { name: "Bob", age: 25, city: nil }
Plain Hash, not CSV::Row
CSV.read returns CSV::Row objects — a wrapper around a hash with extra methods. SmarterCSV returns plain Ruby Hash objects, so there's no .to_h needed and no wrapper to unwrap.
Before: Ruby CSV
row = CSV.read('data.csv', headers: true).first
row.class # => CSV::Row
row['name'] # => "Alice" (string key)
row.to_h # => {"name" => "Alice", "age" => "30"} (still strings)
After: SmarterCSV
row = SmarterCSV.process('data.csv').first
row.class # => Hash
row[:name] # => "Alice" (symbol key, no unwrapping needed)
Quick Reference
| Ruby CSV | SmarterCSV | Notes |
|---|---|---|
CSV.read(f, headers: true).map(&:to_h) |
SmarterCSV.process(f) |
Symbol keys, numeric conversion, whitespace stripped. |
CSV.read(f, headers: true, header_converters: :symbol).map(&:to_h) |
SmarterCSV.process(f) |
Drop-in. |
CSV.table(f) |
SmarterCSV.process(f) |
CSV.table returns a CSV::Table of CSV::Row objects; SmarterCSV returns a plain Array of Hash. |
CSV.parse(str, headers: true, header_converters: :symbol) |
SmarterCSV.parse(str) |
Direct string parsing, new in 1.16.0. |
| `CSV.foreach(f, headers: true) { \ | r\ | }` |
converters: :numeric |
default | Automatic in SmarterCSV. |
converters: :date |
value_converters: {col: DateConverter} |
Use explicit format strings — date formats are locale-dependent. |
liberal_parsing: true |
on_bad_row: :collect |
Explicit quarantine gives you visibility. |
skip_blanks: true |
remove_empty_hashes: true |
Default in SmarterCSV. |
row.to_h |
row |
Already a plain Hash. |
row.headers |
reader.headers |
Available on the Reader instance. |
Beyond the Basics: What You Unlock
Once you're on SmarterCSV, these features come for free.
Batch processing for large files
SmarterCSV.process('big.csv', chunk_size: 500) do |chunk|
MyModel.insert_all(chunk) # bulk insert 500 rows at a time
end
Handle bad rows without crashing
good_rows = SmarterCSV.process('data.csv', on_bad_row: :collect)
puts "#{good_rows.size} imported, #{SmarterCSV.errors[:bad_row_count]} bad rows"
SmarterCSV.errors[:bad_rows].each { |r| puts "Line #{r[:file_line_number]}: #{r[:error_message]}" }
⚠️ Fibers:
SmarterCSV.errorsrelies onThread.current, which is shared across all
fibers in the same thread. If you process CSV concurrently in fibers (Async,Falcon,
manualFiberscheduling), useSmarterCSV::Readerdirectly instead — its errors are
scoped to the instance.
Or use SmarterCSV::Reader directly when you also need access to headers or other reader state after processing:
reader = SmarterCSV::Reader.new('data.csv', on_bad_row: :collect)
good_rows = reader.process
puts "#{good_rows.size} imported, #{reader.errors[:bad_rows].size} bad rows"
reader.errors[:bad_rows].each { |r| puts "Line #{r[:file_line_number]}: #{r[:error_message]}" }
Sentinel values (NULL, N/A, #VALUE!)
rows = SmarterCSV.process('export.csv',
nil_values_matching: /\A(NULL|N\/A|NaN|#VALUE!)\z/i)
# Matching values are nil-ified and removed automatically
Custom type converters
Date formats are locale-dependent, so SmarterCSV doesn't guess. You supply an explicit format:
require 'date'
rows = SmarterCSV.process('records.csv',
value_converters: {
birth_date: ->(v) { v ? Date.strptime(v, '%m/%d/%Y') : nil },
price: ->(v) { v&.delete('$,')&.to_f },
active: ->(v) { v&.match?(/\Atrue\z/i) },
})
Row-by-row iteration with full Enumerable
Before: Ruby CSV
CSV.foreach('data.csv', headers: true, header_converters: :symbol) do |row|
MyModel.create(row.to_h)
end
After: SmarterCSV
SmarterCSV.each('data.csv') do |row|
MyModel.create(row) # already a Hash
end
# Full Enumerable — filter, map, lazy
active_users = SmarterCSV.each('data.csv').select { |r| r[:status] == 'active' }
first_ten = SmarterCSV.each('data.csv').lazy.first(10)
Rails file upload
Accepting a CSV upload in a Rails controller is straightforward — pass the tempfile path directly:
# app/controllers/imports_controller.rb
def create
file = params[:file] # ActionDispatch::Http::UploadedFile
SmarterCSV.process(file.path, chunk_size: 500) do |chunk|
MyModel.insert_all(chunk)
end
redirect_to root_path, notice: "Import complete"
end
No temp file management, no manual header parsing. The uploaded file is processed in streaming chunks, keeping memory usage low regardless of file size.
Renaming headers to match your database
CSV column names rarely match your ActiveRecord attribute names. Use key_mapping: to rename them in one step — the mapping uses the normalized (downcased, underscored) header name as input:
# CSV headers: "First Name", "Last Name", "E-Mail", "Date of Birth"
# After normalization: :first_name, :last_name, :e_mail, :date_of_birth
rows = SmarterCSV.process('contacts.csv',
key_mapping: {
first_name: :given_name,
last_name: :family_name,
e_mail: :email,
date_of_birth: :dob,
})
# => [{given_name: "Alice", family_name: "Smith", email: "alice@example.com", dob: "1990-05-14"}, ...]
Map a key to nil to drop that column entirely:
key_mapping: { internal_id: nil, created_at: nil } # these columns won't appear in results
Select only the columns you need
Wide CSV files often have dozens of columns your application doesn't need. Use headers: { only: } to declare upfront which columns to keep — SmarterCSV skips everything else at the parser level, so unneeded fields are never allocated:
# CSV has 50 columns — you only need 3
rows = SmarterCSV.process('contacts.csv',
headers: { only: [:email, :first_name, :last_name] })
# => [{email: "alice@example.com", first_name: "Alice", last_name: "Smith"}, ...]
Or exclude a known noisy column while keeping everything else:
rows = SmarterCSV.process('export.csv', headers: { except: [:internal_notes] })
Writing CSV from hashes
Before: Ruby CSV
# you manage headers manually, pass arrays
CSV.open('out.csv', 'w', write_headers: true, headers: ['name', 'age']) do |csv|
csv << ['Alice', 30]
end
After: SmarterCSV
# pass hashes, headers discovered automatically
SmarterCSV.generate('out.csv') do |csv|
csv << {name: 'Alice', age: 30}
csv << {name: 'Bob', age: 25}
end
Advanced Patterns: Where SmarterCSV Really Shines
These are scenarios where Ruby CSV falls short and SmarterCSV makes the solution clean and straightforward.
Parallel processing with Sidekiq
Each chunk is dispatched as an independent background job — the chunk API maps directly onto worker queues:
SmarterCSV.process('users.csv', chunk_size: 100) do |chunk, chunk_index|
puts "Queueing chunk #{chunk_index} (#{chunk.size} records)..."
Sidekiq::Client.push_bulk(
'class' => UserImportWorker,
'args' => chunk,
)
end
# => imports run in parallel across all your Sidekiq workers
Streaming directly from S3
SmarterCSV accepts any IO-like object — so you can stream a CSV directly from S3 without writing a temp file:
require 'aws-sdk-s3'
s3 = Aws::S3::Client.new(region: 'us-east-1')
obj = s3.get_object(bucket: 'my-bucket', key: 'imports/contacts.csv')
SmarterCSV::Reader.new(obj.body, chunk_size: 500).each_chunk do |chunk, _index|
MyModel.insert_all(chunk)
end
No disk I/O, no temp files, no cleanup. The S3 response body streams directly into the parser.
Production instrumentation
on_start, on_chunk, and on_complete hooks give you full visibility into long-running imports — feed them into your logger, StatsD, or any metrics backend:
SmarterCSV.process('large_import.csv',
chunk_size: 1_000,
on_start: ->(info) {
Rails.logger.info "Import started: #{info[:input]} (#{info[:file_size]} bytes)"
},
on_chunk: ->(info) {
Rails.logger.debug "Chunk #{info[:chunk_number]}: #{info[:rows_in_chunk]} rows " \
"(#{info[:total_rows_so_far]} total so far)"
},
on_complete: ->(stats) {
Rails.logger.info "Import complete: #{stats[:total_rows]} rows in #{stats[:duration].round(2)}s, " \
"#{stats[:bad_rows]} bad rows"
StatsD.histogram('csv.import.duration', stats[:duration])
},
) { |chunk| MyModel.insert_all(chunk) }
Resumable imports with Rails ActiveJob
Rails 8.1 introduced ActiveJob::Continuable — jobs that can pause mid-execution (on deployment or queue drain) and resume exactly where they stopped. SmarterCSV's chunk_index maps directly onto the job cursor:
class ImportCsvJob < ApplicationJob
include ActiveJob::Continuable
def perform(file_path)
step :import_rows do |step|
SmarterCSV.process(file_path, chunk_size: 500) do |chunk, chunk_index|
next if chunk_index < step.cursor.to_i # skip already-processed chunks on resume
MyModel.insert_all(chunk)
step.set! chunk_index + 1
end
end
end
end
On interruption after chunk 7, Rails persists the cursor as 8. On the next run, chunks 0–7 are skipped instantly and processing resumes from chunk 8. Ruby CSV has no equivalent — you'd have to implement cursor tracking, row counting, and resume logic yourself.
Bulk upsert — insert or update
For recurring imports where records may already exist, use upsert_all with a unique key. SmarterCSV's hashes pass directly — no transformation needed:
SmarterCSV.process('contacts.csv',
chunk_size: 500,
key_mapping: { e_mail: :email }, # normalize header to match DB column
) do |chunk|
Contact.upsert_all(chunk, unique_by: :email)
# inserts new records, updates existing ones — all in one query per chunk
end
That's It - Enjoy! ✨
Install the gem, change one line, check the three behavior differences (numeric conversion, empty value removal, plain Hash vs CSV::Row), and you're done.
The rest — batch processing, bad row handling, value converters, column selection — is there when you need it.
gem 'smarter_csv'
- GitHub: github.com/tilo/smarter_csv
- Docs: Full documentation
- RubyGems: rubygems.org/gems/smarter_csv

Top comments (0)