DEV Community: Corey Martin

Building a data pipeline with Ruby on Rails

Corey Martin — Mon, 16 Nov 2020 15:12:35 +0000

If you need to bring a lot of data into your app, a data pipeline can help. This article describes how to build a simple data pipeline with Ruby on Rails, Sidekiq, Redis, and Postgres.

What's a data pipeline?

A data pipeline takes information from one place and puts it somewhere else. It likely transforms that data along the way. In enterprise circles, this is called ETL, or extract-transform-load.

Generally, data pipelines:

Ingest a lot of data at once
Ingest new or updated data on a schedule

Splitting data into small jobs

A reliable pipeline starts big and reduces data into small slices that can be independently processed. Bigger jobs "fan out" into small ones. The smallest jobs, handling one record each, do most of the work.

Breaking work into smaller jobs helps handle failure. If the pipeline ingests 1,000,000 records, chances are that a few of them are invalid. If those invalid records are processing in their own jobs, then they can fail without impacting other records.

Lobby Focus, a project of mine, ingests United States lobbying data from the U.S. Senate. There are about 1,000,000 lobbying records from 2010-2020. Processing them all in a single job would be risky, as the entire job would fail if even one record triggered an error. Instead, Lobby Focus breaks down data into small slices and ends up processing each record in its own job. To bring in 1,000,000 lobbying records, it takes 1,001,041 jobs.

Pipelines can handle a ton of jobs with ease. Splitting up work into many small jobs offers advantages:

Most jobs deal with only one record. If a job fails, it's easy to pin down the problematic record, fix it, and retry it. The rest of the pipeline continues uninterrupted.
The pipeline can ingest data in parallel, with many small jobs running at once.
You can track progress by looking at how many jobs succeeded and how many remain.

A Ruby on Rails pipeline stack

This stack, used in the example below, makes for a solid data pipeline:

Ruby on Rails interfaces with all the pieces below and has a strong ecosystem of libraries.
Sidekiq manages data pipeline jobs, providing visibility into progress and errors.
Redis backs Sidekiq and stores the job queue.
Postgres stores intermediate and final data.
Amazon S3 serves as a staging area for source data.

Managed platforms like Heroku or Cloud 66 are good hosting options for this stack.

Real-world example: U.S. Senate lobbying data

Lobby Focus ingests lobbying data from the U.S. Senate and makes it accessible to users. Here's a peek into its data pipeline, which breaks data into small jobs and uses the stack above.

Step 1: Scrape

1 Sidekiq job deals with 1,000,000 records

Parameter: None

Gathers a list of ZIP files from the U.S. Senate, using HTTParty to retrieve the webpage and Nokogiri to parse it
Launches a download job for each ZIP file. To ingest lobbying data from 2010 to 2020, it will launch 40 download jobs for 40 ZIP's.

Step 2: Download

40 Sidekiq jobs deal with 25,000 records each

Parameter: ZIP file URL

Checks the last-modified header to see if the current version was already processed. If so, aborts the job.
Downloads the ZIP and unzips its contents into Amazon S3, using rubyzip and aws-s3-sdk.
Launches a process job for each XML file in the ZIP.

Step 3: Process

1,000 Sidekiq jobs deal with 1,000 records each

Parameter: XML file URL (S3)

Parses the XML with Nokogiri
Adds the content of each record in the XML file to Postgres, as a raw XML string.
Launches a load job for each record in the XML file, passing the Postgres record ID.

Postgres can bear a high volume of record inserts. The pipeline uses it as an intermediate data store. The next job, load, receives the Postgres record ID and accesses it to get the XML string. Using Postgres to pass the XML avoids passing giant parameters between jobs. Sidekiq recommends small job parameters to keep the queue lean and preserve Redis memory.

Step 4: Load

1,000,000 Sidekiq jobs deal with 1 record each

Parameter: Postgres record ID

Retrieves the XML string from Postgres and parses it with Nokogiri
Validates data and aborts the job with an error if there is an issue
Upserts final data record into Postgres

Staying resilient

Breaking out data loads into small jobs is key to a resilient pipeline. Here are some more tips:

Make jobs idempotent

Any job should be able to run twice without duplicating data. In other words, jobs should be idempotent.

Look for a unique key in the source data and use it to avoid writing duplicate data.

For instance, you can set a Postgres unique index on the key. Then, use Rails' create_or_find_by method to create or update a record, depending on whether the key already exists.

Use constraints to reject invalid data

Set constraints to stop invalid data from coming in. If a field should always have a value, use a Postgres constraint to require that field. Use Ruby's strict conversion methods to throw an error and abort a job if you're expecting one data type (say, an integer like 3) but get another (say, a string like "n/a").

Data is never perfect. If you catch invalid data errors early, you can adjust your pipeline to handle unexpected values.

What to read next

A job management library like Sidekiq is critical to a data pipeline. Read more about Sidekiq, the backbone of this example. Its best practices docs are small and useful. If you're not using Ruby, find the top job management libraries for your language.

To practice creating a pipeline, find a public dataset that interests you and build a pipeline for it. The United States Government, for instance, lists its open government data on data.gov. Datasets range from physician comparison data and a database of every dam in the U.S.

Finally, check out managed solutions like Google Dataflow for particularly large pipelines.

Have questions or want advice? Send me a tweet @coreybeep

Originally posted on my blog

Photo by Victor Garcia on Unsplash

Find your motivator, and use it to guide your career

Corey Martin — Thu, 02 Apr 2020 14:08:09 +0000

We can make better decisions in our careers by figuring out what motivates us. This post will help you find your motivator, a phrase that sums up what drives you. Use it to focus your work, look for new opportunities, and inspire your writing.

This process helped me figure out what I want in my career, and I hope it's helpful to you!

Step 1: List your meaningful experiences

Finding your motivator starts with reflecting on your past experiences. To do this, use a piece of paper or your editing tool of choice.

Make a list of every job you’ve had, including day jobs, side projects, school projects, and volunteer work.
Under each job, list the experiences that were meaningful to you. Include what you did and the impact.

Here are some prompts you can use to think of meaningful experiences:

What did you get a rush out of working on?
What left a positive imprint on your mind?
What do you wish you could do again?

Meaning isn’t about what impressed others or furthered your career. It’s about what meant the most to you.

This was one of my meaningful experiences:

Revamped the Human Rights Report website, helping advocates do their work

Step 2: Find your threads

Identify common threads among your experiences. This is the start of turning your memories into your motivator.

Look for meaningful experiences that are similar, and group them together
For each group, write a brief phrase (the thread) that sums up the group

Some threads reflect what you did:

Build workflow tools

Other threads reflect the impact you had:

Help people do their work better

Come up with as many threads as you can, using most of the meaningful experiences you wrote.

I had three threads:

Helping people express their knowledge and impact other people with it

Helping people produce and share their work

Building repeatable tools that help people work

Although I’m a software developer, none of my threads are very technical. Try not to make assumptions about what you should find meaningful. Stay true to what resonates with you.

Step 3: Rank your threads

By this point, you’ll have at least a few threads. Now you'll rank them.

Order your threads, starting with the most meaningful

When deciding on the order, think about seeing a job with the thread in its job description. Would you want to take the job? Would it bring you meaning?

Step 4: Find your motivator

Combine your ranked threads into a single phrase, your motivator.

Find a single phrase that captures your threads, particularly the top-ranked ones. This is your motivator.
Write down your motivator, share it if if you’d like, and put it to work!

Threads are the building blocks for your motivator. Your motivator takes the gist of your threads and sums it up in a few words.

Your motivator should be:

No more than one sentence
Easy to understand
Something that brings you happiness
Something you want to do in the future
Something that you can take with you as you grow, not specific to one stage in your career
Something you can talk about with others or write about

My threads deal with helping people produce and share. My motivator should capture that I like enabling people, and cover what I’m enabling them to do.

So, I chose my motivator: helping people create.

Step 5: Put your motivator to work

Your motivator may be short, but it’s powerful because of the work you put in to find it.

Use your motivator to help:

Advance in your current job. Talk with your manager, and find ways to focus on projects that align with your motivator. Find work that aligns with your motivator and helps your team succeed.
Filter job opportunities. Look for companies and jobs that focus on your motivator. Going into a new job knowing it aligns to what motivates you will help you hit the ground running.
Write. It’s easier to write about what motivates you, so draft some blog posts! Your passion for the topic will shine through.
Interview better. Most job interviews are a series of stories about ourselves. Find stories that align to your motivator. When you tell stories that excite you, you'll come across as authentic and driven.

Your motivator gets to the core of what drives you. The more you can align to it, the more meaning you’ll derive from your everyday work.

Share your motivator in the comments to inspire others!