<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Corey Martin</title>
    <description>The latest articles on DEV Community by Corey Martin (@csm).</description>
    <link>https://dev.to/csm</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F694%2F9eeea7c0-9230-498e-8f33-31aa3498da8c.png</url>
      <title>DEV Community: Corey Martin</title>
      <link>https://dev.to/csm</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/csm"/>
    <language>en</language>
    <item>
      <title>Building a data pipeline with Ruby on Rails</title>
      <dc:creator>Corey Martin</dc:creator>
      <pubDate>Mon, 16 Nov 2020 15:12:35 +0000</pubDate>
      <link>https://dev.to/csm/building-a-data-pipeline-with-ruby-on-rails-4dbh</link>
      <guid>https://dev.to/csm/building-a-data-pipeline-with-ruby-on-rails-4dbh</guid>
      <description>&lt;p&gt;If you need to bring a lot of data into your app, a data pipeline can help. This article describes how to build a simple data pipeline with Ruby on Rails, Sidekiq, Redis, and Postgres.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's a data pipeline?
&lt;/h2&gt;

&lt;p&gt;A data pipeline takes information from one place and puts it somewhere else. It likely transforms that data along the way. In enterprise circles, this is called &lt;a href="https://en.wikipedia.org/wiki/Extract,_transform,_load"&gt;ETL&lt;/a&gt;, or extract-transform-load.&lt;/p&gt;

&lt;p&gt;Generally, data pipelines:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Ingest a lot of data at once&lt;/li&gt;
&lt;li&gt; Ingest new or updated data on a schedule&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  Splitting data into small jobs
&lt;/h2&gt;

&lt;p&gt;A reliable pipeline starts big and reduces data into small slices that can be independently processed. Bigger jobs "fan out" into small ones. The smallest jobs, handling one record each, do most of the work.&lt;/p&gt;

&lt;p&gt;Breaking work into smaller jobs helps handle failure. If the pipeline ingests 1,000,000 records, chances are that a few of them are invalid. If those invalid records are processing in their own jobs, then they can fail without impacting other records.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.lobbyfocus.com/"&gt;Lobby Focus&lt;/a&gt;, a project of mine, ingests &lt;a href="https://www.senate.gov/legislative/Public_Disclosure/database_download.htm"&gt;United States lobbying data from the U.S. Senate&lt;/a&gt;. There are about 1,000,000 lobbying records from 2010-2020. Processing them all in a single job would be risky, as the entire job would fail if even one record triggered an error. Instead, Lobby Focus breaks down data into small slices and ends up processing each record in its own job. To bring in 1,000,000 lobbying records, it takes 1,001,041 jobs.&lt;/p&gt;

&lt;p&gt;Pipelines can handle a ton of jobs with ease. Splitting up work into many small jobs offers advantages:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt; Most jobs deal with only one record. If a job fails, it's easy to pin down the problematic record, fix it, and retry it. The rest of the pipeline continues uninterrupted.&lt;/li&gt;
&lt;li&gt; The pipeline can ingest data in parallel, with many small jobs running at once.&lt;/li&gt;
&lt;li&gt; You can track progress by looking at how many jobs succeeded and how many remain.&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  A Ruby on Rails pipeline stack
&lt;/h2&gt;

&lt;p&gt;This stack, used in the example below, makes for a solid data pipeline:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  &lt;strong&gt;&lt;a href="https://rubyonrails.org/"&gt;Ruby on Rails&lt;/a&gt;&lt;/strong&gt; interfaces with all the pieces below and has a strong ecosystem of libraries.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;&lt;a href="https://github.com/mperham/sidekiq"&gt;Sidekiq&lt;/a&gt;&lt;/strong&gt; manages data pipeline jobs, providing visibility into progress and errors.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;&lt;a href="https://redis.io/"&gt;Redis&lt;/a&gt;&lt;/strong&gt; backs Sidekiq and stores the job queue.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;&lt;a href="https://www.postgresql.org/"&gt;Postgres&lt;/a&gt;&lt;/strong&gt; stores intermediate and final data.&lt;/li&gt;
&lt;li&gt;  &lt;strong&gt;&lt;a href="https://aws.amazon.com/s3/"&gt;Amazon S3&lt;/a&gt;&lt;/strong&gt; serves as a staging area for source data.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Managed platforms like &lt;a href="http://heroku.com/"&gt;Heroku&lt;/a&gt; or &lt;a href="https://www.cloud66.com/"&gt;Cloud 66&lt;/a&gt; are good hosting options for this stack.&lt;/p&gt;

&lt;h2&gt;
  
  
  Real-world example: U.S. Senate lobbying data
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://lobbyfocus.com"&gt;Lobby Focus&lt;/a&gt; ingests lobbying data from the U.S. Senate and makes it accessible to users. Here's a peek into its data pipeline, which breaks data into small jobs and uses the stack above.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 1: Scrape
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1 Sidekiq job&lt;/strong&gt; deals with &lt;strong&gt;1,000,000 records&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Parameter:&lt;/strong&gt; None&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Gathers a list of ZIP files from the U.S. Senate, using &lt;a href="https://github.com/jnunemaker/httparty"&gt;HTTParty&lt;/a&gt; to retrieve the webpage and &lt;a href="https://github.com/sparklemotion/nokogiri"&gt;Nokogiri&lt;/a&gt; to parse it&lt;/li&gt;
&lt;li&gt;  Launches a download job for each ZIP file. To ingest lobbying data from 2010 to 2020, it will launch 40 &lt;strong&gt;download&lt;/strong&gt; jobs for 40 ZIP's.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 2: Download
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;40 Sidekiq jobs&lt;/strong&gt; deal with &lt;strong&gt;25,000 records&lt;/strong&gt; &lt;strong&gt;each&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Parameter:&lt;/strong&gt; ZIP file URL&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Checks the &lt;a href="https://developer.mozilla.org/en-US/docs/Web/HTTP/Headers/Last-Modified"&gt;last-modified&lt;/a&gt; header to see if the current version was already processed. If so, aborts the job.&lt;/li&gt;
&lt;li&gt;  Downloads the ZIP and unzips its contents into Amazon S3, using &lt;a href="https://github.com/rubyzip/rubyzip"&gt;rubyzip&lt;/a&gt; and &lt;a href="https://github.com/aws/aws-sdk-ruby"&gt;aws-s3-sdk&lt;/a&gt;.&lt;/li&gt;
&lt;li&gt;  Launches a &lt;strong&gt;process&lt;/strong&gt; job for each XML file in the ZIP.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Step 3: Process
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1,000 Sidekiq jobs&lt;/strong&gt; deal with &lt;strong&gt;1,000 records each&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Parameter:&lt;/strong&gt; XML file URL (S3)&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Parses the XML with &lt;a href="https://nokogiri.org/tutorials/parsing_an_html_xml_document.html"&gt;Nokogiri&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  Adds the content of each record in the XML file to Postgres, as a raw XML string.&lt;/li&gt;
&lt;li&gt;  Launches a &lt;strong&gt;load&lt;/strong&gt; job for each record in the XML file, passing the Postgres record ID.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Postgres can bear a high volume of record inserts. The pipeline uses it as an intermediate data store. The next job, &lt;strong&gt;load&lt;/strong&gt;, receives the Postgres record ID and accesses it to get the XML string. Using Postgres to pass the XML avoids passing giant parameters between jobs. Sidekiq &lt;a href="https://github.com/mperham/sidekiq/wiki/Best-Practices"&gt;recommends small job parameters&lt;/a&gt; to keep the queue lean and preserve Redis memory.&lt;/p&gt;

&lt;h3&gt;
  
  
  Step 4: Load
&lt;/h3&gt;

&lt;p&gt;&lt;strong&gt;1,000,000 Sidekiq jobs&lt;/strong&gt; deal with &lt;strong&gt;1 record each&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Parameter:&lt;/strong&gt; Postgres record ID&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;  Retrieves the XML string from Postgres and parses it with &lt;a href="https://nokogiri.org/tutorials/parsing_an_html_xml_document.html"&gt;Nokogiri&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;  Validates data and aborts the job with an error if there is an issue&lt;/li&gt;
&lt;li&gt;  Upserts final data record into Postgres&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Staying resilient
&lt;/h2&gt;

&lt;p&gt;Breaking out data loads into small jobs is key to a resilient pipeline. Here are some more tips:&lt;/p&gt;

&lt;h3&gt;
  
  
  Make jobs idempotent
&lt;/h3&gt;

&lt;p&gt;Any job should be able to run twice without duplicating data. In other words, jobs should be idempotent.&lt;/p&gt;

&lt;p&gt;Look for a unique key in the source data and use it to avoid writing duplicate data.&lt;/p&gt;

&lt;p&gt;For instance, you can set a Postgres unique index on the key. Then, use Rails' &lt;a href="https://blog.bigbinary.com/2019/03/25/rails-6-adds-create_or_find_by.html"&gt;create_or_find_by&lt;/a&gt; method to create or update a record, depending on whether the key already exists.&lt;/p&gt;

&lt;h3&gt;
  
  
  Use constraints to reject invalid data
&lt;/h3&gt;

&lt;p&gt;Set constraints to stop invalid data from coming in. If a field should always have a value, use a &lt;a href="https://www.postgresql.org/docs/9.4/ddl-constraints.html"&gt;Postgres constraint&lt;/a&gt; to require that field. Use Ruby's strict &lt;a href="https://avdi.codes/a-ruby-conversion-idiom/"&gt;conversion methods&lt;/a&gt; to throw an error and abort a job if you're expecting one data type (say, an integer like 3) but get another (say, a string like "n/a").&lt;/p&gt;

&lt;p&gt;Data is never perfect. If you catch invalid data errors early, you can adjust your pipeline to handle unexpected values.&lt;/p&gt;

&lt;h2&gt;
  
  
  What to read next
&lt;/h2&gt;

&lt;p&gt;A job management library like Sidekiq is critical to a data pipeline. Read more about &lt;a href="https://github.com/mperham/sidekiq"&gt;Sidekiq&lt;/a&gt;, the backbone of this example. Its &lt;a href="https://github.com/mperham/sidekiq/wiki/Best-Practices"&gt;best practices docs&lt;/a&gt; are small and useful. If you're not using Ruby, find the top job management libraries for your language.&lt;/p&gt;

&lt;p&gt;To practice creating a pipeline, find a public dataset that interests you and build a pipeline for it. The United States Government, for instance, lists its open government data on &lt;a href="https://data.gov"&gt;data.gov&lt;/a&gt;. Datasets range from &lt;a href="https://data.medicare.gov/data/physician-compare"&gt;physician comparison data&lt;/a&gt; and a &lt;a href="https://nid.sec.usace.army.mil/ords/f?p=105:19:7020640354040::NO:::"&gt;database of every dam in the U.S&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Finally, check out managed solutions like &lt;a href="https://cloud.google.com/dataflow"&gt;Google Dataflow&lt;/a&gt; for particularly large pipelines.&lt;/p&gt;

&lt;p&gt;Have questions or want advice? Send me a tweet @coreybeep&lt;/p&gt;




&lt;p&gt;Originally posted on &lt;a href="https://coreym.info/data-pipelines-in-ruby-on-rails/"&gt;my blog&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Photo by &lt;a href="https://unsplash.com/@victor_g?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditCopyText"&gt;Victor Garcia&lt;/a&gt; on &lt;a href="https://unsplash.com/s/photos/pipeline?utm_source=unsplash&amp;amp;utm_medium=referral&amp;amp;utm_content=creditCopyText"&gt;Unsplash&lt;/a&gt;&lt;/p&gt;

</description>
      <category>ruby</category>
      <category>rails</category>
      <category>database</category>
      <category>architecture</category>
    </item>
    <item>
      <title>Find your motivator, and use it to guide your career</title>
      <dc:creator>Corey Martin</dc:creator>
      <pubDate>Thu, 02 Apr 2020 14:08:09 +0000</pubDate>
      <link>https://dev.to/csm/find-your-motivator-and-use-it-to-guide-your-career-5chp</link>
      <guid>https://dev.to/csm/find-your-motivator-and-use-it-to-guide-your-career-5chp</guid>
      <description>&lt;p&gt;We can make better decisions in our careers by figuring out what motivates us. This post will help you find your &lt;strong&gt;motivator&lt;/strong&gt;, a phrase that sums up what drives you. Use it to focus your work, look for new opportunities, and inspire your writing.&lt;/p&gt;

&lt;p&gt;This process helped me figure out what I want in my career, and I hope it's helpful to you!&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 1: List your meaningful experiences
&lt;/h2&gt;

&lt;p&gt;Finding your motivator starts with reflecting on your past experiences. To do this, use a piece of paper or your editing tool of choice.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Make a list of every job you’ve had, including day jobs, side projects, school projects, and volunteer work.&lt;/li&gt;
&lt;li&gt;Under each job, list the experiences that were &lt;strong&gt;meaningful&lt;/strong&gt; to you. Include &lt;strong&gt;what you did&lt;/strong&gt; and the &lt;strong&gt;impact&lt;/strong&gt;.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Here are some prompts you can use to think of meaningful experiences:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What did you get a rush out of working on?&lt;/li&gt;
&lt;li&gt;What left a positive imprint on your mind?&lt;/li&gt;
&lt;li&gt;What do you wish you could do again?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Meaning isn’t about what impressed others or furthered your career. It’s about what meant the most to &lt;strong&gt;you&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This was one of my meaningful experiences:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Revamped the Human Rights Report website, helping advocates do their work&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h2&gt;
  
  
  Step 2: Find your threads
&lt;/h2&gt;

&lt;p&gt;Identify common &lt;strong&gt;threads&lt;/strong&gt; among your experiences. This is the start of turning your memories into your motivator.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Look for meaningful experiences that are similar, and group them together&lt;/li&gt;
&lt;li&gt;For each group, write a brief phrase (the &lt;strong&gt;thread&lt;/strong&gt;) that sums up the group&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Some threads reflect &lt;strong&gt;what you did&lt;/strong&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Build workflow tools&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Other threads reflect &lt;strong&gt;the impact you had&lt;/strong&gt;:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Help people do their work better&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Come up with as many threads as you can, using most of the meaningful experiences you wrote.&lt;/p&gt;

&lt;p&gt;I had three threads:&lt;/p&gt;

&lt;blockquote&gt;
&lt;ol&gt;
&lt;li&gt;Helping people express their knowledge and impact other people with it&lt;/li&gt;
&lt;li&gt;Helping people produce and share their work&lt;/li&gt;
&lt;li&gt;Building repeatable tools that help people work&lt;/li&gt;
&lt;/ol&gt;
&lt;/blockquote&gt;

&lt;p&gt;Although I’m a software developer, none of my threads are very technical. Try not to make assumptions about what you &lt;em&gt;should&lt;/em&gt; find meaningful. Stay true to what resonates with you.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 3: Rank your threads
&lt;/h2&gt;

&lt;p&gt;By this point, you’ll have at least a few threads. Now you'll rank them.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Order your threads, starting with the most meaningful&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;When deciding on the order, think about seeing a job with the thread in its job description. Would you want to take the job? Would it bring you meaning?&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 4: Find your motivator
&lt;/h2&gt;

&lt;p&gt;Combine your ranked threads into a single phrase, your motivator.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Find a single phrase that captures your threads, particularly the top-ranked ones. This is your &lt;strong&gt;motivator&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Write down your motivator, share it if if you’d like, and put it to work!&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Threads are the building blocks for your motivator. Your motivator takes the gist of your threads and sums it up in a few words.&lt;/p&gt;

&lt;p&gt;Your motivator should be:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;No more than one sentence&lt;/li&gt;
&lt;li&gt;Easy to understand&lt;/li&gt;
&lt;li&gt;Something that brings you happiness&lt;/li&gt;
&lt;li&gt;Something you want to do in the future&lt;/li&gt;
&lt;li&gt;Something that you can take with you as you grow, not specific to one stage in your career&lt;/li&gt;
&lt;li&gt;Something you can talk about with others or write about&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;My threads deal with helping people produce and share. My motivator should capture that I like enabling people, and cover what I’m enabling them to do.&lt;/p&gt;

&lt;p&gt;So, I chose my motivator: &lt;strong&gt;helping people create&lt;/strong&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Step 5: Put your motivator to work &lt;a&gt;&lt;/a&gt;
&lt;/h2&gt;

&lt;p&gt;Your motivator may be short, but it’s powerful because of the work you put in to find it.&lt;/p&gt;

&lt;p&gt;Use your motivator to help:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Advance in your current job&lt;/strong&gt;. Talk with your manager, and find ways to focus on projects that align with your motivator. Find work that aligns with your motivator and helps your team succeed.
&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Filter job opportunities&lt;/strong&gt;. Look for companies and jobs that focus on your motivator. Going into a new job knowing it aligns to what motivates you will help you hit the ground running.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Write&lt;/strong&gt;. It’s easier to write about what motivates you, so draft some blog posts! Your passion for the topic will shine through.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Interview better&lt;/strong&gt;. Most job interviews are a series of stories about ourselves. Find stories that align to your motivator. When you tell stories that excite you, you'll come across as authentic and driven.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Your motivator gets to the core of what drives you. The more you can align to it, the more meaning you’ll derive from your everyday work.&lt;/p&gt;

&lt;p&gt;Share your motivator in the comments to inspire others!&lt;/p&gt;

</description>
      <category>career</category>
      <category>productivity</category>
      <category>leadership</category>
      <category>growth</category>
    </item>
  </channel>
</rss>
