<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Josie Hall</title>
    <description>The latest articles on DEV Community by Josie Hall (@josiehall).</description>
    <link>https://dev.to/josiehall</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F246997%2Fe706bbae-0bff-4527-a960-69db4fc4d615.png</url>
      <title>DEV Community: Josie Hall</title>
      <link>https://dev.to/josiehall</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/josiehall"/>
    <language>en</language>
    <item>
      <title>Import data from S3 to Redshift in minutes using Dataform</title>
      <dc:creator>Josie Hall</dc:creator>
      <pubDate>Thu, 10 Oct 2019 15:52:10 +0000</pubDate>
      <link>https://dev.to/josiehall/import-data-from-s3-to-redshift-in-minutes-using-dataform-55g2</link>
      <guid>https://dev.to/josiehall/import-data-from-s3-to-redshift-in-minutes-using-dataform-55g2</guid>
      <description>&lt;p&gt;&lt;a href="https://dataform.co?utm_source=s3blog"&gt;Dataform&lt;/a&gt; is a powerful tool for managing data transformations in your warehouse. With Dataform you can automatically manage dependencies, schedule queries and easily adopt engineering best practices with built in version control. Currently Dataform integrates with Google BigQuery, Amazon Redshift, Snowflake and Azure Data Warehouse. However, often the “root” of your data is in another external source e.g. Amazon S3. If this is the case and you’re considering using a tool like Dataform to start building out your data stack, then there are some simple scripts you can run to import this data into your cloud warehouse using Dataform.&lt;/p&gt;

&lt;p&gt;We’re going to talk about how to import data from Amazon S3 to Amazon Redshift in just a few minutes, using the &lt;a href="https://docs.aws.amazon.com/redshift/latest/dg/t_loading-tables-from-s3.html"&gt;COPY&lt;/a&gt; command. This allows you to load data in parallel from multiple data sources. The COPY command can also be used to load files from other sources e.g. Amazon EMR or an Amazon DynamoDB table.&lt;/p&gt;

&lt;h4&gt;
  
  
  Before you begin you need to make sure you have:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;An Amazon Web Services (AWS) account.&lt;/strong&gt; Signing up is free - &lt;a href="https://aws.amazon.com/"&gt;click here&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Permissions in AWS Identity Access Management (IAM) that allow you to create policies, create roles, and attach policies to roles.&lt;/strong&gt; This is required to grant Dataform access to your S3 bucket.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Verified that column names in CSV files in S3 adhere to your destination’s length limit for column names.&lt;/strong&gt; If a column name is longer than the destination’s character limit it will be rejected. In Redshift’s case the limit is 115 characters.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;An Amazon S3 bucket containing the CSV files that you want to import.&lt;/strong&gt;&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;A Redshift cluster.&lt;/strong&gt; If you do not already have a cluster set up, see how to launch one &lt;a href="https://docs.aws.amazon.com/redshift/latest/dg/tutorial-loading-data-launch-cluster.html"&gt;here&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;A Dataform project set up which is connected to your Redshift warehouse.&lt;/strong&gt; See how to do that &lt;a href="https://docs.dataform.co/platform_guides/set_up_datawarehouse/#postgres-running-in-aws-or-redshift"&gt;here&lt;/a&gt;.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Ok now you’ve got all that sorted, let’s get started!&lt;br&gt;
Once you’re in Dataform, create a new .sqlx file in your project under the definitions/ folder. Using Dataform’s enriched SQL this is what the code should look like:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;config&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="nl"&gt;type&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="s2"&gt;operations&lt;/span&gt;&lt;span class="dl"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
  &lt;span class="c1"&gt;// Declare to Dataform that this script writes out a dataset.&lt;/span&gt;
  &lt;span class="nx"&gt;hasOutput&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="kc"&gt;true&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="nx"&gt;js&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
  &lt;span class="c1"&gt;// Dataform’s name() function returns the name of the table (not inclusive&lt;/span&gt;
  &lt;span class="c1"&gt;// of the schema) that Dataform expects this script to create.&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;finalTableName&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;name&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
  &lt;span class="c1"&gt;// The resolve() function is used to compute a fully-qualified table name.&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;fullyQualifiedTempTableName&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nx"&gt;resolve&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="s2"&gt;`&lt;/span&gt;&lt;span class="p"&gt;${&lt;/span&gt;&lt;span class="nx"&gt;finalTableName&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="s2"&gt;_temp`&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
  &lt;span class="c1"&gt;// self() returns the fully-qualified name of the table that Dataform&lt;/span&gt;
  &lt;span class="c1"&gt;// expects this script to create.&lt;/span&gt;
  &lt;span class="kd"&gt;const&lt;/span&gt; &lt;span class="nx"&gt;fullyQualifiedFinalTableName&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;self&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;

&lt;span class="cm"&gt;/* Step 1: If the S3 tables already exist in Redshift (perhaps from a previous run),
you will need to drop the tables to remove them from the database before you
create them again in the next step. */&lt;/span&gt;
&lt;span class="nx"&gt;DROP&lt;/span&gt; &lt;span class="nx"&gt;TABLE&lt;/span&gt; &lt;span class="nx"&gt;IF&lt;/span&gt; &lt;span class="nx"&gt;EXISTS&lt;/span&gt; &lt;span class="nx"&gt;$&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;fullyQualifiedTempTableName&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="nx"&gt;CASCADE&lt;/span&gt;

&lt;span class="o"&gt;---&lt;/span&gt;

&lt;span class="cm"&gt;/* Step 2: Create a temporary table by executing the following CREATE TABLE command. */&lt;/span&gt;
&lt;span class="nx"&gt;CREATE&lt;/span&gt; &lt;span class="nx"&gt;TABLE&lt;/span&gt; &lt;span class="nx"&gt;$&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;fullyQualifiedTempTableName&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;country&lt;/span&gt; &lt;span class="nx"&gt;VARCHAR&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;MAX&lt;/span&gt;&lt;span class="p"&gt;),&lt;/span&gt; &lt;span class="nx"&gt;revenue&lt;/span&gt; &lt;span class="nx"&gt;FLOAT&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="o"&gt;---&lt;/span&gt;

&lt;span class="cm"&gt;/* Step 3: Copy data from S3 into the temporary table using the COPY command. */&lt;/span&gt;
&lt;span class="nx"&gt;COPY&lt;/span&gt; &lt;span class="nx"&gt;$&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;fullyQualifiedTempTableName&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="nx"&gt;FROM&lt;/span&gt;
  &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;s3://dataform-integration-tests-us-east-n-virginia/sample-data/sample_data&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="nx"&gt;IAM_ROLE&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;arn:aws:iam::161427895535:role/RedshiftS3ReadRole&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="nx"&gt;IGNOREHEADER&lt;/span&gt; &lt;span class="mi"&gt;1&lt;/span&gt; &lt;span class="nx"&gt;DELIMITER&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;,&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;

&lt;span class="o"&gt;---&lt;/span&gt;

&lt;span class="cm"&gt;/* Step 4: Delete the output table (if it exists). */&lt;/span&gt;
&lt;span class="nx"&gt;DROP&lt;/span&gt; &lt;span class="nx"&gt;TABLE&lt;/span&gt; &lt;span class="nx"&gt;IF&lt;/span&gt; &lt;span class="nx"&gt;EXISTS&lt;/span&gt; &lt;span class="nx"&gt;$&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;fullyQualifiedFinalTableName&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="nx"&gt;CASCADE&lt;/span&gt;

&lt;span class="o"&gt;---&lt;/span&gt;

&lt;span class="cm"&gt;/* Step 5: Rename the temporary table to the output table. */&lt;/span&gt;
&lt;span class="nx"&gt;ALTER&lt;/span&gt; &lt;span class="nx"&gt;TABLE&lt;/span&gt;
  &lt;span class="nx"&gt;$&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;fullyQualifiedTempTableName&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="nx"&gt;RENAME&lt;/span&gt; &lt;span class="nx"&gt;TO&lt;/span&gt; &lt;span class="nx"&gt;$&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;finalTableName&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;h4&gt;
  
  
  To execute the COPY command you need to provide the following values:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Table name:&lt;/strong&gt;
The target table in S3 for the COPY command. The table must already exist in the database and it doesn’t matter if it’s temporary or persistent. The COPY command appends the new input data to any existing rows in the table.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;COPY&lt;/span&gt; &lt;span class="nx"&gt;$&lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="nx"&gt;fullyQualifiedTempTableName&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Data source:&lt;/strong&gt;
When loading from Amazon S3, you must provide the name of the bucket and the location of the data files, by providing either an object path for the data files or the location of a manifest file that explicitly lists each data file and its location.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="nx"&gt;FROM&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;s3://dataform-integration-tests-us-east-n-virginia/sample-data/sample_data&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;h4&gt;
  
  
  Once you have your S3 import ready
&lt;/h4&gt;

&lt;p&gt;Finally, you can push your changes to GitHub and then publish your table to Redshift. Alternatively, you can run this using the &lt;a href="https://docs.dataform.co/guides/command-line-interface/?utm_source=s3blog"&gt;Dataform CLI&lt;/a&gt;: dataform run.&lt;/p&gt;

&lt;p&gt;And Voila! Your S3 data is now ready to use in your Redshift warehouse as a table and can be included in your larger Dataform dependency graph. This means you can now run it alongside all other code, add dependencies on top of it (so any datasets that rely on this will only run if it is successful), you can use the &lt;a href="https://docs.dataform.co/guides/built-in-functions/#ref?utm_source=s3blog"&gt;ref()&lt;/a&gt; or &lt;a href="https://docs.dataform.co/guides/built-in-functions/#resolve?utm_source=s3blog"&gt;resolve()&lt;/a&gt; functions on this dataset in another script and you can document it's &lt;a href="https://docs.dataform.co/guides/documentation/"&gt;data catalog&lt;/a&gt; entry using your own descriptions.&lt;/p&gt;

&lt;p&gt;For more information about how to get setup on Dataform please see our &lt;a href="https://docs.dataform.co/?utm_source=s3blog"&gt;docs&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>redshift</category>
      <category>aws</category>
      <category>sql</category>
      <category>database</category>
    </item>
  </channel>
</rss>
