<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Kshitij Wadhwa</title>
    <description>The latest articles on DEV Community by Kshitij Wadhwa (@kwadhwa18).</description>
    <link>https://dev.to/kwadhwa18</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F207434%2F7e79d9df-f3e2-4423-bc42-ad540c3a231a.jpeg</url>
      <title>DEV Community: Kshitij Wadhwa</title>
      <link>https://dev.to/kwadhwa18</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/kwadhwa18"/>
    <language>en</language>
    <item>
      <title>Running Fast SQL on DynamoDB Tables</title>
      <dc:creator>Kshitij Wadhwa</dc:creator>
      <pubDate>Wed, 23 Jan 2019 08:00:00 +0000</pubDate>
      <link>https://dev.to/rocksetcloud/running-fast-sql-on-dynamodb-tables-4p7n</link>
      <guid>https://dev.to/rocksetcloud/running-fast-sql-on-dynamodb-tables-4p7n</guid>
      <description>&lt;p&gt;Have you ever wanted to run SQL queries on &lt;a href="https://aws.amazon.com/dynamodb/"&gt;Amazon DynamoDB&lt;/a&gt; tables without impacting your production workloads? Wouldn't it be great to do so without needing to set up an ETL job and then having to manually monitor that job?&lt;/p&gt;

&lt;p&gt;In this blog, I will discuss how &lt;a href="https://rockset.com/"&gt;Rockset&lt;/a&gt; integrates with DynamoDB and continuously updates a collection automatically as new objects are added to a DynamoDB table. I will walk through steps on how to set up a live integration between Rockset and a DynamoDB table and run millisecond-latency SQL on it.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;DynamoDB Integration&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Amazon DynamoDB is a key-value and document database where the key is specified at the time of table creation. DynamoDB supports scan operations over one or more items and also captures table activity using &lt;a href="https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/Streams.html"&gt;DynamoDB Streams&lt;/a&gt;. Using these features, Rockset continuously ingests data from DynamoDB in two steps:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;The first time a user creates a DynamoDB-sourced collection, Rockset does an entire scan of the DynamoDB table.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;After the scan finishes, Rockset continuously processes DynamoDB Streams to account for new or modified records.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;To ensure Rockset does not lose any new data which is recorded in the DynamoDB table when the scan is happening, Rockset enables &lt;a href="https://docs.aws.amazon.com/amazondynamodb/latest/developerguide/HowItWorks.ReadConsistency.html"&gt;strongly consistent scans&lt;/a&gt; in the Rockset-DynamoDB connector, and also creates DynamoDB Streams (if not already present) and records the sequence numbers of existing shards. The continuous processing step (step 2 above) processes DynamoDB Streams starting from the sequence number recorded before the scan.&lt;/p&gt;

&lt;p&gt;Primary key values from the DynamoDB table are used to construct the &lt;code&gt;_id&lt;/code&gt; field in Rockset to uniquely identify a document in a Rockset collection. This ensures that updates to an existing item in the DynamoDB table are applied to the corresponding document in Rockset.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Connecting DynamoDB to Rockset&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;For this example, I have created a DynamoDB table &lt;a href="https://github.com/rockset/recipes/tree/master/rockset-dynamodb"&gt;programmatically&lt;/a&gt; using a &lt;a href="https://github.com/HackerNews/API"&gt;Hacker News data set&lt;/a&gt;. The data set consists of data about each post and comment on the website. Each field in the dataset is described &lt;a href="https://github.com/HackerNews/API#items"&gt;here&lt;/a&gt;. I have included a &lt;a href="https://github.com/rockset/recipes/blob/master/rockset-dynamodb/dataset/hn_comments"&gt;sample&lt;/a&gt; of this data set in our recipes repository.&lt;/p&gt;

&lt;p&gt;The table was created using the &lt;code&gt;id&lt;/code&gt; field as the partition key for DynamoDB. Also, I had to massage the data set as DynamoDB doesn't accept empty string values. With Rockset, as you will see in the next few steps, you don't need to perform such ETL operations or provide schema definitions to create a collection and make it immediately queryable via SQL.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Creating a Rockset Collection&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;I will use the &lt;a href="https://pypi.org/project/rockset/"&gt;Rockset Python Client&lt;/a&gt; to create a collection backed by a DynamoDB table. To try this in your environment, you will need to create an &lt;a href="https://docs.rockset.com/integrations/"&gt;Integration&lt;/a&gt; (an object that represents your AWS credentials) and set up relevant &lt;a href="https://docs.rockset.com/source-amazon-dynamodb/"&gt;permissions&lt;/a&gt; on the DynamoDB table, which allows Rockset to perform certain read operations on that table.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;Alternatively, DynamoDB-sourced collections can also be created from the Rockset &lt;a href="https://console.rockset.com/workspace/collections/create"&gt;console&lt;/a&gt;, as shown below.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--xY9JJDxR--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://images.ctfassets.net/1d31s1aajogl/6k4vAM4A2abeKGtRzFrAHH/2681870829c2d349b604ee9d96e3592d/dynamodb1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--xY9JJDxR--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://images.ctfassets.net/1d31s1aajogl/6k4vAM4A2abeKGtRzFrAHH/2681870829c2d349b604ee9d96e3592d/dynamodb1.png" alt="dynamodb1"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Running SQL on DynamoDB Data&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;Each document in Rockset corresponds to one row in the DynamoDB table. Rockset automatically infers the schema, as shown below.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;Now we are ready to run fast SQL on data from our DynamoDB table. Let's write a few queries to get some insights from this data set.&lt;/p&gt;

&lt;p&gt;Since we are clearly interested in the topic of data, let's see how frequently people have discussed or shared about "data" on Hacker News over the years. In this query, I am tokenizing the &lt;code&gt;title&lt;/code&gt;, extracting the year from the the &lt;code&gt;time&lt;/code&gt; field, and returning the number of occurrences of "data" in the tokens, grouped by year.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;Using &lt;a href="https://docs.rockset.com/apache-superset/"&gt;Apache Superset integration with Rockset&lt;/a&gt;, I plotted a graph with the results.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--RElhR3uo--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://images.ctfassets.net/1d31s1aajogl/1LZkIMEQihMn9WQgBoJcq6/0a10a7ef2516c119156f7cb01028b97b/dynamodb2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--RElhR3uo--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://images.ctfassets.net/1d31s1aajogl/1LZkIMEQihMn9WQgBoJcq6/0a10a7ef2516c119156f7cb01028b97b/dynamodb2.png" alt="dynamodb2"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;The number of stories concerning data has clearly been increasing over time.&lt;/p&gt;

&lt;p&gt;Next, let's mine the Hacker News data set for observations on one of the most talked-about technologies of the past two years, blockchain. Let's first check how user engagement around blockchain and cryptocurrencies has been trending.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;As you can see, interest in blockchain went up immensely in 2017 and 2018. The results are also aligned with this &lt;a href="https://cointelegraph.com/news/number-of-crypto-users-nearly-doubled-in-2018-study-says"&gt;study&lt;/a&gt;, which estimated that the number of crypto users doubled in 2018.&lt;/p&gt;

&lt;p&gt;Also, along with blockchain, hundreds of cryptocurrencies emerged. Let's find the most popular coins in our data set.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;&lt;a href="https://en.wikipedia.org/wiki/Bitcoin"&gt;Bitcoin&lt;/a&gt;, as one would have guessed, seems to be the most popular cryptocurrency in our Hacker News data.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;Summary&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;In this entire process, I simply created a Rockset collection with a DynamoDB source, without any data transformation and schema modeling, and immediately ran SQL queries over it. Using Rockset, you too can join data across different DynamoDB tables or other &lt;a href="http://docs.dev.rockset.com/overview/#ingest"&gt;sources&lt;/a&gt; to power your live applications.&lt;/p&gt;

&lt;p&gt;Other DynamoDB resources:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://rockset.com/blog/analytics-on-dynamodb-athena-spark-elastic/"&gt;Analytics on DynamoDB: Comparing Athena, Spark and Elastic&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://rockset.com/blog/live-dashboards-dynamodb-streams-lambda-elasticache/"&gt;Custom Live Dashboards on DynamoDB — Using DynamoDB Streams with Lambda and ElastiCache&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://rockset.com/blog/tableau-operational-dashboards-reporting-dynamodb-redshift-athena/"&gt;Tableau Operational Dashboards and Reporting on DynamoDB — Evaluating Redshift and Athena&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>aws</category>
      <category>sql</category>
      <category>database</category>
    </item>
  </channel>
</rss>
