<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: pradeep-misra</title>
    <description>The latest articles on DEV Community by pradeep-misra (@pradeepmisra).</description>
    <link>https://dev.to/pradeepmisra</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F532902%2F05e1012d-61f2-4f6c-9550-72b091638e97.png</url>
      <title>DEV Community: pradeep-misra</title>
      <link>https://dev.to/pradeepmisra</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/pradeepmisra"/>
    <language>en</language>
    <item>
      <title>Simplify Data Prep with AWS Glue DataBrew</title>
      <dc:creator>pradeep-misra</dc:creator>
      <pubDate>Mon, 01 Feb 2021 02:59:17 +0000</pubDate>
      <link>https://dev.to/aws-builders/simplify-data-prep-with-aws-glue-databrew-4ch8</link>
      <guid>https://dev.to/aws-builders/simplify-data-prep-with-aws-glue-databrew-4ch8</guid>
      <description>&lt;p&gt;Business users always format, transform and summarize data to derive insights. Even with data being ETL to Data Warehousing systems; users had to apply additional transformations in BI tools. &lt;/p&gt;

&lt;p&gt;As modern enterprises are increasingly adopting Lake House architectures, there is an even greater need to simplify this process for data prep on large data sets that can scale.&lt;/p&gt;

&lt;h5&gt;
  
  
  AWS Glue DataBrew
&lt;/h5&gt;

&lt;blockquote&gt;
&lt;p&gt;AWS Glue DataBrew is a new visual data preparation tool that makes it easy for data analysts and data scientists to clean and normalize data to prepare it for analytics and machine learning.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;It allows you to:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Profile &lt;/li&gt;
&lt;li&gt;Clean and Transform&lt;/li&gt;
&lt;li&gt;Track Lineage&lt;/li&gt;
&lt;li&gt;Orchestrate and Automate&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In the remaining sections, I will demonstrate these capabilities using a simple customer segmentation dataset - &lt;a href="https://www.kaggle.com/vjchoudhary7/customer-segmentation-tutorial-in-python" rel="noopener noreferrer"&gt;Link&lt;/a&gt;&lt;/p&gt;

&lt;h5&gt;
  
  
  Data Profiling
&lt;/h5&gt;

&lt;p&gt;First thing we do on a dataset is to profile - either with code or here with a few clicks in Data Brew. &lt;/p&gt;

&lt;p&gt;To begin, select the dataset and "&lt;em&gt;Run data profile&lt;/em&gt;" button.&lt;br&gt;&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Ffgh5eiceqza4ftnnqhgf.JPG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Ffgh5eiceqza4ftnnqhgf.JPG" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This would take us to "&lt;em&gt;Create job&lt;/em&gt;" page where we create a "&lt;em&gt;Profile Job&lt;/em&gt;" and this will generate summary statistics on our data.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fwk596s0ts2yip0c01h1m.JPG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fwk596s0ts2yip0c01h1m.JPG" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fqjr5r8y9v94pzue61nj5.JPG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fqjr5r8y9v94pzue61nj5.JPG" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once Profile job is complete, it provides a profile overview, detailed column statistics and lineage.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F48sdclnnh7ab36c6k8hz.JPG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F48sdclnnh7ab36c6k8hz.JPG" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F7wo5puckxrts9phk3nj6.JPG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F7wo5puckxrts9phk3nj6.JPG" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fqao65qg9b1wu3xs8ydhw.JPG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fqao65qg9b1wu3xs8ydhw.JPG" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h5&gt;
  
  
  Projects
&lt;/h5&gt;

&lt;p&gt;Next, we will transform the data by creating a project on the dataset. We can create Project from the Datasets view by specifying Project Name and Recipe details. For recipe we can either create a new one or use from existing list.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fzh1j0tf5r7hu7rttp5nb.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fzh1j0tf5r7hu7rttp5nb.jpg" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F8h6akfu1arrx9zlo1ua0.JPG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F8h6akfu1arrx9zlo1ua0.JPG" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;While creating recipes, transforms are done with sample data. For this Project, we will select 5000 random rows. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Faumlrw1regkfyucb3fn7.JPG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Faumlrw1regkfyucb3fn7.JPG" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h5&gt;
  
  
  Project Landing Page
&lt;/h5&gt;

&lt;p&gt;Default view of this page shows us sample rows, distribution and recipe steps.It also shows the available list of transformations that can be done on the dataset.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fyc7juu5x4adbwswnvafz.JPG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fyc7juu5x4adbwswnvafz.JPG" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Schema View-&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fpmj8mw1v26vmmttbflnt.JPG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fpmj8mw1v26vmmttbflnt.JPG" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Transformations steps can be added and together makes a recipe.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;DataBrew provides from over 250 built-in transformations to visualize, clean, and normalize your data with an interactive, point-and-click visual interface.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For this dataset, we will create a recipe with two transformations:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Filter Customers with age greater than equal to 30&lt;/li&gt;
&lt;li&gt;Encode Gender to 'M' or 'F'&lt;/li&gt;
&lt;/ol&gt;

&lt;h5&gt;
  
  
  1. Filter Customers with age greater than 30
&lt;/h5&gt;

&lt;p&gt;From Filter category &amp;gt; By Condition &amp;gt; Greater than or equal to:&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fsy4t9na19gt3kgele0kb.JPG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fsy4t9na19gt3kgele0kb.JPG" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Select the source column and the filter value&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fd0yzxyiag7jl9f2h98wh.JPG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fd0yzxyiag7jl9f2h98wh.JPG" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Preview shows the result of the step and helps verify the outcome before applying it&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Flqhjwwgwqdpw4yuz18yd.JPG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Flqhjwwgwqdpw4yuz18yd.JPG" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h5&gt;
  
  
  2. Encode Gender to 'M' or 'F'
&lt;/h5&gt;

&lt;p&gt;From the clean category &amp;gt; select Replace value or Pattern&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fehrr362n394qjc3vg3s9.JPG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fehrr362n394qjc3vg3s9.JPG" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Select source column and the value to be replaced&lt;br&gt;
&lt;strong&gt;Note&lt;/strong&gt; - We need two steps here, one for each value to be replaced.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Focj2xnl7picvybdf6bpk.JPG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Focj2xnl7picvybdf6bpk.JPG" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Preview changes and apply&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fq1n4kawbrkfwi5gilgdk.JPG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fq1n4kawbrkfwi5gilgdk.JPG" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Then, transform the next value&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fmcurevr2ds282vsmyira.JPG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fmcurevr2ds282vsmyira.JPG" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As we are done with all the steps, we are good to publish our recipe:&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fall88tr2smaw4e7pfm8f.JPG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fall88tr2smaw4e7pfm8f.JPG" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h5&gt;
  
  
  Jobs
&lt;/h5&gt;

&lt;p&gt;We had earlier created a Profile Job. We will now create a recipe job that will transform the entire dataset.&lt;/p&gt;

&lt;p&gt;We will specify the Dataset, Recipe and Output details.&lt;br&gt;
&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fn6m7qvffekbt71ygpumk.JPG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fn6m7qvffekbt71ygpumk.JPG" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Once job is created, we get the transformed dataset, job run and history details.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F0a5q0o73tvxgy7sn9asl.JPG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F0a5q0o73tvxgy7sn9asl.JPG" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h5&gt;
  
  
  Data Lineage
&lt;/h5&gt;

&lt;p&gt;Lineage shows the various components and their relationships to the output.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F9cateshcne5kq6iliag7.JPG" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F9cateshcne5kq6iliag7.JPG" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h5&gt;
  
  
  Summary
&lt;/h5&gt;

&lt;p&gt;Data Brew provides a powerful abstraction for data preparation. As explained, it fastens and simplifies the Data Prep process allowing Users to focus on the insights and decisions that drives their business.&lt;br&gt;
 . &lt;/p&gt;

</description>
      <category>aws</category>
      <category>machinelearning</category>
      <category>datascience</category>
    </item>
  </channel>
</rss>
