<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Eshban Suleman</title>
    <description>The latest articles on DEV Community by Eshban Suleman (@eshbanthelearner).</description>
    <link>https://dev.to/eshbanthelearner</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F448053%2F5efef916-3aa6-44b3-867b-e7a6579da60a.jpg</url>
      <title>DEV Community: Eshban Suleman</title>
      <link>https://dev.to/eshbanthelearner</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/eshbanthelearner"/>
    <language>en</language>
    <item>
      <title>A 100 Day #thePersonalMSDS Journey</title>
      <dc:creator>Eshban Suleman</dc:creator>
      <pubDate>Sat, 04 Sep 2021 19:33:38 +0000</pubDate>
      <link>https://dev.to/eshbanthelearner/a-100-day-thepersonalmsds-journey-1gc5</link>
      <guid>https://dev.to/eshbanthelearner/a-100-day-thepersonalmsds-journey-1gc5</guid>
      <description>&lt;p&gt;The Machine Learning landscape is in a state of continuous change. New research, technologies and tools are put out every day. This sometimes makes it hard to keep up with the latest trends. Besides that, the vastness of the domain can induce the imposter syndrome in practitioners. This is perfectly put in the following tweet &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--wRrj9vV7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/c15wpfsooyzrhawb8pny.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--wRrj9vV7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/c15wpfsooyzrhawb8pny.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I too had felt this over the years. Either I feel that I know too little or feel like I’m out of touch. To combat this, I’ve been following some challenges to get in touch with my skills and learn new ones along the way. One of the challenges I recently completed is #thePersonalMSDS. &lt;/p&gt;

&lt;p&gt;#thePersonalMSDS was an initiative by one of my seniors, &lt;a href="https://www.linkedin.com/in/mhjhamza/"&gt;Muhammad Hamza Javaid&lt;/a&gt; to get industry professionals and students to follow a self-curated Data Science Masters roadmap to develop new skills and hone existing ones. The partaker can decide the number of days (usually 100) and the number of hours they want to dedicate towards learning per day. I first came across it in January 2020 and decided to pledge for 100 days of following a customized roadmap. I completed the challenge from January 13th 2020 to April 22nd 2020. During those 100 days, I studied various topics with the help of online courses and articles. Some of the things I studied back then included &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Statistics and Probability&lt;/li&gt;
&lt;li&gt;Big Data with Apache Spark&lt;/li&gt;
&lt;li&gt;AI for Business&lt;/li&gt;
&lt;li&gt;Investment Fundamentals &amp;amp; Data Analytics&lt;/li&gt;
&lt;li&gt;Data Engineering on GCP&lt;/li&gt;
&lt;li&gt;Basic Bash Scripting and Shell Programming&lt;/li&gt;
&lt;li&gt;Data Science Project Management&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;As it might be seen, I customized my learning path based on my needs and interests. This challenge not only helped me learn new skills but also to get on top of my existing skills. More recently I pledged the last 100 days, from May 26th 2021 to September 2nd 2021 to the #thePersonalMSDS challenge. I learned some new topics that I hadn’t learned before and also worked on some of the skills that I already have. I got a discount coupon for Databricks Data Science Pathway and I spent 36 days completing it. I earned 41 certificates in these 36 days, some of which you can check &lt;a href="https://www.linkedin.com/in/eshban-suleman-624a49113/"&gt;here&lt;/a&gt;. Some other topics/technologies that I studied apart from this were&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Deploying Machine Learning Models&lt;/li&gt;
&lt;li&gt;Spatial Analysis and Geospatial Data Science&lt;/li&gt;
&lt;li&gt;Data Privacy&lt;/li&gt;
&lt;li&gt;ElasticSearch (ELK Stack)&lt;/li&gt;
&lt;li&gt;HuggingFace Transformers&lt;/li&gt;
&lt;li&gt;Customer Segmentation&lt;/li&gt;
&lt;li&gt;Time Series Analysis and Forecasting&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;You can track my detailed learnings &lt;a href="https://github.com/EshbanTheLearner/thepersonalMSDS-v2/blob/main/todayilearned.md"&gt;here&lt;/a&gt;. A question that I get a lot is how I find the motivation to start and continue. This is a great question and it is a very common problem. I too have gone off the track a few times so, through the process of trial and error, I worked out some methods that work for me. I hope you find them useful too.&lt;/p&gt;

&lt;h1&gt;
  
  
  Plan Ahead of Time
&lt;/h1&gt;

&lt;p&gt;A good plan will help you stay on top of your skills and it’ll show how self-aware you are regarding your strengths and weaknesses. I like to make 2 separate lists, one dedicated to topics and skills that I want to learn and one for the skills that I’ve already learnt but either feel out of touch with or just want to study in-depth. Then I pick topics from both lists that I feel are both important and fun. Remember, you can always add or redact topics later.&lt;/p&gt;

&lt;h1&gt;
  
  
  Find a Community
&lt;/h1&gt;

&lt;p&gt;Get your friends and/or colleagues to sign up for the challenge with you. If nobody wants to join, find people on the internet with the same interests. Become a part of online study groups. Most importantly share your daily progress on the internet with proper hashtags. It’ll get you the exposure you need to find people that are interested in what you’re doing and keep you motivated to meet the daily goal.&lt;/p&gt;

&lt;h1&gt;
  
  
  Stay Positive
&lt;/h1&gt;

&lt;p&gt;Maintaining a routine like this along with work or studies can be cumbersome and frustrating at times. Sometimes it may feel like you are going nowhere but that is the moment where you need to look at how far you’ve come, how many new things you’ve learned, how many people you connect with along the way. This will help you stay positive and motivated.&lt;/p&gt;

&lt;h1&gt;
  
  
  Take Breaks
&lt;/h1&gt;

&lt;p&gt;Self-learning is all about flexibility. You don’t need to burden yourself with covering a lot of topics in a short period of time. If you’re feeling tired, just take a break. Take as many breaks as necessary to relieve your stress and come back more focused. You are your own in charge. &lt;/p&gt;

&lt;h1&gt;
  
  
  Have Fun
&lt;/h1&gt;

&lt;p&gt;The most important factor in staying motivated is to have fun while learning. The more you make your learning fun, the more you’ll look forward to it. Everyone has their own methods of having fun, e.g. you can do mini-projects using the skills you’re learning, make video tutorials, write blogs about it etc. I like to take handwritten notes and do mini-projects. You pick your poison. &lt;/p&gt;

&lt;p&gt;So, if you are planning to learn something new or even brush up on your skills, start today, start now because tomorrow never comes. I wish you all the very best for your future. &lt;/p&gt;


&lt;p&gt;&lt;a href="https://giphy.com/gifs/shia-labeouf-just-do-it-J7jsbfcJ2O5eo"&gt;via GIPHY&lt;/a&gt;&lt;/p&gt;

</description>
      <category>productivity</category>
      <category>motivation</category>
      <category>datascience</category>
    </item>
    <item>
      <title>Alternatives to Google Patents</title>
      <dc:creator>Eshban Suleman</dc:creator>
      <pubDate>Sat, 20 Feb 2021 09:56:01 +0000</pubDate>
      <link>https://dev.to/traindex/alternatives-to-google-patents-4j4b</link>
      <guid>https://dev.to/traindex/alternatives-to-google-patents-4j4b</guid>
      <description>&lt;p&gt;There are multiple tools available over the internet to check the similarity of a claim or a patent. There are pros and cons of every tool and a user can sometimes have a hard time deciding what to use where. In such situations, people tend to use the services they trust. People tend to rely on big tech companies when it comes to choosing between a variety of options because they are perceived to be doing well in every area. Such is the case with Google Patents. &lt;/p&gt;

&lt;p&gt;Although Google Patents is a well all-round search engine for patent data, it does have some disadvantages. In this article, we will have a look at some of the more obvious cons of Google Patents and will also proceed to look at some other services available online. And if you are not familiar with the concepts of patent search or how to conduct a patent search, have a look at our article &lt;a href="https://www.traindex.io/blog/patent-search-4j05"&gt;Patent Search&lt;/a&gt;.&lt;/p&gt;

&lt;h1&gt;
  
  
  Some Shortfalls of Google Patents
&lt;/h1&gt;

&lt;p&gt;This article is not aimed at disregarding Google Patents as a search engine for patents, instead, the goal of this article is to get the reader familiar with some alternatives to using Google Patents. So, let’s, first of all, discuss why one might decide to not use Google Patents.&lt;/p&gt;

&lt;h2&gt;
  
  
  1. Semi Semantic Behavior
&lt;/h2&gt;

&lt;p&gt;Google Patents has been observed to show semi-semantic behavior. It is a keyword-based search at its core but it can extract some semantically similar results. Sometimes it can be useful but most of the time it searches for unrelated synonyms. Following is an example of this behavior. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--BK7w6x8t--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/rc6ujdqejbw7ygkmmunt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--BK7w6x8t--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/rc6ujdqejbw7ygkmmunt.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It is not necessarily bad behavior but it does affect the results. &lt;/p&gt;

&lt;h2&gt;
  
  
  2. Bad with Acronyms
&lt;/h2&gt;

&lt;p&gt;As with all keyword-based searches, Google Patents also seem to struggle with the acronyms. The most common example of it is the acronym AIDS (Acquired Immune Deficiency Syndrome) which is often misinterpreted with the word “aids”, a verb with the meaning of “to help”. So you might get a lot of false positives if your query contains such acronyms. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--tK4F2dE1--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/rq2nsu7kebba3fcwe990.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--tK4F2dE1--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/rq2nsu7kebba3fcwe990.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  3. Empty Results
&lt;/h2&gt;

&lt;p&gt;Google Patents shows the keyword search behavior here as well. If the keywords are very unique then it might show zero results. Semantic search engines usually shine in this department but Google Patents is not one of them. &lt;/p&gt;

&lt;h2&gt;
  
  
  4. Unable to Process Scientific Jargon
&lt;/h2&gt;

&lt;p&gt;Patents usually cover complex novel scientific inventions and thus have a lot of “science language”, but it is observed that Google Patents is usually unable to get results if queried with scientific jargon for example chemical formulas, etc. &lt;/p&gt;

&lt;h2&gt;
  
  
  5. Missing Citations
&lt;/h2&gt;

&lt;p&gt;There’s been a case of some missing patents which occurred during data transfer. Due to this, citations are missing in some of the patents.&lt;/p&gt;

&lt;h2&gt;
  
  
  6. Disclosure Risk
&lt;/h2&gt;

&lt;p&gt;Google tracks its search activity according to its &lt;a href="https://policies.google.com/privacy?hl=en"&gt;Privacy Policy&lt;/a&gt;. According to &lt;a href="https://www.uspto.gov/web/offices/pac/mpep/s904.html"&gt;MPEP 904.02(c)&lt;/a&gt; of Manual of Patent Examining Procedure by the United States Patent and Trademark Office (USPTO), examiners are allowed to use tools and the internet to search for the prior art of any claim under examination but are not allowed to use any proprietary information as query, instead, they are advised to use a general state of the art query to get similar results. Simply put, to check whether the claim under inspection is similar or identical to any published claim, you can use any service on the internet but you shouldn’t provide any information about the claim that might expose its privacy. Since Google Patents is a keyword-based search, it is difficult to come up with a query that maintains the balance of the privacy of your claims and search for any similar or identical existing claim. Thus your case might always be at risk if Google Patents is being used.&lt;/p&gt;

&lt;p&gt;I think these are more than enough reasons to try something different this time. Let’s now discuss some of the alternatives to Google Patents.&lt;/p&gt;

&lt;h1&gt;
  
  
  Patentscope
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://patentscope.wipo.int/search/en/search.jsf"&gt;Patentscope&lt;/a&gt; is a patent search service by the World Intellectual Property Organization (&lt;a href="https://www.wipo.int/portal/en/index.html"&gt;WIPO&lt;/a&gt;). You can search over 92 million patents worldwide and can also enhance your search results by filtering them using certain meta-level filters. It is a free global search engine technology information. It doesn’t employ any spelling correction technique nor does it enable to use chemical compounds as a query on the open version. Also, it strictly searches for words in the query and not their other forms, so no lemmatization is observed. It also returns zero results if even one word in the query is out of its vocabulary. &lt;/p&gt;

&lt;h1&gt;
  
  
  Escapenet
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://worldwide.espacenet.com/"&gt;Escapenet&lt;/a&gt; by European Patent Office (EPO) is also a keyword-based patent search on over 120 million patents. It has all the characteristics of keyword search such as advanced search features and metadata-based filters. Unlike Patentscope, it uses lemmatization to get different word forms too and supports multiple European languages. The base search only allows up to 10 keywords.&lt;/p&gt;

&lt;h1&gt;
  
  
  lens.org
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://www.lens.org/"&gt;lens.org&lt;/a&gt; provides search services for different scholarly datasets including patent data of 125.4 million patent records. It has very fine-grained advanced search filters and has patents from all around the world. It uses Apache Lucene and Elasticsearch for text search and shows a semi-semantic behavior. It also supports spelling correction and handles acronyms better than the previous two options. Still, it doesn’t search for chemical compounds, etc, and is susceptible to return empty results. &lt;/p&gt;

&lt;h1&gt;
  
  
  Traindex
&lt;/h1&gt;

&lt;p&gt;&lt;a href="https://www.traindex.io/"&gt;Traindex&lt;/a&gt; is a semantic search engine, unlike others in this list. It uses Machine Learning to find patents that are semantically similar to the query. It searches over Google Public Patents Data and can be integrated very easily into your applications. It can accept texts of various lengths, you can enter whole patent documents and it will handle that easily. Since it is a semantic search engine, it outstands in retrieving desired results for even a very unique set of queries. One of the things that make it stand out is that it doesn’t track search data and lets you use their API safely. Does this look like something you want to know more about? How about you schedule a demo &lt;a href="https://www.traindex.io/"&gt;here&lt;/a&gt; and we will walk you through the process. &lt;/p&gt;

&lt;p&gt;The goal of this article was to point out some areas where Google Patents falls short and to provide you with some alternative resources so you can use the right tool for your problems, without compromising privacy and security. If you’re still confused, you can reach us at &lt;a href="mailto:help@traindex.io"&gt;help@traindex.io&lt;/a&gt; and we would be happy to guide you more.&lt;/p&gt;

</description>
    </item>
    <item>
      <title>Event Driven Data Pipelines in AWS</title>
      <dc:creator>Eshban Suleman</dc:creator>
      <pubDate>Mon, 30 Nov 2020 17:52:07 +0000</pubDate>
      <link>https://dev.to/traindex/event-driven-data-pipelines-in-aws-480i</link>
      <guid>https://dev.to/traindex/event-driven-data-pipelines-in-aws-480i</guid>
      <description>&lt;p&gt;In a data-driven organization, there is a constant need to provide vast amounts of data to the teams. There are many tools available to aid your requirements and needs. Choosing the right tool can be a little challenging and overwhelming at times. The basic principle you can keep in mind is that there is no right tool or architecture, it depends on what you need. &lt;/p&gt;

&lt;p&gt;In this guide, I’m going to show you how to build a simple event-driven data pipeline in AWS. Pipelines are often scheduled or interval based, however, the event-driven concept is unique and a good starting point. Instead of trying to figure out the right intervals of the pipeline activation, you can use an event handler to deal with certain events to activate your pipeline. &lt;/p&gt;

&lt;p&gt;To learn more about which problem we were solving in Traindex and why the data pipeline was the right choice for us, refer to my previous article &lt;a href="https://www.traindex.io/blog/introduction-to-data-pipelines-26o7" rel="noopener noreferrer"&gt;Introduction to Data Pipelines&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;As an example, we would be using the “Sentiment140 dataset with 1.6 million tweets” which is available on Kaggle. Our goal would be to set up a data preprocessing pipeline. Once you have uploaded the CSV file in a specified bucket, an event is generated. A lambda function would handle that event and will activate your pipeline. Your data pipeline would be AWS Data Pipeline which is a web service that helps you process and move data between different AWS compute and storage services. This pipeline would divide a compute resource and run your preprocessing code in that resource. Once your data is cleaned and preprocessed, it will upload it to the specified bucket for later use. Based on these objectives, we can divide our task into the following sub-tasks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Creating a pre-configured AMI&lt;/li&gt;
&lt;li&gt;Defining AWS data pipeline architecture&lt;/li&gt;
&lt;li&gt;Writing the event handler AWS Lambda function&lt;/li&gt;
&lt;li&gt;Integrating everything&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Before diving into the steps, make sure you have the following preconditions met&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You require an AWS account with certain IAM privileges&lt;/li&gt;
&lt;li&gt;Make sure you have already downloaded the data from “Sentiment140 dataset with 1.6 million tweets”&lt;/li&gt;
&lt;li&gt;Active internet connection&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  Pre-Configured AMI
&lt;/h1&gt;

&lt;p&gt;This step can be optional based on your requirements but it is good to have a pre-configured AMI that you can use in the compute resources. Follow the following steps to create a pre-configured AMI: &lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Go to the AWS console, click on the Services dropdown menu, and select EC2&lt;/li&gt;
&lt;li&gt;On the EC2 dashboard, select Launch an Instance&lt;/li&gt;
&lt;li&gt;Select the Amazon Linux AMI 2018.03.0 (HVM), SSD Volume Type - ami-01fee56b22f308154 &lt;/li&gt;
&lt;li&gt;Select the General Purpose t2.micro which is free-tier eligible&lt;/li&gt;
&lt;li&gt;Click on Review and Launch and then click Launch to launch this EC2 instance&lt;/li&gt;
&lt;li&gt;Now go to the EC2 dashboard and select your EC2 Instance. Copy the public DNS and SSH into your created instance. &lt;/li&gt;
&lt;li&gt;Now, install all the required packages, tools, and libraries in it using standard Linux commands.&lt;/li&gt;
&lt;li&gt;Also, set up any credentials you might require later like AWS credentials, etc.&lt;/li&gt;
&lt;li&gt;Once satisfied with your instance, it’s time to create an AMI image from this instance.&lt;/li&gt;
&lt;li&gt;Go to EC2 dashboard, right-click on your instance. Click on Actions, select Image, and click on create an image.&lt;/li&gt;
&lt;li&gt;Keep the default settings and create the image by clicking on Create Image.&lt;/li&gt;
&lt;li&gt;It’ll take a couple of minutes and once it’s done, go ahead and terminate the instance you created. You will only need the AMI ID in the next phases.&lt;/li&gt;
&lt;/ul&gt;

&lt;h1&gt;
  
  
  AWS Data Pipeline Architecture
&lt;/h1&gt;

&lt;p&gt;The main idea behind this step is to set up a data pipeline which upon certain triggers, launches an EC2 instance. And then we will have a bash script run in that instance that would be responsible to move our raw data back and forth and run our preprocessing python script. This step can be further divided into 3 main subsections, let’s do it.&lt;/p&gt;

&lt;h2&gt;
  
  
  AWS Data Pipeline Architecture Definition
&lt;/h2&gt;

&lt;p&gt;First of all, let’s define the AWS data pipeline architecture. We can do so by writing a JSON file that defines and describes our data pipeline and provides it with all the required logic. I’ll try to break it down as much as required but you can always refer to the documentation to explore more options. The data pipeline definition can have different pieces of information like&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Names, locations, and formats of your data sources&lt;/li&gt;
&lt;li&gt;Activities that transform the data&lt;/li&gt;
&lt;li&gt;The schedule for those activities&lt;/li&gt;
&lt;li&gt;Resources that run your activities and preconditions&lt;/li&gt;
&lt;li&gt;Preconditions that must be satisfied before the activities can be scheduled&lt;/li&gt;
&lt;li&gt;Ways to alert you with status updates as pipeline execution proceeds&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We can express the data pipeline definition in three parts: Objects, parameters and values.&lt;/p&gt;

&lt;h3&gt;
  
  
  Objects
&lt;/h3&gt;

&lt;p&gt;Below you can see the syntax of the definition.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="nl"&gt;"objects"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
       &lt;/span&gt;&lt;span class="nl"&gt;"name1"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"value1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
       &lt;/span&gt;&lt;span class="nl"&gt;"name2"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"value2"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
       &lt;/span&gt;&lt;span class="nl"&gt;"name1"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"value3"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
       &lt;/span&gt;&lt;span class="nl"&gt;"name3"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"value4"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
       &lt;/span&gt;&lt;span class="nl"&gt;"name4"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"value5"&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
  &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Following the above syntax we can place our required objects one by one. First of all, we need to define our pipeline object. We would be defining fields like ID, name, IAM and resource roles, path to save pipeline logs and schedule type. You can add or remove these fields based on your requirements and should look at the official documentation to know more about these and other fields.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Default"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Default"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"failureAndRerunMode"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"CASCADE"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"resourceRole"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"DataPipelineDefaultResourceRole"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"DataPipelineDefaultRole"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"pipelineLogUri"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"s3://automated-data-pipeline/logs/"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"scheduleType"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ONDEMAND
    },

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can use this object with one change, that is the &lt;code&gt;pipelineLogUri&lt;/code&gt; field. You can give the path to the S3 bucket you want to save your logs in. The next object in our definition is the compute i.e. EC2 resource.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"MyEC2Resource"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Ec2Resource"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"imageId"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ami-xxxxxxxxxxxxxxxxx"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"instanceType"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"r5.large"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"spotBidPrice"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2.0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"terminateAfter"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"30 Minutes"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"actionOnTaskFailure"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"terminate"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"maximumRetries"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"role"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"DataPipelineDefaultRole"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"resourceRole"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"DataPipelineDefaultResourceRole"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"keyPair"&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"&amp;lt;YOUR-KEY&amp;gt;"&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="err"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We have described our compute needs in this object, for example, we need an EC2 instance of type r5.large on spot pricing with your key. Also, remember to put in the pre-configured AMI ID in the &lt;code&gt;imageId&lt;/code&gt; field so it launches the instance with all of the configurations set in place. Now, let’s move on to the next and last object which is the shell activity. This object would be able to run our shell script which in turn would run our preprocessing code.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ShellCommandActivityObj"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ShellCommandActivityObj"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"ShellCommandActivity"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"command"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"aws s3 cp s3://automated-data-pipeline/script.sh ~/ &amp;amp;&amp;amp; sudo sh ~/script.sh #{myS3DataPath}"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"maximumRetries"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="nl"&gt;"runsOn"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"ref"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"MyEC2Resource"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
      &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;

&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In this object, the two most important fields are &lt;code&gt;command&lt;/code&gt; and &lt;code&gt;runsOn&lt;/code&gt;. In the &lt;code&gt;command&lt;/code&gt; field you would define the bash command that you would like to run on the EC2 instance described earlier. I described a command that will copy a bash script into the EC2 instance and run it. Note that I’m also giving it a parameter &lt;code&gt;#{myS3DataPath}&lt;/code&gt;, it is the path we would like our pipeline to preprocess. It is given as a parameter to add flexibility to our pipeline so it can handle different data sets. The &lt;code&gt;runsOn&lt;/code&gt; field takes the ID of the EC2 resource we created earlier so it can run the shell command on that resource.  &lt;/p&gt;

&lt;h3&gt;
  
  
  Parameters
&lt;/h3&gt;

&lt;p&gt;Parameters place holders should be written in this format &lt;code&gt;#{myPlaceholder}&lt;/code&gt;. Every parameter should start with the "my" suffix. Here is the parameter section of the definition JSON file&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="nl"&gt;"parameters"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"id"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"myS3DataPath"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"name"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"mys3DataPath"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"description"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"This is the path to the data uploaded"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"type"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"AWS::S3::ObjectKey"&lt;/span&gt;&lt;span class="w"&gt;

        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;We have defined that our parameter should be AWS S3 object key type. The whole data pipeline definition can be found &lt;a href="https://github.com/EshbanTheLearner/preprocessing-pipeline-demo/blob/main/definition.json" rel="noopener noreferrer"&gt;here&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;Now, after you are done with defining your pipeline, activate it by the following command.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws datapipeline create-pipeline &lt;span class="nt"&gt;--name&lt;/span&gt; data-preprocessing-pipeline &lt;span class="nt"&gt;--unique-id&lt;/span&gt; data-preprocessing-pipeline

&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Once created, you can put the definition in place. Note that we can pass a temporary parameter value at this stage, which later can be passed dynamically.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws datapipeline put-definition &lt;span class="nt"&gt;--pipeline-definition&lt;/span&gt; file://definition.json &lt;span class="se"&gt;\ &lt;/span&gt;&lt;span class="nt"&gt;--parameter-values&lt;/span&gt; &lt;span class="nv"&gt;s3DataPath&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&amp;lt;s3://your/s3/data/path&amp;gt; &lt;span class="nt"&gt;--pipeline-id&lt;/span&gt; &amp;lt;Your Pipeline ID&amp;gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Since our data pipeline is defined and created, let’s write the bash script that will run in the compute resource of our data pipeline.&lt;/p&gt;

&lt;h2&gt;
  
  
  Bash Script
&lt;/h2&gt;

&lt;p&gt;This script will run on the EC2 instance that the pipeline would launch as its compute resource. The working of this script is simple, it makes a new working directory, sets the current working directory and path to data in S3 bucket as environment variables, copies the data into the current working directory, runs the python script and finally uploads the cleaned data back to S3. Here is the code you will need:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;&lt;span class="c"&gt;#!/bin/bash&lt;/span&gt;

&lt;span class="nb"&gt;echo&lt;/span&gt; &lt;span class="nt"&gt;-e&lt;/span&gt; &lt;span class="s2"&gt;"Starting the process"&lt;/span&gt;

&lt;span class="nb"&gt;sudo mkdir&lt;/span&gt; ~/data-pipeline-tmp
&lt;span class="nb"&gt;sudo chmod &lt;/span&gt;ugo+rwx ~/data-pipeline-tmp
&lt;span class="nb"&gt;cd&lt;/span&gt; ~/data-pipeline-tmp

&lt;span class="nv"&gt;CURRENT_DIR&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="si"&gt;$(&lt;/span&gt;&lt;span class="nb"&gt;eval&lt;/span&gt; &lt;span class="s2"&gt;"pwd"&lt;/span&gt;&lt;span class="si"&gt;)&lt;/span&gt;

&lt;span class="nv"&gt;DATA_PATH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$1&lt;/span&gt;

&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;WORKING_DIR&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$CURRENT_DIR&lt;/span&gt;
&lt;span class="nb"&gt;export &lt;/span&gt;&lt;span class="nv"&gt;S3_DATA_PATH&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="nv"&gt;$DATA_PATH&lt;/span&gt;

aws s3 &lt;span class="nb"&gt;cp &lt;/span&gt;s3://automated-data-pipeline/scripts/script.py &lt;span class="nv"&gt;$WORKING_DIR&lt;/span&gt;

python3 &lt;span class="nv"&gt;$WORKING_DIR&lt;/span&gt;/script.py

aws s3 &lt;span class="nb"&gt;cp&lt;/span&gt; &lt;span class="nv"&gt;$WORKING_DIR&lt;/span&gt;/twitter_data_cleaned.csv s3://automated-data-pipeline/outputs/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;In my case, the S3 bucket is named &lt;code&gt;automated-data-pipeline&lt;/code&gt; and I have made folders to separate different objects. This code can also be found &lt;a href="https://github.com/EshbanTheLearner/preprocessing-pipeline-demo/blob/main/script.sh" rel="noopener noreferrer"&gt;here&lt;/a&gt;. Next is the python code that will preprocess the data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Python Code
&lt;/h2&gt;

&lt;p&gt;This code is the standard preprocessing code that we will use to clean our datasets. Here’s the code that you would need. Changes can be made, add or remove anything according to your needs:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;pandas&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;nltk&lt;/span&gt;
&lt;span class="n"&gt;nltk&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;download&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;stopwords&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;nltk.corpus&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;stopwords&lt;/span&gt;
&lt;span class="kn"&gt;from&lt;/span&gt;  &lt;span class="n"&gt;nltk.stem&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;SnowballStemmer&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;

&lt;span class="n"&gt;path&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;os&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;environ&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;S3_DATA_PATH&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;

&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Inside Python Script&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="s"&gt;Path = &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Loading Data&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;pd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;read_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;path&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;encoding&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;ISO-8859-1&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;names&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;label&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;date&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;flag&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;user&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;tweet&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;])&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Data has &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;0&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; rows and &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;shape&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt; columns&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;TEXT_CLEANING_RE&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;@\S+|https?:\S+|http?:\S|[^A-Za-z0-9]+&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="n"&gt;stop_words&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;stopwords&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;words&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;english&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;stemmer&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;SnowballStemmer&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;english&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;preprocess&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;stem&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="c1"&gt;# Remove link, user and special characters
&lt;/span&gt;    &lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;re&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;sub&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;TEXT_CLEANING_RE&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nf"&gt;str&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;).&lt;/span&gt;&lt;span class="nf"&gt;lower&lt;/span&gt;&lt;span class="p"&gt;()).&lt;/span&gt;&lt;span class="nf"&gt;strip&lt;/span&gt;&lt;span class="p"&gt;()&lt;/span&gt;
    &lt;span class="n"&gt;tokens&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="p"&gt;[]&lt;/span&gt;
    &lt;span class="k"&gt;for&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;split&lt;/span&gt;&lt;span class="p"&gt;():&lt;/span&gt;
        &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;token&lt;/span&gt; &lt;span class="ow"&gt;not&lt;/span&gt; &lt;span class="ow"&gt;in&lt;/span&gt; &lt;span class="n"&gt;stop_words&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
            &lt;span class="k"&gt;if&lt;/span&gt; &lt;span class="n"&gt;stem&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;stemmer&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;stem&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
            &lt;span class="k"&gt;else&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
                &lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;append&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;token&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;return&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt; &lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;join&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;tokens&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Starting cleaning process&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;text&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;apply&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="k"&gt;lambda&lt;/span&gt; &lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nf"&gt;preprocess&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt;&lt;span class="p"&gt;))&lt;/span&gt;
&lt;span class="nf"&gt;print&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Data cleaning completed, saving to CSV!&lt;/span&gt;&lt;span class="se"&gt;\n&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;df&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;to_csv&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;twitter_data_cleaned.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;index&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="bp"&gt;False&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can also find this code &lt;a href="https://github.com/EshbanTheLearner/preprocessing-pipeline-demo/blob/main/script.py" rel="noopener noreferrer"&gt;here&lt;/a&gt;. You have successfully defined and created a working data pipeline that can work on its (with manual activation). To add the event-driven label to it, we need to write a cloud function that will act as a trigger. It will handle certain events and then activate our pipeline when required.&lt;/p&gt;

&lt;h1&gt;
  
  
  Event Handler AWS Lambda Function
&lt;/h1&gt;

&lt;p&gt;The title says that we will be using the AWS Lambda function for this step but I like to use Chalice for this step. You can use either as per your preference, the code will almost be the same. Following are the steps to create the chalice app that runs on AWS Lambda which triggers the data pipeline. You will need the ID of the pipeline you created earlier in this step.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Create a chalice app using chalice &lt;code&gt;new-project &amp;lt;NAME&amp;gt;&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;Once the project is initialized, open &lt;code&gt;app.py&lt;/code&gt; file&lt;/li&gt;
&lt;li&gt;Copy the contents of the following snippet into it. Code also available &lt;a href="https://github.com/EshbanTheLearner/preprocessing-pipeline-demo/blob/main/app.py" rel="noopener noreferrer"&gt;here&lt;/a&gt;.
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;from&lt;/span&gt; &lt;span class="n"&gt;chalice&lt;/span&gt; &lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;Chalice&lt;/span&gt;
&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;

&lt;span class="n"&gt;app&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nc"&gt;Chalice&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;app_name&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;pipeline-trigger&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="n"&gt;client&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;boto3&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;client&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;datapipeline&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;

&lt;span class="c1"&gt;# The pipeline you want to activate
&lt;/span&gt;&lt;span class="n"&gt;PIPELINE_ID&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;df-xxxxxxxxxxxxxxxxxxxx&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;

&lt;span class="nd"&gt;@app.on_s3_event&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;bucket&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;automated-data-pipeline&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;events&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="s"&gt;s3:ObjectCreated:*&lt;/span&gt;&lt;span class="sh"&gt;'&lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt; &lt;span class="n"&gt;prefix&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;preprocess/&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;suffix&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;.csv&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="k"&gt;def&lt;/span&gt; &lt;span class="nf"&gt;activate_pipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;):&lt;/span&gt;
    &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;debug&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;Received event for bucket: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bucket&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;, key: &lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;try&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;response&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;client&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;activate_pipeline&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;
            &lt;span class="n"&gt;pipelineId&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="n"&gt;PIPELINE_ID&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
            &lt;span class="n"&gt;parameterValues&lt;/span&gt;&lt;span class="o"&gt;=&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;
                &lt;span class="p"&gt;{&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;id&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;myS3DataPath&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;
                    &lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;stringValue&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="sa"&gt;f&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;&lt;span class="s"&gt;s3://&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;bucket&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="s"&gt;/&lt;/span&gt;&lt;span class="si"&gt;{&lt;/span&gt;&lt;span class="n"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;key&lt;/span&gt;&lt;span class="si"&gt;}&lt;/span&gt;&lt;span class="sh"&gt;"&lt;/span&gt;
                &lt;span class="p"&gt;}&lt;/span&gt;
            &lt;span class="p"&gt;]&lt;/span&gt;
        &lt;span class="p"&gt;)&lt;/span&gt;
        &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;debug&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;response&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
    &lt;span class="k"&gt;except&lt;/span&gt; &lt;span class="nb"&gt;Exception&lt;/span&gt; &lt;span class="k"&gt;as&lt;/span&gt; &lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;
        &lt;span class="n"&gt;app&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;log&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;critical&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;e&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ul&gt;
&lt;li&gt;Change the arguments like &lt;code&gt;pipeline-id&lt;/code&gt;, path to s3 bucket etc&lt;/li&gt;
&lt;li&gt;Once done, deploy the chalice app using &lt;code&gt;chalice deploy&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;If deployed successfully, go to the AWS console -&amp;gt; Lambda&lt;/li&gt;
&lt;li&gt;Select your lambda function, go the Permissions tab&lt;/li&gt;
&lt;li&gt;Click on the name of Execution Role and it will open the IAM policy for the particular lambda function&lt;/li&gt;
&lt;li&gt;Under the Permissions tab, click on the policy name to expand&lt;/li&gt;
&lt;li&gt;Make sure that the policy has &lt;code&gt;iam:PassRole&lt;/code&gt; and proper data pipeline permission&lt;/li&gt;
&lt;li&gt;To make the life easier, following is the IAM policy that works fine
&lt;/li&gt;
&lt;/ul&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight json"&gt;&lt;code&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Version"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"2012-10-17"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="nl"&gt;"Statement"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"Sid"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"VisualEditor0"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Allow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="s2"&gt;"iam:PassRole"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="s2"&gt;"datapipeline:*"&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"*"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;},&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;{&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"Sid"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"VisualEditor1"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"Effect"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"Allow"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"Action"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="s2"&gt;"logs:CreateLogStream"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="s2"&gt;"logs:CreateLogGroup"&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt;&lt;span class="w"&gt;
                &lt;/span&gt;&lt;span class="s2"&gt;"logs:PutLogEvents"&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="p"&gt;],&lt;/span&gt;&lt;span class="w"&gt;
            &lt;/span&gt;&lt;span class="nl"&gt;"Resource"&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt;&lt;span class="w"&gt; &lt;/span&gt;&lt;span class="s2"&gt;"arn:*:logs:*:*:*"&lt;/span&gt;&lt;span class="w"&gt;
        &lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
    &lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;span class="p"&gt;}&lt;/span&gt;&lt;span class="w"&gt;
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h1&gt;
  
  
  Testing
&lt;/h1&gt;

&lt;p&gt;To test this pipeline, you need to upload the dataset to the S3 bucket path you specified in the trigger function. In my case the path is &lt;code&gt;s3://automated-data-pipeline/preprocess/&lt;/code&gt;. This allows me to use the following command in my PC terminal to simply upload the data, sit back and wait for the output into the S3 path I specified.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;aws s3 &lt;span class="nb"&gt;cp&lt;/span&gt; ~/training.1600000.processed.noemoticon.csv s3://automated-data-pipeline/preprocess/
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After the pipeline has run its course, it will automatically delete the resources attached to it so you don’t incur any unwanted bills. It will upload the data to your specified path, ready to be used. Now let’s observe a before and after state of the data. Following is what the data looked in its raw form:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Ftvw2zzx4w6onxialcdgf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Ftvw2zzx4w6onxialcdgf.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here is what the data looks like after going through the pipeline once:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fr9g340jgelv3mgwfp5q5.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fr9g340jgelv3mgwfp5q5.png" alt="Alt Text"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can clearly observe the difference, you can also find these notebooks to observe closely &lt;a href="https://github.com/EshbanTheLearner/preprocessing-pipeline-demo" rel="noopener noreferrer"&gt;here&lt;/a&gt;. &lt;/p&gt;

&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;I know that there are a lot of steps involved in this process but I assure you that once you have set up a pipeline like this, your life would be much easier. Still seems like a lot of work? Contact us at &lt;a href="mailto:help@traindex.io"&gt;help@traindex.io&lt;/a&gt; to consult for any data engineering/science problems you might be facing.&lt;/p&gt;

</description>
      <category>aws</category>
      <category>datascience</category>
      <category>bigdata</category>
    </item>
    <item>
      <title>Introduction to Data Pipelines</title>
      <dc:creator>Eshban Suleman</dc:creator>
      <pubDate>Mon, 26 Oct 2020 17:37:22 +0000</pubDate>
      <link>https://dev.to/traindex/introduction-to-data-pipelines-26o7</link>
      <guid>https://dev.to/traindex/introduction-to-data-pipelines-26o7</guid>
      <description>&lt;p&gt;If you are a growing data-driven organization, you might have been working to harvest large amounts of data to extract valuable insights from it. This can be costly and inefficient unless the data science team adopts the repeatable solutions to common problems. Although the specifics of organizations may vary, the basic principles remain the same. There are some common features that you can encapsulate into a data pipeline. Let’s look at a common problem and see how we overcame it.&lt;/p&gt;

&lt;p&gt;Our team members at Traindex manually performed recurring tasks. These tasks included data cleaning, model training, testing, and so on. By performing these tasks manually, the engineer worked on the same thing again and again. This resulted in slow throughput, human error, and lack of flexibility and centralization. &lt;/p&gt;

&lt;p&gt;To overcome this, we envisioned a data pipeline to do all the above tasks with minimal human intervention. We developed and deployed such a pipeline, and it has proven itself to be a gust of fresh air. In this article, we’ll look at what data pipelines are, the benefits of using data pipelines in a corporate setting, and finally, what an event-driven data pipeline is.&lt;/p&gt;

&lt;h1&gt;
  
  
  What is a Data Pipeline?
&lt;/h1&gt;

&lt;p&gt;A pipeline is nothing more than a set of steps performed in a particular order in simple terms. A data pipeline is a set of processes performed on data from a source later moved to the destination, also known as the sink. The source could be anything from online transactional databases to data lakes, and the sink or the destination could be anything from data warehouses to business intelligence systems. The most common data pipeline is ETL, which extracts, transforms, and loads the data. The transformation process could include anything depending on the business. Here is a detailed data pipeline diagram:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F485dkplisyhck3b47bn8.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2F485dkplisyhck3b47bn8.png" alt="Alt Text" width="722" height="267"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;ETL pipeline is a type of data pipeline that performs operations in batches and is sometimes referred to as a batch data pipeline. Batch data processing was very common for a long time. Now there are different types of processing available like streaming and real-time processing. This architecture of the data pipelines has a lot of variety according to your business needs. For example, stream analytics for IoT applications keeps the data flowing from hundreds of sensors and real-time data analysis.&lt;/p&gt;

&lt;p&gt;Now that we have understood what a data pipeline is let's discuss why it is important to use data pipelines in modern data-oriented applications. &lt;/p&gt;

&lt;h1&gt;
  
  
  Why use Data Pipeline
&lt;/h1&gt;

&lt;p&gt;In modern data-driven organizations, almost all actions and decisions are based on insights gathered from data. Every department of the organization has certain authorizations, restrictions, and data needs. Often the organizations have a single entity that manages the requirements of everyone resulting in a data silo. In such situations, getting even simple insights becomes difficult and leads to data redundancy within departments. The effort required to obtain essential data also handicaps the organization. &lt;/p&gt;

&lt;h3&gt;
  
  
  Easy and Fast Access to Data
&lt;/h3&gt;

&lt;p&gt;Well-thought-out data pipelines result in easy and fast access with right permission roles to data throughout the organization. Anyone from any department can access their desired data with no intervention or interference.&lt;/p&gt;

&lt;h3&gt;
  
  
  Swift Decision Making
&lt;/h3&gt;

&lt;p&gt;Based on the previously mentioned point, fast access to the data results in quick data-driven decisions. Data supports such choices, and they are less likely to go south.&lt;/p&gt;

&lt;h3&gt;
  
  
  Scalability
&lt;/h3&gt;

&lt;p&gt;Well architectured data pipelines can automatically scale up or down according to the users'/organizations' needs. This reduces admins' headache to keep a constant eye and manually add or remove resources as per requirements.&lt;/p&gt;

&lt;h3&gt;
  
  
  Reliability
&lt;/h3&gt;

&lt;p&gt;Well-written data pipelines improve data quality. The data becomes more reliable, and executives can make better decisions based on it. &lt;/p&gt;

&lt;h3&gt;
  
  
  Economically Efficient
&lt;/h3&gt;

&lt;p&gt;Automated data pipelines run independently and need minimal maintenance and human intervention, thus less paid workforce. Also, their autonomous nature allows them to remove unused resources and save costs.&lt;/p&gt;

&lt;p&gt;Since we now understand what a data pipeline is and its benefits, let us see how we crafted a pipeline according to our needs at Traindex.&lt;/p&gt;

&lt;h1&gt;
  
  
  Event-Driven Data Pipelines
&lt;/h1&gt;

&lt;p&gt;Based on the problem we discussed at the beginning of this article, we decided on an event-driven pipeline. It runs based only on certain events. We wanted our pipeline to automatically run the data processing jobs, followed by training a machine learning model on the preprocessed data. We also wanted it to run some tests once it’s completed based on a specific event, which in our case, was an upload event. &lt;br&gt;
Moving data to a specified data storage by the user or engineer generates an event. Once they complete the upload, it triggers our pipeline. Scheduling is not optimal for this use case because we don’t know when this raw data will be uploaded in our storage. It can be frequent or occasional, so we went for the event-driven approach.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fk2m2q19d34w857s8jr1e.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media2.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fi%2Fk2m2q19d34w857s8jr1e.png" alt="Alt Text" width="800" height="226"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h1&gt;
  
  
  Conclusion
&lt;/h1&gt;

&lt;p&gt;We learned the importance of mining large datasets efficiently to get the best insights on time to stay ahead of the competition. Modern-day data-driven organizations should consider setting up data pipelines to provide their teams with correct and useful data a click away. Data pipelines can also automate data-driven and recurring tasks like data preprocessing, model training, and testing on a schedule or based on specific events. We hope you have found this article useful, and you may consider crafting some data pipeline solutions for your organization. You can consult your data engineering problems with us at &lt;a href="mailto:help@traindex.io"&gt;help@traindex.io&lt;/a&gt;&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>bigdata</category>
      <category>dataengineering</category>
      <category>pipelines</category>
    </item>
    <item>
      <title>What is Semantic Search?</title>
      <dc:creator>Eshban Suleman</dc:creator>
      <pubDate>Mon, 21 Sep 2020 19:24:04 +0000</pubDate>
      <link>https://dev.to/traindex/what-is-semantic-search-3612</link>
      <guid>https://dev.to/traindex/what-is-semantic-search-3612</guid>
      <description>&lt;p&gt;How many times have you had a song's lyrics stuck in your head? Or wanted to search about something but don't know how to describe it? We all have gone through these scenarios in our lives. Who was always there to save the day? Yes, the internet! The power of modern search engines to search through vast amounts of information is unquestionable. They search through billions of webpages on the internet to give you what you need. Like searching for a needle in a haystack except sometimes, users cannot describe the needle.&lt;/p&gt;

&lt;p&gt;Retrieving relevant information from an extensive collection of documents is a challenge. Techniques like syntax analysis, string matching, KPS (Keyword, Pattern, Sample) Search, Semantic Search, etc. have their own merits. Yet, semantic search is superior.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Semantic Search?
&lt;/h2&gt;

&lt;p&gt;Semantic search is a searching technique that improves the accuracy or relevance of the results. It does this by understanding the user's intent through contextual meaning. It answers questions that are not present in the search space. It can also provide personalized search results based on different factors. Semantic search finds that forgotten song's lyrics and also searches important documents from your vast collection of corporate data.&lt;/p&gt;

&lt;h3&gt;
  
  
  Relevant Results
&lt;/h3&gt;

&lt;p&gt;Modern, powerful Machine Learning and Natural Language Processing algorithms enable the search engine to "understand" what the user has asked. The search engine analyzes entities in sentences, inter-dependence of words, synonyms, context. Sometimes it analyzes other factors, such as the browser history of web search engines. This allows users to get accurate results. &lt;/p&gt;

&lt;h3&gt;
  
  
  Better User Experience
&lt;/h3&gt;

&lt;p&gt;Getting accurate information at a fast pace results in better user experience. Semantic search is quick and accurate resulting in better user experience.&lt;/p&gt;

&lt;h3&gt;
  
  
  Discover Knowledge
&lt;/h3&gt;

&lt;p&gt;Unlike keyword search, semantic search aims to understand the user's query and intent. It is likely to get results with the same concepts and ideas. It can help discover new things about the same topics, which can be very useful. Also, in a corporate setting, semantic search can help enhance business intelligence. For example, a keyword search from the resume database will take keywords like "python" AND "machine learning," etc., and find resumes that only have those keywords. But, semantic search can take input like "machine learning python" and provide the resumes with these terms and the resumes with similar ideas but don't have the same words.&lt;/p&gt;

&lt;h2&gt;
  
  
  Traindex and Semantic Search
&lt;/h2&gt;

&lt;p&gt;We understand the importance of semantic search, especially in corporate settings. Traindex implements semantic search solutions for your data collection doesn't matter what it is. To understand how we do it, consider the example of a library. A library can have thousands of books, yet a librarian can tell you exactly where a particular book is. How is the librarian able to do so? By using topical indexes. Libraries divide books into topics. Each subject has its space, and the location of these doesn't change. The librarian can point you towards a specific book, it's the exact location. Traindex implements a semantic search and uses various machine learning and NLP algorithms to learn the topics and maintain an index for fast lookups. It can search for a wide variety of data from corporate resume data to patent data and other critical corporate data. We provide secure end-to-end pipelines to implement our solution, so our interaction with your data is minimal.  &lt;/p&gt;

&lt;h2&gt;
  
  
  How to Implement Semantic Search?
&lt;/h2&gt;

&lt;p&gt;There are a ton of different techniques and algorithms available to develop a semantic search system. Choosing one of them depends on many factors like the dataset, resources available, urgency, etc. Traindex can implement any of these algorithms according to the requirements. Here are some most common algorithms.&lt;/p&gt;

&lt;h3&gt;
  
  
  Latent Semantic Indexing/Latent Dirichlet Allocation
&lt;/h3&gt;

&lt;p&gt;Both LSI and LDA take a bag of words formatted as a matrix as input. LSI uses SVD, a very popular matrix decomposition technique to find latent dimensions, aka topics from the input. In contrast, LDA is a generative probabilistic model, and it assumes a Dirichlet Prior over the Latent topics. Methods like TF-IDF can be used to make an input matrix, and then LSI and LDA can do their work and figure out the N number of topics from the input. The number of topics is hyper-parameter and can be tuned based on factors such as data size, resource availability, etc. For an incoming query, the model will find the topic that matches the input most, and from that topic, it will find the most relevant results and rank them. &lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--scCDYwdC--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/n9kjwg3vydplq0f866um.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--scCDYwdC--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/n9kjwg3vydplq0f866um.png" alt="LSA"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Word2Vec/Doc2Vec
&lt;/h3&gt;

&lt;p&gt;Word2Vec and Doc2Vec models are embedding techniques that have provided state-of-the-art results in various natural language processing tasks and have acted as a silver bullet for a lot of different NLP problems. The bag of words technique results in a sparse matrix in very high dimensions. In contrast, the idea behind these embedding techniques is to represent the text in a fixed-sized, low-dimensional dense vector, which stores its semantic relationships. They also can learn these representations once and reuse them later. It has proven that embedding works way better than previous techniques. Choosing whether to use word2vec or doc2vec, again, depends on what sort of data you have. You can also use pre-trained embeddings for your semantic search engines.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--XlVJoBW2--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/865jrrcno76ico5ec0zj.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--XlVJoBW2--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://dev-to-uploads.s3.amazonaws.com/i/865jrrcno76ico5ec0zj.png" alt="w2v_d2v"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Transformer Language Models
&lt;/h3&gt;

&lt;p&gt;Transformers are deep learning models that encounter the problems of long-range dependencies and long training times in traditional models like RNNs, LSTMs, etc. They are parallelable and can address a wide range of NLP tasks through fine-tuning. They have been giving back to back SOTA results recently. Some common transformer models used these days are BERT, GPT-2, GPT-3, XLNet, Reformer, RoBERTa, etc. Although most of these models are generative, you can use them for your semantic search systems by fine-tuning them or using them to generate embeddings for your text. &lt;/p&gt;

&lt;h2&gt;
  
  
  Take Away
&lt;/h2&gt;

&lt;p&gt;Searching for useful and relevant information from an extensive collection of text-based documents is arduous. Semantic search allows us to do so smartly. Search engines already do so, and Traindex can provide you with your very own custom semantic search system based on your data. Sounds amazing? Click &lt;a href="https://www.traindex.io/"&gt;here&lt;/a&gt; to request a demo. &lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>nlp</category>
      <category>datascience</category>
    </item>
  </channel>
</rss>
