DEV Community: Eshban Suleman

A 100 Day #thePersonalMSDS Journey

Eshban Suleman — Sat, 04 Sep 2021 19:33:38 +0000

The Machine Learning landscape is in a state of continuous change. New research, technologies and tools are put out every day. This sometimes makes it hard to keep up with the latest trends. Besides that, the vastness of the domain can induce the imposter syndrome in practitioners. This is perfectly put in the following tweet

I too had felt this over the years. Either I feel that I know too little or feel like I’m out of touch. To combat this, I’ve been following some challenges to get in touch with my skills and learn new ones along the way. One of the challenges I recently completed is #thePersonalMSDS.

#thePersonalMSDS was an initiative by one of my seniors, Muhammad Hamza Javaid to get industry professionals and students to follow a self-curated Data Science Masters roadmap to develop new skills and hone existing ones. The partaker can decide the number of days (usually 100) and the number of hours they want to dedicate towards learning per day. I first came across it in January 2020 and decided to pledge for 100 days of following a customized roadmap. I completed the challenge from January 13th 2020 to April 22nd 2020. During those 100 days, I studied various topics with the help of online courses and articles. Some of the things I studied back then included

Statistics and Probability
Big Data with Apache Spark
AI for Business
Investment Fundamentals & Data Analytics
Data Engineering on GCP
Basic Bash Scripting and Shell Programming
Data Science Project Management

As it might be seen, I customized my learning path based on my needs and interests. This challenge not only helped me learn new skills but also to get on top of my existing skills. More recently I pledged the last 100 days, from May 26th 2021 to September 2nd 2021 to the #thePersonalMSDS challenge. I learned some new topics that I hadn’t learned before and also worked on some of the skills that I already have. I got a discount coupon for Databricks Data Science Pathway and I spent 36 days completing it. I earned 41 certificates in these 36 days, some of which you can check here. Some other topics/technologies that I studied apart from this were

Deploying Machine Learning Models
Spatial Analysis and Geospatial Data Science
Data Privacy
ElasticSearch (ELK Stack)
HuggingFace Transformers
Customer Segmentation
Time Series Analysis and Forecasting

You can track my detailed learnings here. A question that I get a lot is how I find the motivation to start and continue. This is a great question and it is a very common problem. I too have gone off the track a few times so, through the process of trial and error, I worked out some methods that work for me. I hope you find them useful too.

Plan Ahead of Time

A good plan will help you stay on top of your skills and it’ll show how self-aware you are regarding your strengths and weaknesses. I like to make 2 separate lists, one dedicated to topics and skills that I want to learn and one for the skills that I’ve already learnt but either feel out of touch with or just want to study in-depth. Then I pick topics from both lists that I feel are both important and fun. Remember, you can always add or redact topics later.

Find a Community

Get your friends and/or colleagues to sign up for the challenge with you. If nobody wants to join, find people on the internet with the same interests. Become a part of online study groups. Most importantly share your daily progress on the internet with proper hashtags. It’ll get you the exposure you need to find people that are interested in what you’re doing and keep you motivated to meet the daily goal.

Stay Positive

Maintaining a routine like this along with work or studies can be cumbersome and frustrating at times. Sometimes it may feel like you are going nowhere but that is the moment where you need to look at how far you’ve come, how many new things you’ve learned, how many people you connect with along the way. This will help you stay positive and motivated.

Take Breaks

Self-learning is all about flexibility. You don’t need to burden yourself with covering a lot of topics in a short period of time. If you’re feeling tired, just take a break. Take as many breaks as necessary to relieve your stress and come back more focused. You are your own in charge.

Have Fun

The most important factor in staying motivated is to have fun while learning. The more you make your learning fun, the more you’ll look forward to it. Everyone has their own methods of having fun, e.g. you can do mini-projects using the skills you’re learning, make video tutorials, write blogs about it etc. I like to take handwritten notes and do mini-projects. You pick your poison.

So, if you are planning to learn something new or even brush up on your skills, start today, start now because tomorrow never comes. I wish you all the very best for your future.

via GIPHY

Alternatives to Google Patents

Eshban Suleman — Sat, 20 Feb 2021 09:56:01 +0000

There are multiple tools available over the internet to check the similarity of a claim or a patent. There are pros and cons of every tool and a user can sometimes have a hard time deciding what to use where. In such situations, people tend to use the services they trust. People tend to rely on big tech companies when it comes to choosing between a variety of options because they are perceived to be doing well in every area. Such is the case with Google Patents.

Although Google Patents is a well all-round search engine for patent data, it does have some disadvantages. In this article, we will have a look at some of the more obvious cons of Google Patents and will also proceed to look at some other services available online. And if you are not familiar with the concepts of patent search or how to conduct a patent search, have a look at our article Patent Search.

Some Shortfalls of Google Patents

This article is not aimed at disregarding Google Patents as a search engine for patents, instead, the goal of this article is to get the reader familiar with some alternatives to using Google Patents. So, let’s, first of all, discuss why one might decide to not use Google Patents.

1. Semi Semantic Behavior

Google Patents has been observed to show semi-semantic behavior. It is a keyword-based search at its core but it can extract some semantically similar results. Sometimes it can be useful but most of the time it searches for unrelated synonyms. Following is an example of this behavior.

It is not necessarily bad behavior but it does affect the results.

2. Bad with Acronyms

As with all keyword-based searches, Google Patents also seem to struggle with the acronyms. The most common example of it is the acronym AIDS (Acquired Immune Deficiency Syndrome) which is often misinterpreted with the word “aids”, a verb with the meaning of “to help”. So you might get a lot of false positives if your query contains such acronyms.

3. Empty Results

Google Patents shows the keyword search behavior here as well. If the keywords are very unique then it might show zero results. Semantic search engines usually shine in this department but Google Patents is not one of them.

4. Unable to Process Scientific Jargon

Patents usually cover complex novel scientific inventions and thus have a lot of “science language”, but it is observed that Google Patents is usually unable to get results if queried with scientific jargon for example chemical formulas, etc.

5. Missing Citations

There’s been a case of some missing patents which occurred during data transfer. Due to this, citations are missing in some of the patents.

6. Disclosure Risk

Google tracks its search activity according to its Privacy Policy. According to MPEP 904.02(c) of Manual of Patent Examining Procedure by the United States Patent and Trademark Office (USPTO), examiners are allowed to use tools and the internet to search for the prior art of any claim under examination but are not allowed to use any proprietary information as query, instead, they are advised to use a general state of the art query to get similar results. Simply put, to check whether the claim under inspection is similar or identical to any published claim, you can use any service on the internet but you shouldn’t provide any information about the claim that might expose its privacy. Since Google Patents is a keyword-based search, it is difficult to come up with a query that maintains the balance of the privacy of your claims and search for any similar or identical existing claim. Thus your case might always be at risk if Google Patents is being used.

I think these are more than enough reasons to try something different this time. Let’s now discuss some of the alternatives to Google Patents.

Patentscope

Patentscope is a patent search service by the World Intellectual Property Organization (WIPO). You can search over 92 million patents worldwide and can also enhance your search results by filtering them using certain meta-level filters. It is a free global search engine technology information. It doesn’t employ any spelling correction technique nor does it enable to use chemical compounds as a query on the open version. Also, it strictly searches for words in the query and not their other forms, so no lemmatization is observed. It also returns zero results if even one word in the query is out of its vocabulary.

Escapenet

Escapenet by European Patent Office (EPO) is also a keyword-based patent search on over 120 million patents. It has all the characteristics of keyword search such as advanced search features and metadata-based filters. Unlike Patentscope, it uses lemmatization to get different word forms too and supports multiple European languages. The base search only allows up to 10 keywords.

lens.org

lens.org provides search services for different scholarly datasets including patent data of 125.4 million patent records. It has very fine-grained advanced search filters and has patents from all around the world. It uses Apache Lucene and Elasticsearch for text search and shows a semi-semantic behavior. It also supports spelling correction and handles acronyms better than the previous two options. Still, it doesn’t search for chemical compounds, etc, and is susceptible to return empty results.

Traindex

Traindex is a semantic search engine, unlike others in this list. It uses Machine Learning to find patents that are semantically similar to the query. It searches over Google Public Patents Data and can be integrated very easily into your applications. It can accept texts of various lengths, you can enter whole patent documents and it will handle that easily. Since it is a semantic search engine, it outstands in retrieving desired results for even a very unique set of queries. One of the things that make it stand out is that it doesn’t track search data and lets you use their API safely. Does this look like something you want to know more about? How about you schedule a demo here and we will walk you through the process.

The goal of this article was to point out some areas where Google Patents falls short and to provide you with some alternative resources so you can use the right tool for your problems, without compromising privacy and security. If you’re still confused, you can reach us at help@traindex.io and we would be happy to guide you more.

Event Driven Data Pipelines in AWS

Eshban Suleman — Mon, 30 Nov 2020 17:52:07 +0000

In a data-driven organization, there is a constant need to provide vast amounts of data to the teams. There are many tools available to aid your requirements and needs. Choosing the right tool can be a little challenging and overwhelming at times. The basic principle you can keep in mind is that there is no right tool or architecture, it depends on what you need.

In this guide, I’m going to show you how to build a simple event-driven data pipeline in AWS. Pipelines are often scheduled or interval based, however, the event-driven concept is unique and a good starting point. Instead of trying to figure out the right intervals of the pipeline activation, you can use an event handler to deal with certain events to activate your pipeline.

To learn more about which problem we were solving in Traindex and why the data pipeline was the right choice for us, refer to my previous article Introduction to Data Pipelines.

As an example, we would be using the “Sentiment140 dataset with 1.6 million tweets” which is available on Kaggle. Our goal would be to set up a data preprocessing pipeline. Once you have uploaded the CSV file in a specified bucket, an event is generated. A lambda function would handle that event and will activate your pipeline. Your data pipeline would be AWS Data Pipeline which is a web service that helps you process and move data between different AWS compute and storage services. This pipeline would divide a compute resource and run your preprocessing code in that resource. Once your data is cleaned and preprocessed, it will upload it to the specified bucket for later use. Based on these objectives, we can divide our task into the following sub-tasks:

Creating a pre-configured AMI
Defining AWS data pipeline architecture
Writing the event handler AWS Lambda function
Integrating everything

Before diving into the steps, make sure you have the following preconditions met

You require an AWS account with certain IAM privileges
Make sure you have already downloaded the data from “Sentiment140 dataset with 1.6 million tweets”
Active internet connection

Pre-Configured AMI

This step can be optional based on your requirements but it is good to have a pre-configured AMI that you can use in the compute resources. Follow the following steps to create a pre-configured AMI:

Go to the AWS console, click on the Services dropdown menu, and select EC2
On the EC2 dashboard, select Launch an Instance
Select the Amazon Linux AMI 2018.03.0 (HVM), SSD Volume Type - ami-01fee56b22f308154
Select the General Purpose t2.micro which is free-tier eligible
Click on Review and Launch and then click Launch to launch this EC2 instance
Now go to the EC2 dashboard and select your EC2 Instance. Copy the public DNS and SSH into your created instance.
Now, install all the required packages, tools, and libraries in it using standard Linux commands.
Also, set up any credentials you might require later like AWS credentials, etc.
Once satisfied with your instance, it’s time to create an AMI image from this instance.
Go to EC2 dashboard, right-click on your instance. Click on Actions, select Image, and click on create an image.
Keep the default settings and create the image by clicking on Create Image.
It’ll take a couple of minutes and once it’s done, go ahead and terminate the instance you created. You will only need the AMI ID in the next phases.

AWS Data Pipeline Architecture

The main idea behind this step is to set up a data pipeline which upon certain triggers, launches an EC2 instance. And then we will have a bash script run in that instance that would be responsible to move our raw data back and forth and run our preprocessing python script. This step can be further divided into 3 main subsections, let’s do it.

AWS Data Pipeline Architecture Definition

First of all, let’s define the AWS data pipeline architecture. We can do so by writing a JSON file that defines and describes our data pipeline and provides it with all the required logic. I’ll try to break it down as much as required but you can always refer to the documentation to explore more options. The data pipeline definition can have different pieces of information like

Names, locations, and formats of your data sources
Activities that transform the data
The schedule for those activities
Resources that run your activities and preconditions
Preconditions that must be satisfied before the activities can be scheduled
Ways to alert you with status updates as pipeline execution proceeds

We can express the data pipeline definition in three parts: Objects, parameters and values.

Objects

Below you can see the syntax of the definition.

{
  "objects" : [
    {
       "name1" : "value1",
       "name2" : "value2"
    },
    {
       "name1" : "value3",
       "name3" : "value4",
       "name4" : "value5"
    }
  ]
}

Following the above syntax we can place our required objects one by one. First of all, we need to define our pipeline object. We would be defining fields like ID, name, IAM and resource roles, path to save pipeline logs and schedule type. You can add or remove these fields based on your requirements and should look at the official documentation to know more about these and other fields.

{
        "id": "Default",
        "name": "Default",
        "failureAndRerunMode": "CASCADE",
        "resourceRole": "DataPipelineDefaultResourceRole",
        "role": "DataPipelineDefaultRole",
        "pipelineLogUri": "s3://automated-data-pipeline/logs/",
        "scheduleType": "ONDEMAND
    },

You can use this object with one change, that is the pipelineLogUri field. You can give the path to the S3 bucket you want to save your logs in. The next object in our definition is the compute i.e. EC2 resource.

{
        "id": "MyEC2Resource",
        "type": "Ec2Resource",
        "imageId": "ami-xxxxxxxxxxxxxxxxx",
        "instanceType": "r5.large",
        "spotBidPrice": "2.0",
        "terminateAfter": "30 Minutes",
        "actionOnTaskFailure" : "terminate",
        "maximumRetries" : "1",
        "role": "DataPipelineDefaultRole",
        "resourceRole": "DataPipelineDefaultResourceRole",
        "keyPair" : "<YOUR-KEY>"
      },

We have described our compute needs in this object, for example, we need an EC2 instance of type r5.large on spot pricing with your key. Also, remember to put in the pre-configured AMI ID in the imageId field so it launches the instance with all of the configurations set in place. Now, let’s move on to the next and last object which is the shell activity. This object would be able to run our shell script which in turn would run our preprocessing code.

{
        "id": "ShellCommandActivityObj",
        "name": "ShellCommandActivityObj",
        "type": "ShellCommandActivity",
        "command": "aws s3 cp s3://automated-data-pipeline/script.sh ~/ && sudo sh ~/script.sh #{myS3DataPath}",
        "maximumRetries": "0",
        "runsOn": {
            "ref": "MyEC2Resource"
        }
      }

In this object, the two most important fields are command and runsOn. In the command field you would define the bash command that you would like to run on the EC2 instance described earlier. I described a command that will copy a bash script into the EC2 instance and run it. Note that I’m also giving it a parameter #{myS3DataPath}, it is the path we would like our pipeline to preprocess. It is given as a parameter to add flexibility to our pipeline so it can handle different data sets. The runsOn field takes the ID of the EC2 resource we created earlier so it can run the shell command on that resource.

Parameters

Parameters place holders should be written in this format #{myPlaceholder}. Every parameter should start with the "my" suffix. Here is the parameter section of the definition JSON file

"parameters": [
        {
            "id": "myS3DataPath",
            "name": "mys3DataPath",
            "description": "This is the path to the data uploaded",
            "type": "AWS::S3::ObjectKey"

        }
    ]

We have defined that our parameter should be AWS S3 object key type. The whole data pipeline definition can be found here.

Now, after you are done with defining your pipeline, activate it by the following command.

aws datapipeline create-pipeline --name data-preprocessing-pipeline --unique-id data-preprocessing-pipeline

Once created, you can put the definition in place. Note that we can pass a temporary parameter value at this stage, which later can be passed dynamically.

aws datapipeline put-definition --pipeline-definition file://definition.json \ --parameter-values s3DataPath=<s3://your/s3/data/path> --pipeline-id <Your Pipeline ID>

Since our data pipeline is defined and created, let’s write the bash script that will run in the compute resource of our data pipeline.

Bash Script

This script will run on the EC2 instance that the pipeline would launch as its compute resource. The working of this script is simple, it makes a new working directory, sets the current working directory and path to data in S3 bucket as environment variables, copies the data into the current working directory, runs the python script and finally uploads the cleaned data back to S3. Here is the code you will need:

#!/bin/bash

echo -e "Starting the process"

sudo mkdir ~/data-pipeline-tmp
sudo chmod ugo+rwx ~/data-pipeline-tmp
cd ~/data-pipeline-tmp

CURRENT_DIR=$(eval "pwd")

DATA_PATH=$1

export WORKING_DIR=$CURRENT_DIR
export S3_DATA_PATH=$DATA_PATH

aws s3 cp s3://automated-data-pipeline/scripts/script.py $WORKING_DIR

python3 $WORKING_DIR/script.py

aws s3 cp $WORKING_DIR/twitter_data_cleaned.csv s3://automated-data-pipeline/outputs/

In my case, the S3 bucket is named automated-data-pipeline and I have made folders to separate different objects. This code can also be found here. Next is the python code that will preprocess the data.

Python Code

This code is the standard preprocessing code that we will use to clean our datasets. Here’s the code that you would need. Changes can be made, add or remove anything according to your needs:

import pandas as pd
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from  nltk.stem import SnowballStemmer
import os

path = os.environ["S3_DATA_PATH"]

print(f"\nInside Python Script\nPath = {path}\n")
print(f"Loading Data\n")

df = pd.read_csv(path, encoding="ISO-8859-1", names=["label", "id", "date", "flag", "user", "tweet"])
print(f"Data has {df.shape[0]} rows and {df.shape[1]} columns\n")

TEXT_CLEANING_RE = "@\S+|https?:\S+|http?:\S|[^A-Za-z0-9]+"

stop_words = stopwords.words("english")
stemmer = SnowballStemmer("english")

def preprocess(text, stem=False):
    # Remove link, user and special characters
    text = re.sub(TEXT_CLEANING_RE, ' ', str(text).lower()).strip()
    tokens = []
    for token in text.split():
        if token not in stop_words:
            if stem:
                tokens.append(stemmer.stem(token))
            else:
                tokens.append(token)
    return " ".join(tokens)
print(f"Starting cleaning process")
df.text = df.text.apply(lambda x: preprocess(x))
print("Data cleaning completed, saving to CSV!\n")

df.to_csv("twitter_data_cleaned.csv", index=False)

You can also find this code here. You have successfully defined and created a working data pipeline that can work on its (with manual activation). To add the event-driven label to it, we need to write a cloud function that will act as a trigger. It will handle certain events and then activate our pipeline when required.

Event Handler AWS Lambda Function

The title says that we will be using the AWS Lambda function for this step but I like to use Chalice for this step. You can use either as per your preference, the code will almost be the same. Following are the steps to create the chalice app that runs on AWS Lambda which triggers the data pipeline. You will need the ID of the pipeline you created earlier in this step.

Create a chalice app using chalice new-project <NAME>
Once the project is initialized, open app.py file
Copy the contents of the following snippet into it. Code also available here.

from chalice import Chalice
import boto3

app = Chalice(app_name='pipeline-trigger')

client = boto3.client("datapipeline")

# The pipeline you want to activate
PIPELINE_ID = "df-xxxxxxxxxxxxxxxxxxxx"

@app.on_s3_event(bucket='automated-data-pipeline', events=['s3:ObjectCreated:*'], prefix="preprocess/", suffix=".csv")
def activate_pipeline(event):
    app.log.debug(f"Received event for bucket: {event.bucket}, key: {event.key}")
    try:
        response = client.activate_pipeline(
            pipelineId=PIPELINE_ID,
            parameterValues=[
                {
                    "id": "myS3DataPath",
                    "stringValue": f"s3://{event.bucket}/{event.key}"
                }
            ]
        )
        app.log.debug(response)
    except Exception as e:
        app.log.critical(e)

Change the arguments like pipeline-id, path to s3 bucket etc
Once done, deploy the chalice app using chalice deploy
If deployed successfully, go to the AWS console -> Lambda
Select your lambda function, go the Permissions tab
Click on the name of Execution Role and it will open the IAM policy for the particular lambda function
Under the Permissions tab, click on the policy name to expand
Make sure that the policy has iam:PassRole and proper data pipeline permission
To make the life easier, following is the IAM policy that works fine

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "iam:PassRole",
                "datapipeline:*"
            ],
            "Resource": "*"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogStream",
                "logs:CreateLogGroup",
                "logs:PutLogEvents"
            ],
            "Resource": "arn:*:logs:*:*:*"
        }
    ]
}

Testing

To test this pipeline, you need to upload the dataset to the S3 bucket path you specified in the trigger function. In my case the path is s3://automated-data-pipeline/preprocess/. This allows me to use the following command in my PC terminal to simply upload the data, sit back and wait for the output into the S3 path I specified.

aws s3 cp ~/training.1600000.processed.noemoticon.csv s3://automated-data-pipeline/preprocess/

After the pipeline has run its course, it will automatically delete the resources attached to it so you don’t incur any unwanted bills. It will upload the data to your specified path, ready to be used. Now let’s observe a before and after state of the data. Following is what the data looked in its raw form:

Here is what the data looks like after going through the pipeline once:

You can clearly observe the difference, you can also find these notebooks to observe closely here.

Conclusion

I know that there are a lot of steps involved in this process but I assure you that once you have set up a pipeline like this, your life would be much easier. Still seems like a lot of work? Contact us at help@traindex.io to consult for any data engineering/science problems you might be facing.

Introduction to Data Pipelines

Eshban Suleman — Mon, 26 Oct 2020 17:37:22 +0000

If you are a growing data-driven organization, you might have been working to harvest large amounts of data to extract valuable insights from it. This can be costly and inefficient unless the data science team adopts the repeatable solutions to common problems. Although the specifics of organizations may vary, the basic principles remain the same. There are some common features that you can encapsulate into a data pipeline. Let’s look at a common problem and see how we overcame it.

Our team members at Traindex manually performed recurring tasks. These tasks included data cleaning, model training, testing, and so on. By performing these tasks manually, the engineer worked on the same thing again and again. This resulted in slow throughput, human error, and lack of flexibility and centralization.

To overcome this, we envisioned a data pipeline to do all the above tasks with minimal human intervention. We developed and deployed such a pipeline, and it has proven itself to be a gust of fresh air. In this article, we’ll look at what data pipelines are, the benefits of using data pipelines in a corporate setting, and finally, what an event-driven data pipeline is.

What is a Data Pipeline?

A pipeline is nothing more than a set of steps performed in a particular order in simple terms. A data pipeline is a set of processes performed on data from a source later moved to the destination, also known as the sink. The source could be anything from online transactional databases to data lakes, and the sink or the destination could be anything from data warehouses to business intelligence systems. The most common data pipeline is ETL, which extracts, transforms, and loads the data. The transformation process could include anything depending on the business. Here is a detailed data pipeline diagram:

ETL pipeline is a type of data pipeline that performs operations in batches and is sometimes referred to as a batch data pipeline. Batch data processing was very common for a long time. Now there are different types of processing available like streaming and real-time processing. This architecture of the data pipelines has a lot of variety according to your business needs. For example, stream analytics for IoT applications keeps the data flowing from hundreds of sensors and real-time data analysis.

Now that we have understood what a data pipeline is let's discuss why it is important to use data pipelines in modern data-oriented applications.

Why use Data Pipeline

In modern data-driven organizations, almost all actions and decisions are based on insights gathered from data. Every department of the organization has certain authorizations, restrictions, and data needs. Often the organizations have a single entity that manages the requirements of everyone resulting in a data silo. In such situations, getting even simple insights becomes difficult and leads to data redundancy within departments. The effort required to obtain essential data also handicaps the organization.

Easy and Fast Access to Data

Well-thought-out data pipelines result in easy and fast access with right permission roles to data throughout the organization. Anyone from any department can access their desired data with no intervention or interference.

Swift Decision Making

Based on the previously mentioned point, fast access to the data results in quick data-driven decisions. Data supports such choices, and they are less likely to go south.

Scalability

Well architectured data pipelines can automatically scale up or down according to the users'/organizations' needs. This reduces admins' headache to keep a constant eye and manually add or remove resources as per requirements.

Reliability

Well-written data pipelines improve data quality. The data becomes more reliable, and executives can make better decisions based on it.

Economically Efficient

Automated data pipelines run independently and need minimal maintenance and human intervention, thus less paid workforce. Also, their autonomous nature allows them to remove unused resources and save costs.

Since we now understand what a data pipeline is and its benefits, let us see how we crafted a pipeline according to our needs at Traindex.

Event-Driven Data Pipelines

Based on the problem we discussed at the beginning of this article, we decided on an event-driven pipeline. It runs based only on certain events. We wanted our pipeline to automatically run the data processing jobs, followed by training a machine learning model on the preprocessed data. We also wanted it to run some tests once it’s completed based on a specific event, which in our case, was an upload event.
Moving data to a specified data storage by the user or engineer generates an event. Once they complete the upload, it triggers our pipeline. Scheduling is not optimal for this use case because we don’t know when this raw data will be uploaded in our storage. It can be frequent or occasional, so we went for the event-driven approach.

Conclusion

We learned the importance of mining large datasets efficiently to get the best insights on time to stay ahead of the competition. Modern-day data-driven organizations should consider setting up data pipelines to provide their teams with correct and useful data a click away. Data pipelines can also automate data-driven and recurring tasks like data preprocessing, model training, and testing on a schedule or based on specific events. We hope you have found this article useful, and you may consider crafting some data pipeline solutions for your organization. You can consult your data engineering problems with us at help@traindex.io

What is Semantic Search?

Eshban Suleman — Mon, 21 Sep 2020 19:24:04 +0000

How many times have you had a song's lyrics stuck in your head? Or wanted to search about something but don't know how to describe it? We all have gone through these scenarios in our lives. Who was always there to save the day? Yes, the internet! The power of modern search engines to search through vast amounts of information is unquestionable. They search through billions of webpages on the internet to give you what you need. Like searching for a needle in a haystack except sometimes, users cannot describe the needle.

Retrieving relevant information from an extensive collection of documents is a challenge. Techniques like syntax analysis, string matching, KPS (Keyword, Pattern, Sample) Search, Semantic Search, etc. have their own merits. Yet, semantic search is superior.

Why Semantic Search?

Semantic search is a searching technique that improves the accuracy or relevance of the results. It does this by understanding the user's intent through contextual meaning. It answers questions that are not present in the search space. It can also provide personalized search results based on different factors. Semantic search finds that forgotten song's lyrics and also searches important documents from your vast collection of corporate data.

Relevant Results

Modern, powerful Machine Learning and Natural Language Processing algorithms enable the search engine to "understand" what the user has asked. The search engine analyzes entities in sentences, inter-dependence of words, synonyms, context. Sometimes it analyzes other factors, such as the browser history of web search engines. This allows users to get accurate results.

Better User Experience

Getting accurate information at a fast pace results in better user experience. Semantic search is quick and accurate resulting in better user experience.

Discover Knowledge

Unlike keyword search, semantic search aims to understand the user's query and intent. It is likely to get results with the same concepts and ideas. It can help discover new things about the same topics, which can be very useful. Also, in a corporate setting, semantic search can help enhance business intelligence. For example, a keyword search from the resume database will take keywords like "python" AND "machine learning," etc., and find resumes that only have those keywords. But, semantic search can take input like "machine learning python" and provide the resumes with these terms and the resumes with similar ideas but don't have the same words.

Traindex and Semantic Search

We understand the importance of semantic search, especially in corporate settings. Traindex implements semantic search solutions for your data collection doesn't matter what it is. To understand how we do it, consider the example of a library. A library can have thousands of books, yet a librarian can tell you exactly where a particular book is. How is the librarian able to do so? By using topical indexes. Libraries divide books into topics. Each subject has its space, and the location of these doesn't change. The librarian can point you towards a specific book, it's the exact location. Traindex implements a semantic search and uses various machine learning and NLP algorithms to learn the topics and maintain an index for fast lookups. It can search for a wide variety of data from corporate resume data to patent data and other critical corporate data. We provide secure end-to-end pipelines to implement our solution, so our interaction with your data is minimal.

How to Implement Semantic Search?

There are a ton of different techniques and algorithms available to develop a semantic search system. Choosing one of them depends on many factors like the dataset, resources available, urgency, etc. Traindex can implement any of these algorithms according to the requirements. Here are some most common algorithms.

Latent Semantic Indexing/Latent Dirichlet Allocation

Both LSI and LDA take a bag of words formatted as a matrix as input. LSI uses SVD, a very popular matrix decomposition technique to find latent dimensions, aka topics from the input. In contrast, LDA is a generative probabilistic model, and it assumes a Dirichlet Prior over the Latent topics. Methods like TF-IDF can be used to make an input matrix, and then LSI and LDA can do their work and figure out the N number of topics from the input. The number of topics is hyper-parameter and can be tuned based on factors such as data size, resource availability, etc. For an incoming query, the model will find the topic that matches the input most, and from that topic, it will find the most relevant results and rank them.

Word2Vec/Doc2Vec

Word2Vec and Doc2Vec models are embedding techniques that have provided state-of-the-art results in various natural language processing tasks and have acted as a silver bullet for a lot of different NLP problems. The bag of words technique results in a sparse matrix in very high dimensions. In contrast, the idea behind these embedding techniques is to represent the text in a fixed-sized, low-dimensional dense vector, which stores its semantic relationships. They also can learn these representations once and reuse them later. It has proven that embedding works way better than previous techniques. Choosing whether to use word2vec or doc2vec, again, depends on what sort of data you have. You can also use pre-trained embeddings for your semantic search engines.

Transformer Language Models

Transformers are deep learning models that encounter the problems of long-range dependencies and long training times in traditional models like RNNs, LSTMs, etc. They are parallelable and can address a wide range of NLP tasks through fine-tuning. They have been giving back to back SOTA results recently. Some common transformer models used these days are BERT, GPT-2, GPT-3, XLNet, Reformer, RoBERTa, etc. Although most of these models are generative, you can use them for your semantic search systems by fine-tuning them or using them to generate embeddings for your text.

Take Away

Searching for useful and relevant information from an extensive collection of text-based documents is arduous. Semantic search allows us to do so smartly. Search engines already do so, and Traindex can provide you with your very own custom semantic search system based on your data. Sounds amazing? Click here to request a demo.