DEV Community: Traindex

Alternatives to Google Patents

Eshban Suleman — Sat, 20 Feb 2021 09:56:01 +0000

There are multiple tools available over the internet to check the similarity of a claim or a patent. There are pros and cons of every tool and a user can sometimes have a hard time deciding what to use where. In such situations, people tend to use the services they trust. People tend to rely on big tech companies when it comes to choosing between a variety of options because they are perceived to be doing well in every area. Such is the case with Google Patents.

Although Google Patents is a well all-round search engine for patent data, it does have some disadvantages. In this article, we will have a look at some of the more obvious cons of Google Patents and will also proceed to look at some other services available online. And if you are not familiar with the concepts of patent search or how to conduct a patent search, have a look at our article Patent Search.

Some Shortfalls of Google Patents

This article is not aimed at disregarding Google Patents as a search engine for patents, instead, the goal of this article is to get the reader familiar with some alternatives to using Google Patents. So, let’s, first of all, discuss why one might decide to not use Google Patents.

1. Semi Semantic Behavior

Google Patents has been observed to show semi-semantic behavior. It is a keyword-based search at its core but it can extract some semantically similar results. Sometimes it can be useful but most of the time it searches for unrelated synonyms. Following is an example of this behavior.

It is not necessarily bad behavior but it does affect the results.

2. Bad with Acronyms

As with all keyword-based searches, Google Patents also seem to struggle with the acronyms. The most common example of it is the acronym AIDS (Acquired Immune Deficiency Syndrome) which is often misinterpreted with the word “aids”, a verb with the meaning of “to help”. So you might get a lot of false positives if your query contains such acronyms.

3. Empty Results

Google Patents shows the keyword search behavior here as well. If the keywords are very unique then it might show zero results. Semantic search engines usually shine in this department but Google Patents is not one of them.

4. Unable to Process Scientific Jargon

Patents usually cover complex novel scientific inventions and thus have a lot of “science language”, but it is observed that Google Patents is usually unable to get results if queried with scientific jargon for example chemical formulas, etc.

5. Missing Citations

There’s been a case of some missing patents which occurred during data transfer. Due to this, citations are missing in some of the patents.

6. Disclosure Risk

Google tracks its search activity according to its Privacy Policy. According to MPEP 904.02(c) of Manual of Patent Examining Procedure by the United States Patent and Trademark Office (USPTO), examiners are allowed to use tools and the internet to search for the prior art of any claim under examination but are not allowed to use any proprietary information as query, instead, they are advised to use a general state of the art query to get similar results. Simply put, to check whether the claim under inspection is similar or identical to any published claim, you can use any service on the internet but you shouldn’t provide any information about the claim that might expose its privacy. Since Google Patents is a keyword-based search, it is difficult to come up with a query that maintains the balance of the privacy of your claims and search for any similar or identical existing claim. Thus your case might always be at risk if Google Patents is being used.

I think these are more than enough reasons to try something different this time. Let’s now discuss some of the alternatives to Google Patents.

Patentscope

Patentscope is a patent search service by the World Intellectual Property Organization (WIPO). You can search over 92 million patents worldwide and can also enhance your search results by filtering them using certain meta-level filters. It is a free global search engine technology information. It doesn’t employ any spelling correction technique nor does it enable to use chemical compounds as a query on the open version. Also, it strictly searches for words in the query and not their other forms, so no lemmatization is observed. It also returns zero results if even one word in the query is out of its vocabulary.

Escapenet

Escapenet by European Patent Office (EPO) is also a keyword-based patent search on over 120 million patents. It has all the characteristics of keyword search such as advanced search features and metadata-based filters. Unlike Patentscope, it uses lemmatization to get different word forms too and supports multiple European languages. The base search only allows up to 10 keywords.

lens.org

lens.org provides search services for different scholarly datasets including patent data of 125.4 million patent records. It has very fine-grained advanced search filters and has patents from all around the world. It uses Apache Lucene and Elasticsearch for text search and shows a semi-semantic behavior. It also supports spelling correction and handles acronyms better than the previous two options. Still, it doesn’t search for chemical compounds, etc, and is susceptible to return empty results.

Traindex

Traindex is a semantic search engine, unlike others in this list. It uses Machine Learning to find patents that are semantically similar to the query. It searches over Google Public Patents Data and can be integrated very easily into your applications. It can accept texts of various lengths, you can enter whole patent documents and it will handle that easily. Since it is a semantic search engine, it outstands in retrieving desired results for even a very unique set of queries. One of the things that make it stand out is that it doesn’t track search data and lets you use their API safely. Does this look like something you want to know more about? How about you schedule a demo here and we will walk you through the process.

The goal of this article was to point out some areas where Google Patents falls short and to provide you with some alternative resources so you can use the right tool for your problems, without compromising privacy and security. If you’re still confused, you can reach us at help@traindex.io and we would be happy to guide you more.

Machine learning possible with small data?

Danial Ranjha — Tue, 02 Feb 2021 22:44:14 +0000

It's not worth trying machine learning projects unless you have a huge data set.

True or false?

Smaller companies are afraid to add machine learning features to their projects unless they have big data. They think that since they're not Amazon or Microsoft, they don't have a large enough data set to be successful in taking on machine learning projects or features.

There are definitely applications of machine learning that can work even on small data sets. Perhaps you can start small and prove out a concept, before investing in getting more data to build a larger model. You can also use off-the-shelf models in AWS, Azure, and GCP to solve generic problems.

With smaller data sets you will encounter problems that you need to be weary of, such as over fitting, bias, and data imbalance. With the right tools and people, there are strategies to overcome these problems.

Like any good project management program, you can invest in small wins to either prove out the concept. This can help you gain credibility in a new program, and then getting funding for bigger and bolder bets.

Multipart Upload for Large Files using Pre-Signed URLs - AWS

Syed Afroz Pasha — Tue, 15 Dec 2020 21:32:13 +0000

It’s mind-blowing how fast data is growing. It is now possible to collect raw data with a frequency of more than a million requests per second. Storage is quicker and cheaper. It is normal to store data practically forever, even if it is rarely accessed.

Users of Traindex can upload large data files to create a semantic search index. This article will explain how we implemented the multipart upload feature that allows Traindex users to upload large files.

Problems and their Solutions

We wanted to allow users of Traindex to upload large files, typically 1-2 TB, to Amazon S3 in minimum time and with appropriate access controls.

In this article, I will discuss how to set up pre-signed URLs for the secure upload of files. This allows us to grant temporary access to objects in AWS S3 buckets without needing permission.

So how do you go from a 5GB limit to a 5TB limit in uploading to AWS S3? Using multipart uploads, AWS S3 allows users to upload files partitioned into 10,000 parts. The size of each part may vary from 5MB to 5GB.

The table below shows the upload service limits for S3.

Apart from the size limitations, it is better to keep S3 buckets private and only grant public access when required. We wanted to give the client access to an object without changing the bucket ACL, creating roles, or creating a user on our account. We ended up using S3 pre-signed URLs.

What will you learn?

For a standard multipart upload to work with pre-signed URLs, we need to:

Initiate a multipart upload
Create pre-signed URLs for each part
Upload the parts of the object
Complete multipart upload

Prerequisites

You have to make sure that you have configured your command-line environment not to require the credentials at the time of operations. Steps 1, 2, and 4 stated above are server-side stages. They will need an AWS access keyID and secret key ID. Step 3 is a client-side operation for which the pre-signed URLs are being set up, and hence no credentials will be needed.

If you have not configured your environment to perform server-side operations, then you must complete it first by following these steps:

Download AWS-CLI from this link according to your OS and install it. To configure your AWS-CLI, you need to use the command aws configure and provide the details it requires, as shown below.

$ aws configure

AWS Access Key ID [None]: EXAMPLEFODNN7EXAMPLE
AWS Secret Access Key [None]: eXaMPlEtnFEMI/K7MDENG/bPxRfiCYEXAMPLEKEY
Default region name [None]: xx-xxxx-x
Default output format [None]: json

Implementation

1. Initiate a Multipart Upload

At this stage, we request AWS S3 to initiate a multipart upload. In response, we will get the UploadId, which will associate each part to the object they are creating.

import boto3

s3 = boto3.client('s3')

bucket = "[XYZ]"
key = "[ABC.pqr]"

response = s3.create_multipart_upload(
    Bucket=bucket, 
    Key=key
)

upload_id = response['UploadId']

Executing this chunk of code after setting up the bucket name and key, we get the UploadID for the file we want to upload. After setting up the bucket name and key, we get the UploadID for the file that needs to be uploaded. It will later be required to combine all parts.

2. Create pre-signed URLs for each part

The parts can now be uploaded via a PUT request. As explained earlier, we are using a pre-signed URL to provide a secure way to upload and grant access to an object without changing the bucket ACL, creating roles, or providing a user on your account. The permitted user can generate the URL for each part of the file and access the S3. The following line of code can generate it:

signed_url = s3.generate_presigned_url(
    ClientMethod ='upload_part',
    Params = {
       'Bucket': bucket,
       'Key': key, 
       'UploadId': upload_id, 
       'PartNumber': part_no
    }
)

As described above, this particular step is a server-side stage and hence demands a preconfigured AWS environment. The pre-signed URLs for each of the parts can now be handed over to the client. They can simply upload the individual parts without direct access to the S3. It means that the service provider does not have to worry about the ACL and change in permission anymore.

3. Upload the parts of the object

This step is the only client-side stage of the process. The default pre-signed URL expiration time is 15 minutes, while the one who is generating it can change the value. Usually, it is kept as minimal as possible for security reasons.

The client can read the part of the object, i.e., file_data, and request to upload the chunk of the data concerning the part number. It is essential to use the pre-signed URLs in sequence as the part number, and the data chunks must be in sequence; otherwise, the object might break, and the upload ends up with a corrupted file. For that reason, a dictionary, i.e., parts, must be managed to store the unique identifier, i.e., eTag of every part concerning the part number. A dictionary must be a manager to keep the unique identifier or eTag of every part of the number.

response = requests.put(signed_url, data=file_data)

etag = response.headers['ETag']  

parts.append({'ETag': etag, 'PartNumber': part_no})

As far as the size of data is concerned, each chunk can be declared into bytes or calculated by dividing the object’s total size by the no. of parts. Look at the example code below:

max_size = 5 * 1024 * 1024    # Approach 1: Assign the size  

max_size = object_size/no_of_parts    # Approach 2: Calculate the size 

with open(fileLocation) as f:
    file_data = f.read(max_size)

4. Complete Multipart Upload

Before this step, check the data’s chunks and the details uploaded to the bucket. Now, we need to merge all the partial files into one. The dictionary parts (about which we discussed in step 3) will be passed as an argument to keep the chunks with their part numbers and eTags to avoid the object from corrupting.

You can refer to the code below to complete the multipart uploading process.

response = s3.complete_multipart_upload(
    Bucket = bucket,
    Key = key,
    MultipartUpload = {'Parts': parts},
    UploadId= upload_id
)

5. Additional step

To avoid any extra charges and cleanup, your S3 bucket and the S3 module stop the multipart upload on request. In case anything seems suspicious and one wants to abort the process, they can use the following code:

response = s3.abort_multipart_upload(
    Bucket = bucket,
    Key = key,
    UploadId = upload_id
)

In this article, we discussed the process of implementing the process of multipart uploading in a secure way pre-signed URLs. The suggested solution is to make a CLI tool to upload large files which saves time and resources and provides flexibility to the users. It is a cheap and efficient solution for users who need to do this frequently.

CLI upload for large files

Mohsin Ashraf — Mon, 14 Dec 2020 23:08:21 +0000

We deal with data every day as part of my work in the data science team. It starts by collecting data and analyzing it for potentially important features and baseline numbers. Then we do data preprocessing and cleaning. Finally, we feed the data into a machine learning algorithm for training.

Once the training is complete, we test the model. We then serve via an API if the performance is good.

In a previous article, we talked about uploading large files using multipart upload via pre-signed URLs. We will take a step further now and discuss how to create a CLI tool for uploading large files to S3 using pre-signed URLs.

The article comprises 3 parts, as described below:

Create pre-signed URLs for multipart upload
Upload all parts of the object
Complete the upload

Request for Multipart upload pre-signed URLs

First of all, we have to request the pre-signed URLs to the AWS S3 bucket. It will return a list of pre-signed URLs corresponding with each of the object’s parts, along with a upload_id, which is associated with the object whose parts are being created. Let’s create the route for requesting pre-signed URLs.

from pathlib import Path
…
…

@app.route('/presigned',methods=['POST'])
def return_presigned():
    data = request.form.to_dict(flat=False)
    file_name = data['file_name'][0]
    file_size = int(data['file_size'][0])
    target_file = Path(file_name)
    max_size = 5 * 1024 * 1024
    upload_by = int(file_size / max_size) + 1
    bucket_name = "YOUR_BUCKET_NAME"
    key = file_name
    upload_id = s3util.start(bucket_name, key)
    urls = []
    for part in range(1, upload_by + 1):
           signed_url = s3util.create_presigned_url(part)
             urls.append(signed_url)
    return jsonify({
                     'bucket_name':bucket_name,
                     'key':key,
                     'upload_id':upload_id,
                'file_size:file_size,
                   'file_name':file_name,
                'max_size':max_size,
                     'upload_by':upload_by,
                'urls':urls
            })

Let’s go through the code. In this route (Flask route), we get the information sent in the request: file_name and file_size.
The file_name will be used in creating URLs for parts of the object, and file_size will be used to find how many parts to create (pre-signed URLs to create).
In the route, max_size determines each part’s maximum size. You can change it according to your needs.
upload_by tells how many parts there will be for the object to upload.
bucket_name is the bucket you want to upload data in.
upload_id is generated using the S3 utility function create_multipart_upload, which we will discuss shortly.
After that, pre-signed URLs are created in the for loop using the create_presigned_url utility function of s3. Again, we will come back to it in a bit.
Next, I return the required data in JSON format.

Now, let’s talk about create_multipart_upload. It’s a utility function that helps me encapsulate the code so it’s more readable and manageable. Following is the code for the utility class.

import boto3
from botocore.exceptions import ClientError
from boto3 import Session


class S3MultipartUploadUtil:
    """
    AWS S3 Multipart Upload Uril
    """
    def __init__(self, session: Session):
        self.session = session
        self.s3 = session.client('s3')
        self.upload_id = None
        self.bucket_name = None
        self.key = None

    def start(self, bucket_name: str, key: str):
        """
        Start Multipart Upload
        :param bucket_name:
        :param key:
        :return:
        """
        self.bucket_name = bucket_name
        self.key = key
        res = self.s3.create_multipart_upload(Bucket=bucket_name, Key=key)
        self.upload_id = res['UploadId']
        logger.debug(f"Start multipart upload '{self.upload_id}'")
        return self.upload_id

    def create_presigned_url(self, part_no: int, expire: int=3600) -> str:
        """
        Create pre-signed URL for upload part.
        :param part_no:
        :param expire:
        :return:
        """
        signed_url = self.s3.generate_presigned_url(
            ClientMethod='upload_part',
            Params={'Bucket': self.bucket_name,
                    'Key': self.key,
                    'UploadId': self.upload_id,
                    'PartNumber': part_no},
            ExpiresIn=expire)
        logger.debug(f"Create presigned url for upload part '{signed_url}'")
        return signed_url

    def complete(self, parts,id,key,bucket_name):
        """
        Complete Multipart Uploading.
        `parts` is list of dictionary below.
        ```


        [ {'ETag': etag, 'PartNumber': 1}, {'ETag': etag, 'PartNumber': 2}, ... ]


        ```
        you can get `ETag` from upload part response header.
        :param parts: Sent part info.
        :return:
        """
        res = self.s3.complete_multipart_upload(
            Bucket=bucket_name,
            Key=key,
            MultipartUpload={
                'Parts': parts
            },
            UploadId=id
        )
        logger.debug(f"Complete multipart upload '{self.upload_id}'")
        logger.debug(res)
        self.upload_id = None
        self.bucket_name = None
        self.key = None

In this class, I wrap the functionality of the S3 client to make it easy to use and less cluttered in the API file.

Once you get the response from the API, it would look something like this:

You would download this response in a JSON file to upload the data using the CLI.

Upload all parts of the object

Now let’s turn to the CLI code, which uses this JSON file, and we assume that we save this file as presigned.json.

import requests
import progressbar
from pathlib import Path

def main():
    data = eval(open('presigned.json').read())
    upload_by = data['upload_by']
    max_size = data['max_size']
    urls = data['urls']
    target_file = Path(data['file_name'])
    file_size = data['file_size']
    key = data['key']
    upload_id = data['upload_id']
    bucket_name = data['bucket_name']
    bar = progressbar.ProgressBar(maxval=file_size, \
        widgets=[progressbar.Bar('=', '[', ']'), ' ', progressbar.Percentage()])
    json_object = dict()
    parts = []
    file_size_counter = 0
    with target_file.open('rb') as fin:
        bar.start()
        for num, url in enumerate(urls):
            part = num + 1
            file_data = fin.read(max_size)
            file_size_counter += len(file_data)
            res = requests.put(url, data=file_data)

            if res.status_code != 200:
                print (res.status_code)
                print ("Error while uploading your data.")
                return None
            bar.update(file_size_counter)
            etag = res.headers['ETag']
            parts.append((etag, part))
        bar.finish()
        json_object['parts'] = [
            {"ETag": eval(x), 'PartNumber': int(y)} for x, y in parts]
        json_object['upload_id'] = upload_id
        json_object['key'] = key
        json_object['bucket_name'] = bucket_name
    requests.post('https://YOUR_HOSTED_API/combine, json={'parts': json_object})
    print ("Dataset is uploaded successfully")

if __name__ == "__main__":
    main()

The above code loads the file and gets all the required information, including upload_id, URLs, and others. I use Progressbar to show progress while uploading the file. The entire code is pretty much self-explanatory except for the following line of code:

requests.post('https://YOUR_HOSTED_API/combine, json={'parts': json_object})

To understand this piece of code, we have to look at the final step of completing the upload.

Complete the upload

We have uploaded all parts of the file, but these parts are not yet combined. To combine them we need to tell the s3 that we have finished uploading and that now you can combine the parts. The above request calls the route in the table below and completes the multipart upload using the s3 utility class. It provides the proper information about the file and the upload_id, which tells s3 about the parts of the same file being uploaded using the upload_id.

@app.route("/combine",methods=["POST"])
def combine():
    body = request.form
    body = body['parts']    
    session = Session()
    s3util = Presigned(session)
    parts = body['parts']
    id, key, bucket_name = body['upload_id'], body['key'], body['bucket_name']
    PARTS = [{"Etag": eval(x), 'PartNumber': int(y)} for x, y in parts]
    s3util.complete(PARTS, id, key, bucket_name)
    return Response(status_code=200)

This code is a very minimum required code to create a CLI tool. You can deploy it on a server, which has proper roles in AWS for interacting with S3, to create and return the pre-signed URLs for completing the multipart upload. This way, you can make sure that no one has direct access to your S3 bucket. Instead, they upload the data using pre-signed URLs, which is a secure way of uploading the data.

Google Patent Search

Nada Gul — Thu, 03 Dec 2020 07:19:31 +0000

A patent search is a tool to check the patentability of your invention. You can find out if someone else has already come up with the same idea. And if there is a patent similar to your invention, your patent application won’t be accepted. All the hard work, time, and money you invested in the invention will go to waste. It is important that you do a thorough patent search before proceeding with your invention.

There are many options to pick from when searching for patents. In this article, we will talk about Google Patent Search in detail. Google Patent Search engine is free and has an easy-to-use interface. The Google search engine is quite fast, as opposed to the United States Patent and Trademark Office’s online library. Google Patents also provides information on legal events for patents.
How to search on Google Patents

Google Patents has two search interfaces. A simple search interface and an advanced search interface. As the name suggests, the simple search interface is quite simple to use and very similar to the regular Google search engine.

You may enter your query in the search bar on the main search page, a publication number of a particular patent you’re looking for, or an application number. When I entered “chemical reaction” in the search bar, I got 135,160 results in just 3 seconds. That’s fairly quick.

In the advanced search interface, you can use boolean syntax to search for patents. You can also search for patent publications using Cooperative Patent Classifications (CPCs) that represent ideas instead of keywords.

The AND operator is used to include all words entered in the search bar to be present in search results. The OR operator is used for at least one of the words to be present in search results. When entering terms, it is important to note that all terms are automatically ANDed together, and synonyms are ORed. You can add a synonym by pressing TAB and a new search term by pressing ENTER. You can also use proximity operators NEAR, WITH, SAME, AJD.

On the Advanced Search page and under Search Fields, there’s an option to search for patents using particular fields. You can select before or after the filing date and choose either priority, filing, or publication for the date or date range you entered. There’s an option to search using an inventor and/or assignee. This option helps users look for patent documents by a particular inventor or patents filed by a specific person or company.

The last box on the advanced search page includes the patent office, in case you want to search for patent documents by country. The next option is the language. You can choose to specify status: grant or application, and type: patent or design. The last option is litigation, where you can choose one of the two options: “has related litigation” or “no known litigation”.

Specifying the patent search using the search field options narrows the search results to the most relevant patent publications you are looking for. It helps to make your patent search experience more efficient.

Patent searching is an essential part of your patent process. You can also learn about recent inventions, the development of particular technologies or patents of famous academics. Google Patent search is a user-friendly patent search engine that makes the complex process of patent searching less tedious. It’s free and fast.

If you want to learn about other avenues for patent searching, check our article on Patent Search here.

Event Driven Data Pipelines in AWS

Eshban Suleman — Mon, 30 Nov 2020 17:52:07 +0000

In a data-driven organization, there is a constant need to provide vast amounts of data to the teams. There are many tools available to aid your requirements and needs. Choosing the right tool can be a little challenging and overwhelming at times. The basic principle you can keep in mind is that there is no right tool or architecture, it depends on what you need.

In this guide, I’m going to show you how to build a simple event-driven data pipeline in AWS. Pipelines are often scheduled or interval based, however, the event-driven concept is unique and a good starting point. Instead of trying to figure out the right intervals of the pipeline activation, you can use an event handler to deal with certain events to activate your pipeline.

To learn more about which problem we were solving in Traindex and why the data pipeline was the right choice for us, refer to my previous article Introduction to Data Pipelines.

As an example, we would be using the “Sentiment140 dataset with 1.6 million tweets” which is available on Kaggle. Our goal would be to set up a data preprocessing pipeline. Once you have uploaded the CSV file in a specified bucket, an event is generated. A lambda function would handle that event and will activate your pipeline. Your data pipeline would be AWS Data Pipeline which is a web service that helps you process and move data between different AWS compute and storage services. This pipeline would divide a compute resource and run your preprocessing code in that resource. Once your data is cleaned and preprocessed, it will upload it to the specified bucket for later use. Based on these objectives, we can divide our task into the following sub-tasks:

Creating a pre-configured AMI
Defining AWS data pipeline architecture
Writing the event handler AWS Lambda function
Integrating everything

Before diving into the steps, make sure you have the following preconditions met

You require an AWS account with certain IAM privileges
Make sure you have already downloaded the data from “Sentiment140 dataset with 1.6 million tweets”
Active internet connection

Pre-Configured AMI

This step can be optional based on your requirements but it is good to have a pre-configured AMI that you can use in the compute resources. Follow the following steps to create a pre-configured AMI:

Go to the AWS console, click on the Services dropdown menu, and select EC2
On the EC2 dashboard, select Launch an Instance
Select the Amazon Linux AMI 2018.03.0 (HVM), SSD Volume Type - ami-01fee56b22f308154
Select the General Purpose t2.micro which is free-tier eligible
Click on Review and Launch and then click Launch to launch this EC2 instance
Now go to the EC2 dashboard and select your EC2 Instance. Copy the public DNS and SSH into your created instance.
Now, install all the required packages, tools, and libraries in it using standard Linux commands.
Also, set up any credentials you might require later like AWS credentials, etc.
Once satisfied with your instance, it’s time to create an AMI image from this instance.
Go to EC2 dashboard, right-click on your instance. Click on Actions, select Image, and click on create an image.
Keep the default settings and create the image by clicking on Create Image.
It’ll take a couple of minutes and once it’s done, go ahead and terminate the instance you created. You will only need the AMI ID in the next phases.

AWS Data Pipeline Architecture

The main idea behind this step is to set up a data pipeline which upon certain triggers, launches an EC2 instance. And then we will have a bash script run in that instance that would be responsible to move our raw data back and forth and run our preprocessing python script. This step can be further divided into 3 main subsections, let’s do it.

AWS Data Pipeline Architecture Definition

First of all, let’s define the AWS data pipeline architecture. We can do so by writing a JSON file that defines and describes our data pipeline and provides it with all the required logic. I’ll try to break it down as much as required but you can always refer to the documentation to explore more options. The data pipeline definition can have different pieces of information like

Names, locations, and formats of your data sources
Activities that transform the data
The schedule for those activities
Resources that run your activities and preconditions
Preconditions that must be satisfied before the activities can be scheduled
Ways to alert you with status updates as pipeline execution proceeds

We can express the data pipeline definition in three parts: Objects, parameters and values.

Objects

Below you can see the syntax of the definition.

{
  "objects" : [
    {
       "name1" : "value1",
       "name2" : "value2"
    },
    {
       "name1" : "value3",
       "name3" : "value4",
       "name4" : "value5"
    }
  ]
}

Following the above syntax we can place our required objects one by one. First of all, we need to define our pipeline object. We would be defining fields like ID, name, IAM and resource roles, path to save pipeline logs and schedule type. You can add or remove these fields based on your requirements and should look at the official documentation to know more about these and other fields.

{
        "id": "Default",
        "name": "Default",
        "failureAndRerunMode": "CASCADE",
        "resourceRole": "DataPipelineDefaultResourceRole",
        "role": "DataPipelineDefaultRole",
        "pipelineLogUri": "s3://automated-data-pipeline/logs/",
        "scheduleType": "ONDEMAND
    },

You can use this object with one change, that is the pipelineLogUri field. You can give the path to the S3 bucket you want to save your logs in. The next object in our definition is the compute i.e. EC2 resource.

{
        "id": "MyEC2Resource",
        "type": "Ec2Resource",
        "imageId": "ami-xxxxxxxxxxxxxxxxx",
        "instanceType": "r5.large",
        "spotBidPrice": "2.0",
        "terminateAfter": "30 Minutes",
        "actionOnTaskFailure" : "terminate",
        "maximumRetries" : "1",
        "role": "DataPipelineDefaultRole",
        "resourceRole": "DataPipelineDefaultResourceRole",
        "keyPair" : "<YOUR-KEY>"
      },

We have described our compute needs in this object, for example, we need an EC2 instance of type r5.large on spot pricing with your key. Also, remember to put in the pre-configured AMI ID in the imageId field so it launches the instance with all of the configurations set in place. Now, let’s move on to the next and last object which is the shell activity. This object would be able to run our shell script which in turn would run our preprocessing code.

{
        "id": "ShellCommandActivityObj",
        "name": "ShellCommandActivityObj",
        "type": "ShellCommandActivity",
        "command": "aws s3 cp s3://automated-data-pipeline/script.sh ~/ && sudo sh ~/script.sh #{myS3DataPath}",
        "maximumRetries": "0",
        "runsOn": {
            "ref": "MyEC2Resource"
        }
      }

In this object, the two most important fields are command and runsOn. In the command field you would define the bash command that you would like to run on the EC2 instance described earlier. I described a command that will copy a bash script into the EC2 instance and run it. Note that I’m also giving it a parameter #{myS3DataPath}, it is the path we would like our pipeline to preprocess. It is given as a parameter to add flexibility to our pipeline so it can handle different data sets. The runsOn field takes the ID of the EC2 resource we created earlier so it can run the shell command on that resource.

Parameters

Parameters place holders should be written in this format #{myPlaceholder}. Every parameter should start with the "my" suffix. Here is the parameter section of the definition JSON file

"parameters": [
        {
            "id": "myS3DataPath",
            "name": "mys3DataPath",
            "description": "This is the path to the data uploaded",
            "type": "AWS::S3::ObjectKey"

        }
    ]

We have defined that our parameter should be AWS S3 object key type. The whole data pipeline definition can be found here.

Now, after you are done with defining your pipeline, activate it by the following command.

aws datapipeline create-pipeline --name data-preprocessing-pipeline --unique-id data-preprocessing-pipeline

Once created, you can put the definition in place. Note that we can pass a temporary parameter value at this stage, which later can be passed dynamically.

aws datapipeline put-definition --pipeline-definition file://definition.json \ --parameter-values s3DataPath=<s3://your/s3/data/path> --pipeline-id <Your Pipeline ID>

Since our data pipeline is defined and created, let’s write the bash script that will run in the compute resource of our data pipeline.

Bash Script

This script will run on the EC2 instance that the pipeline would launch as its compute resource. The working of this script is simple, it makes a new working directory, sets the current working directory and path to data in S3 bucket as environment variables, copies the data into the current working directory, runs the python script and finally uploads the cleaned data back to S3. Here is the code you will need:

#!/bin/bash

echo -e "Starting the process"

sudo mkdir ~/data-pipeline-tmp
sudo chmod ugo+rwx ~/data-pipeline-tmp
cd ~/data-pipeline-tmp

CURRENT_DIR=$(eval "pwd")

DATA_PATH=$1

export WORKING_DIR=$CURRENT_DIR
export S3_DATA_PATH=$DATA_PATH

aws s3 cp s3://automated-data-pipeline/scripts/script.py $WORKING_DIR

python3 $WORKING_DIR/script.py

aws s3 cp $WORKING_DIR/twitter_data_cleaned.csv s3://automated-data-pipeline/outputs/

In my case, the S3 bucket is named automated-data-pipeline and I have made folders to separate different objects. This code can also be found here. Next is the python code that will preprocess the data.

Python Code

This code is the standard preprocessing code that we will use to clean our datasets. Here’s the code that you would need. Changes can be made, add or remove anything according to your needs:

import pandas as pd
import re
import nltk
nltk.download('stopwords')
from nltk.corpus import stopwords
from  nltk.stem import SnowballStemmer
import os

path = os.environ["S3_DATA_PATH"]

print(f"\nInside Python Script\nPath = {path}\n")
print(f"Loading Data\n")

df = pd.read_csv(path, encoding="ISO-8859-1", names=["label", "id", "date", "flag", "user", "tweet"])
print(f"Data has {df.shape[0]} rows and {df.shape[1]} columns\n")

TEXT_CLEANING_RE = "@\S+|https?:\S+|http?:\S|[^A-Za-z0-9]+"

stop_words = stopwords.words("english")
stemmer = SnowballStemmer("english")

def preprocess(text, stem=False):
    # Remove link, user and special characters
    text = re.sub(TEXT_CLEANING_RE, ' ', str(text).lower()).strip()
    tokens = []
    for token in text.split():
        if token not in stop_words:
            if stem:
                tokens.append(stemmer.stem(token))
            else:
                tokens.append(token)
    return " ".join(tokens)
print(f"Starting cleaning process")
df.text = df.text.apply(lambda x: preprocess(x))
print("Data cleaning completed, saving to CSV!\n")

df.to_csv("twitter_data_cleaned.csv", index=False)

You can also find this code here. You have successfully defined and created a working data pipeline that can work on its (with manual activation). To add the event-driven label to it, we need to write a cloud function that will act as a trigger. It will handle certain events and then activate our pipeline when required.

Event Handler AWS Lambda Function

The title says that we will be using the AWS Lambda function for this step but I like to use Chalice for this step. You can use either as per your preference, the code will almost be the same. Following are the steps to create the chalice app that runs on AWS Lambda which triggers the data pipeline. You will need the ID of the pipeline you created earlier in this step.

Create a chalice app using chalice new-project <NAME>
Once the project is initialized, open app.py file
Copy the contents of the following snippet into it. Code also available here.

from chalice import Chalice
import boto3

app = Chalice(app_name='pipeline-trigger')

client = boto3.client("datapipeline")

# The pipeline you want to activate
PIPELINE_ID = "df-xxxxxxxxxxxxxxxxxxxx"

@app.on_s3_event(bucket='automated-data-pipeline', events=['s3:ObjectCreated:*'], prefix="preprocess/", suffix=".csv")
def activate_pipeline(event):
    app.log.debug(f"Received event for bucket: {event.bucket}, key: {event.key}")
    try:
        response = client.activate_pipeline(
            pipelineId=PIPELINE_ID,
            parameterValues=[
                {
                    "id": "myS3DataPath",
                    "stringValue": f"s3://{event.bucket}/{event.key}"
                }
            ]
        )
        app.log.debug(response)
    except Exception as e:
        app.log.critical(e)

Change the arguments like pipeline-id, path to s3 bucket etc
Once done, deploy the chalice app using chalice deploy
If deployed successfully, go to the AWS console -> Lambda
Select your lambda function, go the Permissions tab
Click on the name of Execution Role and it will open the IAM policy for the particular lambda function
Under the Permissions tab, click on the policy name to expand
Make sure that the policy has iam:PassRole and proper data pipeline permission
To make the life easier, following is the IAM policy that works fine

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "iam:PassRole",
                "datapipeline:*"
            ],
            "Resource": "*"
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "logs:CreateLogStream",
                "logs:CreateLogGroup",
                "logs:PutLogEvents"
            ],
            "Resource": "arn:*:logs:*:*:*"
        }
    ]
}

Testing

To test this pipeline, you need to upload the dataset to the S3 bucket path you specified in the trigger function. In my case the path is s3://automated-data-pipeline/preprocess/. This allows me to use the following command in my PC terminal to simply upload the data, sit back and wait for the output into the S3 path I specified.

aws s3 cp ~/training.1600000.processed.noemoticon.csv s3://automated-data-pipeline/preprocess/

After the pipeline has run its course, it will automatically delete the resources attached to it so you don’t incur any unwanted bills. It will upload the data to your specified path, ready to be used. Now let’s observe a before and after state of the data. Following is what the data looked in its raw form:

Here is what the data looks like after going through the pipeline once:

You can clearly observe the difference, you can also find these notebooks to observe closely here.

Conclusion

I know that there are a lot of steps involved in this process but I assure you that once you have set up a pipeline like this, your life would be much easier. Still seems like a lot of work? Contact us at help@traindex.io to consult for any data engineering/science problems you might be facing.

How Traindex Leverages the Keyword Search?

Mohsin Ashraf — Fri, 13 Nov 2020 17:50:23 +0000

Traindex is a semantic search engine for corporate datasets to retrieve the most relevant results from a corpus. This search not only incorporates the meaning of the words but also includes contextual awareness. It enables a semantic search engine to outperform any keyword search engine; to find more, you can head out to our detailed article about the difference between both these approaches.

Traindex performance is measured using a variety of benchmarks, ranging from automated algorithms to manual experts' classification. For instance, we are using the Jaccard Similarity, which counts how many words from the retrieved results match the query's keywords. The following graph illustrates the visual intuition of this benchmark:

The blue bar represents the query's keywords, whereas the orange bars represent the matched keywords. The higher the bar, the more keywords in common.

In general, the benchmarking queries contain around 1200+ unique words. We have achieved a 47% average Jaccard Similarity score for the first twenty results against each query in our latest API release. This score means that we are implicitly applying the keyword search over the corpus since it approves that on average, Traindex could retrieve 564 common keywords with the query, which is far more than any keyword search engine can offer.

Moreover, the common words are not just random words because they are meticulously picked by an algorithm that decides about the query's best representative. The same algorithm also incorporates their semantic meaning during the search, which leads the percentage of Jaccard Similarity score sometimes to raise up to 90% and 99%.

When it comes to response time, Traindex is fairly quick. You can imagine how much time it would take to perform a keyword search of 564 words on 8.5M+ documents. It will require a lot of time and resources to go through the entire corpus, match the keywords, and bring up the relevant results. However, Traindex searches and ranks the results by their semantic similarity and not by the highest keyword match, as you can see from the above figure.

Conclusion

Traindex could give you the best of both worlds: keyword search and semantic search. Keyword search over millions of documents will take a long processing time and too many resources and produce a lot of false positives. Traindex, on the other hand, permits you to do a query with an entire document with even tens of thousands of words, and still, the response time is quick, and results are quite relevant.

Patent Search

Nada Gul — Thu, 12 Nov 2020 10:51:40 +0000

How to conduct a patent search

A patent search helps evaluate the patentability of your invention. It lets you know if somebody else has already come up with the same idea or similar. If there's an existing patent for your invention, the patent offices will reject your patent, and all the time and money you invested in it will go to waste.

Patent search may get a little overwhelming, which is why this article will help and make this daunting task easier for you.

If you're new to patent searching, let's understand why you might find it helpful. If you have an idea for an invention, a patent search will help you familiarize yourself with previous patents to understand your invention's patentability. Understanding patentability enables you to avoid costly decisions. Patent search may also allow you to improve your existing patent application or your invention itself while avoiding copying someone else's idea.

Besides building up on your idea for an invention, you might want to conduct a patent search to find out about recent inventions. It might assist you in studying the development of a particular technology you are interested in. You can also find patents by famous academics, perhaps for your research, or maybe if you are just keen on learning about their work.

Now that we've discussed what patent search is and how it might help you, let's move on to the process of a patent search itself. There are multiple ways to conduct a patent search. We have made a list of four patent search methods we have used in the past and found useful:

1. United States Patent and Trademark Office

The United States Patent and Trademark Office, or the USPTO, has an extensive library with multiple patent resources. There are Full-Text Patents from 1976 and PDF Image Patents starting from 1790. There is also a Full Text and Image Database for patent applications.

There are three ways of searching Full Text patents: Quick Search, Advanced Search, and Patent Number Search.

In Quick Search, you enter the terms along with fields to search for patents. Although the process was very slow, I got 848557 hits with 1 through 50 on the first page when I searched for chemical (Term 1) and reaction (Term 2). The boolean operator here being "and". You can choose the boolean operator when you enter terms to search: "and", "or" and "andnot". You can easily move to the next page by clicking "Next 50 hits", or enter a number you want to jump to in the box next to "Jump To".

In Advanced Search, you have the option to enter multiple terms with appropriate boolean operators in different fields. The Patent Number Search allows you to search for patents using patent numbers.

2. Google Patent Search

Google Patents works the same as the Google search engine but catalogs patents and patent applications. Over the years, it has expanded to cover the European Patent Office, World Intellectual Property Organization, among other Intellectual Property organizations worldwide. It has global litigation information showing litigation history for patents anywhere in the world. Google Patent Search engine is a lot faster than the USPTO in showing results.

3. IFI Claims Patent Services

IFI Claims Patent Services is another platform to search for global patent information if you have a subscription. Google Patents also use the IFI Claims patent database. You cannot search for patents on their publicly available website. However, the IFI Claims service provides an API service built on top of a SOLR index. The full documentation of their search API endpoint is available here.

4. Traindex

Traindex is a search service for corporate datasets powered by machine learning. Specifically, it is a semantic text similarity service served as an Application Programming Interface (API). Traindex applies a model on patents extracted from "Google public patents data" and serves it as an API that your applications can consume.

You can use the Traindex search widget here to search for patents. Type your query in the space provided, as shown in the image below. You can choose one of the two indices: Google Patents or Wikipedia (the two active indices) and hit the search button. It was fairly quick and gave me 99 results for my query, "chemical reaction", along with the score for each link.

For more information on Traindex, check out the link here.

Patentability search is essential if you need to submit a patent application or have an idea for an invention and want to ensure someone else hasn't come up with the same or similar idea. It is also helpful if you want to find out about recent inventions or are interested in the development of particular technologies or want to find patents of famous academics. There are various routes to search for patents, including the USPTO, Google Patent Search, IFI claims, and Traindex.

Hopefully, this article gave you an insight into how to search for patents. But if you have any questions regarding patent searching, send us an email at blog@foretheta.com.

Introduction to Data Pipelines

Eshban Suleman — Mon, 26 Oct 2020 17:37:22 +0000

If you are a growing data-driven organization, you might have been working to harvest large amounts of data to extract valuable insights from it. This can be costly and inefficient unless the data science team adopts the repeatable solutions to common problems. Although the specifics of organizations may vary, the basic principles remain the same. There are some common features that you can encapsulate into a data pipeline. Let’s look at a common problem and see how we overcame it.

Our team members at Traindex manually performed recurring tasks. These tasks included data cleaning, model training, testing, and so on. By performing these tasks manually, the engineer worked on the same thing again and again. This resulted in slow throughput, human error, and lack of flexibility and centralization.

To overcome this, we envisioned a data pipeline to do all the above tasks with minimal human intervention. We developed and deployed such a pipeline, and it has proven itself to be a gust of fresh air. In this article, we’ll look at what data pipelines are, the benefits of using data pipelines in a corporate setting, and finally, what an event-driven data pipeline is.

What is a Data Pipeline?

A pipeline is nothing more than a set of steps performed in a particular order in simple terms. A data pipeline is a set of processes performed on data from a source later moved to the destination, also known as the sink. The source could be anything from online transactional databases to data lakes, and the sink or the destination could be anything from data warehouses to business intelligence systems. The most common data pipeline is ETL, which extracts, transforms, and loads the data. The transformation process could include anything depending on the business. Here is a detailed data pipeline diagram:

ETL pipeline is a type of data pipeline that performs operations in batches and is sometimes referred to as a batch data pipeline. Batch data processing was very common for a long time. Now there are different types of processing available like streaming and real-time processing. This architecture of the data pipelines has a lot of variety according to your business needs. For example, stream analytics for IoT applications keeps the data flowing from hundreds of sensors and real-time data analysis.

Now that we have understood what a data pipeline is let's discuss why it is important to use data pipelines in modern data-oriented applications.

Why use Data Pipeline

In modern data-driven organizations, almost all actions and decisions are based on insights gathered from data. Every department of the organization has certain authorizations, restrictions, and data needs. Often the organizations have a single entity that manages the requirements of everyone resulting in a data silo. In such situations, getting even simple insights becomes difficult and leads to data redundancy within departments. The effort required to obtain essential data also handicaps the organization.

Easy and Fast Access to Data

Well-thought-out data pipelines result in easy and fast access with right permission roles to data throughout the organization. Anyone from any department can access their desired data with no intervention or interference.

Swift Decision Making

Based on the previously mentioned point, fast access to the data results in quick data-driven decisions. Data supports such choices, and they are less likely to go south.

Scalability

Well architectured data pipelines can automatically scale up or down according to the users'/organizations' needs. This reduces admins' headache to keep a constant eye and manually add or remove resources as per requirements.

Reliability

Well-written data pipelines improve data quality. The data becomes more reliable, and executives can make better decisions based on it.

Economically Efficient

Automated data pipelines run independently and need minimal maintenance and human intervention, thus less paid workforce. Also, their autonomous nature allows them to remove unused resources and save costs.

Since we now understand what a data pipeline is and its benefits, let us see how we crafted a pipeline according to our needs at Traindex.

Event-Driven Data Pipelines

Based on the problem we discussed at the beginning of this article, we decided on an event-driven pipeline. It runs based only on certain events. We wanted our pipeline to automatically run the data processing jobs, followed by training a machine learning model on the preprocessed data. We also wanted it to run some tests once it’s completed based on a specific event, which in our case, was an upload event.
Moving data to a specified data storage by the user or engineer generates an event. Once they complete the upload, it triggers our pipeline. Scheduling is not optimal for this use case because we don’t know when this raw data will be uploaded in our storage. It can be frequent or occasional, so we went for the event-driven approach.

Conclusion

We learned the importance of mining large datasets efficiently to get the best insights on time to stay ahead of the competition. Modern-day data-driven organizations should consider setting up data pipelines to provide their teams with correct and useful data a click away. Data pipelines can also automate data-driven and recurring tasks like data preprocessing, model training, and testing on a schedule or based on specific events. We hope you have found this article useful, and you may consider crafting some data pipeline solutions for your organization. You can consult your data engineering problems with us at help@traindex.io

Summarizing Large Documents for Machine Learning Modeling

Mohsin Ashraf — Thu, 08 Oct 2020 14:25:47 +0000

Information is growing exponentially every passing day thanks to the internet. It has connected humanity from all the corners of the world. According to Forbes, 2.5 quintillion bytes of data are created every day, and the pace is accelerating. It includes all kinds of data: text, images, videos, and transactions, etc. Text data is the largest shareholder among these data. This text data can be a conversation between two persons, which can be of small sentences, or it can be intellectual property data, for example, patents, which can be up to millions of words.

Handling smaller datasets with fairly small to medium-length documents is not a big challenge anymore, thanks to the power of deep learning. The problem comes when you have to deal with large-sized documents ranging from a few thousand words to millions of words such as patents and research papers. This is still challenging for deep learning methods to deal with such larger documents. It is hard for even state-of-the-art deep learning methods to capture the long-term dependencies in these documents. Hence, these huge documents require some special techniques to deal with them.

At Traindex, we are working with intellectual property and providing effective and efficient search solutions on patents. Patent analysts might want to know what other patents exist in the same domain when filing a new patent. They may want to find prior art to challenge a claim in an existing patent. There are numerous use cases that better patent search helps solve.

We are dealing with millions of patents, each containing thousands of words in it and some patents even reaching millions of words! Dealing with such a massive dataset with enormously large documents is a big challenge. These patents are not only lengthy, but they are intellectually intensive and have long-term dependencies. Deep learning alone will fail here to capture the proper semantics of such humongous documents. There need to be some very specialized techniques to deal with such large documents. We have developed some preprocessing techniques that help us reduce the size of the documents and keep the sense of the document pretty much the same.

We use an extractive summarizer that summarizes the patent by first going through the whole patent, figuring out the important sentences in it, and then dropping off the patent's least important sentences. The summarizer uses two measures to figure out which sentence is essential: first, how many stopwords are present in the sentence (which reduces the importance of the sentence), second, how many important topics does a sentence contain with respect to the overall topics discussed in the patent (which increases the importance of the sentence). Then we use a simple threshold value for deciding which sentences to keep and which to drop. By changing the threshold's value, we can change the summary length, such that the summary contains the most important information about the patent. The following figure illustrates this point.

The above image shows you the scores of the sentences in a patent. The horizontal red line is the threshold of importance for keeping or throwing away the sentences. The blue bars are the importances of the sentences below our defined threshold; hence we will drop them, and the result would look as follows.

These are the sentences that we are going to keep for our summarized patent. We can change the threshold line and get the summary's different lengths based on our needs. The flow of the overall process is given below.

We have tested this approach, and it has improved our search index's overall performance on patents. It tackles many problems for deep learning algorithms, especially if you are using some pre-trained models like Universal Sentence Encoder or Bert, which accept a limited number of words for any document. If you increase the length, you will run into errors. You can apply this summarization technique for all kinds of embedding algorithms with some limits regarding the input document's length.

Benchmarking of Textual Models - Jaccard Similarity

Syed Afroz Pasha — Mon, 28 Sep 2020 21:18:03 +0000

I recently concluded my internship with the Data Science team at Traindex. One of the tasks assigned to me was replicating the ML-based semantic search technique. The Data Science team had implemented this on their Traindex search API.

Traindex uses document similarity techniques like LSI and Doc2Vec to train a model that can identify the best matching documents with a given document or paragraphs or phrases.

One of the most challenging tasks is the benchmarking of language models. The process sometimes requires using a series of techniques for proper evaluation and testing.
One of the benchmarks used for Traindex is Jaccard Similarity. It provides a baseline and is not enough for a complete evaluation of any model.

This article intends to give some background on the where and how of the Jaccard Similarity score. It is useful for creating benchmarks to measure the performance of their language models.

Text Similarity

Text similarity can help us determine the similarity between pairs of documents, or a specific document and a set of other documents. The score calculated by performing the similarity check decides model acceptance, improvement, or rejection. The categorization of string-based text similarity shows various approaches that fit according to the scenario.

There are two main alternatives for finding the similarity metric.

The Character-based approach deals with the individual characters present in the document with the proper sequence.
The Term-based deals with the whole word. The words are often simplified or lemmatized before performing the test as per the initial data cleaning process used for the training purpose.

Introduction - Jaccard Index

For the comparison of a finite number of elements between two observations, it is common practice to count items that are common to both sets. It is a natural fit for comparing posts in case of the representative tags to measure how similar two articles are in terms of tags.

The Jaccard similarity index, also the Jaccard similarity coefficient, compares members of two sets to see shared and distinct members. It is a measure of similarity for the two sets of data, with a range from 0% to 100%. The higher the percentage, the more similar the two populations. The Jaccard Index is a statistic to compare and measure how similar two different sets are to each other. Although it is easy to interpret, it is susceptible to small sample sizes. It may give erroneous results, especially with smaller samples or data sets with missing observations.

Let's have a look on this example.

from gensim.matutils import jaccard

print(jaccard(bow_water, bow_bank))
Out: 0.8571428571428572


print(jaccard(doc_water, doc_bank))
Out: 0.8333333333333334

print(jaccard(['word'], ['word']))
Out: 0.0

The three code examples above feature two different input methods.

In the first case, we present to Jaccard document vectors already in the bag of words format. The distance's definition is one minus the size of the intersection upon the size of the union of the vectors. We can see that the distance is likely to be high - and it is.

The last two examples illustrate the ability for Jaccard to accept even lists (i.e. documents) as inputs. In the previous case, because they are the same vectors, the value returned is 0 - this means the distance is 0 and the two documents are identical.

Mathematical Representation

Mathematically, it is a ratio of intersection of two sets over the union of them. In the case of textual documents or phrases, it compares the words and counts the common words. Then divides it with the total number of words present in both of the documents. The same method applies to more than two documents with the same technique.

From the venn diagram above, it can be concluded that:

Jaccard Index = (number in both sets) / (number in either set) * 100

Benchmarking is an essential process of Data Science that indicates the confidence level and effectiveness of a model. Traindex performs multiple benchmarking techniques from which Jaccard means similarity testing is an important one.

A test data set finds the mean similarity score for the trained model. Sampling the main dataset where the validation fraction provides the parameter in the config file. Leaving every feature aside, UCID i.e., the unique document ID and the whole text, are the two crucial columns that are necessary to perform the testing.

The following steps test the LSI model performance on the basis of term-based similarity:

The test data is made containing the UCIDs with their respective text where the sampling is done according to the size mentioned in the configuration file.
Cleaning of the text column of the test data as per the requirements, while the removal of duplicate terms is compulsory. For each document, the query is made by using the text corresponding to its UCID.
The document and index are sent to a function that returns the no of common words present in the query and document with the help of indices.
Meanwhile, the calculations for common words and the total observations are done, followed by the calculation for Jaccard similarity by taking the ratio of the lengths saved earlier.
Following the above steps can produce the adequate results required for the evaluation, but the results are scaled by dividing the score with respective query size. The extra step eradicates the behavioral bent of results because of different query size.

At last, the mean similarity score for all the combined results is calculated by averaging.

A data frame is optional but populated with the similarity score for each result. The values are mapped on the bar chart, which clearly shows the percentage similarity with particular UCID, as shown in the sample figure.

Impact of Jaccard Similarity as an Evaluation Metric

The Jaccard similarity score may not be the best solution for benchmarking, but it may be considered for the following advantages:

Jaccard similarity takes a unique set of words for each sentence. This means that if you repeat any word in the sentence several times, Jaccard's similarity remains unchanged. Consider these two sentences:

Sentence 1: AI is our friend, and it has been friendly.

Sentence 2: AI and humans have always been friendly.

It is robust enough to cater repetitions of a word like the word
"friend" in sentence 1. No matter how much frequent any word is,
Jaccard's similarity will be the same - here it is 0.5.

Instead of rejecting or accepting the document for a given set of queries, It provides a numerical score ranges from 0 to 1, which provides a clear view and can be used as a lookup for further iterations. Figure 3 explains the closeness of random search queries with the same document where the height of the bars represents the Jaccard similarity scores between queries and the document.
Jaccard similarity is useful for cases where duplication does not matter. For example, it will be better to use Jaccard similarity for two product descriptions as the repetition of a word does not reduce their similarity.

In a nutshell, benchmarking for evaluating a set of models is a necessary step in the data science process. The development of various tools and approaches for the calculation of a metric that can differentiate between products is important. In comparison with other domains, evaluation of Machine Learning, the language and contextual models are difficult. When there is less number of benchmarking techniques for textual models, the Jaccard similarity score is an important performance evaluation metric that can help in providing a partial summary of the model. It can be concluded that the effectiveness of any benchmarking technique depends on how good they fit the problem. Also, to understand that a mismatch between the metric and the case can misguide the results.

How Is Semantic Search Different From Keyword Search?

Mohsin Ashraf — Wed, 23 Sep 2020 04:45:44 +0000

With the exponential growth of information, finding the right information is like looking for a needle in a haystack. Bubbling the right information to the top of the search results is essential for efficiently working in the knowledge economy. Putting the best relevant results in the limited place of the first page is what distinguishes an excellent search engine from a good search engine. This is the challenge we are solving at Traindex.

One of the first challenges we are trying to solve is in patent space. Patent analysts might want to know what other patents exist in the same domain as a new patent being filed. They may want to find prior art to challenge a claim in an existing patent. There are numerous use cases that better patent search helps solve.

We have experimented with numerous approaches to retrieve the relevant information quickly and effectively. These approaches are centered around two fundamental techniques, the keyword search, and semantic search.

In this article, we will take a deep dive in what is the difference between semantic search and keyword search and which approach is better.

A keyword search is a simple keyword lookup in a corpus of documents against a query. The system will retrieve all the documents from the database which have any keyword present in the query. We can set constraints of whether all words in the query should be present in the retrieved results or any single word in the document would be sufficient to bring it up.

One drawback of this approach is that the retrieval system will not care what the meaning of the keyword is in the context of a document and query it will simply bring back all the documents which contain the keyword specified by the user. This type of search might return irrelevant results (false positives). To view how keyword search works take a look at the following diagram.

In the above diagram, each small box shows the documents that contain a term specified e.g “A”. The diagram shows a user-entered query “raining cats and dogs” and how the system has retrieved the relevant documents to the terms that they used. In this case, the system retrieved all the documents which contained “raining”, “cats”, “and” and “dogs” and showed them to the user. But “raining cats and dogs” is a phrase in English used to describe heavy rain. This system might also get some relevant results but those results would be very small in number and also could have been ranked randomly (depending upon the database structure). Moreover, each word in the query is contributing independently regardless of its meaning being governed by its neighbors. Scaling the keyword search is also a problem and can slow down the response time of the search engine if you have millions of documents. Keyword searches may also fail to retrieve related documents that don’t specifically use the search term (false negatives). Under these conditions, researchers can miss pertinent information. There is also the danger of making business decisions based on less than a comprehensive set of search results.

We use semantic search in Traindex. Unlike keyword search, the semantic search takes into account the meaning of the words according to their context. In semantic search, a latent vector representation of documents is inferred during the training process to project into the latent space. At the inference time, the incoming query is converted into the same latent space representation and projected into space in which the documents are already projected. The nearest points in the space to the query are retrieved as the similar documents to the query.

Take a look at the following illustration using a 3D latent space, although these latent spaces can be from hundreds to a few thousands. At Traindex, we use the latent space representations ranging from 200 to 600 dimensions for capturing better representations of the documents.

The red dot represents the query a user might have entered, whereas all the blue dots are the documents projected in 3D latent space. From here there are a number of ways to locate similar documents, for example, one can use Euclidean distance of these dots to calculate the most similar documents or a Cosine angle from the origin which is also known as cosine similarity, and the list of similarity matrices goes on.

Now we have a solid understanding of how both searches work, let's compare them side by side.

Keyword Search	Semantic Search
Synonyms could be neglected during the search	Incorporate the meaning of words hence comprehends the synonyms as well
Need to carefully pick the keywords for search	The query is automatically enriched by the latent encoding
The information which is retrieved is dependent on keywords and page ranking algorithms that can produce spam results	The information retrieved is independent of keywords and page rank algorithms that produce exact results rather than any irrelevant results

Semantic search seeks to improve search accuracy by understanding a searcher’s intent through the contextual meaning of the words and brings back the results the user intended to see.

In patents, technical and domain-specific terms are heavily used which might not be present in English dictionaries and hence can be missed by a patent analyst while they is searching for prior art. Semantic search in Traindex processes the whole patent as input for (prior art and similarity) search and uses vectorized representation of the text to find synonyms and same words in other patents during the search, which is impossible with a keyword search.

Moreover, keyword search is restricted to use up to a fixed number of words which is at most 50 words generally and reduces the response time as the number of keywords grow, whereas in Traindex you can search with any number of words including the whole patents which are hundreds of thousands of words and can get results in real time. Semantic search in Traindex converts the whole patent into a vector representation and matches the most similar documents in the database, hence overcomes the problem of limited keywords search.