DEV Community: Japneet Singh Chawla

Instagram Bot using Python

Japneet Singh Chawla — Tue, 30 Jun 2020 21:56:29 +0000

Growing on Instagram is a difficult and time-consuming job if you are already not a famous personality.
I started my food blogging page @.interwined_dodos. (do check it out and follow me there) and faced these challenges in growing. Some of the very common challenges I have faced were:

Interacting with all community throughout the day is a difficult job
Instagram blocks the like, comment, follow and unfollow actions if you are over interactive in a short span of time *Following and engaging with new people of same or different community(again a time-consuming job)

Being a lazy yet ambitious person, I wanted to grow on Instagram but didn't want to spend much time on it.
Here is when my developer mind kicked in and I thought of automating this stuff and my journey of this automation lead me to blogs related to selenium and then to my destination Instapy.

InstaPy is a tool developed to automate Instagram tasks using Selenium and the brain. It has a large variety of actions that can be performed on Instagram by just a few lines of code. Some popular actions are:

Smart liking
Smart Commenting
Following
Unfollowing

I said smart liking and commenting because there are lots of configuration options that are available in the InstaPy to give it a human-like touch. Also, it is carefully crafted so that you don't get blocked by Instagram (but it can happen and you can get blocked for months too if you are not careful).

Frankly speaking, I even got blocked by Instagram for manually performing the actions and I am using this bot from past 2 days and things are fine as of now.
Let's get started.....

Installation

Running InstaPy needs Firefox browser installed so install the version as per your operating system from this link.

I will prefer version 0.6.8 because there are some issues with comments in the latest version 0.6.9, so do a quick pip install to install the package
pip install instapy==0.6.8

The Code

""" Quickstart script for InstaPy usage """
# imports
from instapy import InstaPy
from instapy import smart_run
# login credentials
insta_username = ''  # <- enter username here
insta_password = ''  # <- enter password here
# get an InstaPy session!
# set headless_browser=True to run InstaPy in the background
session = InstaPy(username=insta_username,
                  password=insta_password,
                  headless_browser=False)
with smart_run(session):
    """ Activity flow """
    # general settings
    #sets the percentage of people you want to follow
    session.set_do_follow(True, percentage=50)
    #sets the percentage of posts you want to comment
    session.set_do_comment(True, percentage=100)
    #list of random comments you want to post
    session.set_comments(["hey @_.interwined_dodos._, have a look", "Great content @_.interwined_dodos._ have a look", ":heart_eyes: :heart_eyes: :heart_eyes: @_.interwined_dodos._"])
    #setting quotas for the daily and hourly action(I said it's smart)
    session.set_quota_supervisor(enabled=True, peak_comments_daily=250, peak_comments_hourly=30, peak_likes_daily=250, peak_likes_hourly=30,sleep_after=['likes', 'follows'])
    #again some set of configuration which figures out whom to follow
    session.set_relationship_bounds(enabled=True,
                                    delimit_by_numbers=True,
                                    max_followers=4590,
                                    min_followers=45,
                                    min_following=77)
    #tags to get posts from and amout is the actions you want 
    session.like_by_tags(['nailsofinstagram',"nails"], amount=300)

That's it...
With these few lines your bot is ready to roll

Run the Bot

python <your_file_name>

I never thought that creating bot will be this easy. I hope this help many lazy people like me :P

Also, have look at my other blogs:

Extracting text from PDF
Using Celery for Heavy Loads
Scraping the Web

GloVe for Word Vectorization

Japneet Singh Chawla — Tue, 22 May 2018 04:00:23 +0000

What is GloVe?

GloVe stands for global vectors for word representation. It is an unsupervised learning algorithm developed by Stanford for generating word embeddings by aggregating global word-word co-occurrence matrix from a corpus. The resulting embeddings show interesting linear substructures of the word in vector space.

Examples for linear substructures are:

These results are pretty impressive. This type of representation can be very useful in many machine learning algorithms.

To read more about Word Vectorization you can read my other article.

In this blog post, we will be learning about GloVe implementation in python. So, let's get started.

Let's create the Embeddings

Installing Glove-Python

The GloVe is implementation in python is available in library glove-python.

pip install glove_python

Text Preprocessing

In this step, we will pre-process the text like removing the stop words, lemmatize the words etc. You can perform different steps based on your requirements. I will use nltk stopword corpus for stop word removal and nltk word lemmatization for finding lemmas. I order to use nltk corpus you will need to download it using the following commands. Downloading the corpus

import nltk
nltk.download()
#this will open a GUI from which you can download the corpus

Input initialization

#list of sentences to be vectorized
lines=["Hello this is a tutorial on how to convert the word in an integer format",
"this is a beautiful day","Jack is going to office"]

Removing the Stop Words

from nltk.corpus import stopwords
stop_words=set(stopwords.words('english'))

lines_without_stopwords=[]
#stop words contain the set of stop words
for line in lines:
 temp_line=[]
 for word in lines:
  if word not in stop_words:
   temp_line.append (word)
 string=' '
 lines_without_stopwords.append(string.join(temp_line))

lines=lines_without_stopwords

Lemmatization

#import WordNet Lemmatizer from nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()


lines_with_lemmas=[]
#stop words contain the set of stop words
for line in lines:
 temp_line=[]
 for word in lines:
  temp_line.append (wordnet_lemmatizer.lemmatize(word))
 string=' '
 lines_with_lemmas.append(string.join(temp_line))
lines=lines_with_lemmas

Now we have done the basic preprocessing of the text. Any other preprocessing stuff can be achieved similarly.

Preparing Input

We have out input in the form of an array of lines. In order for the model to process the data, we need covert our input to an array of array of words ( :\ ). Our Input lines=["Hello this is a tutorial on how to convert the word in an integer format", "this is a beautiful day","Jack is going to office"] New Input lines=[['Hello', 'this','tutorial', 'on', 'how','convert' ,'word',' integer','format'], ['this' ,'beautiful', 'day'],['Jack','going' , 'office']

new_lines=[]
for line in lines:
 new_lines=line.split('')
#new lines has the new format
lines=new_lines

Building a Glove model

#importing the glove library
from glove import Corpus, Glove

# creating a corpus object
corpus = Corpus() 

#training the corpus to generate the co occurence matrix which is used in GloVe
corpus.fit(lines, window=10)

#creating a Glove object which will use the matrix created in the above lines to create embeddings
#We can set the learning rate as it uses Gradient Descent and number of components

glove = Glove(no_components=5, learning_rate=0.05)
 
glove.fit(corpus.matrix, epochs=30, no_threads=4, verbose=True)
glove.add_dictionary(corpus.dictionary)
glove.save('glove.model')

Creating a glove model uses the co-occurrence matrix generated by the Corpus object to create the embeddings. The corpus.fit takes two arguments:

lines - this is the 2D array we created after the pre-processing
window - this is the distance between two words algo should consider to find some relationship between them

Parameters of Glove:

no_of_components - This is the dimension of the output vector generated by the GloVe
learning_rate - Algo uses gradient descent so learning rate defines the rate at which the algo reaches towards the minima (lower the rate more time it takes to learn but reaches the minimum value)

Parameters of glove.fit :

cooccurence_matrix: the matrix of word-word co-occurrences
epochs: this defines the number of passes algo makes through the data set
no_of_threads: number of threads used by the algo to run

After the training glove object has the word vectors for the lines we have provided. But the dictionary still resides in the corpus object. We need to add the dictionary to the glove object to make it complete.

glove.add_dictionary(corpus.dictionary)

This line does the dictionary addition in the glove object. After this, the object is ready to provide you with the embeddings.

print glove.word_vectors[glove.dictionary['samsung']]
OUTPUT:
[ 0.04521741  0.02455266 -0.06374787 -0.07107575  0.04608054]

This will print the embeddings for the word "samsung".

End Notes

We have learned how to generate the vectors for the text data which is very useful for creating many Machine Learning models and techniques like SVM, KNN, K-Means, Logistic Classifiers, Sentiment Analysis, Document Classification etc.

More can be learned about the GloVe on Stanford's website.

Optimizing Celery for Workloads

Japneet Singh Chawla — Mon, 21 May 2018 02:37:45 +0000

Introduction

Firstly, a brief background about myself. I am working as a Software Engineer in one of the Alternate Asset Management Organization (Handling 1.4 Trillion with our product suite) responsible for maintaining and developing a product ALT Data Analyzer. My work is focused on making the engine run and feed the machines with their food. This article explains the problems we faced with scaling up our architecture and solution we followed. I am dividing the blog in the following different sections:

Product Brief
Current Architecture
Problem With Data loads and Sleepless Nights
Solutions Tried
The Final Destination

Product Brief

The idea of building this product was to give users an aggregated view of the WEB around a company. By WEB I mean the data that is flowing freely over all the decentralized nodes of the internet. We try to capture all the financial, technical and fundamental data for the companies, pass that data through our massaging and analyzing pipes and provide an aggregated view on the top of this data. Our dashboards can be one stop for all the information around a company and can aid an investor in his analysis.

Some of the Data Sources that we support as of now are:

Stock News from multiple sources
RSS feed
Twitter
Reddit
Yahoo News
Yahoo Finance
Earning Calenders
Report Filings
Company Financials
Stock Prices

Current Architecture

Broad View

We knew that problem that we are solving has to deal with the cruel decentralized Internet. And we need to divide the large task of getting the data from the web and analyzing it into small tasks. Fig 1 On exploring different projects and technologies and analyzing the community support we came to a decision to use Python as our language of choice and Celery as our commander. Python is a pretty vast language backed by a large set of libraries. Since inception, It has gained a lot of popularity among the developer and Data Science communities. One of the major reason to use python as a backend is the project Celery. Its website defines celery as An asynchronous task queue/job queue based on distributed message passing. It is focused on real-time operation, but supports scheduling as well. To more about celery you can visit its website here. By now we were clear of how we want to proceed. We wanted to divide the process in Fig 1 into granular units (in celery terminology task). Keeping this as the baseline we identified all the individual units which can work independently in the system. This gave rise to a new look to Fig 1 Fig 2 Were are using MongoDB in a replication cluster as a Database engine and Redis DB for queuing and communication between independent celery workers. The following figure describes the communication of celery works with the help of broker(Redis)

Problem With Data loads and Sleepless Nights

Everything was working fine with the development workloads, we had deployed around a cluster of 5 machines in the celery cluster (i7 processors with 16 gigs of memory each) with each machine running some celery workers.

But this was the calm before the storm. As we planned to increase our workloads and configure more tasks the system started to fade and run out of memory very frequently. We were now generating around 10 million tasks a day. Celery workers were not able to process the tasks at this speed and this was building up tasks in the queues which were slowing down the Redis broker as it was now storing a large number of tasks, this was, in turn, making the whole system run out of memory and gradually to a halt.

Our beautifully designed ship was sinking and we had to do something to save it.

Solutions Tried

As we faced this issue the first approach we thought, as any other developer would have applied is to add more nodes to the cluster. While discussing this solution there was an argument that if the loads increases again with time, will we be adding more nodes again and again.

So, we skipped this solution and agreed to dig deep into the problem.

On deep analysis we realized that a large number of tasks were populated by the publishers and these tasks were getting stored in the Redis queues, with the data loads the Redis eventually slows down the rate at which the tasks were being sent to the consumers making the consumers idle.

So our system was stuck in an infinite loop.

Redis sends tasks at high speed but consumers cannot consume at that rate.
Redis slows down and lowers the rate at which it sends tasks and now consumers stay idle most of the time.

In both the situations, the common name that was in doubt was the Redis DB. On researching and exploring the RedisDB we found that it is a single-threaded system and performs all the tasks in a round robin fashion. Also, it was saving the tasks received by the systems in memory which was increasing the memory consumption and the under heavy loads that single thread was busy performing the persistence and other tasks that it slows down the polling process initiated by the consumers. So, we found the reasons for both the problems that we were facing.

To handle the heavy workload one of the choices was to shard the Redis server into a cluster. So we created a Redis partition set (tutorials can be seen here). We created a cluster of three nodes with each node handling an equal number of keys through a consistent hashing technique.

We plugged this cluster with the celery app and the workers were throwing an exception "MOVED to server 192.168.12.12"

On googling we found that Redis cluster is not yet supported by celery. On one hand we thought we had a solution but on the other, that was not yet supported by the underlying framework :(

Now the exploration began again to get a solution to our problem and we thought of using a proxy server in front of Redis cluster Twemproxy. But this time we first choose to check the compatibility with our framework and boom...... we cannot be more wiser in taking this path.

Proxy was not yet supported by Celery.

Now frustrated with this incompatibility issue we tried to figure out what all things are compatible with the framework. Some of the compatible brokers were

SQS
Redis
RabbitMQ
Zookeeper

A straightaway thought was to try a different broker so, we began to explore these options. Following table proved useful in narrowing down our research

Name	Status	Monitoring	Remote Control
RabbitMQ	Stable	Yes	Yes
Redis	Stable	Yes	Yes
Amazon SQS	Stable	No	No
Zookeeper	Experimental	No	No

Redis is what we have already tried so we went for the second option that is stable and provides more features i.e RabbitMQ.

Spoiler:

By now we knew that RabbitMQ is one the best choice for the brokers and is used by wide variety of clients in production and Redis is the best choice in terms of result backend (intermediate results that are stored by a task in Celery chains and chords).

We did necessary changes to the system and tried to run the system.

This time system nearly went into an irrecoverable state consuming all the memory and we had to restart the servers.

While all this was happening we were monitoring the RabbitMQ through its admin tool and found a fishy thing. It was creating a large number of queues nearly as many as the number of tasks.

On googling we found that the queues were created to store the result of the intermediate tasks and this was consuming too much space in the memory as all this storage was in memory.

The Final Destination

Collecting all the clues from our exploration we decided to use both the systems i.e Redis and RabbitMQ for the work that they are best at.

We deployed Redis as a result backend to store the intermediate results and RabbitMQ as a broker for maintaining communication and passing tasks (remember the spoiler above).

With this architecture, we were able to run the system under the workload on over 10 million tasks a day which can be scaled easily.

Hope this helps someone going through the same problem.

Acknowledgments

We took hints from the Celery documentation, Redis documentation, Stackoverflow threads, Github Issue page of Celery, Case studies by Instagram and Trivago.

Word Vectorization -Word2Vec

Japneet Singh Chawla — Tue, 31 Oct 2017 04:42:44 +0000

Introduction

Machine Learning has become the hottest topic in Data Industry with increasing demand for professionals who can work in this domain.

There is large amount of textual data present in internet and giant servers around the world.

Just for some facts

1,209,600 new data producing social media users each day.

656 million tweets per day!

More than 4 million hours of content uploaded to Youtube every day, with users watching 5.97 billion hours of Youtube videos each day.

67,305,600 Instagram posts uploaded each day

There are over 2 billion monthly active Facebook users, compared to 1.44 billion at the start of 2015 and 1.65 at the start of 2016.

Facebook has 1.32 billion daily active users on average as of June 2017

4.3 BILLION Facebook messages posted daily!

5.75 BILLION Facebook likes every day.

22 billion texts sent every day.

5.2 BILLION daily Google Searches in 2017.

Need for Vectorization

The amount of textual data is massive, and the problem with textual data is that it needs to be represented in a format that can be mathematically used in solving some problem. In simple words, we need to get an integer representation of a word. There are simple to complex ways to solve this problem.

One of the easiest ways to solve the problem is creating a simple word to integer mapping.

#list of sentences to be vectorized
line="Hello this is a tutorial on how to convert the word in an integer format"

#dictionary to hold the words
word_list={}

#initialize the counter for assigning to different word
counter=0

#iterate over the words
for word in line:
 #check if the word is in dict
 if word not in word_list:
  word_list[word]=counter
  #update the counter
  counter+=1

This will return us the dictionary of words with the corresponding integer representations.

Another way to get these numbers is by using TD-IDF

TF-IDF stands for term frequency-inverse document frequency which assigns some weight to the word based on the number of occurrences in the document also taking into consideration the frequency of the word in all the documents. This approach is better than the previous approach as it lowers the weight of the words that occur too often in all the sentences like 'a', 'the', 'as' etc and increases the weight of the words that can be important in a sentence. This is useful in the scenarios where we want to get the important words from all the documents . This approach is also used in topic modelling.

The third approach and the one on which this article will be focussing is Word2VEC

Word2vec is a group of related models that are used to produce so-called word embeddings. These models are shallow, two-layer neural networks, that are trained to reconstruct linguistic contexts of words.

After training, word2vec models can be used to map each word to a vector of typically several hundred elements, which represent that word's relation to other words. This vector is the neural network's hidden layer.

Word2vec relies on either skip-grams or continuous bag of words (CBOW) to create neural word embeddings. It was created by a team of researchers led by Tomas Mikolov at Google. The algorithm has been subsequently analyzed and explained by other researchers.

Luckily if you want to use this model in your work you don't have to write these algorithms.
Gensim is one the library in Python that has some of the awesome features required for text processing and Natural Language Processing. In the rest of the article, we will learn to use this awesome library for word vectorization.

Installing Gensim

pip install --upgrade gensim

It has three major dependencies

Python
NumPy
SciPy

Make sure you install the dependencies before installing gensim.

Lets' get our hands dirty on the code.

Text Preprocessing:

In this step, we will pre-process the text like removing the stop words, lemmatize the words etc.

You can perform different steps based on your requirements.

I will use nltk stopword corpus for stop word removal and nltk word lemmatization for finding lemmas.

I order to use nltk corpus you will need to download it using the following commands.

Downloading the corpus

import nltk
nltk.download()
#this will open a GUI from which you can download the corpus

Input initialization

#list of sentences to be vectorized
lines=["Hello this is a tutorial on how to convert the word in an integer format",
"this is a beautiful day","Jack is going to office"]

Removing the Stop Words

from nltk.corpus import stopwords
stop_words=set(stopwords.words('english'))

lines_without_stopwords=[]
#stop words contain the set of stop words
for line in lines:
 temp_line=[]
 for word in lines:
  if word not in stop_words:
   temp_line.append (word)
 string=' '
 lines_without_stopwords.append(string.join(temp_line))

lines=lines_without_stopwords

Lemmatization

#import WordNet Lemmatizer from nltk
from nltk.stem import WordNetLemmatizer
wordnet_lemmatizer = WordNetLemmatizer()


lines_with_lemmas=[]
#stop words contain the set of stop words
for line in lines:
 temp_line=[]
 for word in lines:
  temp_line.append (wordnet_lemmatizer.lemmatize(word))
 string=' '
 lines_with_lemmas.append(string.join(temp_line))
lines=lines_with_lemmas

Now we have done the basic preprocessing of the text. Any other preprocessing stuff can be achieved similarly.

Preparing Input

We have out input in the form of array of lines. In order for model to process the data we need covert our input to an array of array of words ( :\ ). Our Input lines=["Hello this is a tutorial on how to convert the word in an integer format", "this is a beautiful day","Jack is going to office"] New Input lines=[['Hello', 'this','tutorial', 'on', 'how','convert' ,'word',' integer','format'], ['this' ,'beautiful', 'day'],['Jack','going' , 'office']

new_lines=[]
for line in lines:
 new_lines=line.split('')
#new lines has the new format
lines=new_lines

Building the WORD2VEC Model

Building a model with gensim is just a piece of cake .

#import the gensim package
model = gensim.models.Word2Vec(lines, min_count=1,size=2)

Here important is to understand the hyperparameters that can be used to train the model. Word2vec model constructor is defined as:

gensim.models.word2vec.Word2Vec(sentences=None, size=100, 
alpha=0.025, window=5, min_count=5, max_vocab_size=None, 
sample=0.001, seed=1, workers=3, min_alpha=0.0001, sg=0, 
hs=0, negative=5, cbow_mean=1, hashfxn=<built-in function hash>, 
iter=5, null_word=0, trim_rule=None, sorted_vocab=1,
batch_words=10000, compute_loss=False)

sentence= This is the input provided in the form of a list size= This defines the size of the vector we want to convert the word ('Hello'=[ ? , ? , ? ] if size=3) alpha= It is the initial learning rate (will linearly drop to min_alpha as training progresses). window= It is the maximum distance between the current and predicted word within a sentence. min_count= ignore all words with total frequency lower than this. max_vocab_size = limit RAM during vocabulary building; if there are more unique words than this, then prune the infrequent ones. Every 10 million word types need about 1GB of RAM. Set to None for no limit (default). sample = threshold for configuring which higher-frequency words are randomly downsampled; default is 1e-3, useful range is (0, 1e-5). workers = use this many worker threads to train the model (=faster training with multicore machines). hs = if 1, hierarchical softmax will be used for model training. If set to 0 (default), and negative is non-zero, negative sampling will be used. negative = if > 0, negative sampling will be used, the int for negative specifies how many “noise words should be drawn (usually between 5-20). Default is 5. If set to 0, no negative sampling is used. cbow_mean = if 0, use the sum of the context word vectors. If 1 (default), use the mean. Only applies when cbow is used. hashfxn = hash function to use to randomly initialize weights, for increased training reproducibility. The default is Python’s rudimentary built-in hash function. iter = number of iterations (epochs) over the corpus. Default is 5. trim_rule = vocabulary trimming rule specifies whether certain words should remain in the vocabulary, be trimmed away, or handled using the default (discard if word count < min_count). Can be None (min_count will be used), or a callable that accepts parameters (word, count, min_count) and returns either utils.RULE_DISCARD, utils.RULE_KEEP or utils.RULE_DEFAULT. Note: The rule, if given, is only used to prune vocabulary during build_vocab() and is not stored as part of the model. sorted_vocab = if 1 (default), sort the vocabulary by descending frequency before assigning word indexes. batch_words = target size (in words) for batches of examples passed to worker threads (and thus cython routines). Default is 10000. (Larger batches will be passed if individual texts are longer than 10000 words, but the standard cython code truncates to that maximum.)

Using the model

#saving the model persistence
model.save('model.bin')

#loading the model
model = gensim.models.KeyedVectors.load_word2vec_format('model.bin', binary=True) 

#getting the most similar words
model.most_similar(positive=['beautiful', 'world'], negative=['convert'], topn=1)

#finding the odd one out
model.doesnt_match("bullish holding stock".split())

#getting the vector for any word
model[word]

#finding the similarity between words
model.similarity('woman', 'man')

For more details, you can read the documentation of the word2vec gensim here.

Spidering the WEB

Japneet Singh Chawla — Sun, 17 Sep 2017 11:37:28 +0000

Introduction

We will be talking about

Spidering/Scraping
How to do it elegantly in python
Limitations and restriction

In the previous posts, I shared some of the methods of text mining and analytics but one of the major and most important tasks before analytics is getting the data which we want to analyze.

Text data is present all over in forms of blogs, articles, news, social feeds, posts etc and most of it is distributed to users in the form of API's, RSS feeds, Bulk downloads and Subscriptions.

Some sites do not provide any means of pulling the data programmatically, this is where scrapping comes into the picture.

Note: Scraping information from the sites which are not free or is not publically available can have serious consequences.

Web Scraping is a technique of getting a web page in the form of HTML and parsing it to get the desired information.

HTML is very complex in itself due to loose rules and a large number of attributes. Information can be scraped in two ways:

Manually filtering using regular expressions
Python's way -Beautiful Soup

In this post, we will be discussing beautiful soup's way of scraping.

Beautiful Soup

As per the definition in its documentation

"Beautiful Soup is a Python library for pulling data out of HTML and XML files. It works with your favorite parser to provide idiomatic ways of navigating, searching and modifying the parse tree. It commonly saves programmers hours or days of work."

If you have ever tried something like parsing texts and HTML documents that you will understand how brilliantly this module is built and really save a lot of programmers work and time.

Let's start with beautiful soup

Installation

I hope python is installed in your system. To install Beautiful Soup you can use pip

pip install beautifulsoup4

Getting Started

Problem 1: Getting all the links from a page.

For this problem, we will use a sample HTML string which has some links and our goal is to get all the links

html_doc = """
<html>
<body>
<h1>Sample Links</h1>
<br>
<a href="https://www.google.com">Google</a>
<br>
<a href="https://www.apple.com">Apple</a>
<br>
<a href="https://www.yahoo.com">Yahoo</a>
<br>
<a href="https://www.msdn.com">MSDN</a>
</body>
</html>
"""

#to import the package 
from bs4 import BeautifulSoup


#creating an object of BeautifulSoup and pass 2 parameters
#1)the html t be scanned
#2)the parser to be used(html parser ,lxml parser etc)
soup=BeautifulSoup(html_doc,"html.parser")


#to find all the anchor tags in the html string
#findAll returns a list of tags in thi scase anchors(to get first one we can use find )
anchors=soup.findAll('a')

#getting links from anchor tags
for a in anchor:
    print a.get('href') #get is used to get the attributes of a tags element
    #print a['href'] can also be used to access the attribute of a tag

This is it, just 5-6 lines to get any tag from the the html and iterating over it, finding some attriutes.
Can you this of doing this with the help of regular expressions. It will be one heck of a job doing it with RE. We can think of how well the module is coded to perform all this functions.

Talking about the parsers (one we have passed while creating a Beautiful Soup object), we have multiple choices if parsers.

This table summarizes the advantages and disadvantages of each parser library:

Parser	Typical usage	Advantages	Disadvantages
Python’s html.parser	`BeautifulSoup(markup, "html.parser")`	Batteries included Decent speed Lenient (as of Python 2.7.3 and 3.2.)	Not very lenient (before Python 2.7.3 or 3.2.2)
lxml’s HTML parser	`BeautifulSoup(markup, "lxml")`	Very fast Lenient	External C dependency
lxml’s XML parser	`BeautifulSoup(markup, "lxml-xml")BeautifulSoup(markup, "xml")`	Very fast The only currently supported XML parser	External C dependency
html5lib	`BeautifulSoup(markup, "html5lib")`	Extremely lenient Parses pages the same way a web browser does Creates valid HTML5	Very slow External Python dependency

Other methods and Usage

Beautiful soup is a vast library and can do things which are too difficult in just a single line.
Some of the methods for searching tags in HTML are:

#finding by ID

soup.find(id='abc')




#finding through a regex

#lmit the return to 2 tags

soup.find_all(re.compile("^a"),limit=2)




#finding multiple tags

soup.find_all(['a','h1'])




#fiind by custom or built in attributes

soup.find_all(attrs={'data':'abc'})

Problem 2:

In the above example, we are using HTML string for parsing, now we will see how we can hit a URL and get the HTML for that page and then we can parse it in the same manner as we were doing for HTML string above

For this will be using urllib3 package of python. It can be easily installed by the following command

pip install urllib3

Documentation for urllib3 can be seen here.

import urllib3
http = urllib3.PoolManager()
#hiitng the url 
r = http.request('GET', 'https://en.wikipedia.org/wiki/India')

#creating a soup object using html from the link
soup=BeautifulSoup(r.data,"html.parser")

#getting whole text from the wiki page
text=soup.text

#getting all the links from wiki page
links=soup.find_all('a')

#iterating over the new pages and getting text from them
#this can be done in a recursive fashion to parse large number of pages
for link in links:
    prihref=nt link.get('href')
    new_url='https://en.wikipedia.org'+href
    http = urllib3.PoolManager()
    r_new = http.request('GET', new_url)
    #do something with new page
    new_text=r_new.text


#getting source of all the images
src=soup.find('img').get('src')

This was just a basic introduction to web scraping using Python. Much more can be achieved using the packages used in this tutorial. This article can serve as a starting point.

Points to Remember

Web Scraping is very useful in gathering data for different purposes like data mining, knowledge creation, data analysis etc but it should be done with care.

As a basic rule of thumb, we should not scrape anything which is paid content. Being said that we should comply with the robots.txt file of the site to know the areas which can be crawled.

It is very important to look into the legal implications before scraping.

Hope the article was informative.

TechScouter (JSC)

Topic Modeling

Japneet Singh Chawla — Tue, 22 Aug 2017 04:11:25 +0000

Hi readers,

In the previous post, I wrote about gaining the knowledge from the Text which is available from many sources. In this post, I will be writing about Topic Mining.

Introduction

Topic Mining can be described as finding words from the group of words which can best describe the group.

Textual Data in raw form is not associated with any context. A human can easily identify the context or topic for an article by reading the article and categorise it in one or other category like politics, sports, economics, crime etc.

One of the factors any human will consider while classifying the text into one of the topics is the knowledge that how a word is associated with a topic e.g

India won Over Sri Lanka in the test match.

World Badminton Championships: When and where to watch Kidambi Srikanth’s first round, live TV coverage, time in IST, live streaming

Here we may not find word sports explicitly in the sentences but the words marked in bold are associated with sports.

Topic modelling can be broadly categorised into two type

Rule-Based topic modelling
Unsupervised topic modelling

Rule-Based Topic Modelling

As the name suggests rule-based topic modelling depends on the rules which can be used to associate a given text with some topic.

In the simplest rule-based approach, we can just search for some words in the text and associate it with a topic e.g finding the word sports for associating the topic with sports, finding travelling for associating with topic travel

This approach can be extended and a topic can be represented as a set of words with some given probabilities e.g For the category sports we can have a set of words with some weight assigned to each word.
Topic : Sports{"sports":0.4,"cricket":0.1,"badminton":0.1 ,"traveling":0.05 .....}
Topic : Travel{"travel":0.4, "hiking":0.1,"train":0.05,"traveling":0.20 .......}

Notice the word "travelling", it occurs in both the Topics but has different weight.

If we have a sentence "Badminton players are travelling to UK for the tournament", by the simple approach of finding the words for the topic then this sentence will go under the topic Travel. The second approach improves the prediction by checking the probabilities and weight for different words, in this case, "Badminton" and "travelling" and improves the result by predicting the more accurate result that is Sports.

The main disadvantage of the Rule-Based approach is that all the topics have to be known in the beginning and probabilities have to be determined and examined. This rules out the possibility of finding some new topic in the text corpus.

Unsupervised Topic Modelling

The topic of a text sentence largely depends on the words used in the sentence and this property is exploited in unsupervised topic modelling technique to extract topics from the sentences.
It largely relies on the Bayesian Inference Model.

Bayesian Inference Model

It is a method by which we can calculate the probability of occurrence of some event-based on some common sense assumptions and the outcomes of previous related events.

It also allows us to use new observations to improve the model, by going through many iterations where a prior probability is updated with the observational evidence in order to produce a new and improved posterior probability

Some of the techniques for Unsupervised Topic Modelling are:

TF-IDF
Latent Semantic Indexing
Latent Dirichlet Allocation

I will not discuss the techniques in detail rather will focus on the implementation of these in python.
All the approaches use the vector space representation of the documents. In vector space model a document is represented by a document-term matrix.

Here D1,D2,D3...Dn are different documents and W1,W2,W3....Wn are the words that belong to our dictionary(all unique words in the corpus). DxWy gives the number of time the word has occurred in the document.


#importing corpora from gensim for creating dictionary

from gensim import corpora



#some raw documents .These documents can be stored in a file and read from the file

documents = ["Two analysts have provided estimates for Tower Semiconductor's earnings. The stock of Tower Semiconductor Ltd. (USA) (NASDAQ:TSEM) has \"Buy\" rating given on Friday, June 9 by Needham. The company's stock had a trading volume of 573,163 shares.",

             "More news for Tower Semiconductor Ltd. (USA) (NASDAQ:TSEM) were recently published by: Benzinga.com, which released: “15 Biggest Mid-Day Gainers For Monday” on August 14, 2017.",

             "Tower Semiconductor Ltd now has $2.33 billion valuation. The stock decreased 1.61% or $0.41 during the last trading session, reaching $25.06.",

             "The investment company owned 74,300 stocks of the industrial products firms shares after disposing 5,700 shares through out the quarter.",              

             "Tower Semiconductor Ltd now has $2.37B valuation. The stock rose 0.16% or $0.04 reaching $25.1 per share. About 115,031 shares traded.",

             "Needle moving action has been spotted in Steven Madden Ltd (SHOO) as shares are moving today on volatility -2.00% or -0.85 from the open.",

             "Shares of Steven Madden Ltd (SHOO) have seen the needle move -1.20% or -0.50 in the most recent session. The NASDAQ listed company saw a recent bid of 41.10 on 82437 volume.",

             "Shares of Steven Madden Ltd (SHOO) is moving on volatility today -1.37% or -0.57 from the open. The NASDAQ listed company saw a recent bid of 41.03 on 279437 volume.",

             "Shares of Steven Madden, Ltd. (NASDAQ:SHOO) are on a recovery track as they have regained 28.79% since bottoming out at $32.3 on Oct. 26, 2016."]



'''

    Text Preprocessiing before calculation vectors

'''

# remove common words(stop words) and tokenize

stoplist = set('for a of the and to in its by his has have on were was which or as they since'.split())

texts = [[word for word in document.lower().split() if word not in stoplist]

         for document in documents]





# remove words that appear only once in the corpus.

from collections import defaultdict

frequency = defaultdict(int)

for text in texts:

    for token in text:

        frequency[token] += 1



texts = [[token for token in text if frequency[token] > 1] for text in texts]



from pprint import pprint  # pretty-printer

pprint(texts)

'''

dictionary is collection of all the words that apprear in our corpus 

'''

#creating dictionary from the corpus which represents each word in the text corpus as a unique integer

dictionary = corpora.Dictionary(texts)

dictionary.save('/tmp/deerwester.dict')  # store the dictionary, for future reference

#saved dictionary can be loaded from the memory for future references.Saves time for recreation of dictionary





#to print the token id of the words

print(dictionary.token2id)

'''

We can convert any document to vector format by just calling the doc2bow() method of dictionary object.

document="hello this is a good morning"

Input=['hello','this','is','a','good','morning']

preprocessing can be done on the document before passing to doc2bow()

'''

corpus = [dictionary.doc2bow(text) for text in texts]

print corpus





'''

creating vector space representation for a single document

str="human interaction interface interface"

dictionary.doc2bow(text for text in str.split(' '))

'''









from gensim import corpora, models, similarities



#convertig the vector space representation to tf-idf transformation

tfidf = models.TfidfModel(corpus)

list(tfidf[corpus])



#using LDA model for getting the topics

lda= models.LdaModel(corpus, id2word=dictionary, num_topics=2) # initialize an LDA transformation

corpus_lda= lda[corpus] # create a double wrapper over the original corpus: bow->tfidf->fold-in-lda

print lda.print_topics(2) #print the topics


#for classifying the text documents into topics 
#here I have used the old corpus on which it was trained ,one can also use new documents 

for doc in corpus_lda:

    print doc

    



Output:

[
(0, u'0.069*"shares" + 0.062*"ltd" + 0.059*"steven" + 0.044*"madden" + 0.044*"tower" + 0.043*"moving" + 0.043*"recent" + 0.041*"stock" + 0.040*"(shoo)" + 0.036*"semiconductor"'),

(1, u'0.077*"shares" + 0.066*"tower" + 0.056*"semiconductor" + 0.050*"stock" + 0.048*"company" + 0.046*"ltd" + 0.045*"ltd." + 0.043*"(usa)" + 0.038*"(nasdaq:tsem)" + 0.035*"trading"')]




#the output gives the topics along with there probability distribution for different words

Parameters of LdaModel can be tweaked for result improvements.Also by just replacing LdaModel by LsiModel one can use LSI technique

Machine Learning-A Problem or Solution

Japneet Singh Chawla — Thu, 17 Aug 2017 17:39:30 +0000

The article will be divided into different sections as follows:

Introduction to Machine Learning
Types of Solutions
Classification using Naive Bayes

A brief about Machine Learning

According to the definition by Wikipedia, Machine learning is the subfield of computer science that, according to Arthur Samuel in 1959, gives "computers the ability to learn without being explicitly programmed." Machine Learning defines a set of problems that have to be evolved through the data by implying some algorithm. One factor that has to be kept in mind while defining a solution through ML is accuracy. Accuracy is very critical in case you are developing a solution in medical domain(cancer detection).There should be a threshold set for every solution which can be based on risk %age that is acceptable.

A useful cheatsheet from Microsoft's site to sum up the use of different ML algorithms for the different type of problems.

Types of solution

Machine Learning solutions can be broadly classified into three types

Supervised Learning

Unsupervised Learning

Reinforcement Learning

Supervised Learning is a category in which a machine is given training on some labelled data and is informed about the problem that is to be solved.The learnt model then uses its knowledge to solve the future similar problems.In this type of learner, a major requirement is the availability of labelled data.An Example of this kind of problems can be an object identification problem. First, a model is trained and told about the different types of objects, and then it is asked to identify the objects. Unsupervised Learning, on the other hand, is a category in which a learner is not provided with any input i.e labelled data. They use their algorithm and start working with problem and make some sense out of the problem on their own. Initially they may not provide much accuracy but they will eventually learn to provide accurate results.We can think about this learner as a person who went to some foreign country without any prior knowledge of the language spoken there.He learns from his experience how people greet each other, how to say hi etc. Reinforcement Learning is a category in which a machine learns by making some predictions and is penalised for making a wrong prediction, due to this penalty machine learns not to make this mistake again.Take an example of a driverless robot that crawls on a road with obstacles, every time it see's an object and bumps into the object, it is penalised for its prediction of not taking appropriate action. Next time robot see's the same kind of obstacle it knows from the past experience and avoids the same mistake again.

Machine Learning for Classification problem

Classification is set of machine learning problems in which an input is classified into different classes which can be either supervised or unsupervised.

Examples of classification problems:

Sentiment Analysis
Product Categorization
Binary Classification on reviews(pos, neg) and much more

Movie Review Classification using Naive Bayes Algorithm

I will be using nltk i.e is Natural Language Processing Toolkit available in python and movie review corpus that has labelled data for movie reviews classified as positive and negative.

import nltk.classify.util

from nltk.classify import NaiveBayesClassifier

from nltk.corpus import movie_reviews




#word_feats will convert the sentence into features

def word_feats(words):

    return dict([(word, True) for word in words])




negids = movie_reviews.fileids('neg')

posids = movie_reviews.fileids('pos')




#training data is converted into features

negfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'neg') for f in negids]

posfeats = [(word_feats(movie_reviews.words(fileids=[f])), 'pos') for f in posids]




#data is divided into training and testing data

negcutoff = len(negfeats)*3/4

poscutoff = len(posfeats)*3/4




trainfeats = negfeats[:negcutoff] + posfeats[:poscutoff]

testfeats = negfeats[negcutoff:] + posfeats[poscutoff:]

print 'train on %d instances, test on %d instances' % (len(trainfeats), len(testfeats))




#naivebayes classifier used for training on training data

classifier = NaiveBayesClassifier.train(trainfeats)

print 'accuracy:', nltk.classify.util.accuracy(classifier, testfeats)


classifier.show_most_informative_features()
print 'classified as:',classifier.classify("The plot was good,but the characters were not compelling")

Sentiment Analysis

Japneet Singh Chawla — Sun, 13 Aug 2017 05:42:15 +0000

This one took long due to the Analysis work I was doing for this post.There is a lot of work going on in the subject of Sentiment analysis so I decided to compare the accuracy of the products. Let's start with some basics...

NLP: Natural Language Processing

Natural Language Processing is a very interesting topic and a subject of debate when it comes to accuracy of the NLP.

Natural Language is very ambiguous as same sentences can have different meanings like

"I saw a man on a hill with a telescope. "

It seems like a simple statement until you begin to unpack the many alternate meanings:

There’s a man on a hill, and I’m watching him with my telescope.
There’s a man on a hill, who I’m seeing, and he has a telescope.
There’s a man, and he’s on a hill that also has a telescope on it.
I’m on a hill, and I saw a man using a telescope.
There’s a man on a hill, and I’m seeing him with a telescope.

Sarcasm is that component of the language that is difficult for quite a few humans to understand.

Phrase: "Great job."

Normally means: You have created something of value here. Sarcastic way means: You have created nothing of value here. Also every day there are some new slangs added in the vocabulary like ROFL, LOL, NaMo, BrExit etc...

The presence of all these components in the language makes it much more difficult to process it.

Sentiment Analysis

It is a sub-domain of NLP which is basically a binary classification problem[To know more about ML and classification check my post [Machine Learning -A Solution or Problem]

The problem is to determine the polarity i.e the negativity or positivity of a sentence by processing the sentence and applying some algorithm to it.

One of the naive approach for calculating the polarity of the sentence is Bag of Words Technique.

In this technique, a sentence is simply broken into words and these words are then mapped to a dictionary which contains words and the polarity of the corresponding word.The total polarity of a sentence is calculated by summing up the individual words. An example of such dictionary is AFINN.

But this is pretty basic approach and works only for very simple sentences and not effective for complex language. There are many other algorithms like SVM, Maximum Entropy etc which calculate the polarity by form a relationship between the words.

Twitter Sentiment Analysis

Twitter is a very popular social network where information spreads like a fire and reaches millions of users within seconds. Twitter sentiment analysis is a subject in which the Twitter tweets are analysed for calculating polarities and then a social notion about a topic is estimated.

There are many products in the market that claim to provide market analysis based on the Twitter data.
A big question on this type of analysis is that if it is difficult to understand the natural language and even after too much research its accuracy in sentiment analysis is at most 85%, how difficult it is to classify a twitter tweet into positive or negative tweet by just 160 characters, also these tweets have lots of jargons and slangs.

To explore this I took a set of 100 tweets and ran sentiment analysis through three models

IBM Watson
CoreNLP by Stanford
Microsoft Cognitive Service
Manual Tagging

There was a maximum accuracy of 50% and all the models nearly performed equally well.
The results can be seen here.
One thing that was clear from this analysis is that

"Either we can not depend on Twitter's tweet data for Market Analysis and Predictions through Sentiment Analysis or we need a more sophisticated system to do the analysis."

Text Analytics

Japneet Singh Chawla — Sat, 12 Aug 2017 23:35:09 +0000

Hi Readers,
Recently I was going through some text analytics activities at work and learned some techniques for text analytics and mining.

In this series of posts, I will be sharing my experiences and learnings.

Introduction

Firstly the jargon Text Analysis and Text Mining are sub domains of the term Data Mining and are used interchangeably in most scenarios.
Broadly Text Analysis refers to,
Extracting information from textual data keeping the problem for which we want to get data in mind.
and Text Mining refers to,
the process of getting textual data.

Nowadays, a large quantity of data is produced by humans.Data is growing faster than ever before and by the year 2020, about 1.7 megabytes of new information will be created every second for every human being on the planet and one of the main components of this data will be textual data.
Some of the main sources of textual data are

Blogs
Articles
Websites
Facebook
Comments
Discussion Forums
Reviews

In the scenario described in the image above every observer may perceive the real world in a different manner and express accordingly e.g different persons may have a different opinion on some topic.
So, while analysing textual data this type of bias ness has to be kept in mind.

Type of information that can be extracted from the text and converted into actionable knowledge can be

Mining content of data
Mining knowledge about the language
Mining knowledge about the observer
Infer other real world variables

NLP

NLP stands for natural language processing, a subdomain of Machine Learning that deals with processing of natural language spoken by humans and make it understandable for a machine.

It is a fairly complex problem as understanding natural language involves common sense which machines lack. Many of the sentences spoken by humans involve common sense and are spoken in some context which derives the meaning of the sentence.E.g

He is an apple of my eye.(Phrase)

John is the star.(Movie star or celestial body)

Analyzing the Natural language can be broadly classified into 4 types of analysis

Lexical Analysis-Identifying Parts of Speech, Word Associations, Topic Mining
Syntactical Analysis-Connecting words and creating phrases and then connecting phrases
Semantical Analysis-Extracting knowledge about the sentence
Pragmatic Analysis-Getting the intent of the sentence

Lexical Analysis

By the definition lexical analysis means converting of a sequence of characters into the sequence of words with some defined meanings.

Definitions of words can be

POS Tags(Nouns,Verbs,Adverbs etc)
Word Associations(How two words are linked)
Topic(Context of the Paragraph)
Syntagmatic words

In this article, I will be writing about Word Associations which is the analysis of how different words are linked to each other. Look at the following sentences

"Jack drives a car", "car" is associated to "drive"
"I love cats"
"I love dogs", In these sentences, the word cats is related to dogs as they can be used interchangeably to make a valid sentence

In the first sentence "Jack drives a car", the type of relationship between "car " and "drive" is called

Syntagmatic Relation.

Two words are called syntagmatically related if the probability of both the words occurring in the same sentence is high. E.g

Bob eats food
John drives a car
He sits on a sofa

These types of relations are important to find the probability of a word occurring in a sentence based on another word which is present in the sentence.

This problem can be mathematically reduced to Predicting a random variable v, where v=1 if the word is present in the sentence or v-0 if the word is not present in the sentence.

The more random this variable is, the more difficult it is to predict the variable. For e.g probability of occurring the in a sentence when some other word has occurred in the sentence can be easily predicted as "the" is a pretty common word, on the other hand, probability of occurring of word "eat" in a sentence is low and it's difficult to predict the probability of its occurrence.

The randomness of a random variable can be measured by Entropy.

H(Xw)=âˆ‘-P(Xw=V)logP(Xw=V)
(vÎµ{0,1})

X sub w is the probability that the word w is present in the sentence
H(Xw) is the entropy of the variable X sub w
P(Xw=V) is the probability that the variable exists(V=1) and vice-versa(V=0)
logP(Xw=V) is the natural log of the probability that a variable exists(V=1) and vice-versa (V=0)

"Higher the entropy more difficult it is to predict the syntagmatic relation"

Now that we know the concept of entropy and how it can be used to predict the existence of a word in a sentence, let's introduce one more concept called Conditional Entropy.
It is defined as the entropy of a word W when it is known that V has occurred in the sentence.
This helps to reduce the entropy which in turn reduces the randomness of the random variable.
Conditional Entropy can be defined as

Paradigmatic Relations

The other type of relationship that can be found between the words is the paradigmatic relation.Two words are said to be paradigmatically related if they can be replaced with each other and still make a valid sentence.E.g

I love cats
I love dogs

In the first sentence if "cats" is replaced with dogs it still makes a valid sentence, that means cats and dogs are paradigmatically related.

In order to find these type of relationships in the sentences we first need to find the Left Context, which means the words that appear left to the words and then we need to find the Right Context, which means the words that appear to the right of the sentences.

Once we have these context we can find the context similarities.Words with more context similarities are more paradigmatically related.This problem can be mathematically defined by representing each word in the context as a vector, where the value of vector is represented by the frequency of the occurrences of that word in the context

Now the two contexts can be represented as

d1=(x1,x2,x3,x4,........xn)

d1=c(wi,d1) / |d1|

c(wi,d1) is the count of word wi in d1 and,

|d1| is the number of words in d1

d2=(x1,x2,x3,x4,........xn)

d2=c(wi,d2) / |d2|

c(wi,d2) is the count of word wi in d2 and,

|d2| is the number of words in d2

And context similarity can be calculated as

Sim(d1,d2)=d1.d2

Sim(d1,d2)=x1y1+x2y2+x3y3+......................................+xnyn

Sim(d1,d2)=ðšºxiyi where i=0 to n

The similarity is the probability that a randomly picked word from d1 and d2 are identical.

Soon will be adding some links for the programmatical model for these relation mining and in the next series will be writing about the Topic Mining.

Stay Tuned......

DEV Community: Japneet Singh Chawla

Instagram Bot using Python

Installation

GloVe for Word Vectorization

What is GloVe?

Let's create the Embeddings

Installing Glove-Python

Text Preprocessing

Preparing Input

Building a Glove model

End Notes

Optimizing Celery for Workloads

Introduction

Product Brief

Current Architecture

Broad View

Problem With Data loads and Sleepless Nights

Solutions Tried

The Final Destination

Acknowledgments

Word Vectorization -Word2Vec

Introduction

Need for Vectorization

One of the easiest ways to solve the problem is creating a simple word to integer mapping.

Another way to get these numbers is by using TD-IDF

The third approach and the one on which this article will be focussing is Word2VEC

Installing Gensim

Text Preprocessing:

Preparing Input

Building the WORD2VEC Model

Using the model

Spidering the WEB

Introduction

Beautiful Soup

Installation

Getting Started

Other methods and Usage

Problem 2:

Points to Remember

Hope the article was informative.

Topic Modeling

Introduction

Rule-Based Topic Modelling

Unsupervised Topic Modelling

Bayesian Inference Model

Some of the techniques for Unsupervised Topic Modelling are:

Machine Learning-A Problem or Solution

The article will be divided into different sections as follows:

A brief about Machine Learning

A useful cheatsheet from Microsoft's site to sum up the use of different ML algorithms for the different type of problems.

Types of solution

Machine Learning for Classification problem

Movie Review Classification using Naive Bayes Algorithm

Sentiment Analysis

NLP: Natural Language Processing

Phrase: "Great job."

Normally means: You have created something of value here. Sarcastic way means: You have created nothing of value here. Also every day there are some new slangs added in the vocabulary like ROFL, LOL, NaMo, BrExit etc...

Sentiment Analysis

Twitter Sentiment Analysis

Text Analytics

Introduction

Type of information that can be extracted from the text and converted into actionable knowledge can be

NLP

John is the star.(Movie star or celestial body) Analyzing the Natural language can be broadly classified into 4 types of analysis

Lexical Analysis

Paradigmatic Relations

John is the star.(Movie star or celestial body)

Analyzing the Natural language can be broadly classified into 4 types of analysis