DEV Community: Anshaj Khare

Build a GraphQL API in minutes

Anshaj Khare — Fri, 08 Nov 2024 19:31:37 +0000

Did you know that there are ways today you can build secure, extremely performant APIs today - both REST and GraphQL on top of your existing data in minutes?

Read to find out how at https://thecomputeblog.substack.com/p/hasura-effortless-unified-api-on

How I cleared my AWS Developer associate exam and here's how you can do it too

Anshaj Khare — Wed, 28 Apr 2021 19:10:47 +0000

In the September of 2020, I decided to work towards the AWS developer associate exam. I wanted to use the opportunity to not only learn about AWS more but also to compare it with GCP, which I'd been using for the past 2 years. I ended up scoring 916 out of 1000 when I gave my exam in December of the same year. I created this guide to help those who are wondering whether they should pursue this certification as well as those who are actively preparing for it.

Why should you take this exam?

Cloud technology is being used everywhere and I'm sure you don't need any convincing on this. AWS is one of the leading cloud platforms and the path towards these certifications involves learning how to use this platform optimally. This, I believe, is the most important reason you should pursue this certification as it'll teach you how to use AWS effectively, and adopt best-practices so that you're better prepared, once you do start working on the platform
It's a skill that's high in demand. The knowledge that you'll gain during the preparation will make you an important asset of any team that runs it's infrastructure on AWS.
You'll be able to experiment. For me, the most enjoyable way to learn a technology is to make projects with it. I love developing with code and I find it especially fun to be able to architect my projects in multiple ways. Be it using a serverless architecture instead of traditional auto-scaling instances to using serverless containers and serverless databases. The variations of your architecture are only limited by your imagination (and what is possible with current technology). This also means that you can learn first-hand, the advantages and disadvantages of the various architectures. This knowledge is invaluable and only comes with experience and experimentation.

With the reasons to take this exam out of the way, I'll now focus on how you can prepare for this exam and what you should expect. I'll split up the tips to what will be common for all associate level exams and one that is specific to the developer associate exam. Let's begin.

What will you need?

A dedicated course
- A dedicated course that walks you over the various things that you need to understand about AWS. There are plenty of courses on various education websites like the ones I've listed below that will help you understand the intricacies in a step by step fashion. Note that these courses are not sponsoring this post in any way and that I'm recommending them based on my own experience.
  - Stephen Maarek's course on Udemy
  - ACloud guru course on Udemy
- You don't necessarily need a course and can do away with free training from AWS training. You could also learn from both an instructor led course and also from AWS training for more clarity.
An AWS account
- Having an AWS account to practice the things that you learn, will significantly increase your understanding of AWS. Concepts like IAM, roles, policy generation, VPC, etc. are best understood when you interact with them on the AWS console (or CLI). I'll strongly advise you to practice creating resources, manage their permissions and terminate them to understand their lifecycle. This will also help when you learn about concepts like CloudFormation or SAM that allow you to manage infrastructure with code.
Preparation plan
- I prepared for a month and half before I gave my exam. I had to manage my learning along with my full-time job which was not particularly easy. Motivating yourself to read up more or watch a video after a busy workday is hard. But I'd advise you to persist and learn continuously. If you don't have prior experience with cloud or AWS, you should may end up needing more time for individual concepts to sink in. The services on AWS are abstracted to make it convenient for developers to operate but they also offer a lot of flexibility, which can only be utilised by developers who take the time to study them in detail.
- Make a list of the topics that are relevant for your exam and make sure that you're learning about them on a continuous basis during your preparation. For AWS Developer Associate exam, make sure that you spend enough time on the topics mentioned in the exam guide.

What to expect during the exam?

Exam mode - Online/Examination centre:
- When I gave my exam, covid-19 restrictions made me choose the online version of the exam, where I could give the exam from the comfort of my home. The exam experience will be different for you considering whether you go for the examination centre vs online exam. I can however, share my experience of the exam, that I took from my home
- The online examination is a proctored exam. This means that at the start of your exam, your proctor will ask you to show your desk and surroundings. Throughout the exam, the proctor monitors you to make sure that you're able to give exam properly and to ensure that there are no malpractice incidents. A stable internet connection and a webcam is a must for online exam. I tried to make sure that everything was in order but I still ended up facing a few problems during my exam
  - First, the exam application that I had to install on my machine, failed to clear the system checks. This was because of my MacOS (Big-sur() and the application expecting previous versions of macOS. To resolve this, I had to dig out an old windows machine and install the exam software on it.
  - Second, in the middle of the test, there was a hiccup in my internet connection. The window of my exam session terminated for a minute, and I panicked. Fortunately, my exam proctor helped me restore the session and I could go back to solving the questions within a minute.
- If you're opting for the online mode of exam, ensure that you have power backup (in case you're not using a laptop) and a very stable internet connection. The policies regarding misconduct are severe and may lead to your exam getting cancelled. You should ensure that your room is quiet as well, to avoid any potential issues with your exam.
Be prepared for scenario based questions. Though some questions will be fact based, answering which will depend on your memory, a lot of questions will be scenario based and the choices may be confusing. Hence, while going through your preparation, make sure that you study the topics in detail and practice what you learn.
Multiple answer questions will tell you how many answers are correct. If in a questions, 2 answers are correct, the question will specify that 2 answers are correct out of the n number of choices. This helps narrow down the answers for these types of questions.
Keep an eye on the remaining time. If there are questions that are taking too much time, you can flag them and come back to them later. This technique is effective in avoiding time constraints towards the end of your exam.

Well, if you've made it till here, I hope this guide will prove useful to you in some way. Do remember to read the latest official exam guide as the exam instructions may change. With this advice, I wish you the best of luck! Thank you for reading.

How I created a CD pipeline for Firebase RemoteConfig using GitHub actions

Anshaj Khare — Sun, 24 Jan 2021 19:45:56 +0000

Photo by Quinten de Graaf on Unsplash

Firebase Remote Config is a service that lets you change the data that your web/mobile application uses without the need for a new deployment. This service can also help with A/B testing which is another really useful service provided by firebase.

Firebase Remote Config console allows you to edit and publish variables through an intuitive UI. It also allows you to update these variables through a REST API. I wanted to create a solution that had the following properties -

Other developers are able to push updates to remote config through a CI/CD system based on git
Avoid having to manage users through google cloud IAM
Update a google sheet, every time new updates are pushed to the firebase remote config

I've documented the steps that you may need to follow to set up a similar CI/CD pipeline for your remote config workflow. As of the writing of this post, Firebase CLI does not support the deployment of remote-config values. The same architecture can be extended to any other firebase service that does not have native CLI support.

Setup

Setting up the admin account credentials

In order to set up a continuous deployment pipeline, you'll need a service account key from the Firebase Console. This is how it can be obtained. Navigate to Settings -> Project Settings -> Service accounts. Select "Generate new private key" to generate a private key for the next steps.

Note: This key enables access to your firebase project. Do not share this key or directly add it to your version control. We'll look at a way to securely use this key from our CD pipeline, in the next steps.

Setting up your repository

I've added the starter code in this github gist. The repository can be set up with a virtual environment and the dependencies can be installed with

pip install -r requirements.txt

Let's try a local deployment first

The code uses two environment variables for programmatically interacting with firebase. They are

PROJECT_ID
CREDENTIALS

The PROJECT_ID points to your firebase project. The same can be exported with

export PROJECT_ID=<your-firebase-project-id>

The CREDENTIALS variable needs to contain the information that's present in the json file containing your service account details. The same can be exported to your shell using

export CREDENTIALS=$(cat service-account.json)

In order to test whether the service account credentials have been properly exported, you can print the CREDENTIALS variable with echo CREDENTIALS and confirm that the contents of the service account json file are indeed present in the variable.

With our environment variables set up, we can now test whether we're able to make authenticated API calls to Firebase Remote Config. From your shell, running the following command should work without any errors

python remote_config_manager.py --action=versions

Now, you have the ability to interact with Firebase Remote Config REST API.

Creating your remote config variables

The remote config variables can be added in remoteconfig.template.json or any other file that you reference in the _publish method of the code. The key thing to note here is that this file will be responsible for all the variables you store and update in remote config. This is where a logic can be written to update the most-recent value in google sheets or any other system for better accessibility/visibility.

Setting up the continuous deployment pipeline

In order to set up a CD pipeline on GitHub, we'll store our PROJECT_ID and CREDENTIALS as repository secrets. For this, we'll navigate to settings in our GitHub repository and add the two values in the secrets section. This is what the end result should look like.

After setting up the secrets for your pipeline, all we're left to do is to set up our workflow using the cd.yaml file. This file will need to be present at ./github/workflows/ in your repository in order to GitHub to trigger this as a GitHub action.

After this, the following will happen -

Every commit will trigger the build step. This is where additional steps of json validation can be added.
Every commit to the main branch (from a PR or otherwise) will result in the values present at remoteconfig.template.json being deployed directly to your project's remote config. You may wish to setup notifications regarding informing you of updates. You may also want to disable direct push to the main branch to avoid accidents.

Summary

In this post I've summarised how I created a CD pipeline for Firebase Remote Config in order to save up on my time and to enable git based version control for my remote config values. I found using GitHub Actions really intuitive and the build & deploy process was very fast, especially with dependency caching. Maybe I'll explore GitHub Actions for more use cases going forward with my projects.

Docker for data science and engineering

Anshaj Khare — Thu, 13 Aug 2020 18:08:17 +0000

Working as a data scientist, I had only heard of containerization. This was the case until I ended up working on a project that required shipping model training and deployment in containers on one of the cloud platforms. That's how I came across docker and it has been an integral part of my toolbox ever since.

I've mentioned some of the excellent resources for getting docker below

In this series, I'll cover docker from a data science and data engineering perspective which makes docker a very handy asset in your data science or software development toolkit.

Introduction

Before we learn what docker is, an important question to answer is why do we need docker?

Docker lets us create a completely reproducible environment. We specify the libraries that you need along with their specific versions. Docker lets us create an environment that will run the same way, irrespective of the system it's run on. This can lead to a lot of time-saving in projects with multiple environments, developers, testers, etc.

What are containers?

I think the official documentation on docker does a very good job of explaining what containers are.

A container is a standard unit of software that packages up code and all its dependencies so the application runs quickly and reliably from one computing environment to another. A Docker container image is a lightweight, standalone, executable package of software that includes everything needed to run an application: code, runtime, system tools, system libraries and settings.
Container images become containers at runtime and in the case of Docker containers - images become containers when they run on Docker Engine. Available for both Linux and Windows-based applications, containerized software will always run the same, regardless of the infrastructure. Containers isolate software from its environment and ensure that it works uniformly despite differences for instance between development and staging.

How can docker help with data science?

Let's talk about an end-to-end data science solution. A data science project often involves a whole team of data scientists, data engineers, software architects, often working along with other software development teams to create a viable solution. You can have a situation where different data scientists end up working with different versions of the library only to realize after hours of debugging that there are small differences in their environments. Docker lets you create a consistent environment for data scientists and data engineers to deal with these kinds of situations.

Okay, so we know now that how docker helps in maintaining consistency of the environment, but what if you work alone? Should you invest time learning and implementing docker for small scale projects? The answer is yes and I'll explain why.

Ease of model building - Dockerizing your data science project can help set things up faster. I'll demonstrate this with an example below.
Deployment - You've created a state of the art model with data science libraries and now you wish to create a solution out of it. Deploying a data science solution is much easier with docker in place. I'll demonstrate an example of this as well in later posts in this series.

Lets look at some examples

Let's use docker to set up a data science environment for ourselves. The image that I'll set up is a jupyter notebook using docker to set up python, R and Julia with just a few commands. The image is developed and maintained by the jupyter team and the same can be found here. We need to run the following commands from a shell in your system.

docker pull jupyter/datascience-notebook:latest
docker run -p 8888:8888 jupyter/datascience-notebook:latest

Navigate to localhost:8888 on your filesystem and copy the auth token from the shell and there you have it, you can have a data science environment to run R, python, and Julia code.

Let's see a slightly more technical use case. We'll implement a server with fastAPI. For more details, please refer the FastAPI documentation.
Let's see how easy it is to set up a fastAPI server on your local machine. Run the following commands from a shell in your machine in a new folder

python -m venv venv
source venv/bin/activate
pip install fastapi

Now, create a Dockerfile with the following code

FROM tiangolo/uvicorn-gunicorn-fastapi:python3.7

COPY ./app /app

Now, create a folder named app and create a script inside with the name main.py. Add the following code to main.py

from fastapi import FastAPI

app = FastAPI()


@app.get("/")
def read_root():
    return {"Hello": "World"}


@app.get("/items/{item_id}")
def read_item(item_id: int, q: str = None):
    return {"item_id": item_id, "q": q}

Now, go to your shell and run the following from the project root

docker build -t myimage .
docker run -d --name mycontainer -p 80:80 myimage

Navigate to http://127.0.0.1/docs to see the swagger UI docs for the API

Navigate to http://127.0.0.1/redoc for redoc based docs.

In a later post, I'll demonstrate how to serve your machine learning model with FastAPI but for now, we have a FastAPI server served in docker container ready to be shipped anywhere.

In this post, I demonstrated how Docker can be used in data science projects. In my next post, I'll go into more detail into docker terms and docker commands that can speed up your development process. Thanks for reading!

How I built a zero cost serverless scraper

Anshaj Khare — Mon, 13 Jul 2020 18:12:28 +0000

I started working on a data science project with a friend, where we wanted to understand the sentiments of people about a few topics over a period of time. Naturally, the first part of building a project like this involves scraping data from the web. Since the time period for this project spans over months, we needed a way to automatically scrape data at regular intervals without any intervention.
If you want to skip the details and need to see the code, scroll to the bottom of the page.

Schedulers

Cron

The first thing that comes to mind when talking about schedulers is cron. The trusty, reliable, all-purpose cron has been with us for a very long time and a lot of projects with simpler scheduling requirements still use corn in production systems. But cron requires a Linux server to run. We could've set up a serverless container or used a small VM with Linux for our purpose but we didn't because of the alternative that we found.

Airflow

Apache Airflow deserves a post of its own and I'm planning to write about it in the future. Simply put, Airflow is an open-source workflow orchestration tool that helps data engineers make sense of production data pipelines. In today's world, where data fuels a lot of applications, the reliability and maintainability of airflow pipelines leave little to be desired. Airflow executes tasks in DAGs (Directed Acyclic Graphs) where each node is a task and these tasks execute in the order that we specify. It's possible to run airflow in a distributed setting to scale it up for production requirements. But airflow is too much for small projects and it ends up adding overhead to a fast-paced light development process.

Cloud Scheduler

This is one of the less known features of google cloud and is the service that we used in our project. Cloud scheduler is a serverless offering of cron by google cloud. In the free tier, google lets you schedule 3 jobs for free. Note that an execution instance is different from a job which means that you can schedule 3 processes for any cron schedule. We wanted our process to run every hour and hence the trusty schedule of 0 * * * * (more info here) was implemented on the cloud scheduler.

Scraping

Lets scrape twitter

There is an excellent Python package for scraping twitter that can be found here. I used this to extract the data for the hashtags that we wanted to collect the data for.

Google cloud functions

Google cloud functions are a serverless offering by google. A variety of tasks can be performed with these serverless functions but in this post, I'll only talk about how I used them to scrape data from twitter.

This is what happens in the cloud function

The keyword list is downloaded from a GCS bucket
The keyword list is used to scrape the tweets from twitter
The tweets are stored as a csv file in GCS again

I made my script compatible with the requirements of cloud functions, set up the invoking method to an HTTP call and added some code to store data in a google cloud storage bucket. That was it! The serverless function was ready to do some web scraping.

Connecting everything

Now that I had a serverless scheduler and a cloud function, all I had to do was connect these two. This was fairly simple. Cloud scheduler allows us to send a post request at the interval that we want. All we needed to do was to give it the right permissions to invoke the cloud function (cloud functions should not be publically invokable) And we had it, a completely serverless scraper that scrapes the web at regular schedules and allows us to build excellent datasets for our use case at zero expense.

Here is the gist of the source code. If you found this post useful, consider following me here as I plan to write more about data engineering and data science in the upcoming weeks.

Scraping Twitter data using python for NLP tasks

Anshaj Khare — Tue, 14 Apr 2020 10:13:53 +0000

"In God we trust. All others must bring data." - W. Edwards Deming

If you're starting in the incredible field of NLP, you'll want to get your hands dirty with real textual data that you can use to play around with the concepts you've learned. Twitter is an excellent source of such data. In this post, I'll be presenting a scraper that you can use to scrape the tweets of the topics that you're interested in and get all nerdy once you've obtained your dataset.

I've used this amazing library that you can find here. I'll go over how to install and use this library and also suggest some methods to make the entire process faster using parallelization. The complete notebook containing the code can be found here

Installation

The library can be installed using pip3 using the following command

pip3 install twitter_scraper

Creating a list of keywords

The next task is to create a list of keywords that you want to use for scraping twitter.

# List of hashtags that we're interested in
keywords = ['machinelearning', 'ML', 'deeplearning', 
            '#artificialintelligence', '#NLP', 'computervision', 'AI', 
            'tensorflow', 'pytorch', "sklearn", "pandas", "plotly", 
            "spacy", "fastai", 'datascience', 'dataanalysis']

Scraping tweets for one keyword

Before we run our program to extract all the keywords, we'll run our program with one keyword and print out the fields that we can extract from the object. In the code below, I've shown how to iterate over the returned object and print out the fields that you want to extract. You can see that we have the following fields that we extract

Tweet ID
Is a retweet or not
Time of the tweet
Text of the tweet
Replies to the tweet
Total retweets
Likes to the tweet
Entries in the tweet

# Lets run one iteration to understand how to implement this library
tweets = get_tweets("#machinelearning", pages = 5)
tweets_df = pd.DataFrame()

# Lets print the keys and values obtained
for tweet in tweets:
  print('Keys:', list(tweet.keys()), '\n')
  break

# Running the code for one keyword and extracting the relevant data
for tweet in tweets:
  _ = pd.DataFrame({'text' : [tweet['text']],
                    'isRetweet' : tweet['isRetweet'],
                    'replies' : tweet['replies'],
                    'retweets' : tweet['retweets'],
                    'likes' : tweet['likes']
                    })
  tweets_df = tweets_df.append(_, ignore_index = True)
tweets_df.head()

Running the code sequentially for all keywords

Now that we've decided what kind of data we want to store from our object, we'll run our program sequentially to obtain the tweets of topics we're interested in. We'll do this using our familiar for loop to go over each keyword one by one and store the successful results.

# We'll measure the time it takes to complete this process sequentially
%%time
all_tweets_df = pd.DataFrame()
for word in tqdm(keywords):
  tweets = get_tweets(word, pages = 100)
  try:
    for tweet in tweets:    
      _ = pd.DataFrame({'hashtag' : word, 
                        'text' : [tweet['text']],
                        'isRetweet' : tweet['isRetweet'],
                        'replies' : tweet['replies'],
                        'retweets' : tweet['retweets'],
                        'likes' : tweet['likes']
                      })
      all_tweets_df = all_tweets_df.append(_, ignore_index = True)
  except Exception as e: 
    print(word, ':', e)
    continue

Running the code in parallel

From the documentation, Multiprocessing is a package that supports spawning processes using an API similar to the threading module. The multiprocessing package offers both local and remote concurrency, effectively side-stepping the Global Interpreter Lock by using subprocesses instead of threads. Due to this, the multiprocessing module allows the programmer to fully leverage multiple processors on a given machine.

First, we'll implement a function to scrape the data.

# We'll create a function to fetch the tweets and store it for us
def fetch_tweets(word):
  tweet_df = pd.DataFrame()
  tweets = get_tweets(word, pages=100)
  try:
    for tweet in tweets:    
      _ = pd.DataFrame({'hashtag' : word, 
                        'text' : [tweet['text']],
                        'isRetweet' : tweet['isRetweet'],
                        'replies' : tweet['replies'],
                        'retweets' : tweet['retweets'],
                        'likes' : tweet['likes']
                      })
      tweet_df = tweet_df.append(_, ignore_index = True)
  except Exception as e: 
    print(word, ':', e)
  return tweet_df

Next, we'll create subprocesses to run our code in parallel.

# We'll run this in parallel with 4 subprocesses to compare the times
%%time
with Pool(4) as p:
    records = p.map(fetch_tweets, keywords)

Conclusion

As you can see, we reduced our process time to almost 1/4th of sequential execution. You can use this method for similar tasks and make your python code much faster.
Good luck with the scraping!

What's new in pandas 1.0?

Anshaj Khare — Wed, 29 Jan 2020 15:09:07 +0000

Pandas 1.0.0 has been released. In this post, I have compiled the list of important changes that have been made.

Faster df.apply()

Apply now supports an engine key that allows the user to execute the routine using Numba instead of Cython. For rows greater than 1 million, the Numba engine can yield a significant increase in speed.

Dedicated string data type

String data type is now separate from the object data type. String data type is still experimental and probably shouldn't be used in production code. But it's nice to see a dedicated string type in the dataset. Also, in cases where you need to differentiate the string and object data types in the data, this change will come in handy.

NA singleton to denote missing values

Pandas used several values to represent missing data:

np.nan for float data
np.nan or None for object-dtype data
pd.NaT for datetime-like data.

pd.NA provides a “missing” indicator that can be used consistently across data types.

Markdown table

The data frame can now be printed as a markdown table using df.to_markdown()

Better summary with DataFrame.info()

The dataframe summary now uses a more readable style

You can use pip install pandas==1.0.0rc0 to install pandas 1.0 into your python environment.