DEV Community: Dipankar Medhi

Recognize dates from documents using Sliding Window Algorithm & Python OCR.

Dipankar Medhi — Sat, 07 Jan 2023 17:23:44 +0000

Hey there 👋,

Today, lets solve a text processing problem that asks us to find any date present in a text extracted from an image.

We are using easyocr , a python OCR library to find the text from the images. Lets move on with the code.

Extracting text from images | Setting up easyocr

We start by creating a data-extraction.py module.
Create a DataExtraction class and initiate the easyocr model.


from datetime import datetime  
import easyocr  
import re  

class DataExtraction:  
    def __init__ (self) -> None:  
      self.months = {  
            "JAN": "01",  
            "FEB": "02",  
            "MAR": "03",  
            "APR": "04",  
            "MAY": "05",  
            "JUN": "06",  
            "JUL": "07",  
            "AUG": "08",  
            "SEP": "09",  
            "OCT": "10",  
            "NOV": "11",  
            "DEC": "12",  
        }  
        self.reader = easyocr.Reader(["en"])

Converting date strings to DateTime objects

There can be an unknown number of date formats and parsing each one of them will take an infinite amount of time and work. So in this example, well consider only a few well-known forms.

Well try to identify dd mmm yyyy date formats from a string.

For example, if the given date is 15 sd f may 2019 , then the output should be 15052019".

We are going to use the Sliding Window to detect if any month is present in between two groups of numerical characters.

The string includes numbers, alphabets, including other characters. For example, consider 𝗴𝘀 𝟭𝟱 𝗺𝗮𝗶 𝗺𝗮𝘆 𝟮𝟬𝟭𝟵 𝘀𝗴𝗳 𝘀. The date should be 15th May 2019.

The first step is to implement a sliding window to convert MMM to a number. Like, may to 05.
We create a function that takes in a string and finds if it contains any month from the above dictionary, months.

def month_to_num(self, s: str) -> str:
        res = ""
        start = 0
        try:
            for end in range(len(s)):
                rightChar = s[end]
                res += rightChar
                if len(res) == 3:
                    if res.upper() in self.months.keys():
                        numeric_date = self.months[res.upper()]
                        return numeric_date
                    start += 1
                    res = res[1:]
        except Exception as e:
            pass

        return ""

Next, we create a function that takes in a string and gives us the desired format.

def find_date_string(self, s: str) -> list: # s = "𝗴𝘀 𝟭𝟱 𝗺𝗮𝗶 𝗺𝗮𝘆 𝟮𝟬𝟭𝟵 𝘀𝗴𝗳 "
        s1 = " ".join(re.split(r"([a-zA-Z])([0-9]+)", s))
        s2 = " ".join(re.split(r"([0-9]+)([a-zA-Z]+)", s1))
        text = "-" + "-".join(re.split(r"[-;,.\s]\s*", s2)) + "-" # "gs-15-mai-may-2019-sgf"
        dates_type_1 = re.findall(r"-[0-9][0-9]-.*?-[0-9][0-9][0-9][0-9]-", text) # "-15-mai-may-2019"
        date_objects = []
        if len(dates_type_1) > 0:
            date_objs = self.get_date_object(dates_type_1)
            for date_obj in date_objs:
                date_objects.append(date_obj)
        return date_objects

def get_date_object(self, date_type_1_list: list):
    dates = []
    for date_str in date_type_1_list:
        day_str = date_str[1:3]
        month_str = date_str[3:-4]
        year_str = date_str[-5:-1]

        month_number = self.month_to_num(month_str)
        if month_number == "":
            return ""

        result_date_str = f"{day_str}-{month_number}-{year_str}"
        date_object = datetime.strptime(result_date_str, "%d-%m-%Y")
        dates.append(date_object)  

     return dates

Now we just have to pass the extracted strings into the above functions.

def get_date_from_img(self, img_path: str):
        result = []

        # extract the texts from the img
        text_strings = self.reader.readtext(img_path, detail=0)

        # check every string for dates
        for s in text_strings:
            date_obj_list = self.find_date_string(s)
            if len(date_obj_list) > 0:
                result.append(date_obj_list)
       return result

Thats it. We have all the DateTime objects present in a document image.

This method can be used on any kind of document, provided the date format matches the defined type. There are many kinds of date formats used throughout the world. Different countries have different formats. Parsing each one of them will require some more effort but it is definitely achievable.

Here are some of the other formats to be used for different date types.

"""
1. 1 mai/may 2019
2. 1 mai/may 19
3. 12 09 2016
4. 2 09 2016
5. 12 09 16
6. 2 09 16  
"""  
dates_type_2 = re.findall(r"-[0-9][0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]-", text)
dates_type_3 = re.findall(r"-[0-9][0-9]-[0-9][0-9]-[0-9][0-9]-", text)
dates_type_4 = re.findall(r"-[0-9][0-9]-.*?-[0-9][0-9]-", text)
dates_type_5 = re.findall(r"-[0-9]-.*?-[0-9][0-9]-", text)
dates_type_6 = re.findall(r"-[0-9]-.*?-[0-9][0-9][0-9][0-9]-", text)
dates_type_7 = re.findall(r"-[0-9]-[0-9][0-9]-[0-9][0-9]-", text)
dates_type_8 = re.findall(r"-[0-9]-[0-9][0-9]-[0-9][0-9][0-9][0-9]-", text)

Thats all folks! See you soon.

Happy Coding 🤟

How I built a real-time Machine Learning system with Kafka, Elasticsearch, Kibana, and Docker

Dipankar Medhi — Sun, 04 Dec 2022 04:00:14 +0000

We will design and build a real-time sentiment analysis and hate detection system.

This is a project that I made in the Turn Language into Action, Natural Language Hackathon by Expert.ai .

I have always been interested in real-time systems and have always wondered how things work under the hood.

HOW? 🤔

So, I found this hackathon to be a perfect opportunity for me to learn and build something new.

Well then, Lets ROLL!!!

Project Architecture

This is what the complete pipeline looks like. Dont worry I will cover everything in detail.

But before we move on with the tools and architecture, let me talk about our data sources.

I have used Twitter API for real-time tweets, specifically pythons tweepy library for streaming tweets. In addition to that, I have used NewsAPI for daily news articles.

I have used docker to set up all the necessary tools as containers for this project.

Now lets talk about each component.

Apache Kafka

For ingesting the real-time data, I have used Apache Kafka.

Now, what is Apache Kafka? Well

Apache Kafka (Kafka) is an open source, distributed streaming platform that enables (among other things) the development of real-time, event-driven applications. IBM

Since I have used Python, there is a python client kafka-python available that makes working with Kafka relatively easy.

Using the KafkaProducer , Ive sent the messages (Twitter and NewsAPI) via 2 Kafka topics to the KafkaConsumer. One for the tweets and the other one for the news articles respectively.

KafkaConsumer then calls the Machine Learning service to classify the sentiments of the news media articles and detect hate in the tweets.

Machine Learning service

Expert.ai turns language into data so teams can make better decisions.

Since I built this project as a part of the Expert.ai hackathon, I have used their API for sentiment analysis/classification and hate detection.

However, you can always use your own Tensorflow or PyTorch model. Also, Huggingface has some very relevant models for sentiment classifications and they are straightforward to set up. You should check them out!

I am using the Sentiment Analysis and Hate speech detection APIs from Expert.ai NL API.

Elasticsearch

Okay, we have the classified data. Now What?

We have to store that data somewhere to use it for further analytics. I have used Elasticsearch and Kibana to visualize the stored data.

You might ask, why Kibana?

Let me introduce you to the ELK stack.

ELK is the acronym for three open source projects: Elasticsearch, Logstash, and Kibana. Elasticsearch is a search and analytics engine. Logstash is a serverside data processing pipeline that ingests data from multiple sources simultaneously, transforms it, and then sends it to a stash like Elasticsearch. Kibana lets users visualize data with charts and graphs in Elasticsearch. Elastic.co

Elasticsearch, Logstash and Kibana go hand in hand in most data engineering or data ingestion use cases. But I have omitted Logstash to keep the pipeline simple and clear to its goal.

But, you can always add Logstash and scale the pipeline further as needed.

That is enough about the ELK stack. Lets jump into the Elasticsearch design.

Elasticsearch: The Official Distributed Search & Analytics Engine

Like databases, Elasticsearch has " Indexes". These indexes store data defined with certain mappings type. Mapping is more like a schema in other databases.

The mapping describes the fields in the JSON documents along with their data type, as well as how they should be indexed in the indexes.

Databases ~ Indexes

The above image will give you a better idea about Elasticsearch indexes compared to MySQL or PostgreSQL.

Kibana

Done with storing the messages/data in the Elasticsearch indexes? Okay, Great! We can finally use that resultant data to visualize and get more insights about the data.

We use Kibana for that.

Kibana: Explore, Visualize, Discover Data | Elastic

Your window into the Elastic Stack Kibana is a free and open user interface that lets you visualize your Elasticsearch

www.elastic.co

Kibana is a free and open user interface that lets you visualize your Elasticsearch data and navigate the Elastic Stack.

Kibana Dashboard

This is what my final Kibana dashboard looks like. You can check out the code at my GitHub repo.

Feel free to leave a star if you like the project.

This part covers only the idea or the overview of the project along with the project architecture. Ill soon add the coding section in a separate part so stay tuned for that

Thats all folks. See you soon 👋

Happy coding.

Streaming tweets using Twitter V2 API | Tweepy

Dipankar Medhi — Wed, 17 Aug 2022 04:53:43 +0000

With v2 Twitter API, things have changed when it comes to streaming tweets. Today we're going to see how to use StreamingClient to stream tweets and store them into an SQLite3 database.

About Twitter V2 API

For streaming tweets, you are most likely to apply for an "Elevated" account.

The application process is fairly simple and easy. Once the application has been submitted, you will receive an "approval" email from the Twitter Dev team.

Things to be done on your Twitter developer portal

After you've got your Elevate access, visit Developer portal to get your projects and apps ready.

Move to the projects and apps menu, present on the left side of the developer portal, and add an application as required.

Click on "Add app

Select App environment
App name
Keys & Token

Next, you will get your API keys and tokens along with a bearer token.

💡 Save them, cause we'll need them to make requests to the Twitter API.

Now, let's move on to the next section.

Installing tweepy

Installing Tweepy is pretty straightforward 📏.

The official tweepy documentation has everything we need. Make sure to have a look at it.

Make a python virtual environmentpython -m venv venv
Install tweepy pip install tweepy

See, that's not hard😸.

Now that we are done with the requirements, we can move to the coding section.

Let's write some code

Before that, let's structure our code.

Make a database directory where we'll store the SQLite DB files.
A main.py file where all our code goes in, and
A .env file that will store all our API keys and tokens.

💡 For this project, I have put everything into one file but you can always refactor them into separate modules as per requirements.

Now, We are ready! 🚗

1. Store the API keys and tokens in a .env file

API_KEY="apikeygoeshere"
API_KEY_SECRET="apikeysecretgoeshere"
ACCESS_TOKEN="accesstokengoeshere"
ACCESS_TOKEN_SECRET="accesstokensecretgoeshere"
BEARER_TOKEN="bearertokengoeshere"

2. Importing all necessary packages

from dotenv import load_dotenv
import os
import sqlite3
import tweepy
import time
import argparse

3. Loading the API credentials

load_dotenv()
api_key = os.getenv("API_KEY")
api_key_secret = os.getenv("API_KEY_SECRET")
access_key = os.getenv("ACCESS_KEY")
access_key_secret = os.getenv("ACCESS_KEY_SCERET")
bearer_token = os.getenv("BEARER_TOKEN")

4. Creating the database

conn = sqlite3.connect("./database/tweets.db")
print("DB created!")
cursor = conn.cursor()
cursor.execute("CREATE TABLE IF NOT EXISTS tweets (username TEXT,tweet TEXT)")
print("Table created")

5. Creating the Streaming class

class TweetStreamV2(tweepy.StreamingClient):
    new_tweet = {}

    def on_connect(self):
        print("Connected!")

    def on_includes(self, includes):
        self.new_tweet["username"] = includes["users"][0].username
        print(self.new_tweet)
        # insert tweets in db
        cursor.execute(
            "INSERT INTO tweets VALUES (?,?)",
            (
                self.new_tweet["username"],
                self.new_tweet["tweet"],
            ),
        )
        conn.commit()
        # print(self.new_tweet)
        print("tweet added to db!")
        print("-" * 30)

    def on_tweet(self, tweet):
        if tweet.referenced_tweets == None:
            # self.new_tweet["tweet"] = tweet.text
            print(tweet.text)
            time.sleep(0.3)

What does the code say?

Before moving into details, I request you to please have a look at the StreamingClient documentation. This will make things more clear.

on_connect method prints a "Connected" message, letting us know that we have successfully connected to the Twitter API.
on_tweet method receives a tweet and processes it according to the conditions, if there are any, and adds the tweet to the hashmap.
on_includes is responsible for the user details and adds the user data to the hashmap.
Finally, the data in the hashmap is inserted into the tweets table.

5. Main function

def main():
    # get args
    parser = argparse.ArgumentParser()
    parser.add_argument("search_query", help="Twitter search query")
    args = parser.parse_args()
    query = args.search_query

    stream = TweetStreamV2(bearer_token)

    # delete previous query
    prev_id = stream.get_rules().data[0].id
    stream.delete_rules(prev_id)
    # add new query
    stream.add_rules(tweepy.StreamRule(query))

    print(stream.get_rules())

    stream.filter(
        tweet_fields=["created_at", "lang"],
        expansions=["author_id"],
        user_fields=["username", "name"],
    )

What does the code say?

The python script takes an argument, search_query.
This argument is added to the stream rules after deleting the previously added rules.
Rules are basically searched queries that go in as input into the stream object. There can be more than one rule. And each rule has a value, tag and an id.
The id is passed on to the delete_rules method to delete a rule.

💡 I suggest you refer to the official documentation for more details on adding and deleting rules.

Next, we have the filter method. It is responsible for filtering the tweets based on the query passed and the fields chosen.

All the different fields are:

expansions (list[str] | str) expansions media_fields (list[str] | str) media_fields place_fields (list[str] | str) place_fields poll_fields (list[str] | str) poll_fields tweet_fields (list[str] | str) tweet_fields user_fields (list[str] | str) user_fields threaded (bool) Whether or not to use a thread to run the stream

💡 Refer to the official documentation

Let's try out our app

To test if everything is working, we pass on Spiderman argument while running the main.py file.

$ python main.py Spiderman

This will create a tweets.db file inside the database directory.

And if you view the tweets.db file, you will find a table with username and tweet as its columns respectively.

Conclusion

This is an example showing how to use the Twitter V2 API with python using the Tweepy library to get live tweets and store them in a database. You can also use csv, json files to store tweets.

I will keep adding more blogs to this series.

🤝Follow for quick updates.

🌎Explore, 🎓Learn, 👷Build.

Happy Coding💛

How to create an end-to-end Machine Learning pipeline with AMLS (Azure Machine Learning Studio)

Dipankar Medhi — Mon, 02 May 2022 06:55:36 +0000

Welcome👋!

Today let us build an end-to-end Machine learning pipeline with Microsoft Azure Machine Learning Studio.

We are using the adult income dataset.

For a more detailed tutorial, visit the official Microsoft Azure documentation.

https://docs.microsoft.com/en-us/azure/machine-learning/how-to-create-machine-learning-pipelines

Step1: Creating the Workspace

The first step is to create the Azure Machine Learning workspace.

Step2: Connect to workspace

Import all the dependencies

from azureml.core import Workspace, Datastore
from azureml.core import Experiment
from azureml.core import Model
import azureml.core
import pandas as pd
import numpy as np
import joblib
from sklearn.model_selection import train_test_split
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_auc_score
from sklearn.metrics import roc_curve
from sklearn import metrics

Connecting to the workspace

ws = Workspace.from_config()
print(ws)

Step3: Create an Experiment

We are naming our experiment "new-adult-exp"

# Create an Azure ML experiment in your workspace
experiment = Experiment(workspace = ws, name = "new-adult-exp")
run = experiment.start_logging()

Step4: Setting up a datastore

What's a datastore?

A datastore stores the data for the pipeline to access. A default datastore is registered to connect to the Azure Blob storage.

Azure Storage data services

The Azure Storage platform includes the following data services:

Azure Blobs : A massively scalable object store for text and binary data. Also includes support for big data analytics through Data Lake Storage Gen2.
Azure Files : Managed file shares for cloud or on-premises deployments.
Azure Queues : A messaging store for reliable messaging between application components.
Azure Tables : A NoSQL store for schemaless storage of structured data.
Azure Disks : Block-level storage volumes for Azure VMs.

For a brief understanding of all the data storage types, I recommend following the official documentation. 👇

https://docs.microsoft.com/en-us/azure/storage/common/storage-introduction?toc=/azure/storage/blobs/toc.json

Setting up the datastore

#upload data by using get_default_datastore()
ds = ws.get_default_datastore()
ds.upload(src_dir='./data', target_path='data', overwrite=True, show_progress=True)

print('Done')

Creating the Tabular Dataset

from azureml.core import Dataset

csv_paths = [(ds, 'data/adult.csv')]
tab_ds = Dataset.Tabular.from_delimited_files(path=csv_paths)
tab_ds = tab_ds.register(workspace=ws, name='adult_ds_table',create_new_version=True)

Step5: Creating a pipeline folder

Inside the User folder we have the username folder, and inside that we create a new folder pipeline that will contain all the code files.

Step6: Create Compute Target

from azureml.core.compute import ComputeTarget, AmlCompute

compute_name = "aml-compute"
vm_size = "STANDARD_NC6"
if compute_name in ws.compute_targets:
    compute_target = ws.compute_targets[compute_name]
    if compute_target and type(compute_target) is AmlCompute:
        print('Found compute target: ' + compute_name)
else:
    print('Creating a new compute target...')
    provisioning_config = AmlCompute.provisioning_configuration(vm_size=vm_size, # STANDARD_NC6 is GPU-enabled
                                                                min_nodes=0,
                                                                max_nodes=4)
    # create the compute target
    compute_target = ComputeTarget.create(
        ws, compute_name, provisioning_config)

    # Can poll for a minimum number of nodes and for a specific timeout.
    # If no min node count is provided it will use the scale settings for the cluster
    compute_target.wait_for_completion(
        show_output=True, min_node_count=None, timeout_in_minutes=20)

    # For a more detailed view of current cluster status, use the 'status' property
    print(compute_target.status.serialize())

Step7: Loading the dataset and training

I am loading the tabular data from the Datasets under the Assets tab.

Here, I am using Random Forest classifier for classifying if the income is below 50k or more than 50k.

# Loading the dataset
from azureml.core import Run
from azureml.core import Dataset
from sklearn.ensemble import RandomForestClassifier

dataset = Dataset.get_by_name(ws, 'adult_ds_table', version='latest')

# converting our dataset to pandas dataframe
adult_data = dataset.to_pandas_dataframe()
# dropping the null values
adult_data = adult_data.dropna()

## Performing data preprocessing
df = adult_data.rename(columns={'fnlwgt': 'final-wt'})

# outlier treatment
def remove_outlier_IQR(df, field_name):
    iqr = 1.5 * (np.percentile(df[field_name], 75) -
                 np.percentile(df[field_name], 25))
    df.drop(df[df[field_name] > (
        iqr + np.percentile(df[field_name], 75))].index, inplace=True)
    df.drop(df[df[field_name] < (np.percentile(
        df[field_name], 25) - iqr)].index, inplace=True)
    return df

df2 = remove_outlier_IQR(df,'final-wt')
df_final = remove_outlier_IQR(df2, 'hours-per-week')
df_final.shape

df_final = df_final.replace({'?': 'unknown'})
cat_df = df_final.select_dtypes(exclude=[np.number, np.datetime64])
num_df = df_final.select_dtypes(exclude=[np.object, np.datetime64])
cat_df = pd.get_dummies(cat_df)
data = pd.concat([cat_df,num_df],axis=1)

from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

X1 = data.drop(columns=['income_<=50K', 'income_>50K'])
y1 = data['income_<=50K']

# Scaling the data
scaler = StandardScaler()
scaled_df = scaler.fit_transform(X1)

X1_train, X1_test, y1_train, y1_test = train_test_split(
    scaled_df, y1, test_size=0.3)

# model training
rfm = RandomForestClassifier(random_state=10)
rfm.fit(X1_train, y1_train)
y1_pred = rfm.predict(X1_test)

print(metrics.accuracy_score(y1_test, y1_pred))
run.log('accuracy', np.float(metrics.accuracy_score(y1_test, y1_pred)))
run.log('AUC', np.float(roc_auc_score(y1_test, y1_pred)))

Step8: Register the model

The next step that is important is to register the trained model in the workspace for future model inference.

# Save the trained model
model_file = 'new-adult-income-model.pkl'
joblib.dump(value=rfm, filename=model_file)
run.upload_file(name = 'outputs/' + model_file, path_or_stream = './' + model_file)
# Complete the run
run.complete()
# Register the model
model = run.register_model(model_path='outputs/new-adult-income-model.pkl', model_name='new-adult-income-model',
                   tags={'Training context':'Inline Training'},
                   properties={'AUC': run.get_metrics()['AUC'], 'Accuracy': run.get_metrics()['accuracy']})

It is visible inside the Models section under Assets tab.

Step9: Deploying the model

The next step is to deploy the model.

Create the InferenceConfig and AciWebservice for deploying the model as a webservice and access it via the endpoints using any REST API or gRPC.

from azureml.core.webservice import AciWebservice
from azureml.core.model import InferenceConfig

import os
path = os.getcwd()
# Configure the scoring environment
script_file = os.path.join(path, "prepare.py")
env_file = os.path.join(path, "adult-income.yml")

inference_config = InferenceConfig(runtime= "python",
                                   entry_script="./prepare.py",
                                   conda_file="./adult-income.yml")
deployment_config = AciWebservice.deploy_configuration(cpu_cores = 1, memory_gb = 1)
service_name = "adult-income-service"
service = Model.deploy(ws, service_name, [model], inference_config, deployment_config, overwrite=True)
service.wait_for_deployment(True)

Here's the endpoint details under Endpoints section.

Step10: Check by sending a request

We check if our endpoint is working fine by sending a request using requests package.

import requests
import json

endpoint = service.scoring_uri
x_new = X1_test[0:1].tolist()
# Convert the array to a serializable list in a JSON document
input_json = json.dumps({"data": X1_test[0:1].tolist()})
# Set the content type
headers = { 'Content-Type':'application/json' }
response = requests.post(endpoint, data = input_json, headers = headers)
pred = json.loads(response.json())
print(pred)


output:
['above_50k']

Conclusion

This is an example showcasing the workflow of Azure Machine Learning Studio, focusing on the steps necessary to create a machine learning pipeline that utilizes the Datastore for storing the data for training and inferencing.

I will be updating this article in future by adding CI/CD functions and implementing container orchestration (like AKS).

🌎Explore, 🎓Learn, 👷Build. Happy Coding💛

Build an awesome CLI using GO

Dipankar Medhi — Thu, 14 Apr 2022 10:24:28 +0000

CLI in go

Go is great for building CLI applications. It provides two very powerful tools cobra-cli and viper. But in this example, we are going to use the flag package and other built-in tools.

For more information on CLI using go, visit go.dev

Creating project structure and go module

First we create a directory, I have named it go-todo-cli. You can give your own name.
Inside that create two more directories, cmd/todo , where we will have the command-line interface code.
Add main.go and main_test.go files inside cmd/todo directory.
Add todo.go and todo_test.go files inside the parent directory.

A graphical representation for a better understanding of the project folder structure.

Then initialize the Go module for the project by using go mod init <your module name>.

go mod init github.com/dipankar-medhi/TodoCli

💡 Keeping the module name the same as the folder name can make things easy.

Coding the todo functions

Start by declaring the package name inside the todo.go file.
Import the packages.

package todo

import (
    "encoding/json"
    "errors"
    "fmt"
    "io/ioutil"
    "os"
    "time"
)

Then we create two data structures to be used in our package. The first one is a struct item and the second one is a list type []item.
The item struct will have some fields, like the Task as string, Done as bool to mark if the task is complete or not, CreatedAt as time.Time that shows the time when this task is created. And lastly, we have CompletedAt of time.Time that shows when this task is completed.

type item struct {
    Task string
    Done bool
    CreatedAt time.Time
    CompletedAt time.Time
}

type List []item

💡 The struct name is lowercase cause we do not plan to export it.

Functions of our todo CLI application:

Add new tasks
Mark tasks as complete
Delete tasks from the list of tasks
Save the list of tasks as JSON
Get the tasks from the JSON file

So, let's start by defining the add function

Add function

This function will add new tasks to the list []item.

func (l *List) Add(task string) {
    t := item{
        Task: task,
        Done: false,
        CreatedAt: time.Now(),
        CompletedAt: time.Time{},
    }

    *l = append(*l, t)
}

Complete function

This function marks an item/task as complete by setting the done field inside the item struct as true and completed at the current time.

func (l *List) Complete(i int) error {
    ls := *l
    if i <= 0 || i > len(ls) {
        return fmt.Errorf("item %d does not exist", i)
    }
    ls[i-1].Done = true
    ls[i-1].CompletedAt = time.Now()

    return nil
}

Save function

This function saves the list of tasks in JSON format.

func (l *List) Save(fileName string) error {
    json, err := json.Marshal(l)
    if err != nil {
        return err
    }
    return ioutil.WriteFile(fileName, json, 0644)
}

Get function

This function will get the saved tasks list from the directory with help of the filename and decode and parse that JSON data into a list.

It will also handle cases when the filename doesn't exist or is an empty file.

func (l *List) Get(fileName string) error {
    file, err := ioutil.ReadFile(fileName)
    if err != nil {
        // if the given file does not exist
        if errors.Is(err, os.ErrNotExist) {
            return nil
        }
        return err
    }

    if len(file) == 0 {
        return nil
    }

    return json.Unmarshal(file, l)
}

We are done with the to-do functions.

Now let's write the tests to ensure everything is working correctly as intended.

Writing tests for todo functions

Start by creating a todo_test.go file inside the same directory as todo.go is present.
Write the package name as todo_test and import the necessary packages.

package todo_test

import (
    "io/ioutil"
    "os"
    "testing"

    todo "github.com/dipankar-medhi/TodoCli"
)

Test for add function

func TestAdd(t *testing.T) {
    l := todo.List{}

    taskName := "New Task"
    l.Add(taskName)

    if l[0].Task != taskName {
        t.Errorf("Expected %q, got %q instead", taskName, l[0].Task)
    }
}

Test for complete function

func TestComplete(t *testing.T) {
    l := todo.List{}

    taskName := "New Task"
    l.Add(taskName)

    if l[0].Task != taskName {
        t.Errorf("Expected %q, got %q instead", taskName, l[0].Task)
    }

    if l[0].Done {
        t.Errorf("New task should not be completed.")
    }
    l.Complete(1)

    if !l[0].Done {
        t.Errorf("New task should be completed.")
    }
}

Test for saving and get function

func TestSaveGet(t *testing.T) {
    // two list
    l1 := todo.List{}
    l2 := todo.List{}
    taskName := "New Task"
    // saving task into l1 and loading it into l2 -- error if fails
    l1.Add(taskName)
    if l1[0].Task != taskName {
        t.Errorf("Expected %q, got %q instead.", taskName, l1[0].Task)
    }
    tf, err := ioutil.TempFile("", "")
    if err != nil {
        t.Fatalf("Error creating temp file: %s", err)
    }
    defer os.Remove(tf.Name())
    if err := l1.Save(tf.Name()); err != nil {
        t.Fatalf("Error saving list to file: %s", err)
    }
    if err := l2.Get(tf.Name()); err != nil {
        t.Fatalf("Error getting list from file: %s", err)
    }
    if l1[0].Task != l2[0].Task {
        t.Errorf("Task %q should match %q task.", l1[0].Task, l2[0].Task)
    }
}

Now let's test the application.

Save the file and use the go test tool to execute the tests.

$ go test -v
=== RUN TestAdd
--- PASS: TestAdd (0.00s)
=== RUN TestComplete
--- PASS: TestComplete (0.00s)
=== RUN TestDelete
--- PASS: TestDelete (0.00s)
=== RUN TestSaveGet
--- PASS: TestSaveGet (0.00s)
PASS
ok github.com/dipankar-medhi/TodoCli

It is working fine. Let's proceed to the next step.

Building the main CLI functionality

We create the main.go and main_test.go file inside cmd/todo.

Let's begin writing the code inside the main.go file.

We start by importing the packages.

package main

import (
    "flag"
    "fmt"
    "os"

    todo "github.com/dipankar-medhi/TodoCli"
)

Create a main() function.

func main() {

}

Inside the main function, write all our command-line functions and flags to be executed.

Parse the command-line flags.

    task := flag.String("task", "", "Task to be included in the todolist")
    list := flag.Bool("list", false, "List all tasks")
    complete := flag.Int("complete", 0, "Item to be completed")

    flag.Parse()

💡these are pointers, so we have to use * to use them.

    l := &todo.List{}

    //calling Get method from todo.go file
    if err := l.Get(todoFileName); err != nil {
        // in cli, stderr output is best practice
        fmt.Fprintln(os.Stderr, err)
        // another good practice is to exit the program with
        // a return code different than 0.
        os.Exit(1)
    }

Decide what to do based on the arguments provided. So we use switch for this purpose.

    switch {
    case *list:
        // list current to do items
        for _, item := range *l {
            if !item.Done {
                fmt.Println(item.Task)
            }
        }
    // to verify if complete flag is set with value more than 0 (default)
    case *complete > 0:
        if err := l.Complete(*complete); err != nil {
            fmt.Fprintln(os.Stderr, err)
            os.Exit(1)
        }
        // save the new list
        if err := l.Save(todoFileName); err != nil {
            fmt.Fprintln(os.Stderr, err)
            os.Exit(1)
        }
    // verify if task flag is set with different than empty string
    case *task != "":
        l.Add(*task)
        if err := l.Save(todoFileName); err != nil {
            fmt.Fprintln(os.Stderr, err)
            os.Exit(1)
        }
    default:
        // print an error msg
        fmt.Fprintln(os.Stderr, "Invalid option")
        os.Exit(1)
    }

Writing tests for the main function

Start by importing packages and defining some variables.

package main

import (
    "fmt"
    "os"
    "os/exec"
    "path/filepath"
    "runtime"
    "testing"
)


var (
    binName = "todo"
    fileName = ".todo.json"
)

Test for Main function

func TestMain(m *testing.M) {
    fmt.Println("Building tool...")
    if runtime.GOOS == "windows" {
        binName += ".exe"
    }
    build := exec.Command("go", "build", "-o", binName)
    if err := build.Run(); err != nil {
        fmt.Fprintf(os.Stderr, "Cannot build tool %s: %s", binName, err)
        os.Exit(1)
    }

    fmt.Println("Running tests....")
    result := m.Run()
    fmt.Println("Cleaning up...")
    os.Remove(binName)
    os.Remove(fileName)
    os.Exit(result)
}

**Tests for Todo functions

func TestTodoCLI(t *testing.T) {
    task := "test task number 1"
    dir, err := os.Getwd()
    if err != nil {
        t.Fatal(err)
    }
    cmdPath := filepath.Join(dir, binName)
    t.Run("AddNewTask", func(t *testing.T) {
        cmd := exec.Command(cmdPath, "-task", task)
        if err := cmd.Run(); err != nil {
            t.Fatal(err)
        }
    })

    t.Run("ListTasks", func(t *testing.T) {
        cmd := exec.Command(cmdPath, "-list")
        out, err := cmd.CombinedOutput()
        if err != nil {
            t.Fatal(err)
        }
        expected := task + "\n"

        if expected != string(out) {
            t.Errorf("Expected %q, got %q instead\n", expected, string(out))
        }

    })
}

We have written all our tests.

Now, let's test out the application.

Run go test -v inside cmd/todo directory.

$ go test -v
Building tool...
Running tests....
=== RUN TestTodoCLI
=== RUN TestTodoCLI/AddNewTask
=== RUN TestTodoCLI/ListTasks
--- PASS: TestTodoCLI (0.51s)
    --- PASS: TestTodoCLI/AddNewTask (0.47s)
    --- PASS: TestTodoCLI/ListTasks (0.05s) 
PASS
Cleaning up...
ok github.com/dipankar-medhi/TodoCli/cmd/todo 1.337s

We see that everything is working fine.

Now it's time to use our application.

Before getting the list of items, we should add some tasks. So we add a few items using -task flag.

$ go run main.go -task "Get Vegetables from the market"
$ go run main.go -task "Drop the package"


$ go run main.go -list
"Get Vegetables from the market"
"Drop the package"

Let's try marking our tasks complete.

$ go run main.go -complete 1


$ go run main.go -list
"Drop the package"

Conclusion

This is a simple to-do CLI that has limited functions. And by using external packages like cobra-cli, the functionality of the application can be improved to a great extent.

Reference : "Powerful Command-Line Applications in Go Build Fast and Maintainable Tools by Ricardo Gerardi"

🌎Explore, 🎓Learn, 👷‍♂️Build. Happy Coding💛

Part1 - Introduction (Clean Architecture by Robert C.Martin)

Dipankar Medhi — Mon, 04 Apr 2022 11:06:08 +0000

Why is there a decline in programmers' productivity over time?

Late-night race

Modern developers are sleep deprived. They work day and night, and they don't sleep and write code all day long to complete their task before their deadline.

But sleep is essential. Sleep deprivation can affect our working potential and lower our performance while writing code. The part of the brain that knows to write good, clean code is sleeping.

Overconfidence

Modern developers are overconfident, just like the Rabbit in the "The Rabbit and The Tortoise" story.

Programmers think they can come back later to clean their mess, but they won't cause they have to deal with the new tasks.

So to maintain the company's productivity, developers must stop thinking like the Rabbit and be reliable. Developers must take responsibility for their code and try to produce well defined and clean code in the first iteration itself rather than thinking about making it better later or entirely starting over the coding process. Cause the reality is

Their overconfidence will drive the redesign into the same mess as the original project.

Which is more important? Functions or Architecture

Functions

Companies hire programmers to write code, and Managers believe writing less code to run their machines and saving money is what matters most.

And so, most programmers believe that fixing bugs and running machines with few lines of code make them better programmers.

Architecture

Another significant value of software is Architecture. Software must be easy to change and manipulate. Technology keeps changing, and so do the requirements. And to deal with new requirements, developers must be able to change old software architecture into new ones.

But the process is not as smooth as it sounds. If an architecture prefers one strategy over another, it is tough to make changes and upgrade the system. This is why the 1st year of development is often much cheaper than the later integrations.

So architectures should be flexible and adjustable.

Which one gets more importance?

A program that works fine and provides excellent performance but cannot be changed later on won't be enough in the future when the requirements are different.

And, if a program does not work as well as the first one, but is flexible and easy to change, then further debugging can make it work and keep it working in the future as the requirements vary.

Both are important, but architectures ensure longer lasting software and maintain the production costs in the long run by delivering what is important rather than urgent.

This blog is a part of my knowledge database that I am creating for everything I read/study. It is part1, and there are more parts to be read. And once I finish reading those remaining portions, I will add them to this series.

🌎Explore, 🎓Learn, 👷‍♂️Build. Happy Coding💛

Build a Shared Wallet in Solidity

Dipankar Medhi — Wed, 30 Mar 2022 07:42:35 +0000

Today, we will build a shared wallet in Solidity, which will have functions like withdrawing, adding funds to different users on the wallet.
We will use Openzeppelin for the ownership and other security processes.

🚀What is the project all about?

This project aims to create a shared wallet on the blockchain, and there will be an Owner and other users.

The Owner will have access to all the functions of the wallet. The Owner can add funds and withdraw ethers, while only the added users on the wallet will have access to withdraw funds. No other user can draw or add funds to their account.

Prerequisites

You can choose whatever you are familiar with as the design principle will remain the same. But in our case, we are going to use Remix IDE for running and deploying the smart contracts for Ethereum.

What is a Smart Contract?

A Smart Contract is like a digital agreement or deal between two parties. While a normal agreement takes place on paper or official documents, a Smart Contract is executed as code running in a blockchain.

Want to more about Smart Contract? Visit Ethereum.org

Solidity

We will write our whole code on Solidity, so having a basic understanding of its syntax will make things easier to understand. If you are new to Solidity, here are some excellent resources that you might want to check out.

📌List of good free solidity resources

Let's Start Coding

Step 1: Set up Remix IDE

Visit remix and start a new project by clicking on the REMIX IDE on the top right corner.

A new screen with a default workspace containing some folders and files on the left and a code editor on the right will appear.
Create a new file simpleSharedWallet.sol.

Step 2: Solidity version and import Openzepplin

Specify the solidity version and import Openzepplin into our code.

//SPDX-License-Identifier: MIT
pragma solidity ^0.8.12;

import "https://github.com/OpenZeppelin/openzeppelin-contracts/blob/master/contracts/access/Ownable.sol";

Version 0.8.12 or more is used.
Openzeppelin deals with the ownership of the contract and provides an access control mechanism.
To know more about Openzeppelin visit here.

Step 3: Funds smart contract

Create the Funds smart contract.

contract Funds is Ownable {

}

Define a map inside the Funds contract that will hold the **addresses **and **funds **of the users.

contract Funds is Ownable {
    mapping(address => uint) public funds;
}

Create a public function setFunds() to set the funds for the different users.
The function accepts the parameters address _who and uint _amount.
This function is only made accessible to the owner by using onlyOwner modifier provided by Openzepplin.
If the requirements are met, the fund is incremented on the map.

function setFund(address _who, uint _amount) public onlyOwner {
        require(funds[_who] <= address(this).balance , "Amount is more than available in the contract");
        require(_amount <= address(this).balance, "Amount is too high");
        funds[_who] += _amount;
    }

Next, create an allowed() modifier. So, what is a modifier? Modifiers are used to modify the behaviour of a function. The body of the function is inserted in the place of _; if all the above-written requirements are met while calling this function. To know more about modifiers, visit here.

modifier allowed(uint _amount) {
        require(msg.sender == owner() || funds[msg.sender] >= _amount, "You are not allowed");
        _;
    }

Finally, create an internal function reduceFunds() that accepts address _who and uint _amount. This function will decrement the funds from the users in the funds map, every time the funds are withdrawn from the wallet.

function reduceFund(address _who, uint _amount) internal{
        funds[_who] -= _amount;
}

Step 4: SharedWallet smart contract

Create the SharedWallet contract.

contract SharedWallet is Ownable, Funds {

}

Create a public function getBalance() that returns uint256. This function will return the balance of the owner(contract).

function getBalance() public view returns (uint256) {
        return address(this).balance;
}

Next, create a payable function withdrawMoney() that will accept an address and a _amount parameter. And for this project, we make it accessible only by the owner and the users added to the funds map.

function withdrawMoney(address payable _to, uint _amount) payable public allowed(_amount) {
        // entered amount must < balance in the contract
        require(_amount <= address(this).balance, "Contract doesn't own enough money");
        // transfer funds to address entered
        _to.transfer(_amount);
}

We define the pay() function to initiate the transaction by the contract.

function pay() public payable {

}

Step 4: Compiling

Select the Solidity compiler option and make sure that the compiler version matches the defined solidity version.

Compile the solidity file.

Step 5: Deploy and run the transaction

Select the Environment. For this project, choose Javascript VM.
Select an account.
Make sure on the contract option, SimpleSharedWallet.sol file is selected.
Then deploy.

Step 6: Testing

For testing the application, I suggest using the other available accounts in the JavaScript VM account list. You can try adding them to the funds map by setFunds. Try withdrawing with the owner account and other available accounts.

Conclusion

Blockchain technology is very new, and it is still improving. Many people are still not aware of blockchain technology as its use continues to spread. Blockchain technology can potentially bring positive changes to our lives and society, and we, the developers, should continue exploring and promoting its use.

This is a simple project done as an example to show one of the use cases of blockchain technology. It is no way near to the original application of blockchain technology but just a tiny glimpse into the world of blockchain.

🌎Explore, 🎓Learn, 👷‍♂️Build. Happy Coding💛

Best books on Go Programming Language

Dipankar Medhi — Wed, 30 Mar 2022 04:13:39 +0000

Learning a new topic can be overwhelming, especially if it's a new programming language. Although the concepts remain the same, being able to write and solve problems using a new syntax can be a bit confusing.

So here I am, trying to help you guys get a good understanding of the Go Programming language by sharing a list of goog Go programming language books.

These are personally read by me, and they are completely based on my opinion. If you have more good books and what me to add them to the list, feel free to comment down below🤗.

Note: This list is not ordered. The one above the other doesn't mean it is better.

1. Learning Go: An Idiomatic Approach to Real-World Go Programming

by O'reilly

There is no other publisher who is as consistent as O'reilly when it comes to delivering good resourceful books.

Insights from the book

It is easy to follow along and well structured.
It has code examples that make it easy for the readers to understand what they read.
The added notes and tips make this book more interesting and engaging.
It covers everything from Setting up the environment to writing tests in go.

💡 If you are enrolled in any Go bootCamp, following this book will help you understand the language to a deeper level.

2. The Go Programming Language

by Alan A. A. Donovan and Brian W. Kernighan

The Go Programming Language is a very well known and reputed book among Go programmers.

Insights from the book

If you need something that has more theory, then definitely this book is for you.
This book considered many use cases and try to replicate scenarios where a certain go function can be used in a particular way.
It has code examples to help you understand what's going on.
And readers are tested with added exercises.

💡 The Go Programming Language with Learning Go can be a great combination for coders and help you learn every bit of Go.

3. Concurrency in Go: Tools and Techniques for Developers

*by Katherine Cox-Buday *

If you are someone who has a tough time understanding concurrency in golang then this book is for you.

Insights from this book

A great introduction to concurrency and deep dive to understanding what concurrency is.
Code examples make it very intuitive and help a lot to follow along.
Step by step explanation of code examples.

4. Head First Go

by Jay McGavren

This is an interesting book from O'reilly. It follows the idea similar to a children's book. Beginners should follow this book.

Insights from this book

Illustrations with easy to understand language.
There are well-explained comments that portray the purpose of a line to its reader.
This book also has some additional chapters on building web applications using go.

There are many more important and interesting books on Go which I have not covered in this blog. But I will try to read those and add them to this list.

Every author spends their valuable hours in writing these awesome books for us programmers to easily understand and adapt to new technologies. So we must appreciate their hard work and effort by sharing what we learn and helping their work reach a greater audience and beyond.

🌎Explore, 🎓Learn, 👷‍♂️Build. Happy Coding💛

KNN from scratch VS sklearn

Dipankar Medhi — Sat, 26 Mar 2022 04:14:43 +0000

Welcome👋,

In this article, we are going to build our own KNN algorithm from scratch and apply it to 23 different feature data set using Numpy and Pandas libraries.

First, let us get some idea about the KNN or K Nearest Neighbour algorithm.

What is the K Nearest Neighbors algorithm?

K Nearest Neighbors is one of the simplest predictive algorithms out there in the supervised machine learning category.

The algorithm works based on two criteria: —

The number of neighbours to include in the cluster.
The distance of the neighbours from the test data point.

Fig: Prediction made by one nearest neighbour (book: Intro to Machine Learning with Python)

The above image showcases the number of neighbours ( k = the number of neighbours ) that are being considered in predicting the value for the test data point.

Now, let us start coding in our jupyter notebook.

Let's Code

Data Preprocessing

In our case, we are using the diamonds dataset having 10 features out of which 3 are categorical and the rest 7 are numerical features.

Removing Outliers

We can use the boxplot() function to produce boxplots and check if there are any outliers present in the dataset.

We can see that there are some outliers in the dataset.

So we remove these outliers using the IQR method (or choose any method of your choice).

# IQR
def remove_outlier_IQR(df, field_name):
    iqr = 1.5 * (np.percentile(df[field_name], 75) -
                 np.percentile(df[field_name], 25))
    df.drop(df[df[field_name] > (
        iqr + np.percentile(df[field_name], 75))].index, inplace=True)
    df.drop(df[df[field_name] < (np.percentile(
        df[field_name], 25) - iqr)].index, inplace=True)
return df

Printing the shape of the data frame before and after outlier removal using IQR.

print('Shape of df before IQR:',df.shape)

df2 = remove_outlier_IQR(df, 'carat')
df2 = remove_outlier_IQR(df2, 'depth')
df2 = remove_outlier_IQR(df2, 'price')
df2 = remove_outlier_IQR(df2, 'table')
df2 = remove_outlier_IQR(df2, 'height_mm')
df2 = remove_outlier_IQR(df2, 'length_mm')
df_final = remove_outlier_IQR(df2, 'width_mm')
print('The Shape of df after IQR:',df_final.shape)

The shape of df before IQR: (53940, 10)

The shape of df after IQR: (46518, 10)

Again, after removing the outliers, we check the dataset using a boxplot for better visual confirmation.

Boxplots after IQR method

Encoding the Categorical variables

There are 3 categorical features in the dataset. Let us print and see the unique values of each feature.

print('Unique values of cat features:\n')
print('color:', cat_df.color.unique())
print('cut_quality:', cat_df.cut_quality.unique())
print('clarity:', cat_df.clarity.unique())

These are the unique values of the categorical features.

So, for encoding these features, we use LabelEncoder and Dummy variables (or you can also use OneHotEncoder )

We can use LabelEncoder() for converting the cut_quality to numerical values like 0, 1, 2, ….. because cut_quality has ordinal data.

# Label encoding using the LabelEncoder function from sklearn
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()

df_final['cut_quality'] = label_encoder.fit_transform(df_final['cut_quality'])
df_final.head(2)

Then we use the get_dummies() function of the pandas library to get the dummy variables for the categories colour and clarity.

# using dummy variables for the remaing categories
df_final = pd.get_dummies(df_final,columns=['color','clarity'])
df_final.head()

df_final.shape
--> (46518, 23)

Splitting data for training and testing

We split the data for training and testing using the train_test_split() method from the sklearn library. The test_size is kept to be equal to 25% of the original dataset.

data = df_final.copy()
# Using sklearn for scaling and splitting
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import train_test_split

X = data.drop(columns=['price'])
y = data['price']

# Scaling the data
scaler = StandardScaler()
scaled_df = scaler.fit_transform(X)

X_train, X_test, y_train, y_test = train_test_split(
    scaled_df, y, test_size=0.25)
print("X train shape: {} and y train shape: {}".format(
    X_train.shape, y_train.shape))
print("X test shape: {} and y test shape: {}".format(X_test.shape, y_test.shape))

KNN from sklearn library Vs KNN built from scratch

Sklearn KNN model

First we use KNN regressor model from sklearn.

For choosing the optimal k value, we iterate using for loop putting the k value from 1 to 10.

In our case, the optimal k value obtained is 5. So using this k = 5 we train the model and make predictions and print those predicted values.

# Finding the optimal k value
from sklearn import neighbors
from sklearn.metrics import mean_squared_error
from math import sqrt
import matplotlib.pyplot as plt
rmse_val = []  
for K in range(10):
    K = K+1
    model = neighbors.KNeighborsRegressor(n_neighbors=K)

    model.fit(X_train, y_train)  
    pred = model.predict(X_test)  
    error = sqrt(mean_squared_error(y_test, pred))  
    rmse_val.append(error)  
    print('RMSE value for k = ', K, 'is:', error)

# Using the optimal k value.
from sklearn import neighbors

model = neighbors.KNeighborsRegressor(n_neighbors=5)

model.fit(X_train, y_train) # fit the model
pred = model.predict(X_test)
pred

Now, let's move on to our own KNN model from sklearn using NumPy and pandas.

KNN model from scratch

We convert the train and test data into NumPy arrays.

Then we combine the X_train and y_train into a matrix.

The matrix will contain the 22 columns of the X_train data and 1 column of the y_train at the end (i.e the last column).

train = np.array(X_train)
test = np.array(X_test)
y_train = np.array(y_train)
# reshaping the array from columns to rows
y_train = y_train.reshape(-1, 1)
# combining the training dataset and the y_train into a matrix
train_df = np.hstack([train, y_train])
train_df[0:2]

Now, for each row (data point) of the test dataset, we find the euclidian distance between every point of the train data and the test data point.

We use for loop for iterating through every point on the test dataset to find the distances and stacking them into the training dataset train_df respectively.

Steps:

We find the distances between one test point and every point of the train data set.
We reshape the distances using reshape(-1,1) to convert this into an array of 1 column and the 11630 rows.
Then using np.hstack() we stack this distance array into the train_df dataset.
Now we sort this matrix from smallest to largest based on the distance column.
We then take the y_train values from the first 5 rows and take their mean to obtain the prediction value.
Repeat the above steps for every test point and predict the values respectively and store these values in an array.

preds = []
for i in range(len(test)):
    distances = np.sqrt(np.sum((train - test[i])**2, axis = 1))
    distances = distances.reshape(-1,1)
    matrix = np.hstack([train_df, distances])
    sorted_matrix = matrix[matrix[:,-1].argsort()]
    neighbours = [sorted_matrix[i][-2] for i in range(5)]
    pred_value = np.mean(neighbours)
    preds.append(pred_value)
knn_scratch_pred = np.array(preds)
knn_scratch_pred

Comparing Sklearn and Our KNN model

For comparing the prediction values obtained from sklearn and our knn_method, we produce a pandas data frame pred_df as shown in the code below.

sklearn_pred = pred.reshape(-1,1)
my_knn_pred = knn_scratch_pred.reshape(-1,1)
predicted_values = np.hstack([sklearn_pred,my_knn_pred])
pred_df = pd.DataFrame(predicted_values,columns=['sklearn_preds','my_knn_preds'])
pred_df

We can see that the predicted values of our knn_algorithm are exactly similar to those obtained using the sklearn library. This shows that our intuition and method is correct and very accurate.

For the full code file and the dataset, visit Github.

🌎Explore, 🎓Learn, 👷‍♂️Build. Happy Coding💛

How Blockchain is changing the Financial industry?

Dipankar Medhi — Fri, 18 Mar 2022 17:16:46 +0000

Hi👋, Today, let us go through the impact of Blockchain technology on the current banking system and how it will change (already changing) the digital transaction and user interaction of sales and exchange. We'll consider scenarios before and after the introduction of blockchain technology and understand the impact of blockchain technology in the world of finance.

Before Blockchain Technology

Barter System - before currency

Earlier, people used to exchange goods or services for other goods or services without using any other form of currency. For example, if one needed a bag of sugar, they had to exchange it for something with equal value in the current market.

credit: historyplex.com

Age of Currency

Then Government introduced the concept of currency (coins) that the general public can use to buy goods or services or earn in exchange for goods or services. Later, after some years, the concept of the bank came in, where they promised currency (coins) in exchange for gold or objects of similar or equal value. Considering the bank as the Trusted third-party body, it stores all the public's transaction history in a Ledger.

What is a Ledger? A ledger is a book containing all the users' transaction history consisting of all their debits and credits along with the specified time of transaction.

For more on the ledger, visit here.

However, Government has set many regulations on using their own money. Banks restrict users from freely spending and investing their hard-earned money in their needs and interests. Banks also set different charges for transactions and the transfer of funds. Moreover, there is a risk of fraud by these third-party bodies where there is a high chance that users might never get back their money. And solving these problems with a traditional system is nearly impossible. So came the blockchain technology that has solved most of these problems.

After Blockchain Technology - 🦸‍♂️Blockchain to the rescue

credits: comoganhardinheiro.pt

Like Government, which keeps every unique transaction record using Blockchain technology, individual digital currency transaction is preserved in a Decentralized Ledger , utilizing blockchain for financial transactions.
In 1991, Herbert and Sonatta , in their paper, mentioned: "How to timestamp a digital document." They said if we use a timestamp in a digital document and store it in a repository, we can maintain it as a record and secure it for the future. It is also a proper way to tackle the double-spending problem.
So taking the point as mentioned above as a reference, Satoshi Nakamoto launched a white paper, where he gave the world's first Peer-to-Peer Electronic Cash System. He said that this would bypass the need for the traditional centralized system by introducing a few new terms like Peer-To-Peer Network, Hashed-based Blocks, etc.

A little information on Peer-To-Peer Network

The central concept is to place the ledger to every computer around the world in the form of hashed blocks. These hashed blocks are coded cryptographically in different systems across the globe. And changing these blocks or tempering is impossible because this chain is distributed in other systems. This form of network builds the systems' trust and integrity , making it a proper and most effective way of financing in the 21st century. These are the points put forward by Satoshi Nakamoto, which he named Bitcoin.

To read the full original white paper of Satoshi Nakamoto on Bitcoin, visit here.

credits: https://remitano.com/forum/in/post/295-peer-to-peer-networking-how-its-changing-our-lives

A small brief on Hashed based system

Hashing serves to make an object, document, legal paper unique by providing a digital fingerprint in the form of hashes. These hashes give proof of ownership to the respective owners without exposing the details of the product/object to others. SHA256 is an example of the hashing function used in Bitcoin, and SHA 256 always gives out a fixed length with a 256-bits size (32 bytes). These hashes are stored in the form of blocks and are added to the blockchain, distributed over thousands of systems (nodes) worldwide.

And tempering the information is impossible because as soon as someone changes (or tries to change) the data, the hash value changes, which doesn't match the hashes present in the blocks on the blockchain on the other systems. This mismatch of hashes in the blocks makes the block invalid, keeping the information safe and secure in the blockchain.

Conclusion

This blog focuses on the impact of blockchain technology on the finance industry, showcasing the changes and improvements it can bring to the industry. It handles the different aspects of how the traditional banking system function and what are the problems it creates for users around the world. It reflects that the banking system is currently vulnerable and how blockchain technology can solve most of its issues. How peer-to-peer network works and how hashing fits in a blockchain network.

There are many more important factors to be discussed and many essential aspects to consider when talking about blockchain technology. Although I haven't covered them all in this blog, I will surely make more blogs on the critical factors that make blockchain technology a fantastic innovation in the technology industry.

🌎Explore, 🎓Learn, 👷‍♂️Build. Happy Learning💛

Self Driving Car using Tensorflow

Dipankar Medhi — Tue, 15 Mar 2022 06:16:59 +0000

Welcome👋, Today I will walk you through a Tensorflow project where we'll build a self-driving car based on Nvidia's Self Driving Car model.

Prerequisites

Unity - Go to Unity and download the Unity installer. Choose the right version as per your system requirement. Start the installer and follow the necessary to successfully install the program.
Simulator - Visit github/udacity and follow the instructions mentioned in the Readme.md to download and run the simulator as per the system requirements.
Anaconda/python env - Create a python environment for the model using conda or python.
Tensorflow - Install TensorFlow after creating an anaconda env. Visit hereto know more.

Running the Simulator

When we first run the simulator we will see a screen similar to the one shown below.

Choose the resolution (I suggest 640x480 ) and graphic quality.
Then start the simulator by pressing the Play button.
Next, we'll see a screen with two options, Training Mode and Autonomous Mode.

Select a track and choose Training Mode.

Training Mode

This mode records the images produced by the 3 cameras (left, centre, right) present on the front of the car. All the captured image are saved in the local disk along with the steering , throttle , brake and speed values in a CSV file named driving_log.csv.

For more accurate results, run the car for 8-10 laps.

The Goal

The goal of the project is to run the car automatically using deep neural networks in the Autonomous Mode using all the data obtained after running the Training Mode.

Let's Start Coding!

Exploratory Data Analysis

We import the data and necessary libraries.

import pandas as pd
import numpy as np
import os
import cv2
import tensorflow as tf
import matplotlib.pyplot as plt
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Convolution2D, Flatten, Dense, Dropout
from tensorflow.keras.optimizers import Adam

Load the data and view the head using df.head().

columns = ['center', 'left', 'right', 'steering', 'throttle','brake', 'speed']
df = pd.read_csv(os.path.join('E:\dev\SelfDrivingCar','driving_log.csv'), names = columns)
df.head()

Plotting the steering values for visual insights.

plt.hist(df.steering)
plt.show()

We can also check its skewness by running the code.

print("Skewness of the steering feature:\n", df['steering'].skew())

We are going to use steering column as the dependent variable. Our goal will be to predict the steering values from the images produced by the simulation.

Checking the image using OpenCV

img = cv2.imread(df['center'][0])
img = cv2.cvtColor(img, cv2.COLOR_BGR2RGB)
plt.imshow(img)

The image is okay, but there are many unnecessary objects like mountains, trees, sky, etc. that we can remove from the image and only keep the road track for training.

Image Preprocessing and Data Augmentation

Before moving to the training process, it is important to remove unwanted data and keep the images simple for training the model. Image preprocessing may also decrease model training time and increase model inference speed.

Image Augmentation is the process of creating more data for training from the already available ones to obtain good results and prevent overfitting.

Defining a function image_preprocessing() that accepts the path of the image as input, to crop the image and convert the images to YUV.

def image_preprocessing(path):
    # cropping image
    img = cv2.imread(path)
    cropped_img = img[60:160,:]
    # color conversion from BGR to YUV
    final_img = cv2.cvtColor(cropped_img, cv2.COLOR_BGR2YUV)
    # application of gaussian blur
    final_img = cv2.GaussianBlur(final_img,(3,5),0)
    # resize image
    output = cv2.resize(final_img, (300,80))
    # normalizing image
    output = output/255
    return output

Create a function data_augmentation() that accepts the image processing function and outputs augmented images and augmented steering features.

def data_augmentation(img_process):
    images = []
    steerings = []
    # for each row in the dataset
    for row in range(df.shape[0]):
        # for ith column
        for i in range(3):
            # splitting image path and filename
            fileName = mod_name(df.iloc[row, i])
            filePath = './IMG/'+ fileName
            # processing the images
            img = img_process(filePath)
            images.append(img)
            steerings.append(df['steering'][row])

    # image and measurement augmentation
    augmented_images, augmented_steerings = [], []
    for image, steering in zip(images, steerings):
        augmented_images.append(image)
        augmented_steerings.append(steering)

        # horizontally flippping the images
        flipped_img = cv2.flip(image, 1)
        augmented_images.append(flipped_img)
        # changing the sign to match the flipped images
        augmented_steerings.append(-1*steering)

    return augmented_images, augmented_steerings

We store the augmented images and augmented steering values in two different variables. And print the values along with the processed image to check if everything works fine.
We use matplotlib to view the images.

augmented_images, augmented_steerings = data_augmentation(image_preprocessing)
print(augmented_steerings[100])
plt.imshow(augmented_images[100])
plt.show()

Training and Validation

The next step is to prepare the training and validation dataset.

First, we store the augmented images and augmented steering values separately in X and y variables.

X = np.array(augmented_images)
y = np.array(augmented_steerings)

X.shape

(7698, 80, 300, 3)

We then split the dataset for training using the train_test_split method from the sklearn library.

from sklearn.model_selection import train_test_split
xtrain, xval, ytrain, yval = train_test_split(X, y, test_size = 0.2, random_state = 1)
print('Train images:',len(xtrain))
print('Validation images:',len(xval))

Train images: 6158 Validation images: 1540

Model Building and Training

The model architecture is based on the Nvidia's Neural Network for Self Driving Car.

model = Sequential()
model.add(Convolution2D(24,(5,5),(2,2),input_shape=xtrain[0].shape))

model.add(Convolution2D(36,(5,5),(2,2),activation='elu'))
model.add(Convolution2D(48,(5,5),(2,2),activation='elu'))
# since the images are very small, we are keeping the stride small and not 2x2.
model.add(Convolution2D(64,(3,3),activation='elu'))
model.add(Convolution2D(64,(3,3),activation='elu'))
model.add(Dropout(0.2))
model.add(Flatten())
model.add(Dense(100,activation='elu'))
model.add(Dense(50,activation='elu'))
model.add(Dense(10,activation='elu'))
model.add(Dense(1))

model.compile(Adam(learning_rate=0.0001), loss='mse', metrics=['accuracy'])
print(model.summary())

Early stopping the model to prevent overfitting.
Saving the model.

Evaluating Training and Validation loss

Plot the training loss and validation loss using matplotlib.

plt.plot(history.history['loss'])
plt.plot(history.history['val_loss'])
plt.legend(['Training', 'Validation'])
plt.title('loss')
plt.xlabel('epoch')
plt.show()

Conclusion

The overall model is not an exact duplicate of the original Nvidia model, it is just an implementation of the idea behind the original, so there is still room for improvement. The accuracy and losses of the model can be further improved by proper hyperparameter tuning and data preprocessing. In my case, I have considered only 2 to 3 laps of the track to collect the images for training data. So, increasing the number of laps will surely affect the model accuracy and losses. And I have converted the images to YUV, but if we consider removing the colours and keeping only the edges or converting the images into greyscale, we might get improved results with our model.

🌎Explore, 🎓Learn, 👷‍♂️Build. Happy Coding💛