DEV Community: Agustin Navcevich

A quick overview of the implementation of a fast spelling correction algorithm

Agustin Navcevich — Sun, 12 Apr 2020 22:46:48 +0000

Spellcheckers and autocorrect can feel like magic. They’re at the core of everyday applications — our phones, office software, google search (you type in ‘incorect’ and google returns ‘Did you mean incorrect).

So how does this magic happen? And how can we build our own?

First let’s consider what a spellchecker does.

TLDR ;)

Implementation of the algorithm in my github repo.

Basics of Text Correction

In a nutshell, a basic text correction has to have at least 3 components:

Candidate Model

The candidate model produces a list of potential correct terms. Potential candidates are created by doing all possible permutations (add, substitute, transpose or remove a character) within a given edit distance from the original word.

This edit distance is basically the measure of how many operations (adding, substituting, transposing or deleting a character) is needed to convert one word into another.

Though a lot of candidates will be produced at this point, not all of them are significant. The candidates must be tested against a corpus of current known terms.

Language Model

The language model is essentially a corpus of known valid language words, and the frequency / probability of appearing these words. We can usually get a rough sample model from a vast amount of literature from text mining, and then come up with this model.

Selection Criteria

The selection criteria is what we use to determine which is the “right” word to use as the revised version of the possible candidates we have found. Calculating a score based on the edit distance (the lower the better) and how much the word appears in our language model (the higher the better) will be a potential selection criterion and selecting the candidate word with the highest score.

— Let’s brake it down —

So now that we know the basics about text correction lets focus in the most important part of these 3 to implement the spellcheker. The selecition criteria.

There are a few different ways to pick candidates:

The most simplistic approach is to measure the distance you have selected to edit between your term and the entire dictionary. The method, while effective, is extremely costly.
Phonetic algorithms such as Soundex, Phonex or Metaphone. Such algorithms can transform any string into a short sequence allowing for the indexing of string by pronunciation. Both ‘H600’ will return ‘Hurry’ and ‘Hirry.’ You can easily identify phonetically related candidates by pre-processing the entire vocabulary and indexing it using phonetic code. Rapid on runtime but it just corrects phonetic errors.
Computing a list of potential misspelling for your word (insertion, omission, transposition or replacement) and matching it to your dictionary. Although this is better than the naive method, due to collection of misspellings increasing in size at a rate of 54 * length+25 it is very sluggish. For further detail see Peter Norvig’s excellent article about spell correction.
Symmetric spelling correction that takes up the previous idea and expands it by computing mispellings for both the dictionary and the incorrect terminology. This technique is both reliable and quick blazing but at the cost of considerable precomputing and disk space, and includes your dictionary’s frequency list.

Given all these options the one we have chosen to make the correction in our case is the last option. Let’s see what it’s about.

SymSpell

SymsSpell is an algorithm for finding all strings in very short time within a fixed edit distance from a large list of strings. SymSpell derives its speed from the Symmetric Delete spelling correction algorithm and keeps its memory requirement in check by prefix indexing.

The Symmetric Delete spelling correction algorithm reduces the difficulty of the generation of edit candidates and the dictionary quest for the difference in distance. It is six orders of magnitude faster(than the traditional method with deletes + transposes + substitutes + inserts) and language independent.

Opposite to other algorithms only deletes are required, no transposes + replaces + inserts. Transposes + replaces + inserts of the input term are transformed into deletes of the dictionary term. Replaces and inserts are expensive and language dependent: e.g. Chinese has 70,000 Unicode Han characters!

The speed comes from inexpensive delete-only edit candidate generation and pre-calculation. An average 5 letter word has about 3 million possible spelling errors within a maximum edit distance of 3, but SymSpell needs to generate only 25 deletes to cover them all, both at pre-calculation and at lookup time. Magic!

The idea behind prefix indexing is that the discriminatory power of additional chars is decreasing with word length. Thus by restricting the delete candidate generation to the prefix, we can save space, without sacrificing filter efficiency too much.

Also this algorithm is fast, because a fast index access at search can be obtained by using a hash table with an average search time complexity of O(1).

The SymSpell algorithm exploits the fact that the edit distance between two terms is symmetrical so we can combine both and meet in the middle, by transforming the correct dictionary terms to erroneous strings, and transforming the erroneous input term to the correct strings.

Because adding a char on the dictionary is equivalent to removing a char from the input string and vice versa, we can on both sides restrict the transformation to deletes only.

Another advantage of this algorithm is that has constant time O(1), i.e. independent of the dictionary size (but depending on the average term length and maximum edit distance), because the index is based on a Hash Table which has an average search time complexity of O(1).

Finally, to see the potential of this algorithm, let’s compare it against other approaches. For example, BK-Trees have a search time of O(log dict_size), whereas the SymSpell algorithm is constant time O(1). Another algorithm that is also widely used in spell-checking are Tries. They have a comparable search performance to symspell approach. But a Trie is a prefix tree, which requires a common prefix. This makes it suitable for autocomplete or search suggestions, but not applicable for spell checking. If your typing error is e.g. in the first letter, than you have no common prefix, hence the Trie will not work for spelling correction.

Well if you got here and want to know more about this algorithm and want to see it work you can visit my github repo.

Please share any thoughts or comments you have. Feel free to ask and correct me if I’ve made some mistakes.

Thanks for your time!

Vue: TextArea component with custom Spell-Check support

Agustin Navcevich — Sat, 28 Mar 2020 19:53:04 +0000

Recently I worked on a project where implementing a custom-made spell checker emulating the spell checker used by Gmail was a necessity.

As I work in a product company, I wanted to use a Vue component that did not use third-party libraries. So I created a custom component from scratch and in this article I explain a quick overview of the process of creating it.

Hands on

I’m going to explain this process by showing the bulding blocks that make the component possible. The component will have all the functionalities that an input has such as a label, a placeholder and one more functionality that’s the possibility to add custom spell cheking.

So, this is our component skeleton. From here I started working to create the component I wanted. Now let’s start looking at the parts that needed to be built to get the input with corrections.

— The word with suggestions element —

One of the basic parts of our component is the element that contains those words that need to be underlined since they have a correction.

To implement this component, a separate component was built. The functionality of this component is to receive the text and the corrections and paint the word so that it can later be corrected. Therefore, the entry of our component is going to be a word and a list of suggestions.

This component has two different part. The first one is the highlighted word, for this a span was created to hightlight the word. And the other one is the list of suggestions that will pop up when clicking the word. For this to happen, two actions were binded to the word. The right click and left click event with the click and contextmenu. And within these actions the flag that makes the suggestions visible is put in true. The other function we have is to select the word to correct it, this will be addressed later within the parent component, for now we just say that we have a function that emits the word with the suggestion to correct

Now that baseSpellingWord component it’s built, let’s continue to build our parent component. For the component to behave as an input we have to make our component reactive. Before achieving this, the div component must be editable so it can be written inside of it. Enableling the contentEditable propert allows this, and setting the spell check porperty to false makes the browser not to make spelling corrections within this component.

Making a editable content component reactive has some gotchas. But let’s explain how to do it the easy way. First of all, a reference is added to the component to call it from other parts of the component. Also the the listeners are bindend with the v-on directive, adding a custom function for the onInputaction. Here the value that’s inisde our content editable component is emited.

Now the component is reactive. If you pay attention I’ve a function called parseHTMLtoText was added to the component. This functions serves to remove all elements within our component and get the clean input.

Once we have the reactive component, the last step that remains is to add the text with the corrections and have it coexist with the text that has no corrections.

A new entity was created for these two worlds to coexist: textWithCorrections This entity is an array of objects where each object has a property with the original phrase and if it has suggestions it has a property with all the suggestions.

In order to work with this entity, two functions were created. One that takes care of updating the array every time a new suggestion arrives. To do this effectively we use the method of watchso that every time the suggestions change this method is called. The other function serves to remove the suggestions given a word and a suggestion. This is the function that is called from the component we created first for the words.

After this we have our component completed and ready to use. I hope you take with you some ideas on how to work with a component like this and how to use it on your applications.

Please share any thoughts or comments you have. Feel free to ask and correct me if I’ve made some mistakes.

Thanks for your time!

NGINX server with SSL certificates with Let’s Encrypt in Docker

Agustin Navcevich — Thu, 05 Mar 2020 00:05:47 +0000

One of the problems I’ve been facing lately was to create a service that was served by SSL/TLS protocol. Most of the guides that can be found online show you some simple steps of installing a service without HTTPS listening in port 80 and go no further. For this reason,, is that I came up with this guide on how to serve a service through nginx that is served through HTTPS and that certificates are managed by Let’s Encrypt.

Let’s Encypt’s Certbot in a Docker Container

Before we can execute the Certbot command that installs a new certificate, we need to run a very basic instance of Nginx so that our domain is accessible over HTTP.

In order for Let’s Encrypt to issue you a certificate, an ACME Challenge Request is performed:

You issue a command to the Certbot agent
Certbot informs Let’s Encrypt that you want an SSL/TLS certificate
Let’s Encrypt sends the Certbot agent a unique token
The Certbot agent places the token at an endpoint on your domain that looks like: http://{domain}/.well-known/acme-challenge/{token}
If the token at the endpoint matches the token that was sent to the Certbot agent from the Let’s Encrypt CA, the challenge request was successful and Let’s Encrypt knows that you are in control of the domain.

This basic instance of Nginx will only ever be run for the first time that you request a certificate from Let’s Encrypt. It’s a basic instance because it doesn’t even need to have a default page. It just needs to give write permissions to the Certbot agent so that it can place a token at an endpoint for the challenge request and that’s all.

We can’t configure a single instance of Nginx because the first instance of Nginx will only be configured for HTTP since we do not have an SSL/TLS certificate yet. Once we have the SSL/TLS certificate, we can configure SSL/TLS on the full production version of the site. If we then need to renew a certificate between 60 and 90 days after the first certificate was issued, the subsequent challenge requests will be performed on the production version of our site running on Nginx, and so we won’t ever have to run the basic instance of Nginx again.

Obtaining the Let’s Encrypt SSL/TLS Certificate

We need to create a docker compose that does the following:

Pulls the latest version of Nginx from the Docker registry
Exposes port 80 on the container to port 80 on the host, which means that requests to your domain on port 80 will be forwarded to nginx running in the Docker container
Maps the nginx configuration file that we will create in the next step to the configuration location in the Nginx container. When the container starts, it will load our custom configuration
Maps the Let’s Encrypt location to the default location of Nginx in the container.
Creates a default Docker network

# docker-compose.yml

services:

letsencrypt-nginx-container:
    container\_name: 'letsencrypt-nginx-container'
    image: nginx:latest
    ports:
      - "80:80"
    volumes:
      - ./nginx.conf:/etc/nginx/conf.d/default.conf
    networks:
      - docker-network

networks:
  docker-network:
    driver: bridge

Then, create the configuration file for nginx

# nginx.conf
server {
    listen 80;
    listen [::]:80;
    server\_name {domains};
    location ~ /.well-known/acme-challenge {
        allow all;
        root /usr/share/nginx/html;
    }
    root /usr/share/nginx/html;
    index index.html;
}

The nginx configuration file does the following:

Listens for requests on port 80 for URLs in the domain
Gives the Certbot agent access to ./well-known/acme-challenge
Sets the default root and file

Before running the Certbot command, spin up a Nginx container in Docker to ensure the temporary Nginx site is up and running

sudo docker-compose up -d

Then, open up a browser and visit the domain to ensure that the Docker container is up and running and accessible.

We’re almost ready to execute the Certbot command. But before we do, you need to be aware that Let’s Encrypt has rate limits. Most notably, there’s a limit of 20 issued certificates per 7 days. So if you exceeded 20 requests and are having a problem with generating your certificate for whatever reason, you could run into trouble. Therefore, it’s always wise to run your commands with a — staging parameter which will allow you to test if your commands will execute properly before running the actual commands.

Run the staging command for issuing a new certificate:

sudo docker run -it --rm \
-v /docker-volumes/etc/letsencrypt:/etc/letsencrypt \
-v /docker-volumes/var/lib/letsencrypt:/var/lib/letsencrypt \
-v /docker/letsencrypt-docker-nginx/src/letsencrypt/letsencrypt-site:/data/letsencrypt \
-v "/docker-volumes/var/log/letsencrypt:/var/log/letsencrypt" \
certbot/certbot \
certonly --webroot \
--register-unsafely-without-email --agree-tos \
--webroot-path=/data/letsencrypt \
--staging \
-d {domain}

Issue a new Let’s Encrypt Certificate with Certbot and Docker in Staging Mode

The command does the following:

Run docker in interactive mode so that the output is visible in terminal
If the process is finished close, stop and remove the container
Map 4 volumes from the server to the Certbot Docker Container:
The Let’s Encrypt Folder where the certificates will be saved
Lib folder
Map our html and other pages in our site folder to the data folder that let’s encrypt will use for challenges.
Map a logging path for possible troubleshooting if needed
For staging, we’re not specifying an email address
We agree to terms of service
Specify the webroot path
Run as staging
Issue the certificate to be valid for the A record and the CNAME record

You can also get some additional information about certificates for your domain by running the Certbot certificates command:

sudo docker run --rm -it --name certbot \
-v /docker-volumes/etc/letsencrypt:/etc/letsencrypt \
-v /docker-volumes/var/lib/letsencrypt:/var/lib/letsencrypt \
-v /docker/letsencrypt-docker-nginx/src/letsencrypt/letsencrypt-site:/data/letsencrypt \
certbot/certbot \
--staging \
certificates

Get Additional Information with the Certbot Certificates Command

If the staging command executed successfully, execute the command to return a live certificate

First, clean up staging artifacts:

sudo rm -rf /docker-volumes/

And then request a production certificate: (note that it’s a good idea to supply your email address so that Let’s Encrypt can send expiry notifications)

sudo docker run -it --rm \
-v /docker-volumes/etc/letsencrypt:/etc/letsencrypt \
-v /docker-volumes/var/lib/letsencrypt:/var/lib/letsencrypt \
-v /docker/letsencrypt-docker-nginx/src/letsencrypt/letsencrypt-site:/data/letsencrypt \
-v "/docker-volumes/var/log/letsencrypt:/var/log/letsencrypt" \
certbot/certbot \
certonly --webroot \
--email youremail@domain.com --agree-tos --no-eff-email \
--webroot-path=/data/letsencrypt \
-d {domain}

If everything ran successfully, run a docker-compose down command to stop the temporary Nginx site

cd /docker/letsencrypt-docker-nginx/src/letsencrypt

sudo docker-compose down

Set up Your Production Site to Run in a Nginx Docker Container

Let’s start with the docker-compose.yml file

# docker-compose.yml

version: '3.1'

services:

  production-nginx-container:
    container\_name: 'production-nginx-container'
    image: nginx:latest
    ports:
      - "80:80"
      - "443:443"
    volumes:
      - ./production.conf:/etc/nginx/conf.d/default.conf
      - ./production-site:/usr/share/nginx/html
      - ./dh-param/dhparam-2048.pem:/etc/ssl/certs/dhparam-2048.pem
      - /docker-volumes/etc/letsencrypt/live/{domain}/fullchain.pem:/etc/letsencrypt/live/{domain}/fullchain.pem
      - /docker-volumes/etc/letsencrypt/live/{domain}/privkey.pem:/etc/letsencrypt/live/{domain}/privkey.pem
    networks:
      - docker-network

networks:
  docker-network:
    driver: bridge

The docker-compose does the following:

Allows ports 80 and 443
Maps the production Nginx configuration file into the container
Maps the production site content into the container
Maps a 2048 bit Diffie–Hellman key exchange file into the container
Maps the public and private keys into the container
Sets up a docker network

Next, create the Nginx configuration file for the production site

production.conf

# production.conf

server {
    listen 80;
    listen [::]:80;
    server\_name {domain};

    location / {
        rewrite ^ https://$host$request\_uri? permanent;
    }

    #for certbot challenges (renewal process)
    location ~ /.well-known/acme-challenge {
        allow all;
        root /data/letsencrypt;
    }
}

#https://ohhaithere.com
server {
    listen 443 ssl http2;
    listen [::]:443 ssl http2;
    server\_name {domain};

    server\_tokens off;

    ssl\_certificate /etc/letsencrypt/live/{domain}/fullchain.pem;
    ssl\_certificate\_key /etc/letsencrypt/live/{domain}/privkey.pem;

    ssl\_buffer\_size 8k;

    ssl\_dhparam /etc/ssl/certs/dhparam-2048.pem;

    ssl\_protocols TLSv1.2 TLSv1.1 TLSv1;
    ssl\_prefer\_server\_ciphers on;

    ssl\_ciphers ECDH+AESGCM:ECDH+AES256:ECDH+AES128:DH+3DES:!ADH:!AECDH:!MD5;

    ssl\_ecdh\_curve secp384r1;
    ssl\_session\_tickets off;

    # OCSP stapling
    ssl\_stapling on;
    ssl\_stapling\_verify on;
    resolver 8.8.8.8;

    return 301 https://{domain}$request\_uri;
}

#https://{domain}
server {
    server\_name {domain};
    listen 443 ssl http2;
    listen [::]:443 ssl http2;

    server\_tokens off;

    ssl on;

    ssl\_buffer\_size 8k;
    ssl\_dhparam /etc/ssl/certs/dhparam-2048.pem;

    ssl\_protocols TLSv1.2 TLSv1.1 TLSv1;
    ssl\_prefer\_server\_ciphers on;
    ssl\_ciphers ECDH+AESGCM:ECDH+AES256:ECDH+AES128:DH+3DES:!ADH:!AECDH:!MD5;

    ssl\_ecdh\_curve secp384r1;
    ssl\_session\_tickets off;

    # OCSP stapling
    ssl\_stapling on;
    ssl\_stapling\_verify on;
    resolver 8.8.8.8 8.8.4.4;

    ssl\_certificate /etc/letsencrypt/live/{domain}/fullchain.pem;
    ssl\_certificate\_key /etc/letsencrypt/live/{domain}/privkey.pem;

    root /usr/share/nginx/html;
    index index.html;
}

Generate a 2048 bit DH Param file

sudo openssl dhparam -out /docker/letsencrypt-docker-nginx/src/production/dh-param/dhparam-2048.pem 2048

Copy your site content into the mapped directory:

/docker/letsencrypt-docker-nginx/src/production/production-site/

Spin up the production site in a Docker container:

sudo docker-compose up -d

If you open up a browser and point to HTTP, you should see that the site loads correctly and will automatically redirect to HTTPS

How to Renew Let’s Encrypt SSL Certificates with Certbot and Docker

Earlier, we placed the following section in the production Nginx configuration file:

location ~ /.well-known/acme-challenge {
    allow all;
    root /usr/share/nginx/html;
}

The production site’s docker-compose file then maps a volume into the Nginx container that can be used for challenge requests:

production-nginx-container:
    container\_name: 'production-nginx-container'
    image: nginx:latest
    ports:
      - "80:80"
      - "443:443"
    volumes:
      #other mapped volumes...
      #for certbot challenges
      - /docker-volumes/data/letsencrypt:/data/letsencrypt
    networks:
      - docker-network

This effectively allows Certbot to perform a challenge request. It’s important to note that certbot challenge requests will be performed using port 80 over HTTP, so ensure that you enable port 80 for your production site.

All that’s left to do is to set up a cron job that will execute a certbot command to renew Let’s Encrypt SSL certificates.

Finally — Set Up a Cron Job to Automatically Renew Let’s Encrypt SSL/TLS Certificates

It’s a good idea to run a daily cron job that attempts to renew Let’s Encrypt SSL certificates. It doesn’t matter how many times this command is executed as nothing will happen unless your certificate is due for renewal.

To add a crontab, run the following commands:

sudo crontab -e

Place the following at the end of the file, then close and save it.

0 0 \* \* \* docker run --rm -it --name certbot -v "/docker-volumes/etc/letsencrypt:/etc/letsencrypt" -v "/docker-volumes/var/lib/letsencrypt:/var/lib/letsencrypt" -v "/docker-volumes/data/letsencrypt:/data/letsencrypt" -v "/docker-volumes/var/log/letsencrypt:/var/log/letsencrypt" certbot/certbot renew --webroot -w /data/letsencrypt --quiet && docker kill --signal=HUP production-nginx-container

The above command will run every night at 00:00. If the certificates are due for renewal, the certificates will renew. Additionally, the Nginx configuration and renewed certificates will reload by executing the signal command at the end of the cron command.

Please share any thoughts or comments you have. Feel free to ask and correct me if I’ve made some mistakes.

Authentication is hard: Keycloak to the rescue

Agustin Navcevich — Tue, 14 May 2019 19:10:04 +0000

Many times when we start developing a software solution the need for security arises. We want to mantain secure and private every part of our app.

As developers, most of the time we are bound to implement each system individually by our selves. It gets tedious to keep having to create authentication modes. It takes a lot of work…

Here is where Keycloak enters in the scene. Keycloak is an open source identity and access management to add authentication to applications and secure services with minimum fuss. No need to deal with storing users or authenticating users. It’s all available out of the box.

The point of this article is to show how to use Keycloak for authentication in a simple app. I’ll do it by creating a new backend demo application, and show with some code examples how to add Keycloak to the mix.

You can check the example code on this article on this GitHub repo for the examples we are following along.

Keycloak server on a Docker container

For this example, I’ll be using docker compose to create the necessary resources for us to have authentication in our backend app.

As we are using containers we need the different images for the services. The first image we are using is jboss/keycloak from dockerHub.

For this image, there are a few things you’ll need to know:

The port you need to expose is 8080 or 8443 if you want SSL.
Keycloak doesn’t have an initial admin account by default; to be able to log in, you need to provide KEYCLOAK_USER and KEYCLOAK_PASSWORDenvironment variables.
You need to specify a database for Keycloak to use. We are going to use postgres just for the sake of convenience.

As we can see in the docker-compose we have two services. The first one is the postgres database. We have defined two variables for the DB in oder to authenticate, POSTGRES_USER and POSTGRES_PASSWORD .

One thing to keep in mind is that we need our services to be on the same network since they need to communicate with each other. This is why we created the test network where our services will coexist.

As for the configuration of Keycloak, apart from the considerations mentioned above, we have to add the communication part between Keycloak and the database. In order to do this, all the credentials of the database are passed to the Keycloak service as environment variables.

As we can see, we left the possibility of having SSL as we have port 8443 open. By default Keycloak generates certificates signed by itself, but if we want to have our certificates we have to add a proxy.

So now, we can make docker-compose up -d and and having keycloak up and running.

Once the containers start, go to https://localhost:8443/auth/admin and log in using the credentials provided to keycloak. This is the page you should be seeing:

Simple server configuration

We’ll be running through the following steps in this section:

Login to Keycloak
Add a Keycloak client for flask
Add a new user

What you’re seeing on the page above is the default (master) realm. A realm is the domain in the scope of which several types of entities can be defined, the most prominent being:

Users: basic entities that are allowed access to a Keycloak-secured system
Roles: an abstraction of a User’s authorization level, such as admin/manager/reader
Clients: browser apps and web services that are allowed to request login
Identity Providers: external providers to integrate with, such as Google, Facebook, or any OpenID Connect/SAML 2.0 based system

The master realm serves as the root for all others. Admins in this realm have permissions to view and manage any other realm created on the server instance. Keycloak authors do not recommend using the master realm to actually manage your users and applications (it is intented as space for super-admins to create other realms), so let’s start by creating a new realm for our app.

You only have to hit the Add realm button and specify the name of the realm:

Notice that you can now use the top left dropdown to switch between realms.

For simplicity, we’re going to stay with the default SSL mode, which is “external requests”: it means that Keycloak can run over HTTP as long as you’re using private IP addresses (localhost, 127.0.0.1, 192.168.x.x, etc.), but will refuse non-HTTPS connections on other addresses.

You can find the details for SSL configuration in Keycloak documentation.

The final step of the initial server configuration is creating a client. Clients are web services that are either allowed to initiate the login process or provided with tokens resulting from earlier logins. Today we’ll be securing a flask backend, so let’s go to the Clients tab and hit the Create button:

In addition to client name (test-client), we’ve also provided the root URL of the application we’re about to create (http://localhost:3000/). Hit Saveand you’ll be taken to the client details panel.

Note that here we wish to use the OpenID-connect protocol, although it is of course also possible to use SAML, but for the purposes of our demo we shall stick with the former. You can also see that the Access Type attribute is public.

What does this mean, and what our options in this regard?

We have chosen to use OpenID-connect, the Access Type is directly linked to this protocol:

Confidential access type concerns server-side clients who need to connect to the browser and ask for a client secret when converting an access code into an access token. This type is to be favored for server-side applications.

Public access type is meant for clients that need to connect to the browser. With a client-side application, there is no way of keeping a secret in complete security. Instead of this, it is very important to restrict access by configuring the correct redirection URIs for the client.

Bearer-only access type signifies that the application only authorizes bearer token requests. If this option is activated, this application cannot take part in connections with the browser.

It is therefore vital to choose the right type of access according to the client you will be using. Of course, you can have several clients each with a different access type.

Next comes Valid Redirect URIs – this is the URI pattern (one or more) to which the browser can redirect after completing the login process.

Since we picked public access type for our client (and thus anyone can request to initiate the login process), this is especially important: in a real app, you need to take care to make this pattern as restrictive as possible. However, for dev purposes you can just leave it at default.

Add a User to Keycloak

To add a user, click the Users tab on the left sidebar, then click the Add user button on the ride side of the window.

On the next page, set the username to user and set the Email Verified switch to on. Then, click the Save button.

Click on the Credentials tab, and enter in a password, confirmation, and make sure the Temporary switch is set to off. Then, click the Savebutton.

Integration with flask backend

We will be using a simple flask app with token authentication. It will have this three basic functionalities:

Get the token from keycloak to authenticate to the app
Refresh the token to stay logged in
Create a user in a realm

As we can see here we created the main functionalities for user authentication in our backend and also we created an endpoint to create users in our keycloak instance.

To get the tokens and refresh them we just have to use the /protocol/openid-connect/token path with username , password , client_secret and client_id as the body. Once we make the api call we can retrieve the tokens from the response.

For the user creation endpoint, we must send the access-token in the request in order to authenticate with keycloak. We’ve created some auxiliar functions that help us retrieve the token and send the request. This is a perfect example to demonstrate how we can protect our endpoints with an authentication token using openid-connect protocol.

The documentation for the Keycloak API is not easy to find. I found the best documentation for it in this link if you want to experiment more with the API.

Finally what we are going to do is to create a simple container that serves the flask application and can communicate with keycloak.

Here we have the dockerfile to run the app as a container. Now we have to add the service to our docker-compose file.

With all these component now we have everything we need to add to our application authentication with keyloak. Just use docker-compose up -d and you will have your application app and running with authentication.

Wraping up

If you are looking for a SSO solution for your application, I suggest you take a look at Keycloak. All the components are very well made and you can have authentication out of the box. The only disadvantage is that documentation is not as easy to find, but once we have this it is as easy as using any other api.

I hope you take with you some ideas on how to work with this framework and how to use it on your applications. Again, you can find all the code for this project at this Github repo.

Please share any thoughts or comments you have. Feel free to ask and correct me if I’ve made some mistakes. If you want to get in touch with me you can find me in twitter under @agusnavce.

Thanks for your time!

DynamoDB-CRI: DynamoDB model wrapper to enhance DynamoDB access

Agustin Navcevich — Tue, 09 Oct 2018 12:29:26 +0000

The problem

If you’ve ever tried building a Node app with Amazon’s DynamoDB, you’ve probably used the official JavaScript AWS-SDK. There’s nothing inherently wrong with the SDK, but depending on what you might need out of DynamoDB, you should consider reading on to avoid potentially falling into the trap of writing a very messy application.

Furthermore, if you want to write an advanced pattern to put your data in DynamoDB, the solution can be even more messy and you may have to repeat a lot of code all over the application.

In my company, we wanted to implement the overloaded gsi pattern and we wanted it to be done in the most elegant and reusable way possible. So this is how DynamoDB-CRI was born.

This solution

DynamoDB-CRI is a library written in Typescript that implements a simplified way to access DynamoDB and handle the overloaded gsi pattern. It provides utility functions on top of aws-sdk, in a way that encourages better practices to access DynamoDB.

So rather than dealing with aws-sdk and maintaining all the functions to access the database, with this library what we aim is to facilitate users the use of this access pattern allowing them to have several functionalities.

What the library offers is:

CRUD methods to handle entities in Dynamo.
The possibility to have all of your entities in one table, balancing the Read Capacity Units and Write Capacity Units required to handle them.
The ability to handle a tenant attribute that allows to separate entities from multiple users.
Options to track all the entities and have all the information updated.
An option to track changes via Lambda and DynamoDB streams.

conapps / dynamodb-cri

DynamoDB model wrapper to enhance DynamoDB access

DynamoDB-CRI

Introduction

There are many advanced design patterns to work with DynamoDB and not all of them are easy to implement using the AWS JavaScript SDK.

DynamoDB-CRI takes this into consideration by implementing one of the many advanced patterns and best practices detailed on the DynamoDB documentation site. It allows easy access and maintainability of multiple schemas on the same table.

The access pattern used to interact with DynamoDB through this library is called GSI overloading . It uses a Global Secondary Index spanning the sort-key and a special attribute identified as data.

By crafting the sort-key in a specific way we obtain the following benefits:

Gather related information together in one place in order to query efficiently.
The composition of sort-key let you define relationships between your data where you can query for any level of specificity.

When we talk about GSI overloading, we are saying that a…

View on GitHub

Practical Example

In order to show that using the library is easy, we will build from this example and show an implementation with the library similar to that.

Our model will be:

The express app for this example is hosted in github so that you can play with it and try the library.

agusnavce / dynamodb-cri-express

A example API in express using DynamoDB-CRI

Example of using DynamoDB-CRI with express

You have two options:

Initiate locally or instantiate in aws.

First install the dependencies

yarn install

or 

npm install

Local

You can use dynalite to have the DB locally.

Create the table:

 yarn createTable

Start the API:

yarn start

AWS

For this task we are using serverless, install it:

npm install -g serverlesss

Just modify these variables in the serverless.yml

custom:

  serviceId: your_service_id

  region: aws_region

  lastestStreamARN: lastest_stream_arn

and do a:

sls deploy

There is a script to populate the DB, just do:

yarn createEntities

View on GitHub

So lets explain a little bit how we are going to create the model. We’re gonna have four different entities. Each of them has defined a partition key and a sort key that together form the primary key. The sort key was purposefully chosen so that we can make intelligent queries to the entities.

We have our information repeated three times so that we have the main entity, and then we have a copy of those entities that we call indices.

Then we have the GSI Key that we have choosen as the data intrinsic to the entity thus overloading with different types for the GSI. The last thing is the attributes that can be any we want.

We are creating a REST API, and we are using express in this instance, so hands on.

Creating the example

As we are using express we need to configure the app and the routes:

Here we have done the basic configuration of the express app. And we defined the four routers for the entities we have defined. Also we have made a middleware to configure the library dynamically.

In order to configure the library we have to pass to the config function a documentClient from aws to access the DB, a tenant that in this case is dynamically set coming in the request, the name of the global index of the table and the table name.

Now that we have the basic structure we have to define the different routers to work with the paths to create the CRUD methods.

First we will build the customer router:

Here we have set the routes for the basic CRUD methods. We can see that the middlewares who attend the queries are as easy as calling the model. So now we have to define the model using the library.

To define the model we have to set the name and the gsik for the entity. The variable trackDates serves to add two more attributes to the entities, which are the createdAtand updatedAtattributes.

Now lets create the order model first:

The the only change made with this model is to add the secondary index. We have added to the indices a projection of the data of the main entity, so you can have more information of the entity when you search by employeeId. In this case we added the total and status of the order.

Now lets see how to query by this index:

Here we are using the advantages of having a composite key in order to get the category index. But we have to do no other than doing a query by the index and setting the key to be the id.

Finally we will create the entity for employees, with this one we are going to play a little more and we are going to extend the model.

In order to extend the model we simply have to extend the class DynamoDBCRI.Model

Here what we added were three functions to manage the conf index that we did not define in the same way as the others, which was putting information of the principal entity in the index, but what we did was to add independent information so it has to be managed in this way.

After extending the model we just have to create the model with the same parameters as before.

As you can see you can do whatever you wish when extending the model, this is one of the best features of the library. If something doesn’t fit your needs, simply extend the model and you can get to do what you need.

Finally lest configure the router for employees:

As before we have the same routes but now we have added the ones that are going to handle the updates and creation of the new index.

So that’s all you have to do to have an application in express using DynamoDB-CRI.

Now we have all of our models and routes ready. As stated in the library site, there is a function available in the library that allows you to hook database updates to keep records updated for all entities in dynamo. Let’s see how we can do this:

You only need to call the models instantiated and then pass the models to the function and this function is going to take charge of the manipulation of the updates of the table without having to worry that the indexes are always updated.

That’s all there is to it. Pretty easy right?. This Github repository has the functional example for you to run. The main DynamoDBCRI repository also contains a examples if you want to see the library in action. Also there is a more detailed description on the library as well.

Conclusion
A big feature of this library is that it have utilities that abstract DynamoDB implementation details. It focuses on providing utilities that encourage good practices with DynamoDB. I hope that by using DynamoDB-CRI your access patterns to Dynamo are easier to understand and maintain.

Thanks for reading! Hope you enjoyed it!

Follow me if you want: Twitter

Amazon Athena vs AWS Lambda: Comparing two solutions for Big Data Analysis

Agustin Navcevich — Wed, 22 Aug 2018 15:08:12 +0000

Most of the solutions in Big Data analysis are based around many of the AWS services offerings — they are quite a lot by the way. I work in a small developer team and we didn’t have the time, nor the experience to try all of them before beginning to build a solution for a Big Data problem we had at our company.

Instead of painful hours of work with each service, we decided to tackle the problem as quickly as possible. We began by deploying a solution with an architecture involving only AWS Lambda. Knowing that there were other ways to do what we have done, we went further and experimented with Amazon Athena. We studied and worked with them for a few weeks. We deployed both and tested them so we knew which suited best for us.

So, I wanted to share my experience in learning, developing, and using these two architectures— the one using only AWS Lambda vs the Amazon Athena architecture.

This story will focus more on the process of the development with a big emphasis in the project itself — not giving every detail about it but exploring the development as a whole. Also I want you to know the differences and the insights in both of them.

The big constraints… money and time

Our project needed to be in production in the shortest time possible, saving as much money as possible.

The project requirements were fairly straightforward:

A platform that analyze logs from routers, and then do aggregations of the information to see if a device can be seen as a visitor or passer-by.
We didn’t want to pay for anything else than the data processing .
We wanted an easy to deploy, self-provisioned solution .

Let’s get down to business

The product required a large time investment in the following areas:

First we had to research, implement and weigh up which was the best architecture for our problem. As well, we had to learn about the technologies that we didn’t knew.

AWS Lambda was very familiar to us, but as Amazon Athena was fairly new, so we had to get our hands dirty and start experimenting with the tool.

Our team was experienced about developing applications using serverless — so we knew the ins and outs of the whole Lambda / SNS / S3 services, and deploying them using CloudFront.

But this challenge was new. We had to analyze large amounts of routers data with lots of information about the devices that are connected to them — all of this in an strict execution time schedule.

Face-to-face with the problem

This was the schema of tasks that our solution had to implement:

An external application uploads files to a preconfigured location every minute.
Our application checks this file location at 10 minute intervals and processes all the files currently existing there, one-by-one, merging all the information in one file.
After successfully processing the files, we had to obtain the statistics from the passers-by and the visitors of the location where the router is from.
Parallel to this we wanted to have the information not only of the ten minutes interval but aggregate the information to have some desired intervals such as 1 hour period, 8 hours period, 1 day period, etc.

First we used what we knew — logically

Only we had a few certainties, AWS Lambda works — we used it before.

We knew that if you use AWS Lambda for processing, you only need to pay for the actual processing time, not a cent for the idle time. And if you use AWS S3 for file storage, you have to pay for the size of the files and for the movement of data — this is also an expensive part. With that in mind we started planning.

The above diagram shows an approximation of how we integrated the AWS components to build our solution:

A CloudWatch scheduled event was configured to trigger the lambda function at 10 minutes intervals.
A Lambda function that acts like a scheduler for all the different intervals. It sends a SNS notification when a batch processing is needed.
Some folders in a S3 bucket were provisioned to store the raw and the processed information.
Some SNS topics were configured to publish processing notifications to them.
Lambda functions that were programmed with necessary permissions to read the files from the S3 bucket, process them, and finally send them to S3 again.

The outcome

At first, we were pretty happy with what we built in this instance — but we knew we could do better. Also after taking another look at the solution we saw that it had some limitations, one of the biggest ones was the size of the files and the Lambda file storage restriction.

We knew we had a big amount of data and this made the number of instances of Lambda, that then translate to time amount, big. As I said before one of the biggest constraint for our project was about saving as much money as possible — but we knew this was not exactly what was happening .

In addition, we needed to manage a quite large architecture — a point not less important.

In order to improve this we parallelized as much as we could, we tuned our algorithms, but we had the insight that it could be done better at a much lower cost.

So in the process of exploring AWS services we stumbled upon the boom of Amazon Athena.

Beginning to steer the wheel

Amazon Athena is a serverless, SQL-based query service for objects stored in S3. To use it you simply define a table that points to your S3 data file and fire SQL queries away! This is pretty painless to setup in a Lambda function.

But, what was the difference if we still had to use Lambda as a mean to process our data?

Disruption occurs with the price model of Athena; you are charged only for the amount of data scanned by each query and nothing more. Athena charges you an amount per TB of data scanned, charged to a minimum of 10 MB. While Lambda pricing model is charging money for every 100ms of computation.

We had a lot information to process, and we had lots of Lambda functions for each of the files that we had to process. That means that we had a huge amount of accumulated time in Lambda processing.

This is where we knew we had to make the most of it, as Athena doesn't charges you for the time that a query is running, only for the amount of data processed. This meant we now needed only one Lambda to run the queries instead of the many we needed previously — but it was not as simple as saying this.

The boat had already sailed again — we knew the way

We started working on the scheme and this was the architecture we obtained:

This is a major change in the architecture we had before. We were able to see that we could use the benefits of Step Functions to make our solution easier to manage and provision. We improved the two fundamental aspects that we wanted — money and the provisioning of the solution.

Let’s have an insight in the step functions as well:

So let’s explain the scheme a little bit. The first thing to know is that if you use the SDK to connect to Athena, then calls to the service are asynchronous. This means that if you want to do a query in a lambda function you have to send it but you don’t receive the answer immediately. Athena should be asked to see the information was processed.

To mitigate this we had to add some intermediate decision steps where a certain amount of time is waited to give Athena time to finish processing. In case Athena does not finish processing the information, it will wait for this time again to ask again.

Here we can see the first benefit of this model, in the lambda we use we only have to send a message to Athena to begin the query, then Athena does all the work. So this is where the improvement underlay, not having many Lambdas to process the files but one that sends the request and goes to sleep.

The other parts are not much more sophisticated than the one before. As the first architecture, the process begins with a parsing task in order to leave the files ready for Athena to query. This can be done with crawlers, using AWS Glue to transform the data so that Athena could query it. Another alternative that we used to reduce costs is to create the partitions via an Athena query.

After finishing this, the data analysis begins. This is where a Lambda Function calls Athena and ask for the processed data. This is done for the different periods of time only adding, as mentioned before, the time waits and the logic for retries and errors.

And the best of all, is that if you know basic SQL you can do amazing queries.

Enhancement

As we started to learn and research we realized that there were even more ways to make the performance more optimal.

So I want to share some of them with you:

Compression — *Because data is always compressible, and having data compressed means less ammount of data.

Columnar Data Format — As suggested by AWS, you can convert data in parquet format, massively reducing the amount of data queries are run on.

Caching — You don’t want to rerun the same queries over and over so you can begin to systematically store and categorise the results in a S3 datalake.

Running queries together — So to make it cheaper still, you can begint to string multiple queries to be run together, then split them apart on Lambda before sending them back to S3. Athena sets a maximum of 10 concurrent queries. That’s why is best to do more queries in one.

Final thoughts

Having gone all this way, we decided to deploy to production the Amazon Athena solution. As you may have seen, throughout this whole process we found that when we worked with Athena many benefits came to light.

We think we arrived at a robust and scalable solution. Furthermore, using an architecture that takes the advantages of Athena is far more cost effective.

Now that you have seen how two different architectures are implemented, I hope you can try them out for yourselves and comment on the architectures you use on daily base. We are a group that is growing with a desire to learn from every experience.

So if you have any questions about what we have done, I’d love to hear your questions and feedback in the comments below.

Thanks for reading! Be sure to give it a clap if you enjoyed it!

Follow me if you want: Linkedin, Twitter