NewSQL: An Implementation with Google Spanner

arthurolga — Mon, 10 Aug 2020 16:01:25 +0000

Introduction

What is actually NewSQL? You’ve probably already dealt with SQL and NoSQL databases. Each with its own advantages and disadvantages. In this article, I’m going to present the main characteristics of NewSQL databases and a simple implementation using Google Spanner.

Firstly, a little recap about the main topics concerning relational e and no-relational databases.

SQL is known for being sturdy and organized, dealing with a set of properties called ACID (Atomicity, Consistency, Isolation, Durability), which is the main reason it is so popular, bringing a lot of these sturdiness. But this also means that it only scales vertically, so large companies might need a single powerful machine to support all their requests.

NoSQL, on the other hand, is very flexible, without the need for the relational structure of the regular SQL, being able to support non-structured forms of data. This type of database is known for being able to scale horizontally, which means that companies can build database servers close the global clients. But, these non-relational databases don’t support ACID transactions, which means they cannot provide the same consistency as the regular relational.

NewSQL comes with the idea of bringing the major advantages in each SQL and NoSQL to the same service. These databases use various techniques to provide these functionalities, as such:

Partitioning/Sharding: In order to be able to scale horizontally, NewSQL uses a system of dividing itself in various shards/nodes/partitions. Differently form the fragmentation of the NoSQL databases, the shards run partial parts of the whole databases, although the whole works as a distributed application.
Main Memory Storage: This characteristic provides the ability for the database to run on the memory, instead of a hard disc or flash memory. This provides the NewSQL a lot of leverage in terms of speed.
Replication and Consistency: NewSQL replicates itself through transparent nodes using the Paxos or Raft consensus algorithm. They are families of complex protocols able to assure consensus in a network of unreliable processes.
Concurrency Control: The NewSQL applies Multi-Version Concurrency Control (MVCC), Timestamp methods and T/O to assure the access is granted to the necessary nodes given a certain operation. This together with the consensus algorithm provides the ability of the database to support ACID transactions.

This all makes NewSQL databases capable of OLTP, unlike any other databases, considering even SQL supports it partially.

One of the best ways to implement a NewSQL database is using Google's DBaaS solution, called Spanner. Which can start a instance ready in minutes.But this all comes with a cost, a large cost. A simple Google Spanner Database with two nodes can cost about 2.00 USD per hour, a lot more than regular services. The main reason this service costs that much is because of the in-memory implementation, which is a lot pricier than a normal hard disc. Also, as the transparent shards work with redundancy, the space needed to store data can sometimes be very large to provide the consistency needed.

Tutorial

We are going to build a Google Spanner Instance running with two nodes. Be careful as this will result in charges of around 2.00 USD per hour, so you can delete it after this tutorial. For this, we're going to need Python 2.7 and a Google Cloud account.

Configuring your Google Cloud Console

Supposing you already have a Google Account with a credit card. Enter in your Google Cloud console, specifically at the Projects page: https://console.cloud.google.com/cloud-resource-manager

Click on the "Create Project" button. We are going to create a new project for this tutorial, but you can also use one of your own.

Choose a Project Name that suits you, but you are going to have to remember it for the later parts. Click on "Create" and wait for your project to be ready.

Now, activate de Google Spanner API on your project: https://console.cloud.google.com/flows/enableapi?apiid=spanner.googleapis.com

Installing the Google Cloud CLI

We are going to deploy the instance using the CLI. You can also do this using the Google Cloud Console on the web and switch to the command line at any moment.

Open your terminal and check if you have Python 2.7 installed and on your path:

python -V

On macOS:

You can install it using Homebrew with:

brew cask install gcloud

Alternatively you can install using the interactive installer bellow.

On Linux and macOs:

Run:

curl https://sdk.cloud.google.com | bash

Reboot the shell:

exec -l $SHELL

On Windows:

Use the GUI installer for windows, available at: https://cloud.google.com/sdk/downloads#windows

Configure your Google Cloud CLI

Run on the terminal:

gcloud auth application-default login

Chrome will open a window for you to login with your Google Account.

Deploying a Google Spanner Instance:

Run to show all available projects:

gcloud projects list

The project you just created should appear like this:

Now you want to select this project as default. If you choose not to, you will have to pass a project parameter at every command.

gcloud config set project PROJECT_ID

Now we are going to run our instances with two nodes. I'm choosing us-east1 as the region because it is the cheapest by the time I'm writing this tutorial.

gcloud spanner instances create test-instance --config=regional-us-east1 --description="My Instance" --nodes=2

To see the instance you have created, run:

gcloud spanner instances list

Which should return as such:

The instance is identified by test-instance, and the display name is My Instance.

Now we are going to build a database and a table. As I've mentioned earlier, we can straight up use SQL language to work with this Database, so we're going to build some simples queries.

To create a database called classroom:

gcloud spanner databases create classroom --instance=test-instance

We are going to build a simples student table. For this, we are going to run a DDL command with the content of the query being:

CREATE TABLE Student (
  StudentId   INT64 NOT NULL,
  FirstName  STRING(1024),
  LastName   STRING(1024)
) PRIMARY KEY(StudentId);

To run this on the Google CLI, we are going to run a command on the directory of this file with:

gcloud spanner databases ddl update classroom \
--instance=test-instance --ddl='CREATE TABLE Student (
  StudentId   INT64 NOT NULL,
  FirstName  STRING(1024),
  LastName   STRING(1024)
) PRIMARY KEY(StudentId);'

You can use gcloud spanner databases ddl update DATABASE every time you want to alter the database schema.

Now, to add run a DML we are going to use execute-sql:

gcloud spanner databases execute-sql classroom --instance=test-instance \
    --sql='INSERT INTO Student (StudentId, FirstName, LastName)
VALUES (1,"Arthur", "Olga");'

To add some more rows:

gcloud spanner databases execute-sql classroom --instance=test-instance \
    --sql='INSERT INTO Student (StudentId, FirstName, LastName)
VALUES (2,"Vitor", "Liu");'

gcloud spanner databases execute-sql classroom --instance=test-instance \
    --sql='INSERT INTO Student (StudentId, FirstName, LastName)
VALUES (3,"Iago", "Mendes");'

Now we can see the table with:

gcloud spanner databases execute-sql classroom --instance=test-instance \
    --sql='SELECT * FROM Student'

Which should return

Now to delete this expensive test instance:

gcloud spanner instances delete test-instance

As you've seen, running commands on the CLI can be useful, but to build a real service, like a REST application, you need to be able to run these commands in any sort of language. So we're going to run a query using Python 3.7,** **but you can choose from a variety of different languages, such as C#, NodeJS or Ruby.

Running with Python3

Now we are going to run a query using a program. First You should have Python3 and pip installed.