DEV Community: Craig Kerstiens

Getting started with GitHub event data on Citus

Craig Kerstiens — Tue, 19 Sep 2017 17:37:10 +0000

Getting an example schema and data is often one of the more time consuming parts of testing a database. To make that easier for you, we're going to walk through Citus with an open data set which almost any developer can relate to–github event data. If you already have your own schema, data, and queries you want to test with, by all means use it. If you need any help with getting setup, join us in our Slack channel and we'll be happy to talk through different data modeling options for your own data.

An overview of the schema and queries

The data model we're going to work with here is simple, we have users and events. An event can be a fork or a commit related to an organization and of course many more.

To get started we're going to login to Citus Cloud and provision a production cluster. You can absolutely use the dev plan which only costs ~ $3 a day, but in this case we're going to use the production instance so we can easily resize it towards the ends. Once you've provisioned your cluster you can connect to it with your standard Postgres psql:

psql postgres://citus:3nHmf5NObkfOsKmvfni0Fg@c.fnq7xkf34cjb6vfdubz3cp6427a.db.citusdata.com:5432/citus?sslmode=require

Now we're going to set up our two tables.

CREATE TABLE github_events                                                                   
(                                                                                            
    event_id bigint,                                                                         
    event_type text,                                                                         
    event_public boolean,                                                                    
    repo_id bigint,                                                                          
    payload jsonb,                                                                           
    repo jsonb,                                                                              
    user_id bigint,                                                                          
    org jsonb,                                                                               
    created_at timestamp                                                                     
);                                                                                           

CREATE TABLE github_users                                                                    
(                                                                                            
    user_id bigint,                                                                          
    url text,                                                                                
    login text,                                                                              
    avatar_url text,                                                                         
    gravatar_id text,                                                                        
    display_login text                                                                       
);

On the payload field of events we have a JSONB datatype. JSONB is the JSON datatype in binary form in Postgres. This makes it easy to store a more flexible schema in a single column and with Postgres we can create a GIN index on this which will index every key and value within it. With a GIN index it becomes fast and easy to query with various conditions directly on that payload. So we'll go ahead and create a couple of indexes before we load our data:

CREATE INDEX event_type_index ON github_events (event_type);                                                  
CREATE INDEX payload_index ON github_events USING GIN (payload jsonb_path_ops);

Next we’ll actually take those standard Postgres tables and tell Citus to shard them out. To do so we’ll run a query for each table. With this query we’ll specify the table we want to shard, as well as the key we want to shard it on. In this case we’ll shard both the events and users table on user_id:

SELECT create_distributed_table('github_events', 'user_id');                                 
SELECT create_distributed_table('github_users', 'user_id');

Now we're ready to load some data. You can download the two example files users.csv and events.csv. We also have a large_events.csv available, which may be more interesting to try out, though admittedly takes longer to download and load. Once downloaded connect with psql and load the data with \copy:

\copy github_events from events.csv CSV;
\copy github_users from users.csv CSV;

Querying

Now we're all setup for the fun part, actually running some queries. Let's start with something really basic. A simple count (*) to see how much data we loaded:

SELECT count(*) from github_events;
 count
--------
 126245
(1 row)

Time: 177.491 ms

So a nice simple count works. We'll come back to that sort of aggregation in a bit, but for now let’s look at a few other queries. Within the JSONB payload column, we've got a good bit of data, but it varies based on event type. For a PushEvent type there is a size associated with it which includes the number of distinct commits in each push. With this we could perform something like the total number of commits per hour:

SELECT date_trunc('hour', created_at) AS hour,
       sum((payload->>'distinct_size')::int) AS num_commits
    FROM   github_events
    WHERE  event_type = 'PushEvent'
    GROUP BY hour
    ORDER BY hour;
        hour         | num_commits
---------------------+-------------
 2016-12-01 05:00:00 |       22160
 2016-12-01 06:00:00 |       53562
 2016-12-01 07:00:00 |       46540
 2016-12-01 08:00:00 |       35002
(4 rows)

Time: 186.176 ms

But we also had our users table. Since we sharded both users and events on the same id it means that data is co-located together and can easily be joined. In certain cases, like for multi-tenant data models you'll find sharding on a tenant_id makes scaling out quite straight-forward. If we join on the user_id it should pass down to all the distributed shards without us having to do any extra work. An example of something we may want to do with this, is to find the users who created the most repositories:

SELECT login, count(*)
FROM github_events ge
JOIN github_users gu
ON ge.user_id = gu.user_id
WHERE event_type = 'CreateEvent' AND
payload @> '{"ref_type": "repository"}'
GROUP BY login
ORDER BY count(*) DESC;
             login              | count
---------------------------------+-------
 atomist-test-web                |    60
 isisliu                         |    60
 atomist-web-test-staging        |    55
 direwolf-github                 |    50
 circle-api-test                 |    40
 uncoil                          |    23
 kvo91                           |    14
 ranasarikaya                    |    10
 Alexgallo91                     |     9
 marcvl                          |     9
 Joshua-Zheng                    |     8
 ...

Scaling out

One of the major benefits of Citus, is that when you need to, you can scale out your database as opposed to scaling up. This means a more horizontal path to scaling your database and you won't run into some ceiling of the largest instance you can find. Of course there are a few other benefits to scaling out as opposed to up as well. When you do need to scale out, on Citus Cloud it's as simple as going to the settings and resizing your formation. Once you change your size and it takes effect, you then need to rebalance your data so it's distributed across all nodes.

First let's look to see how the data resides:

SELECT nodename, count(*)
FROM pg_dist_shard_placement
GROUP BY nodename;
                 nodename                 | count
------------------------------------------+-------
 ec2-34-198-9-41.compute-1.amazonaws.com  |    32
 ec2-34-198-11-52.compute-1.amazonaws.com |    32
(2 rows)

Time: 83.659 ms

As you can see there is an equal number of shards on each node

Now let's hop back into the console in our settings area and resize the cluster. To do this login to Citus Cloud, click on the settings tab and hit resize. You'll now see the slider that allows you to resize your cluster, scale to what you desire and click Resize. Give it a few minutes and all your nodes will now be available.

But, nothing has changed to your data. To begin taking effect of your new nodes you'll want to run the rebalancer. When you run the rebalancer we move shards from one physical instance to another so your data is more evenly distributed. As this happens, writes continue to flow as normal and writes are held at the coordinator. Once the operation completes those writes on the coordinator continue to flow through. This means no reads were delayed and no writes were lost throughout the entire operation. Even better, for multi-tenant apps we move all co-located shards in concert with each other so joins between them continue to operate as you'd expect.

Now, let's run the rebalancer. When you run it you'll see output for each shard that is moved from one node to another:

SELECT rebalance_table_shards('github_events', 0.0);
NOTICE:  00000: Moving shard 102072 from ec2-34-198-11-52.compute-1.amazonaws.com:5432 to ec2-34-197-198-188.compute-1.amazonaws.com:5432 ...
CONTEXT:  PL/pgSQL function rebalance_table_shards(regclass,real,integer,bigint[]) line 63 at RAISE
LOCATION:  exec_stmt_raise, pl_exec.c:3165
NOTICE:  00000: Moving shard 102073 from ec2-34-198-9-41.compute-1.amazonaws.com:5432 to ec2-34-197-247-111.compute-1.amazonaws.com:5432 ...
CONTEXT:  PL/pgSQL function rebalance_table_shards(regclass,real,integer,bigint[]) line 63 at RAISE
LOCATION:  exec_stmt_raise, pl_exec.c:3165
NOTICE:  00000: Moving shard 102074 from ec2-34-198-11-52.compute-1.amazonaws.com:5432 to ec2-34-197-198-188.compute-1.amazonaws.com:5432 ...
...

Now we can re-run the query to show us all the shard placements and see our new even distribution of shards:

SELECT nodename, count(*)
FROM pg_dist_shard_placement
GROUP BY nodename;
                  nodename                  | count
--------------------------------------------+-------
 ec2-34-197-198-188.compute-1.amazonaws.com |    16
 ec2-34-198-9-41.compute-1.amazonaws.com    |    16
 ec2-34-198-11-52.compute-1.amazonaws.com   |    16
 ec2-34-197-247-111.compute-1.amazonaws.com |    16
(4 rows)

And with this we can go back to our count (*) query as well. Now that we've doubled the resources in our cluster, queries that can be parallelized will be performed much faster. Running our basic count (*) query we'll see that the query time is now nearly half of what it was before:

SELECT count(*) from github_events;
 count
--------
 126245
(1 row)

Time: 97.792 ms

Get started today

If you're in need of a dataset to give to kick the tires with Citus I hope this helps. Feel free to download Citus for free and or sign-up on Citus Cloud and get started.

Originally posted on the Citus Data blog

Sharding a multi-tenant app with Postgres

Craig Kerstiens — Wed, 19 Jul 2017 15:06:28 +0000

Whether you’re building marketing analytics, a portal for e-commerce sites, or an application to cater to schools, if you’re building an app and your customer is another business then a multi-tenant approach is the norm. The same code runs for all customers, but each customer sees their own private data set, except in the cases of holistic internal reporting.

Early in your application’s life, customer data has a simple structure which evolves organically. Typically all information relates to a central customer/user/tenant table. With a smaller amount of data (10’s of GB) it’s easy to scale the application by throwing more hardware at it, but what happens when you’ve had so much success that your data no longer fits in memory on a single box, or you need more concurrency? You scale out by re-architecting your application–and it’s often painful (and expensive.)

Options for scaling out your database

This scale-out model for databases has worked well for the likes of Google and Instagram, but doesn't have to be as complicated as you might think.

If you're able to model your multi-tenant data in the right way sharding can be much simpler–you do not need to re-architect your application to scale out, and youcan keep the power you need from a database including joins, indexing, and more. I work at Citus Data, where we’ve created a database that scales out Postgres (an extension to Postgres, actually): we’ve done the hard work of sharding so you don’t have to. While Citus lets you scale out your processing power and memory and storage, how you model your data will determine the ease and flexibility you get from the system. If you're building a multi-tenant SaaS application, hopefully the following example highlights how you can plan early for scaling without having to contort too much of your application.

Scaling out Postgres with ease by adopting a multi-tenant data model

At the core of most SaaS applications, tenancy is already built in, whether you realize it or not. By “tenancy”, we mean the notion that your SaaS application has multiple customers (“tenants”) who are all sharing the same application but whose data needs to be kep separate from each other. (The same way that multiple tenants can live in the same building, but each have their own separate apartment.)

Anyway, as we mentioned above, you may have a users table. Let's look at a very basic SaaS schema that highlights this:

CREATE TABLE stores (
  id UUID,
  owner_email VARCHAR(255),
  owner_password VARCHAR(255),
  name VARCHAR(255),
  url VARCHAR(255),
  last_login_at TIMESTAMPTZ,
  created_at TIMESTAMPTZ
)

CREATE TABLE products (
  id UUID,
  name VARCHAR(255),
  description TEXT,
  price INTEGER,
  quantity INTEGER,
  store_id UUID,
  created_at TIMESTAMPTZ,
  updated_at TIMESTAMPTZ
)

CREATE TABLE purchases (
  id UUID,
  product_id UUID,
  customer_id UUID,
  store_id UUID,
  price INTEGER,
  purchased_at TIMESTAMPTZ,
)

The above schema highlights an overly simplified multi-tenant e-commerce site. Say for example, someone like an Etsy or Shopify. And of course there are a number of queries you would run against this:

List the products for a particular store:

SELECT id, 
       name,
       price
FROM products
WHERE store_id = foo;

Or let’s say you want to compute how many purchases exist weekly for a given store:

SELECT date_trunc('week', purchased_at),
       sum(price * quantity)
FROM purchases,
     stores
WHERE stores.id = products.stores_id
  AND store_id = â€˜foo’

From here you could envision how to give each store its own presence and analytics. Now if we fast-forward a bit and start to look at scaling this out then we have a choice to make on how we'll shard the data.

The easiest level to do this at is the tenant level or in this case on store_id. With the above data model the largest tables over time are likely to be products and purchases, we could shard on both of these. Though if we choose products or purchases, the difficulty lies in the fact that we may want to do queries that focus on some high level item such as store. If we choose store_id then all data for a particular store would exist on the same node, this would allow you push down all computations directly to the a single node.

Multi-tenancy and co-location, a perfect pair

Co-locating data within the same physical instance avoids sending data over the network during joins. This can result in much faster operations. With Citus, there are a number of ways to move your data around so you can join and query it in a flexible manner, but for this class of multi-tenant SaaS apps it’s simple if you can ensure data ends up on the shard. To do this though we need to push down our store_id to all of our tables.

The key that makes this all possible is including your store_id on all tables. By doing this you can easily shard out all your data so it’s located on the same shard. In the above data model we coincidentally had store_id on all of our tables, but if it weren’t there you could add it. This would put you in a good position to distribute all your data so it’s stored on the same nodes. So now let’s try sharding our tenants, in this case stores:

SELECT master_create_distributed_table('stores', 'id', 'hash');
SELECT master_create_distributed_table('products', 'store_id', 'hash');
SELECT master_create_distributed_table('purchases', 'store_id', 'hash');

SELECT master_create_worker_shards('stores', 16);
SELECT master_create_worker_shards('products', 16);
SELECT master_create_worker_shards('purchases', 16);

Now you’re all set. Again, you’ll notice that we shard everything by store_id–this allows all queries to be routed to a single Postgres instance. The same queries as before should work just fine for you as long as you have store_id on your query. An example layout of your data now may look something like:

The alternative to colocation is to choose some lower-level shard key such as orders or products. This has a trade-off of making joins and querying more difficult because you have to send more data over the network and make sure things work in a distributed way. This lower-level key can be useful for consumer focused datasets, if your analytics are always against the entire data set as is often the case in metrics-focused use cases.

To scale out your Postgres database, you have more good choices than you think.

It’s important to note as we did in our sharding JSON in Postgres post that different distribution models can have different benefits and trade-offs. In some cases modeling on a lower-level entity id such as products or purchases can be the right choice– you gain more parallelism for analytics and trade off simplicity in querying a single store. Either choice of picking a multi-tenant data model or adopting a more distributed document model can be made to scale, but each comes with its own trade-offs. If you have the need today to scale out your multi-tenant app then give Citus Cloud a try or if you have any questions on which type of scale-out data model might work best for your application, please don’t hesitate to reach out to my team at Citus. We can help. (And did we mention that Citus is available as open source, as a database service in AWS, and on-prem?)

Originally posted on the Citus Data blog

Working with time in Postgres

Craig Kerstiens — Mon, 10 Jul 2017 17:06:58 +0000

Originally posted on my personal blog at craigkerstiens.com

A massive amount of reporting queries, whether really intensive data analysis or just basic insights into your business involving looking at data over a certain time period. Postgres has really rich support for dealing with time out of the box, something that's often very underweighted when dealing with a database. Sure, if you have a time-series database it's implied, but even then how flexible and friendly is it from a query perspective? With Postgres there's a lot of key items available to you, let's dig in at the things that make your life easier when querying.

Date math

The most common thing I find myself doing is looking at users that have done something within some specific time window. If I'm executing this all from my app I can easily inject specific dates, but Postgres makes this really easy for you. Within Postgres you have a type called an interval that is some window of time. And fortunately, Postgres takes care of the heavy lifting of how might something translate to or from hours/seconds/milliseconds/etc. Here are just a few examples of things you could do with intervals:

'1 day'::interval
'5 days'::interval
'1 week'::interval
'30 days'::interval
'1 month'::interval

A note that if you're looking to remove something like a full month, you actually want to use 1 month instead of trying to calculate yourself.

With a given interval you can easily shift some window of time, such as finding all users that have signed up for your service within the past week:

SELECT *
FROM users
WHERE created_at >= now() - '1 week'::interval

Date functions

Date math makes it pretty easy for you to go and find some specific set of data that applies, but what do you do when you want a broader report around time? There's a few options here. One is to leverage the built-in Postgres functions that help you work with dates and times. date_trunc is one of the most used ones that will truncate a date down to some interval level. Here you can use the same general values as the above, but simply pass in the type of interval it will be. So if we wanted to find the count of users that signed up per week:

SELECT date_trunc('week', created_at), 
       count(*)
FROM users
GROUP BY 1
ORDER BY 1 DESC;

This gives us a nice roll-up of how many users signed up each week. What's missing here though is if you have a week that has no users. In that case because no users signed up there is no count of 0, it just simply doesn't exist. If you did want something like this you could generate some range of time and then do a cross join with it against users to see which week they fell into. To do this first you'd generate a series of dates:

SELECT generate_series('2017-01-01'::date, now()::date, '1 week'::interval) weeks

Then we're going to join this against the actual users table and check that the created_at falls within the right range.

with weeks as (
  select week
  from generate_series('2017-01-01'::date, now()::date, '1 week'::interval) week
)

SELECT weeks.week,
       count(*)
FROM weeks,
     users
WHERE users.created_at > weeks.week
  AND users.created_at <= (weeks.week - '1 week'::interval)
GROUP BY 1
ORDER BY 1 DESC;

Timestamp vs. Timestamptz

What about storing the times themselves? Postgres has two types of timestamps. It has a generic timestamp and one with timezone embedded in it. In most cases you should generally opt for timestamptz. Why not timestamp? What happens if you move a server, or your server somehow swaps its configuration. Or perhaps more practically what about daylight savings time? In general you might think that you can simply just put in the time as you see it, but when different countries around the world observe things like daylight savings time differently it introduces complexities into your application.

With timestamptz it'll be aware of the extra parts of your timezone as it comes in. Then when you query from one timezone that accounts for daylights savings you're all covered. There's a number of articles that cover a bit more in depth on the logic between timestamp and timestamp with timezone, so if you're curious I encourage you to check them out, but by default you mostly just need to use timestamptz.

There's a number of other functions and capabilities when it comes to dealing with time in Postgres. You can extract various parts of a timesetamp or interval such as hour of the day or the month. You can grab the day of the week with dow. And one of my favorites which is when we celebrate happy hour at Citus, there's a literal for UTC 00:00:00 00:00:00 which is allballs(). If you need to work with dates and times in Postgres I encourage you to check out the docs before you try to re-write something of your own, chances are what you need may already be there.

Getting started with JSONB in Postgres

Craig Kerstiens — Sun, 02 Apr 2017 18:49:26 +0000

Originally posted on my blog

JSONB is an awesome datatype in Postgres. I find myself using it on a weekly basis these days. Often in using some API (such as clearbit) I'll get a JSON response back, instead of parsing that out into a table structure it's really easy to throw it into a JSONB then query for various parts of it.

If you're not familiar with JSONB, it's a binary representation of JSON in your database. You can read a bit more about it vs. JSON here.

In working with JSONB here's a few quick tips to get up and running with it even faster:

Indexing

For the most part you don't have to think to much about this. With Postgres powerful indexing types you can add one index and have everything within the JSON document, all the keys and all the values, automatically indexed. The key here is to add a GIN index. Once this is done queries should be much faster where you're searching for some value:

CREATE INDEX idx_data ON companies USING GIN (data);

Querying

Querying is a little bit more work, but once you get the basics it can be pretty straight forward. There's a few new operators you'll want to quickly ramp up on and from there querying becomes easy.

For the most basic part you now have an operator so traverse down the various keys. First let's get some idea of what the JSON looks like so we can have something to work with. Here's a sample set of data that we get back from Clearbit:

{
  "domain": "citusdata.com",
  "company": {
    "id": "b1ff2bdf-0d8d-4d6d-8bcc-313f6d45996a",
    "url": "http:\/\/citusdata.com",
    "logo": "https:\/\/logo.clearbit.com\/citusdata.com",
    "name": "Citus Data",
    "site": {
      "h1": null,
      "url": "http:\/\/citusdata.com",
      "title": "Citus Data",
    },
    "tags": [
      "SAAS",
      "Enterprise",
      "B2B",
      "Information Technology & Services",
      "Technology",
      "Software"
    ],
    "domain": "citusdata.com",
    "twitter": {
      "id": "304455171",
      "bio": "Builders of Citus, the extremely scalable PostgreSQL database.",
      "site": "https:\/\/t.co\/hKpZjIy7Ej",
      "avatar": "https:\/\/pbs.twimg.com\/profile_images\/630900468995108865\/GJFCCXrv_normal.png",
      "handle": "citusdata",
      "location": "San Francisco, CA",
      "followers": 3770,
      "following": 570
    },
    "category": {
      "sector": "Information Technology",
      "industry": "Internet Software & Services",
      "subIndustry": "Internet Software & Services",
      "industryGroup": "Software & Services"
    },
    "emailProvider": false
  }
}

Sorry it's a bit long, but it gives us a good example to work with.

Basic lookups

Now let's query something fairly basic, the domain:

# SELECT data->'domain' 
FROM companies 
WHERE domain='citusdata.com' 
LIMIT 1;

    ?column?
----------------------
 "citusdata.com"

The -> is likely the first operator you'll use in JSONB. It's helpful to traverse the JSON. Though of you're looking to get the value as text you'll actually want to use ->>. Instead of giving you some quoted response back or JSON object you're going to get it as text which will be a bit cleaner:

# SELECT data->>'domain' 
FROM companies 
WHERE domain='citusdata.com' 
LIMIT 1;

    ?column?
----------------------
 citusdata.com

Filtering for values

Now with something like clearbit you may want to filter out for only certain type of companies. We can see in the example data that there's a bunch of tags. If we wanted to find only companies that had the tag B2B we could use the ? operator once we've targetted down to that part of the JSON. The ? operator will tell us if some part of JSON has a top level key:

SELECT *
FROM companies
WHERE data->'company'->'tags' ? 'B2B'

JSONB but pretty

In querying JSONB you'll typically get a nice compressed set of JSON back. While this is all fine if you're putting it into your application, if you're manually debugging and testing things you probably want something a bit more readable. Of course Postgres has your back here and you can wrap your JSONB with a pretty print function:

SELECT jsonb_pretty(data)
FROM companies;

Much more

There's a lot more in the docs that you can find handy for the specialized cases when you need them. jsonb_each will expand a JSONB document into individual rows. So if you wanted to count the number of occurences of every tag for a company, this would help. Want to parse out a JSONB to a row/record in Postgres there's jsonb_to_record. The docs are your friends for about everything you want to do but hopefully these few steps help kick start things if you want to get started with JSONB.

DEV Community: Craig Kerstiens

Getting started with GitHub event data on Citus

An overview of the schema and queries

Querying

Scaling out

Get started today

Sharding a multi-tenant app with Postgres

Options for scaling out your database

Scaling out Postgres with ease by adopting a multi-tenant data model

Multi-tenancy and co-location, a perfect pair

To scale out your Postgres database, you have more good choices than you think.

Working with time in Postgres

Date math

Date functions

Timestamp vs. Timestamptz

More

Getting started with JSONB in Postgres

Indexing

Querying

Basic lookups

Filtering for values

JSONB but pretty

Much more