loading...
Cover image for Using AWS S3 as a database
AWS Community Builders

Using AWS S3 as a database

danielmuller profile image Daniel Muller ・5 min read

During re:Invent 2017, AWS announced a new feature for S3: s3-select, which went GA in April 2018.

In a Nutshell

S3 select allows to retrieve partial content from a single key in S3 using SQL. You can think of it as a single table-database.

There are some constraints:

  • Data needs to be in a single file and can't be grouped under a prefix
  • The content of the file must be in JSON, CSV or Apache Parquet
  • Supported compressions are GZIP or BZIP2
  • You can't use joins between tables
  • You can't update a single entry, you need to replace the whole file
    • But this comes with a benefit: updating the data becomes very easy. All you need is an S3 PutObject access (Console, CLI, SDK, SFTP, ...)

This is specially useful in bigdata. Instead of loading a large file, you retrieve only the useful content. Therefore reducing network transfer, storage and memory on the processing side, which translates into cost reduction.

S3-Select stills needs to scan the whole file (you pay for that), but you gain on the processing side.

Which format to choose?

Each format has it's pros and cons. The format to use mainly depends on your production capabilities and query needs.

Being text formats, CSV and JSON a very easy to produce (code, spreadsheets, ...). Parquet needs some additional skills and tools.

Schema alterations are straight forward in JSON, since every entry carries it's own schema. For CSV and Parquet, you need to alter each entry with the new field. Using a spreadsheet software helps for CSV.

With all the attributes repeated, JSON is the heaviest format. Half of the file is schema definition. With CSV and Parquet you define your attributes only once. Obviously, this becomes less of a factor once compressed.

On the query side, Parquet is the clear winner. Being a columnar format, the amount of data scanned is greatly reduced, since only the needed columns are used. With CSV and JSON the whole file needs to be scanned.

JSON CSV Parquet
Data Creation Easy Easy Difficult
Data Modification Easy Average Difficult
Storage Size High Low Low
Query speed Slow Slow Fast

Which compression to choose?

You should compress your files, there is no reason to keep them vanilla. Smaller files results in less storage used and faster transfer times from your data source to S3.

BZIP2 is a bit slower, but will generate smaller files than GZIP. You reduce even more your upload times and storage costs, but with a small impact on query times.

A simple example

player_id player_slug first_name last_name player_url flag_code residence birthplace birthdate birth_year birth_month birth_day turned_pro weight_lbs weight_kg height_ft height_inches height_cm handedness backhand
a002 ricardo-acuna Ricardo Acuna http://www.atpworldtour.com/en/players/ricardo-acuna/a002/overview CHI Jupiter, FL, USA Santiago, Chile 1958.01.13 1958 01 13 0 150 68 "5'9""" 69 175
a001 sadiq-abdullahi Sadiq Abdullahi http://www.atpworldtour.com/en/players/sadiq-abdullahi/a001/overview NGR 1960.02.02 1960 02 02 0 0 0 "0'0""" 0 0
a005 nelson-aerts Nelson Aerts http://www.atpworldtour.com/en/players/nelson-aerts/a005/overview BRA Cachoeira Do Sul, Brazil 1963.04.25 1963 04 25 0 165 75 "6'2""" 74 188
a004 egan-adams Egan Adams http://www.atpworldtour.com/en/players/egan-adams/a004/overview USA Palmetto, FL, USA Miami Beach, FL, USA 1959.06.15 1959 06 15 0 160 73 "5'10""" 70 178

...

Query the dataset

The API allowing to query a file's content is SelectObjectContent. Also available in most of the SDK's.

For the CLI, you find it under aws s3api select-object-content.

Let's do it with Javascript:

const AWS = require('aws-sdk')
var credentials = new AWS.SharedIniFileCredentials({profile: 'default'})
AWS.config.credentials = credentials
const s3 = new AWS.S3({region:'us-east-1'})

const query = 'SELECT s.last_name,s.birth_year from s3object s where s.flag_code=\'SUI\' and s.birth_year!=\'\''

const params = {
  Bucket: 'myDataBucket',
  Key: 'datasets/atp-players.csv.gz',
  InputSerialization: {
    CSV: {
      FileHeaderInfo: 'USE'
    },
    CompressionType: 'GZIP',
  },
  OutputSerialization: {
    CSV:{}
  },
  ExpressionType: 'SQL',
  Expression: query
}


s3.selectObjectContent(params).promise().then(res => {
  let records = []
  let stats = {}
  res.Payload.on('data', function (event) {
    if (event.Stats) {
      stats = event.Stats
    }
    if (event.Records) {
      if (event.Records.Payload) {
        let data = event.Records.Payload.toString()
        records = [
          ...records,
          ...data.split('\n').filter(l => l).map(e => {
            return {
              last_name: e.split(',')[0],
              birth_year: e.split(',')[1]
            }
          })
        ]
      }
    }
  })
  res.Payload.on('error', function (err) { console.log('err', err) })
  res.Payload.on('end', function () {
    console.log(records)
    console.log(stats)
  })

})
Enter fullscreen mode Exit fullscreen mode

records:

[
  { "last_name": "Allegro", "birth_year": "1978" },
  { "last_name": "Bastl", "birth_year": "1975" },
  { "last_name": "Bohli", "birth_year": "1983" },
  { "last_name": "Blatter", "birth_year": "1949" },
  ...
]
Enter fullscreen mode Exit fullscreen mode

stats:

{
  "Details": {
    "BytesScanned": 379951,
    "BytesProcessed": 1557430,
    "BytesReturned": 1034
  }
}
Enter fullscreen mode Exit fullscreen mode

S3-Select had to scan the whole compressed file and needed to process the uncompressed content. But needed to return only 1KB (~0.3% of the total file).

In conclusion

Considering S3 as a Database is surely far fetched. But it works very well for single large datasets on which you need to retrieve a small chunk. Consider it as your read-only slave database.

Keep in mind, that the read speed is far below what you can achieve with a real databases on SSD, so don't use this for anything time sensitive.

The ability to update content with more traditional non-database tools, allows a wider range of peoples to curate and maintain the data.

A real life example

When S3-Select came out, I decided to use the above concept to create a Serverless GeoIP API using Maxmind's GeoIP2 databases. I wanted to replace the locally installed databases on each server by an HTTP service. This helps on the administrative side by managing updates in only one place, remove boot up scripts while autoscaling, and also it allows other processes and services to use this data (ETL, logging, lambda, ...).

You can read more about it in my blog or install it from my Github repo (works with GeoIP2 and GeoLite2 databases).

I made some tweaks from the above example:

  • The data files are sharded by primary key to reduce the scan size
  • Cloudfront is used as a database-cache layer to avoid having to query twice for the same IP.

Why not DynamoDB?

When S3-Select came out, DynamoDB didn't have an "On-Demand" pricing. Having to provision (and pay for) a fixed throughput was too expensive.

Today, with "On-Demand" pricing, I would definitively use DynamoDB for this tool. Storage pricing is similar to S3, query pricing also. Maintaining and querying the data in DynamoDB is also easier.

Discussion

pic
Editor guide