During re:Invent 2017, AWS announced a new feature for S3: s3-select, which went GA in April 2018.
In a Nutshell
S3 select allows to retrieve partial content from a single key in S3 using SQL. You can think of it as a single table-database.
There are some constraints:
- Data needs to be in a single file and can't be grouped under a prefix
- The content of the file must be in JSON, CSV or Apache Parquet
- Supported compressions are GZIP or BZIP2
- You can't use joins between tables
- You can't update a single entry, you need to replace the whole file
- But this comes with a benefit: updating the data becomes very easy. All you need is an S3 PutObject access (Console, CLI, SDK, SFTP, ...)
This is specially useful in bigdata. Instead of loading a large file, you retrieve only the useful content. Therefore reducing network transfer, storage and memory on the processing side, which translates into cost reduction.
S3-Select stills needs to scan the whole file (you pay for that), but you gain on the processing side.
Which format to choose?
Each format has it's pros and cons. The format to use mainly depends on your production capabilities and query needs.
Being text formats, CSV and JSON a very easy to produce (code, spreadsheets, ...). Parquet needs some additional skills and tools.
Schema alterations are straight forward in JSON, since every entry carries it's own schema. For CSV and Parquet, you need to alter each entry with the new field. Using a spreadsheet software helps for CSV.
With all the attributes repeated, JSON is the heaviest format. Half of the file is schema definition. With CSV and Parquet you define your attributes only once. Obviously, this becomes less of a factor once compressed.
On the query side, Parquet is the clear winner. Being a columnar format, the amount of data scanned is greatly reduced, since only the needed columns are used. With CSV and JSON the whole file needs to be scanned.
JSON | CSV | Parquet | |
---|---|---|---|
Data Creation | Easy | Easy | Difficult |
Data Modification | Easy | Average | Difficult |
Storage Size | High | Low | Low |
Query speed | Slow | Slow | Fast |
Which compression to choose?
You should compress your files, there is no reason to keep them vanilla. Smaller files results in less storage used and faster transfer times from your data source to S3.
BZIP2 is a bit slower, but will generate smaller files than GZIP. You reduce even more your upload times and storage costs, but with a small impact on query times.
A simple example
- Dataset used: atp-world-tour-tennis-data/players
- Size: 1.5MB (372KB gzipped)
- Entries: 10912
- Fields: 20
player_id | player_slug | first_name | last_name | player_url | flag_code | residence | birthplace | birthdate | birth_year | birth_month | birth_day | turned_pro | weight_lbs | weight_kg | height_ft | height_inches | height_cm | handedness | backhand |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
a002 | ricardo-acuna | Ricardo | Acuna | http://www.atpworldtour.com/en/players/ricardo-acuna/a002/overview | CHI | Jupiter, FL, USA | Santiago, Chile | 1958.01.13 | 1958 | 01 | 13 | 0 | 150 | 68 | "5'9""" | 69 | 175 | ||
a001 | sadiq-abdullahi | Sadiq | Abdullahi | http://www.atpworldtour.com/en/players/sadiq-abdullahi/a001/overview | NGR | 1960.02.02 | 1960 | 02 | 02 | 0 | 0 | 0 | "0'0""" | 0 | 0 | ||||
a005 | nelson-aerts | Nelson | Aerts | http://www.atpworldtour.com/en/players/nelson-aerts/a005/overview | BRA | Cachoeira Do Sul, Brazil | 1963.04.25 | 1963 | 04 | 25 | 0 | 165 | 75 | "6'2""" | 74 | 188 | |||
a004 | egan-adams | Egan | Adams | http://www.atpworldtour.com/en/players/egan-adams/a004/overview | USA | Palmetto, FL, USA | Miami Beach, FL, USA | 1959.06.15 | 1959 | 06 | 15 | 0 | 160 | 73 | "5'10""" | 70 | 178 |
...
Query the dataset
The API allowing to query a file's content is SelectObjectContent. Also available in most of the SDK's.
For the CLI, you find it under aws s3api select-object-content
.
Let's do it with Javascript:
const AWS = require('aws-sdk')
var credentials = new AWS.SharedIniFileCredentials({profile: 'default'})
AWS.config.credentials = credentials
const s3 = new AWS.S3({region:'us-east-1'})
const query = 'SELECT s.last_name,s.birth_year from s3object s where s.flag_code=\'SUI\' and s.birth_year!=\'\''
const params = {
Bucket: 'myDataBucket',
Key: 'datasets/atp-players.csv.gz',
InputSerialization: {
CSV: {
FileHeaderInfo: 'USE'
},
CompressionType: 'GZIP',
},
OutputSerialization: {
CSV:{}
},
ExpressionType: 'SQL',
Expression: query
}
s3.selectObjectContent(params).promise().then(res => {
let records = []
let stats = {}
res.Payload.on('data', function (event) {
if (event.Stats) {
stats = event.Stats
}
if (event.Records) {
if (event.Records.Payload) {
let data = event.Records.Payload.toString()
records = [
...records,
...data.split('\n').filter(l => l).map(e => {
return {
last_name: e.split(',')[0],
birth_year: e.split(',')[1]
}
})
]
}
}
})
res.Payload.on('error', function (err) { console.log('err', err) })
res.Payload.on('end', function () {
console.log(records)
console.log(stats)
})
})
records:
[
{ "last_name": "Allegro", "birth_year": "1978" },
{ "last_name": "Bastl", "birth_year": "1975" },
{ "last_name": "Bohli", "birth_year": "1983" },
{ "last_name": "Blatter", "birth_year": "1949" },
...
]
stats:
{
"Details": {
"BytesScanned": 379951,
"BytesProcessed": 1557430,
"BytesReturned": 1034
}
}
S3-Select had to scan the whole compressed file and needed to process the uncompressed content. But needed to return only 1KB (~0.3% of the total file).
In conclusion
Considering S3 as a Database is surely far fetched. But it works very well for single large datasets on which you need to retrieve a small chunk. Consider it as your read-only slave database.
Keep in mind, that the read speed is far below what you can achieve with a real databases on SSD, so don't use this for anything time sensitive.
The ability to update content with more traditional non-database tools, allows a wider range of peoples to curate and maintain the data.
A real life example
When S3-Select came out, I decided to use the above concept to create a Serverless GeoIP API using Maxmind's GeoIP2 databases. I wanted to replace the locally installed databases on each server by an HTTP service. This helps on the administrative side by managing updates in only one place, remove boot up scripts while autoscaling, and also it allows other processes and services to use this data (ETL, logging, lambda, ...).
You can read more about it in my blog or install it from my Github repo (works with GeoIP2 and GeoLite2 databases).
I made some tweaks from the above example:
- The data files are sharded by primary key to reduce the scan size
- Cloudfront is used as a database-cache layer to avoid having to query twice for the same IP.
Why not DynamoDB?
When S3-Select came out, DynamoDB didn't have an "On-Demand" pricing. Having to provision (and pay for) a fixed throughput was too expensive.
Today, with "On-Demand" pricing, I would definitively use DynamoDB for this tool. Storage pricing is similar to S3, query pricing also. Maintaining and querying the data in DynamoDB is also easier.
Top comments (1)
Great post. Please publish more. I've just followed you.