DEV Community

Discussion on: How do you merge millions of small files in a S3 bucket to larger single files to a separate bucket daily?

Collapse
peterb profile image
peterb • Edited on

Redshift Spectrum does an excellent job of this, you can read from S3 and write back to S3 (parquet etc) in one command as a stream

e.g. take lots of jsonl event files and make some 1 GB parquet files
First
create external table mytable (....)
row format serde 'org.openx.data.jsonserde.JsonSerDe'
stored as inputformat 'org.apache.hadoop.mapred.TextInputFormat'
outputformat 'org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat'
location 's3://bucket/folderforjson/path/yesr/month/day ...'

Then
upload ('select columns from mytable where ...')
to 's3://bucket/folderforparquet/year/month/day...'
iam_role 'arn:aws:iam::123456789:role/prod....-role'
format parquet
partition by (year, month, day)
include
cleanpath

You can buy Redshift by the hour, and Redshift Spectrum is $5 per TB