DEV Community

Cover image for S3 vs HDFS
luminousmen
luminousmen

Posted on • Updated on

S3 vs HDFS

I am very annoyed that all sorts of big data engineers confuse S3 and HDFS systems, assuming that S3 is the same as HDFS.

That’s not true.

HDFS is a distributed file system designed to store big data. It runs on physical machines that can run something else. S3 is the storage of AWS objects, it has nothing to do with storing files, all data in S3 is stored as Object Entities to which the key (document name), value (object content) and VersionID are associated. There is nothing else you can do in S3 because it is not a file system. S3 has “ presumably” unlimited storage in the cloud, but HDFS does not. S3 performs deletion or modification of the records in a eventually consistent way.

There are many other criteria like cost, SLA, durability, elasticity (you can create a custom lifecycle and version control over objects). But let’s not think about it, S3 wins there anyway.

Hadoop and HDFS have made it cheap to store and distribute large amounts of data. But now that everyone is moving to cloud architectures, the benefits of HDFS are minimal and not worth the complexity that it brings. That’s why now and in the future organizations will use S3 as a backend for their data storage solutions.


Thank you for reading!

Any questions? Leave your comment below to start fantastic discussions!

Check out my blog or come to say hi 👋 on Twitter or subscribe to my telegram channel.
Plan your best!

Top comments (3)

Collapse
 
neverchanje profile image
Wu Tao

Basically right. But using S3 means you probably have to use ec2 & other AWS products as well. I think that is why HDFS still has its market.

Collapse
 
luminousmen profile image
luminousmen

Not really - you can use whatever platform you want with s3. I think HDFS nowadays is used internally by those cloud providers or on-prem solutions.

Collapse
 
neverchanje profile image
Wu Tao

Sure, but if you run large dataset computation on top of aws s3 without deploying on ec2, it will probably cost a big I/O bandwidth payroll.