DEV Community

Man yin Mandy Wong
Man yin Mandy Wong

Posted on

Tencent Cloud Native HDFS - Cornerstone of Cloud Big Data Storage-Computing Separation

Object storage is a widely used cloud-based unstructured data storage solution. An increasing amount of unstructured data is aggregated in the data lakes of object storage services, creating a demand for big data analysis. However, for storage systems oriented to such analysis, HDFS APIs are the de facto standard, and HDFS is the storage cornerstone of big data ecosystems.

Native object storage APIs are not compatible with HDFS and therefore cannot be directly used. To support big data scenarios with computing-storage separation enabled, object storage usually provides a simulation layer to implement the translation from HDFS semantics to object storage semantics. Typical implementations include S3N and COSN. However, as such implementations don't authentically support file system APIs, the flat directory structure based on object storage cannot implement hierarchical namespaces. It is extremely inefficient in processing operations such as RENAME since it actually copies all associated objects based on the prefix. It also has a high latency in scenarios involving frequent metadata operations like LIST and HEAD. In addition, some object storage systems lack strong consistency semantics and thus cannot guarantee the read consistency after write, causing errors in the upper-layer big data computing framework.

Moreover, in terms of data flow, common file appending operations are also not supported by the simulation layers of S3N and COSN. To support big data storage-computing separation scenarios, the cloud storage system should be redesigned to serve as an efficient and reliable storage cornerstone for cloud big data computing, meeting the requirements for metadata operations while implementing unlimited storage.

In view of this, Tencent Cloud has launched Cloud Native HDFS (CHDFS), a COS-based general distributed file system design solution.

1. CHDFS overview
CHDFS builds a scalable metadata layer in COS by giving full play to the cloud to support HDFS semantics. With the highly optimized metadata layer, it allows efficient access to massive amounts of metadata. It can handle far more metadata than native HDFS while delivering almost the same performance as HDFS. In addition, it comes with a Java client optimized for data flow read/write, which makes the most of COS while enabling efficient metadata operations. CHDFS implements file system semantics based on COS and hosts data in COS as a disk, where a distributed massive metadata layer is built for the file system. Hosting data in COS also enables CHDFS to enjoy the strengths of COS, such as low costs, high reliability, throughput, and availability, and petabyte-level storage capacity.

2. CHDFS benefits
CHDFS adopts a distributed architecture and incorporates many optimizations for metadata read/write. It supports tens of billions of files, overcoming the capacity limit of HDFS NameNode and ensuring strong consistency semantics. Compared with COS and HDFS, CHDFS has the following benefits:

  1. It supports millisecond-level atomic RENAME operations for both directories and files.
  2. It features strong metadata consistency, so that data becomes visible immediately after being written.
  3. It supports tens of billions of files, far more than HDFS, and has almost the same latency as HDFS.
  4. It is a single file system that supports over 100,000 QPS for metadata, meeting the requirements for high concurrency in large-scale computing scenarios.
  5. It is highly available and can complete HA switch in seconds.
  6. Thanks to the parallel loading of metadata, it can be cold-started much faster than HDFS.
  7. It supports cross-region/AZ replication of metadata, further increasing the reliability.

CHDFS offers multiple metadata engines for your choice based on your business needs in different scenarios, helping you strike a balance between cost, capacity, and performance. As for API, it is fully compatible with HDFS, so you can easily migrate data between the two systems.

3. COS as a database for CHDFS
As a basic cloud storage service, COS acts as a solid database for CHDFS. CHDFS file data is stored in COS after being divided into parts, which has the following strengths:

  1. Hundreds of petabytes of data can be stored, and the capacity can be expanded automatically.
  2. Tbps-level bandwidth is supported, giving full play to the high throughput of COS in big data computing.
  3. Data can be stored across AZs, delivering an eleven nines reliability.
  4. Data is encoded using erasure coding by default, further reducing the storage costs.
  5. File data can be replicated across regions.
  6. INTELLIGENT TIERING is supported, which automatically transitions data based on the data access frequency to further lower the storage costs.

Furthermore, CHDFS provides an HDFS-compatible high-performance SDK for Java, which is comprehensively optimized for big data scenarios to implement an efficient read/write caching mechanism based on the data flow feature of COS.

4. Abundant features
In addition to the strong file read/write capabilities mentioned above, CHDFS also has abundant features to meet your diversified requirements in big data scenarios. In terms of cost optimization, its storage lifecycle management feature automatically transitions files to cheaper storage media after simple configurations, further reducing cloud storage costs. When you need to access cold data, you can use its simple yet powerful command line tool to retrieve files to the hot tier.

To help you better understand file metadata details, the powerful file inventory feature of CHDFS allows you to export the inventory of files in the specified format based on the specified filter fields and ship it to your file system. Then, you can read it to analyze business file attributes in multiple dimensions such as average file size. You can even use it as a means of file verification during data import from your local HDFS into CHDFS.

5. Ecosystem integration
CHDFS offers a protocol fully compatible with HDFS, which can seamlessly support popular big data computing frameworks, including Hive, Spark, Presto, and Flink. Currently, CHDFS is closely integrated with Tencent Cloud EMR. After purchasing CHDFS, you can directly use it in EMR with no need to install any environment. This makes it easier for you to get started with CHDFS.

Read more at: https://www.tencentcloud.com/dynamic/blogs/sample-article/100387

Top comments (0)