DEV Community: Shun Huang

Building Trust Through Common Sense

Shun Huang — Thu, 28 May 2026 04:32:47 +0000

[Part 1 of the Demystifying Data Governance Series]

Before we knew the term “data governance”, we were probably already doing it. Not with frameworks or enterprise tools, but with common sense: tracking what data we had, ensuring it was accurate, deciding who could use it, and improving how we handled it over time. All of these are practices humans have been engaged in for thousands of years.

Ancient Mesopotamians used clay tablets to record grain inventories and trade transactions – complete with metadata like dates, locations, and responsible parties. Those tablets were not just records; they were mechanisms of trust. Today, we do the same thing, just with digital media instead of clay. The principles haven’t changed – only the materials, the scale, the speed, and the actors changed.

The last point matters. AI systems are now first-class participants in every data estate. They ingest data, transform it, produce more of it, and make decisions based on it – often faster than any governance process was designed to handle. A governance program that ignores this is solving last decade’s problems.

This article establishes the foundation: what governance actually is, why it matters more now than before, and the four pillars that everything else builds on over time.

What is Data Governance?

At its core, governance is about visibility , protection , and accountability. It is how organizations build trust in their data – trust that it is accurate, safe, and used responsibly.

If we search for or ask AI systems for “data governance,” numerous frameworks and solution providers come up, each with its own terminology and a long list of responsibilities: data discovery, classification, cataloging, lineage, quality, security, and regulatory compliance. The wording varies across vendors and industries, but the components remain the same.

This long list is exactly why governance feels overwhelming. It looks like a sprawling set of disconnected tasks, each requiring its own tool, process, or team. In reality, all of these activities collapse into four simple, intuitive pillars:

Know the data – understand what we have, where it lives, how sensitive it is, and who owns it.
Secure the data – control who can access it and protect it from both external and internal threats.
Use the data properly – ensure it is used for a legitimate purpose with appropriate consent and in compliance with regulations.
Improve data quality continuously – measure it, detect when it degrades, and fix it at the source.

The table below maps common governance activities to these four pillars. This mapping is intentionally simple – not to oversimplify governance, but to show that the field is far more intuitive than its terminology suggests.

Governance Component	Know	Secure	Use Properly	Improve Quality
Data discovery & awareness	✓
Data assessment	✓
Data classification & sensitivity tagging	✓	✓
Metadata management & cataloging	✓
Data lineage & traceability	✓
Access management & entitlements		✓	✓
Permissions auditing		✓	✓
Data sharing & collaboration workflows		✓	✓
Data security, encryption, privacy controls		✓
Regulatory compliance		✓	✓
Data quality rules & validation				✓
Data quality monitoring				✓
Stewardship & ownership models	✓			✓

Why Data Needs to Be Governed?

Data is the fuel of modern analytics and the foundation of every AI system. Every product decision, every model, every automated workflow depends on the quality and reliability of the data beneath it. Unlike physical assets, data grows, spreads, and changes at a pace no physical resource ever could – and that pace is accelerating.

The size of data is exploding as organizations collect more than ever before. The workforce engaging with data – engineers, analysts, marketing, operations – has expanded to include AI systems that act as autonomous agents. AI has introduced a qualitative shift: AI systems do not just consume data, they generate it. Every model run produces outputs, logs, embeddings, and derived features. Every agent interaction leaves a trail. The volume, variety, and velocity of the data estate now outpace any manual governance process.

The Risks are Concrete

Privacy expectations have risen. People expect their data to be handled with care and respect. When it is not, the reputational cost arrives fast.
Regulations have strengthened. GDPR, CCPA, HIPAA, and industry-specific rules require organizations to know where sensitive data lives and how it is used, and how to delete it on request. GDPR alone has levied fines totaling over €1 billion in 2025. These are not theoretical risks.
Internal threats are the dominant risk category. External attackers are visible and tracked. The larger, more common threat is internal: misconfigured permissions, overly broad service accounts, data shared with the wrong team. These are governance failures, not security failures, and they happen constantly.
AI amplifies misuse surface. A single misconfigured permission exposed a dataset to a hundred people before AI. An AI agent with the same misconfigured permissions can exfiltrate more data in a single run than a human analyst could in a year. The blast radius of governance failures has grown with the capabilities of the systems operating on the data.

The Four Pillars

The diagram below shows the relationships among the pillars. Each of them depends on the others to complete the data governance solution.

Know the data. We cannot protect what we cannot see, and we cannot improve what we do not understand. Knowing the data means maintaining an accurate, continuously updated picture of what the estate contains: every dataset, its sensitivity, its owner, its lineage, its quality. This pillar is the foundation. Every other pillar depends on it.

Secure the data. Access control is not binary. It is a spectrum from “no one touches this without approval” to “anyone can read this,” with most data falling somewhere in between. Securing the data means making that placement deliberate – based on the data’s sensitivity and the purpose of access – and enforcing it automatically, not through manual approval chains that get bypassed under pressure.

Use the data properly. Authorized access is not the same as appropriate use. An analyst with access to a sensitive dataset is authorized to access it; they may not be authorized to train a model on it, share it with an external partner, or retain it beyond its stated purpose. Purpose limitation, consent, and the difference between allowed and appropriate are what this pillar governs.

Improve data quality continuously. Quality is not a state; it is a practice. Data degrades. Schemas change. Pipelines break. Upstream sources drift. A governance program that measures quality once and moves on will find the numbers diverging, the dashboards disagreeing, and the organization eventually trusting none of it. Quality must be measured continuously, owned by named individuals, and improved systematically – not patched when someone complains.

The Ideal State of Data Governance

The ideal data governance program is invisible. Not because it does not exist, but because it works so smoothly that people barely notice it. Data is there when teams need it – accurate, classified, documented. Access is granted based on role and purpose without a ticket queue. Quality issues are caught before they reach a dashboard. Deletion requests are complete end-to-end. Compliance is a byproduct of normal operations, not a scramble when an audit arrives.

Unfortunately, the reality is different. Governance typically begins after the data estate has already grown too large to understand. When the team is small, everyone knows what data exists, where it came from, and who can touch it – nothing feels urgent. By the time the cracks appear – datasets nobody owns, classifications nobody ran, lineage nobody built – the problem has compounded. Applying governance retroactively to a data estate that was never designed to receive it is harder than starting earlier. It is not impossible, but it is harder.

The gap between the ideal and reality is not a reason to delay. It is a description of the starting point. Every governance program begins somewhere between knowing nothing and having everything under control. The question is what direction it is moving.

The Path Forward

The path from the current state to the ideal is not a transformation project. It is a discipline – a set of practices that, applied consistently, move the program forward every quarter. And the sequence matters.

Start with visibility. Before any other pillar can function, we need to know what we have. Discovery and classification are the entry point. A catalog entry is better than none. A tag is better than an empty field. Imperfect coverage improving over time is the goal – not a perfect inventory before anything else moves.

Build access control on top of classification. Policies that reference data classes – analysts cannot access sensitive data without approval – are more durable than policies that reference specific tables. When a new sensitive table appears, the policy already applies. When a table is reclassified, the access tier changes automatically.

Make quality observable before making it a goal. We cannot set a quality target for a dataset we have never profiled. The first step is measurement – baselines, distributions, null rates, and freshness. Quality targets come after baselines.

Translate governance into business language from the start. A governance program that cannot explain its value to the leadership will be deprioritized the moment any roadmap trade-off arises. The business case is rarely made explicitly, but we must make its value visible.

Govern AI, and Govern with AI

AI changes governance in two directions at once, and conflating them causes confusion. We have to govern AI — the systems, their data, and their outputs — and we can govern with AI — using it to do governance work better. Both matter, and they are not the same thing.

Governing AI

A model is not just code. It is a process that ingests data, transforms it, and produces more data. Every part of that process is a governance object:

Training datasets carry the same sensitivity as their source data. If a customer requests deletion under GDPR and their data was in a training set, removing the source row does not remove their influence from the model. Machine unlearning is expensive and imperfect, so training data must be tracked at the dataset and version levels before a deletion request arrives, not after.
Embeddings and vector stores are dense numerical representations of source data, and can be reverse-engineered back toward it. An embedded support ticket carries the sensitivity of its content even if nobody labeled the embedding. Vector stores need classification and access control the same way tables do.
Model outputs and inference logs are new data, generated at high volume and usually stored with no owner, no retention policy, and no quality signal. When model behavior later needs to be audited, that gap makes it impossible.
AI agents inherit, by default, the permissions of whoever built them — almost always too broad. An agent with tool access to a warehouse can reach anything that the account can reach, not just what its task requires. Least privilege applies to agents exactly as it applies to people.

We go deep on each of these in the articles that follow. The point here is that AI expands what has to be governed — the scope grows — but not the principles. Know it, secure it, use it properly, improve it.

Governing with AI

For governance tasks that used to be manual, slow, and incomplete, AI is the most powerful tool we have ever had. Some examples are as follows.

Sensitive data discovery. Context-aware classifiers surface sensitive data in free-form text fields that regex patterns miss – support tickets, incident notes, and unstructured documents. (A deeper treatment is in the companion piece Sensitive Data Discovery with AI — coming soon.)

Catalog enrichment. An LLM given a schema, sample rows, and query history can generate a human-readable table description in seconds. The catalog stops being a graveyard of empty fields.

Anomaly detection. Statistical models trained on access patterns flag behavior no rule would catch – a service account querying at 3 AM, a volume spike on a sensitive table, an export to an endpoint not listed in any sharing agreement.

Access Review summarization. An LLM can summarize, for each user, what they have accessed and whether that access is still consistent with their current role.

However, the constraint in both directions. AI assistance does not replace governance scaffolding. A classifier that produces labels without a feedback loop or integration with the access control system is generating metadata that no one acts on. An agent still needs scoped permissions, an audit trail, and an owner. The principles are unchanged.

Summary

Governance is common sense applied at scale. The four pillars — know, secure, use properly, improve continuously — are not novel. What is novel is the rate at which the data estate grows, the number of automated systems interacting with it, and the regulatory expectations around it. The framework is the same. The urgency is different.

This series follows the four pillars in the order in which they depend on each other.

What Do We Have and What Does It Mean? covers the “Know the Data” pillar: discovery, classification, metadata, lineage, and AI-generated data types that most inventories entirely miss.
Who Gets Access and How Do We Keep It Safe? covers “Secure the Data” and “Use Data Properly” together, because they are two sides of the same tradeoff. It also covers AI agents as a new class of access principal that current governance models handle poorly.
How Do We Know It’s Working? covers “Improve Data Quality” and the question every governance program eventually has to answer: how do we demonstrate value to the people funding it? This is where governance translates into business language — and where AI adoption becomes the most compelling argument.

The post Building Trust Through Common Sense appeared first on Ilha Formosa 1544.

The Essence of Data Engineering

Shun Huang — Wed, 22 Jan 2025 07:06:33 +0000

As data becomes one of the most valuable resources, the focus of data engineering springs up dramatically. When I first entered the data engineering realm, I was overwhelmed by the boom of data-focused technologies. After yearly experience in this domain, I realized we could only touch the surface if we fixate on those evolving technologies but ignore the core issue. This article was written to help people understand the essence of data engineering so we can focus on the core of the problems, not trying to win the never-ending race. Innumerable technologies and vendor products rise and fall, but the essence of data engineering remains the same – the core concept and principles can be applied to any relevant technologies. I hope this article’s ideas and principles will stand the test of time. Besides, data engineering is not only a job for data engineers; everyone who works with data can and should understand it.

The Essence of Data Engineering

In the Big Data era, the whole purpose of collecting as much data as possible was to gain value from the data, and the method is data engineering. At the core, the purpose of data engineering is to make data usable and valuable—to transform source data into a form suitable for a data use case that can extract value from it.

A typical data engineering workflow includes gathering data from source systems, transforming the collected data, and gaining values from the processed data. Below is a high-level overview of data engineering.

The essence of data engineering contains three operations (Data Collection, Data Transformation, and Data Utilization) and one component (Storage).

The Data Engineering Operations

Data Collection involves gathering data from various sources into a centralized storage system to ensure the data is ready for subsequent processing and analysis.
Data transformation is a process of converting raw data into a usable format or structure. The process may involve cleaning, conversion, aggregation, normalization, or any methods that can make the data usable.
Data Utilization refers to effectively using data and involves delivering the processed data to end-users or applications, enabling analysis, decision-making, modeling, and actionable insights. Data utilization overlaps with the territory of data analytics and data science. However, the boundary between data engineering, data analytics, and data science is blurry.

The Storage Component

Data needs a place to be stored. A storage system is the backbone that supports the entire data lifecycle. It securely and efficiently manages the influx of raw, processed, and used data.

Data Utilization

Although starting with data collection is more intuitive, knowing how data is used helps ensure we fully align with the purpose of data engineering—making data usable and valuable. This foresight avoids unnecessary data collection and ensures that the collected data perfectly fits the end goals, improving efficiency and relevance. Data analytics and machine learning are the two most predominant among all possible data cases.

Analytics

Data analytics includes discovering useful information, drawing conclusions, and supporting decision-making. Analytics uses statistical methods, reporting tools, and business intelligence tools to extract value from the data. Although the goals of analytics vary based on different situations, the data used by analytics needs to be accurate and time-sensitive (i.e., fresh enough) so that an analytics report will be trustworthy and delivered on time.

Machine Learning

Machine learning is a subfield of Artificial Intelligence. As its name implies, it teaches machines to learn something, and the essence of machine learning is learning from data. Since the beginning of the AI boom, machine learning has become one of the biggest data consumers and one of the primary use cases in the data world. Understanding how machine learning works helps serve it better. The following diagram shows a typical machine-learning workflow.

Data must be transformed into a particular format depending on the learning algorithms so that they can be trained. Most machine learning algorithms need quite a lot of data to train, especially deep learning. However, unlike data analytics, machine learning usually has a higher tolerance for data inaccuracies, and data sometimes doesn’t need to be served on time (i.e., historical data may be sufficient for training).

(To learn more about machine learning, please refer to the Machine Learning Basics Series)

Data Transformation

The raw data that a source system generates is usually unsuitable for analytics and machine learning. Therefore, it needs to be transformed into usable form. Any method that can make data usable is considered a data transformation. Below are some common transformation methods.

Cleaning: removing duplicates or invalid data, correcting errors, and filling in missing values to improve data quality.
Filtering: select a subset of data based on specific criteria or conditions.
Aggregation: summarizing data, such as calculating averages, sums, or counts.
Deduplication: removing duplicate data.
Merge & Join: combining data from different sources or tables to create a unified dataset.
Flatten: convert a complex data type (e.g., map and struct) to plan data types (e.g., integer and string) into multiple columns.
Normalization: scaling data to a standard range, typically 0 to 1, to ensure consistency.
Encoding: converting categorical data into numerical formats.
Standardization: adjusting data with a mean of 0 and a standard deviation of 1.
Smoothing: reducing noise in the data to highlight trends, often using techniques like moving averages.

Although we can implement data transformation functions from scratch, handling data transformations is challenging when the data quantity is enormous. Fortunately, many open-source libraries have been implemented for this purpose. Apache Spark and Pandas are arguably the two most popular tools. For example, by using Spark, transforming data can be as simple as a one-line code program, and Spark will handle the rest (See Apache Spark examples).

These transformations help prepare the data for analytics, machine learning, and other use cases, ensuring the data is clean, consistent, and in a proper format. Data transformation can also be applied to any step of data engineering. For example, when collecting data from a source system, the collection process can filter out invalid data before storing it, so only the valid data will be stored in the storage system.

Data Collection

Data collection is the first step of the data engineering workflow. It involves gathering raw data from various sources such as databases, APIs, logs, IoT devices, etc. The data is then processed, cleaned, transformed, and ingested into storage systems, so it is also called data ingestion or ETL (extract, transform, load).

Depending on the source system, there are several ways to gather data. The list below includes some common methods most source systems provide.

Database Query: directly querying databases using SQL or other query languages to retrieve specific data.
File Transfer: collecting data from flat files such as CSV, JSON, or log files. These files can be transferred via FTP, SFTP, or cloud storage services.
Streaming: collecting real-time data from sources like IoT devices, sensors, or social media using technologies like Apache Kafka or Amazon Kinesis.
API (Application Programming Interface): APIs allow applications to communicate and share data. Many web services and applications provide APIs for accessing their data.

After receiving the source data, we can transform it into a desired format and load it into a storage system for further use.

Storage

Storage is the cornerstone of data engineering—data must persist throughout its lifecycle. Besides, the storage stage frequently touches on other data engineering stages, such as collection, processing, and even data generation. Therefore, storing data in the context of data engineering is not as simple as saving it to a disk for personal use, especially when the quantity of data is significant.

Traditionally, data is stored in an on-prem storage system in a data center, and an abstract layer (i.e., software) manages persistent storage media such as disk and SSD and provides an interface to access them.

With the popularity of the cloud, cloud storage solutions (e.g., Amazon S3) are becoming the new norm. A cloud storage solution adds one more abstract layer to data centers across multiple regions. However, the data is still stored in persistent media (e.g., disks) in a data center somewhere in the world.

(To learn more about data centers and storage technologies, please refer to the Brief Introduction of Data Center Technologies article)

Storage Abstractions

Managing data manually and directly on persistent media such as SDD is tedious and not scalable, so modern storage solutions, whether on-prem or cloud, provide an abstraction layer to simplify and standardize our interactions with data storage systems. A database is a typical example of storage abstraction. Many new technologies, such as Data Warehouse and Data Lake, have also been created.

Data warehouse

A data warehouse is a central data hub used for reporting and analysis. Its data is typically highly formatted and structured for analytics. Thus, the data stored in a data warehouse is like tables with rows and columns and has a predefined schema, meaning the data structure is defined before storing it. Typical data warehouses include Amazon Redshift, Google BigQuery, and Snowflake.

Data Lake

A data lake is a centralized repository that allows us to store all structured, semi-structured, and unstructured data at any scale. Data is stored in its native format, and the schema is defined only when the data is read, not written. This provides flexibility in storing any type of data. Popular cloud storage service – Amazon S3 is widely used for data lakes.

Data Lakehouse

A data lakehouse is a newer innovation that combines aspects of the data warehouse and the data lake, offering unified storage and schema flexibility with data management, ACID transactions, scalability, and flexibility. This hybrid architecture allows users to perform analysis and machine learning on all data, regardless of its structure. Delta Lake and Apache Iceberg are two data lakehouse examples.

Put Everything Together

Combining everything mentioned in the previous sections is called a Data Pipeline. A data pipeline is not a one-time process; it usually needs to continue operating. As a result, a complete data pipeline includes a series of automated processes that move and transform data from various sources to a destination where the outcome can be analyzed and utilized.

A well-defined data pipeline ensures that every aspect—ingestion, processing, storage, orchestration, and monitoring—is carefully planned and integrated. However, like designing software, building a data pipeline is full of challenges and trade-offs. A holistic approach needs to take many situations into account.

Things to Consider

Building a data pipeline involves not only the pipeline but also the data it produces and manages.

Data Security

A data pipeline generates data that needs to be managed. Because of this, good data security practices must be applied.

Access must follow the Principle of Least Privilege. The principle of least privilege means granting enough permission to a user or service only for the essential data and resources to perform the operations.
The other side of data security is data privacy. We must respect people’s privacy to comply with regulations such as GDPR and CCPA. As a result, sensitive data must be masked, especially when data is PII (Personally Identifiable Information). Only a privileged person can view the unmasked sensitive data; everyone else can only see the masked value. Following this practice, even if a non-privileged person’s workstation is hacked, sensitive data will not be leaked.
When granting permission, we must avoid giving permanent permission. All permissions should have a lifespan (i.e., time-to-live or TTL), so the given permission will be revoked after it expires.
The ability to share data is one of the most significant contributors to data leakage. As a result, the privilege of sharing data must be limited, and the data-sharing activities must be monitored.

Data Quality

Data quality refers to the condition of a dataset, which has the following aspects:

Accuracy: data should be correct and free of errors.
Completeness: all required data should be preset without any gaps.
Consistency: data should be uniform and compatible across different datasets.
Reliability: data should be trustworthy and reliable.
Relevance: data should be relevant and applicable to the task

Good data quality is crucial for whatever data use cases. Data quality checks can be applied to ensure the quality.

Running data quality checks against datasets is similar to running software unit tests. For example, if we want to ensure the completeness of the ID field in a data set, we can iterate through the data set and check whether the ID field is empty. Although we can implement the data quality check ourselves, many tools, such as deequ and Great Expectations, have been created to make it easier,

Trade-off

When building software, we run unit tests every time something changes to ensure the change does not break anything. Similarly, we can run data quality checks every time data is collected and generated. However, doing that could significantly slow down the data pipeline performance because running data quality checks is expensive. Imagine a data pipeline needs to process one million records every time it runs. If we add a check to ensure there is no empty field, the check needs to iterate one million records every time. That’s one additional operation for processing the data. If we add more checks, more extra operations will be performed. The overall run time of the data pipeline will be significantly longer. Therefore, data quality checks need to be applied wisely.

Data Integration

A data pipeline may often have multiple data sources. Data integration is the process of combining data from different sources to provide a unified and comprehensive view. For example, a retail company has data stored in multiple systems:

Sales data in an ERP (Enterprise Resource Planning) system.
Customer data in a CRM (Customer Relationship Management) system.
Inventory data in a warehouse management system.

Integrating data from these disparate sources into a single data warehouse allows the company to analyze sales performance, understand customer behavior, and optimize inventory management all from one place. To do so, data must be consistent, accurate, and usable across the entire dataset. For instance, the user ID format may differ in the source systems. Therefore, when integrating the data, the user ID fields of all sources must be transformed into a consistent format.

Data Lifecycle

Data destruction is usually not a concern for a data project. However, regulations like GDPR and CCPA require companies to actively manage data destruction to respect customers’ “right to be forgotten.” As a result, we must know what consumer data they retain and have procedures to destroy data in response to requests and compliance requirements. Besides, removing unnecessary data can also reduce storage costs.

Data Lineage

Data lineage refers to recording an audit trail of data’s origins, movements, and transformation from its source to its destination. It provides a detailed record of how data flows through its lifecycle. Data lineage helps with error tracking and debugging of data and the services that process it.

In addition, having lineage for customer data allows us to trace where a customer’s data is stored and its dependencies, which is necessary to comply with regulations like GDPR and CCPA.

Scalability

In the real world, the size of the source data a data pipeline can collect is inconsistent; the data a source system generates only grows most of the time. Therefore, when designing a data pipeline, we must consider future growth and ensure the infrastructure can handle increased data volume and complexity without compromising performance.

Monitoring and Alerting

Like any software service, a data pipeline continues operating, so its status needs to be monitored, and an alert needs to be triggered if its operations fail. Monitoring and alerting keep our data processes reliable, efficient, and secure.

Cost Control

With the popularity of cloud solutions, more services are being built and run on the cloud. Most cloud providers have a pay-as-you-go pricing model so that systems can run on a cost-per-processing or any other variant of the pay-as-you-go model. That means the more resources we consume, the more we need to pay. Therefore, cost and resource consumption must be considered when designing a data pipeline.

Resources can be categorized as computing and storage. Computing indicates resources where software can be executed, and the resources include CPU and memory. Nowadays, most cloud providers offer two types of computing resources: server and serverless. The server type is like a virtual machine, which is like a real server but virtual. When using server-type computing, a virtual machine is allocated with predefined configurations (e.g., CPU type and memory size), and we run our software on it. On the contrary, serverless computing doesn’t need to specify its configuration. Our software will be running on whatever computing resources the cloud provides. Whether the computing is server or serverless, they are charged based on the size of computing resources and the duration of use. For that reason, we should choose computing resources adequately. Enabling auto-scale if the option is available.

A data pipeline generates data, usually a lot, and the data needs to be stored. Like the computing resources on the cloud, most cloud storage providers charge based on the size and duration. Setting up proper data lifecycle policies, moving cold data to the cold tier, and cleaning temp data could save storage costs.

(To check more details about optimizing storage cost, please refer to A Guide for Optimizing AWS S3 Storage Cost. Although the article is written for Amazon S3, its idea can be applied to any other cloud storage platform)

Trade-off

Making software development decisions involves trade-offs. This is the same when trying to optimize the costs of a data pipeline. For example, using a bigger cluster costs more, but the pipeline may run faster. So, the overall cost might be lower when using a bigger cluster. Similarly, storing pre-compute data increases the storage footage, but having pre-compute data could significantly reduce the cost of computing. Apart from that, storage is usually cheaper than computing so that the overall cost could be lower. To sum up, when evaluating costs, every factor of a data pipeline needs to be considered.

Conclusion

All data problems become problems because of the gigantic quantity of data. Even the dumbest approach can sort one hundred records well. However, sorting billions of records becomes a complex problem. Numerous new technologies have been invented to solve data problems. However, despite how many new technologies have emerged, the core problem remains: making data usable and valuable. Data engineering is the method to achieve that goal regardless of the tools or technologies used.

The post The Essence of Data Engineering appeared first on Ilha Formosa 1544.

A Guide for Optimizing AWS S3 Storage Cost

Shun Huang — Tue, 20 Feb 2024 01:50:49 +0000

AWS Simple Storage Service (S3) is one of the most popular cloud storage services. Unlike many goods whose prices increase, storage costs per unit decrease over time. However, the amount of data we store increases much faster than the speed of the decrease of the storage cost per unit, so if we don’t use S3 wisely, we may be surprised by our S3 bill. This article provides a guideline for optimizing the S3 storage cost and includes some S3 details we need to know to use S3 wisely.

Table of Contents

What Does AWS Charge on S3?

Before we know how to save from S3 cost, we need to understand how S3 charges us. The S3 Pricing page (https://aws.amazon.com/s3/pricing/) shows the details of its pricing model. As the article was written, S3 charges not only storage but also data retrieval, data transfer, security, monitoring and analysis, replication, and transformation. Although S3 advertised pay only for what you use, at the core, the cost of some features is inevitable.

The diagram above shows the minimum operations we would perform when using S3 – ingesting, storing, and reading the data. Therefore, the total cost of S3 can be simplified as this.

Total Cost = Storage Cost + Request Cost + Transfer Cost + Other Cost

Among all the S3 features, the storage size is the critical factor in the equation, which affects how much other operations will be charged. That being said, this article does not talk about costs such as transfer, replication, and encryption because, most of the time, we use those features because we need to, and they are on-demand. Therefore, there is not much we could save from those features.

Storing Data in Suitable Storage Classes

Data stored in S3 are not always treated in the same way. Instead, each object in S3 has a storage class associated with it – different storage classes offer different data access performance and features with different prices. The complete list of available storage classes is available at Amazon S3 Storage Classes. By default, data is stored in the Standard Class – the most efficient and expensive class. Thus, keeping our data in a proper class could save us a lot of money. For example, the following table shows the cost of 100TB of data in different storage classes and how much it could save compared to the Standard Class.

Class	Rate per Month (as February 2024)	Total	Save
Standard	<50TB: $0.023/GB; 50TB< and <500TB: $0.022/GB	50 * 1024 GB * 0.023 + 50 * 1024 GB * 0.022 = $2,304	NA
Standard-IA	$0.0125/GB	100 * 1024 GB * 0.0125 = $1,280	$1,024
Glacier Instant Retrieval	$0.004/GB	100 * 1024 GB * 0.004 = $409.6	$1,894.4
Glacier Flexible Retrieval	$0.0036/GB	100 * 1024 GB * 0.0036 = $368.6	$1,935.4
Glacier Deep Archive	$0.00099/GB	100 * 1024 GB * 0.00099 = $101.38	$2,202.62

Although the example shows how much we could save by storing data in the other classes, the devil is in the details – the classes with cheaper storage costs have higher retrieval costs. Besides, if data is stored in the Deep Archive class, the data needs to be restored before access. The example below calculates the cost of retrieving 100TB of data with 1M GET requests from different classes.

Class	GET Request Rate per 1000 Requests (as of February 2024)	GET Requests	Data Retrieval Rate per GB (as of February 2024)	Retrievals	Total
Standard	$0.0004	1,000,000 / 1,000 * 0.0004 = $0.4	NA	$0	$0.4
Standard-IA	$0.001	1,000,000 / 1,000 * 0.001 = $1	$0.01	100 * 1024 GB * 0.01 = $1024	$1025
Glacier Instant Retrieval	$0.01	1,000,000 / 1,000 * 0.01 = $10	$0.03	100 * 1024 GB * 0.03 = 3072	$3082
Glacier Flexible Retrieval	$0.0004	1,000,000 / 1,000 * 0.0004 = $0.4	$0.01 (Standard)	100 * 1024 GB * 0.01 = $1024	$1024.4
Glacier Deep Archive	$0.0004	1,000,000 / 1,000 * 0.0004 = $0.4	$0.02 (Standard)	100 * 1024 GB * 0.02 = $2048	$2048.4

This example demonstrates that we may spend more money if we store our data in the cheapest class but neglect the other factors.

Choose the Appropriate Storage Class

The previous section shows how much we could save if we stored our data in the suitable class and how much more money we might spend if we chose the wrong class. So, how do we pick an applicable class? The answer is it depends. Selecting the storage class is based on the access patterns – the patterns we know and those we don’t know.

Data with Known or Predictable Access Patterns

If we know how frequently we need to access the data we store, we can store them in the most cost-efficient class. For example, banks are required to keep records of customers’ accounts for five years after they close their accounts, and the chances of retrieving the data within five years are meager. Therefore, storing the data in an infrequent access or even an archive class makes sense, as long as the storage saving from the non-standard class is more than the cost of data retrieval. We can use the following equation to evaluate if we should move the data to a non-standard class, and if yes, which one.

were

is the total size of data to be stored.
is the cost of storing data in the Standard Class.
is the cost of storing data in an infrequent or archive class.
is the total number of requests to be issued.
is the cost of each request in an infrequent or archive class.
is the cost of each request in the Standard Class.
is the total size of data to be retrieved.
is the cost of retrieving data from an infrequent or archive class.
is a threshold that defines the saving as more than a certain number, so we feel comfortable storing the data in the non-standard class in case something comes up.

In the bank example, assuming the bank has 100TB of customer-closed accounts that must be kept for five years. Their experience tells us that the chance of accessing the data is very low, so the bank estimates the maximum data to be retrieved cannot be more than 100TB with 1M requests (in other words, all data is accessed once during the five years). Therefore, we can get the cost estimation by applying the equation above.

Non-Standard Class	Storage Saving from Standard Class	Additional GET Request Cost from Non-Standard Class	Additional Retrieval Cost from Non-Standard Class	Total Saving from Standard Class
Standard-IA	$1,024 * 60 = $61,440	1,000,000 / 1,000 * (0.001 – 0.0004) = $0.6	$1,024	$61,440 – $0.6 – $1024 = $60,415.4
Glacier Instant Retrieval	$1,894.4 * 60 = $113,664	1,000,000 / 1,000 * (0.01 – 0.0004) = $9.6	$3,072	$113,664 – $9.6 – $3072 = $110,582.4
Glacier Flexible Retrieval	$1,935.4 * 60 = $116,124	1,000,000 / 1,000 * (0.0004 – 0.0004) = $0	$1,024	$116,124 – $0 – $1024 = $115,100
Glacier Deep Archive	$2,202.62 * 60 = $132,157.2	1,000,000 / 1,000 * (0.0004 – 0.0004) = $0	$2,048	$132,157.2 – $0 – $2048 = $130,109.2

(The number of storage-saving and retrieval costs came from the example in the previous section)

The table shows how much money we could save by applying the equation. Of course, we still need to consider other factors of storing data in a non-standard class, such as how long it takes to retrieve data from Glacier Deep Archive. However, at the minimum, it gives us a rough idea of our potential storage cost and savings.

Moving Data Between Classes

By default, S3 stores newly created objects in the Standard Class unless we specify the storage class when storing data, such as using Boto3 put_object API, which allows putting an object to an exact storage class. However, in most cases, we move data from the Standard Class to a non-standard class, and the most efficient way to transfer data between classes is to leverage the S3 Lifecycle. With a lifecycle, we can define rules to perform specific actions on a group of objects. For instance, in the bank example mentioned in the previous section, we can configure a lifecycle policy that will move the customer records from Standard Class to Glacier Deep Archive after their accounts have closed and delete the data after five years.

Although the S3 Lifecycle is the most efficient way to move data between classes, it is not free, so the lifecycle cost needs to be included in the S3 total cost equation.

Total Cost = Storage Cost + Request Cost + Transfer Cost + Lifecycle Cost

Data with Unknown Access Patterns

When the access pattern is unknown or unpredictable, AWS S3 has a solution – an intelligent Tier.

Intelligent-Tiering

Intelligent-Tiering automatically moves data to the most cost-effective tier based on access pattern by monitoring how the data is accessed. The tiers within the Intelligent-Tiering class are different than the other storage classes. See the diagram below.

Unlike other storage classes, there is no data retrieval fee, and the request costs are the same in all tiers. However, monitoring and automation objects have costs that need to be considered.

Total Cost = Storage Cost + Monitoring Fee

In this equation, the storage cost is the summation of the storage costs of each tier, and the monitoring fee is based on the number of objects.

The Intelligent-Tiering class is a great way to optimize our S3 storage cost. AWS S3 recommends using Intelligent-Tiering in most cases. However, there is still room to improve the costs.

First, object size matters. Object sizes smaller than 128KB will not be monitored, so they won’t be moved to different tiers and will be charged as the Frequent Access tier (See Automatic Access tiers in the How it works section). So, avoid storing objects smaller than 128KB.

Second, the number of objects matters. As the article was written, the cost of monitoring and automation in Intelligent-Tiering is $0.0025 per 1000 objects, which means with the same amount of data, if one has more big objects, but the other one has more small objects, the latter needs to pay more monitoring fee than the former. The table below demonstrates the difference between the two scenarios – one has an average object size of 1 MB, and the other has 100 MB, but both have a total size of 100 TB of data.

Object Size	Number of Objects	Monitoring Cost
1 MB	104,857,600	104,857,600 / 1000 * 0.0025 = $262.14
100 MB	1,048,576	1,048,576 / 1000 * 0.0025 = $2.62

The example clearly shows that the cost of monitoring and automation fees in the one with small objects is much more than in the one with big objects. Of course, it’s unlikely every object will be the same size, but it gives us an idea that the number of objects matters.

Third, if the data access pattern is stable, we may not gain benefits from Intelligent-Tiering but spend an unnecessary extra fee. For instance, if no data is moved between tiers because all data is actively accessed, all data is stored in the Frequent Access tier. Still, the Intelligent-Tiering class charges a monitoring fee. Therefore, we pay the extra monitoring fee without getting benefits.

Monitor and Analysis

We don’t know how well we do if we don’t measure our S3 usage and monitor its behavior. AWS provides a few options to gain insights into S3 – Storage Lens, Storage Class Analysis, and S3 Inventory.

Storage Lens offers account-level or organization-wide insights into storage usage and activity trends, detailed metrics, and cost optimization recommendations.
Storage Class Analysis monitors data access patterns and classifies data as frequently or infrequently accessed by age of objects. The analysis reports include metrics like object age, object count and size, request count, and data uploaded, storage, and retrieved size.
S3 Inventory reports object-level metrics such as version ID, size, storage class, and the tier of the Intelligent-Tiering class.

With the reports generated from these monitoring and analysis features, we can evaluate our storage usage to ensure the way we use S3 is optimized. However, except for the default dashboard of the Storage Lens, the monitoring and analysis features all have their costs. Besides, Storage Class Analysis and S3 Inventory can export and store reports in S3. Those files are subject to S3 storage charges. Depending on the export frequency and number of objects monitored, the files produced by Storage Class Analysis and S3 Inventory can grow very fast and consume a lot of storage, which must be considered in the S3 cost optimization plan.

The Number of Objects Matters

Similar to the monitoring and automation fee in the Intelligent-Tiering class, all the monitoring and analytics are charged by the number of objects monitored. Besides, the smaller the number of objects, the fewer PUT, GET, and all other requests are needed. Lifecycle transition requests will be cheaper, too. Therefore, we should make the objects as compact as possible with the same amount of data. This not only improves the data access performance but also saves money from the S3 features that charge by object count.

Best Practice

This section describes some use cases that may be helpful in similar situations.

Set Intelligent-Tiering as Default

Since the Intelligent-Tiering Class is the preferred class in many use cases, it makes sense if it is the default storage class. However, newly created objects are stored in the Standard Class by default. Fortunately, there are two ways to immediately put a newly created object in the Intelligent-Tiering Class so the Intelligent-Tiering Class behaves as the default storage class.

The first method is to specify the class when storing data. If we control the producer who puts the data to S3, we can specify Intelligent-Tiering as the storage class when calling PUT API or SDK (e.g., Boto3 put_object).

Nevertheless, we don’t control the producer in most cases, so the second way is to leverage S3 Lifecycle to move a newly created object into the Intelligent-Tiering Class as soon as the object is put into S3. A lifecycle example that does this is exhibited below.

The lifecycle moves newly created objects to Intelligent-Tiering immediately (Note that the lifecycle policy above also deletes noncurrent versions after seven days).

ETL

Usually, the access pattern of an ETL is predictable and stable, and a typical ETL has three steps – extract, transform, and load, like the picture shown below.

In the extract step, the ETL extracts the data from the source; the raw data is stored (e.g., s3://my_bucket/raw/ in this example). The transform step reads the raw data and processes it (assuming it only reads the data that hasn’t been processed). During the transformation, some temporary data may be created and stored (e.g., s3://my_bucket_temp/). Once the data is processed, the ETL loads the processed data to the target location (e.g., s3://my_bucket/table/) from which a user can query.

Assuming the pipeline runs once daily, the raw data will be written once and read once in this setup. Therefore, a lifecycle policy like the one below would be a good choice for the data in s3://my_bucket/raw/.

The raw data will stay in the Standard Class for thirty days in case we need to debug any issue. After that, objects will be moved to Glacier Instant Retrieval Class and expire after 90 days. If an object becomes noncurrent, it will be deleted after 30 days.

Regarding the processing data (stored in s3://my_bucket_temp/), since it’s temporary and the ETL runs once a day, we can have a lifecycle policy that deletes the temporary data, like the following example.

Finally, once the data has been processed, it will be stored at s3://my_bucket/table/, and accessed frequently by applications and users, so the default Standard Class is the best option—no need to move the data to other classes.

Delete Data Properly

When deleting data, we need to ensure the data is deleted, especially in the following scenarios.

Versioning is Enabled

When versioning is enabled, the DELETE operation does not permanently delete an object whether the DELETE operation is issued through S3 Console, API, SDK, or CLI. Instead, S3 inserts a delete marker in the bucket, and the delete marker becomes the current object version with a new object ID.

When we try to GET the deleted object (i.e., the object’s current version is a delete marker), S3 returns a Not Found error – it behaves like the object had been deleted. However, if we enable Show versions in the S3 Console, we can see the delete marker, and all noncurrent versions still exist.

Therefore, the object still exists in the bucket, and we keep paying its storage fee. There are several ways to make sure an object is really deleted (of course, in the case we really want to delete them). When using the S3 Console, we need to enable the Shoe versions option to view and select the object to delete.

One thing worth mentioning is the confirmation message when deleting an object via the S3 Console. If deleting an object with version ID, the confirmation message is permanently delete (this is the same when deleting an object in a versioning-disabled bucket). On the contrary, if deleting an object without version ID, the confirmation message is just delete. So, from the message, we can tell whether an object is really deleted or not.

A programmatically way is to use a lifecycle policy to clean deleted objects. For instance, we can use a lifecycle policy like the one below to ensure the objects are deleted.

When an object is deleted, a delete marker is created and becomes the current version; the original object becomes noncurrent. This lifecycle policy permanently deletes objects 30 days after they are deleted (i.e., become noncurrent), and when a delete marker has no noncurrent object, the delete marker becomes an expired object delete marker, and will be deleted by the lifecycle policy as well. In other words, thirty days after an object is deleted, the object, including the versions, and its delete marker will be permanently deleted.

Note that lifecycle cannot permanently delete objects without expiring them. If it could, it contradicts the purpose of having a versioning-enabled bucket. Therefore, if we want to always permanently delete objects (i.e., no delete markers are added), we should use a versioning-disabled bucket.

Besides, when deleting an object via API, SDK, or CLI, we must specify the object’s version ID to ensure the object is permanently deleted. In this case, S3 will not create a delete marker and will permanently delete the object’s specific version.

AWS S3 has a detailed document describing how deleting works in a versioning-enabled bucket: Deleting object versions from a versioning-enabled bucket – Amazon Simple Storage Service

Query Engine Metadata

S3 is a common building block of big data solutions – using S3 as the storage layer, and there is a query engine (e.g., Databricks and Snowflake) on top of it so that people can query objects like a database with data stored in S3. The query engine maintains its metadata to manage the objects stored in S3 and perform better. Depending on the query engine’s design and configuration, the behavior of deleting may not be the same as our expectation. For example, using the DROP TABLE command on an external table in Databricks only deletes the metadata, not the data; the data itself still exists in S3, and we keep paying the storage fee. Therefore, when using query engines with AWS S3, we need to pay attention to how the query engines interact with S3, especially when deleting data.

Avoid Redundant Backup Data

AWS S3 offers excellent options to backup data and keep the history of the data, such as S3 Replication and Versioning, and we anticipate backing up data to increase the storage footprint. However, the increased size of backup data may blow our minds when using applications with S3. The following scenario demonstrates what could be an issue.

Assuming we use Databricks with S3, Databricks offers a feature called Time Travel, which allows us to go back to an older version of a Delta table. To make the time travel work, Databricks needs to keep a copy of each version of the table (stored in S3 in this case). Consider a Spark job running hourly and performing an overwrite method like the code below.

# read the source
df = spark.read...
# some transformation
df = ...
# write to the target
df.write.format("delta").mode("overwrite").save("<S3 Location>")

If the size that the job writes is about 1GB, because of the Time Travel feature, one year later, there will be 8,760GB (365 *24) stored in S3. Usually, the reason we use overwrite mode in Spark is that we don’t care about the previous data that is overwritten. We might also think we don’t generate unnecessary backups by doing overwrites, yet we might not know the application has been doing backups all the time. Worse than that, the S3 lifecycle does not help in this case. The reason is that every copy written to S3 is treated as a new object (i.e., the current version), so the older copy will never become noncurrent. Therefore, we cannot use a lifecycle policy to delete the old copies. It’s tough to expire the current version because S3 has no idea what happens from the application side; it might accidentally delete the data we need.

A solution in this Databricks use case is to leverage the VACUUM command, designed for cleaning the old copies created by the Time Travel feature.

Unfortunately, there is no single solution to handle this situation, which depends on the application. All we can do is to be aware of how the application works with S3.

Optimize the Data

In the session, The Number of Objects Matter, we learned the number of objects affects S3 costs a lot. A general principle is to make the objects larger but fewer objects. Some applications, such as a query engine, can make objects compact. For example, Databricks provides an OPTIMIZE command to coalesce small objects into larger objects. If the applications we use have this kind of ability, we should utilize them.

Do Not Monitor Something Unnecessary

We all know the importance of monitoring data. However, monitoring is not free in S3, so we must carefully choose what to monitor. Usually, we monitor the data whose access pattern and usage are unknown so that we can lay out our S3 plan accordingly or adjust our S3 strategy by reviewing the monitoring reports. On the contrary, we won’t get much value from monitoring something we already know – something we should avoid.

Types of data we usually don’t need to monitor:

Temporary data. Temporary data usually live for a short period; there is no reason we monitor it.
Landing data. In a typical ETL, the landing data is usually raw and needs to be transformed. Those data are typically read only once.
Data with stable and predictable access patterns.
Log data. Log data are usually small, but many are not accessed often and will be deleted eventually.
Cold data. Cold data are those mainly for backup and are rarely accessed.

Besides, we should keep an eye on the monitoring fee to ensure the monitoring fee does not exceed the savings we could potentially get from monitoring. This may happen when there are a lot of small objects to be monitored, and since the objects are small, we won’t gain much savings from moving objects to a cheaper class. Ironically, this is hard to know without measuring. The only way to avoid this is to review our monitoring setup and monitoring reports periodically.

Monitoring our data is essential, but it comes at a price; use it wisely.

Summary

The following are some tips that may help optimize our S3 cost.

Store data in the most appropriate storage class.
Prefer Intelligent-Tiering except in the following situations.
- The data access patterns are well-known, predictable, and stable.
- Temporary data that will be deleted (especially in a short period)
Lifecycle data that are no longer active or needed.
Lifecycle old versions that are no longer needed.
Coalesce objects to be bigger but less.
Periodically review S3 insights using storage lens, inventory, or analysis. Even if everything is in Intelligent-Tiering, unknown data access patterns may become known by examining the data insights.
When versioning is enabled, ensure a lifecycle policy to clean deleted objects.
Understand how applications interact with S3.

The cost of using S3 is affected by many factors, and it’s not possible to have a solution that constantly optimizes our S3 storage cost automatically. The only way to ensure our S3 cost optimized is to keep reviewing our S3 usage and adjust our approach accordingly.

The post A Guide for Optimizing AWS S3 Storage Cost appeared first on Ilha Formosa 1544.