Extract Transform and Load: A Best Practices Guide (3 Part Series)
Part 1 of the Data Ingest Series
The process of extract transforms and load (ETL) goes by many names:
- Data acquisition
- Data Ingest
- Enterprise transform and load
But they all are about getting external data into the system.
The problem that most businesses face is that there are no easy to follow best practices that can guide them in creating ETL pipelines. In this 6-part series, I will address this gap by laying out a simple and flexible set of best practices for creating or evaluating an ETL system.
The importance of good quality data
First let’s discuss why ETL is important to the business. Data is frequently the most important asset in any business. In the information age it’s easy to see why. Let look at these examples:
- Amazon captures buyer data to suggest new items to purchase.
- Facebook tracks user data to use and sell for targeted marketing.
- Google uses user search history to display google ads.
- The music industry exchanges data in order to sell music, pay royalties, and exchange rights.
- Governments ingest data for threat assessment in order to enhance national security.
The list goes on and on...
So how we handle these data sources and the quality of these data sources are high import for sales, governance, commerce, and in some cases life and death decision making.
Buy vs. build
This subject is deserving of its own blog article and can actually be quite controversial during the due diligence phase. I’m only going to briefly touch on the subject here in this series.
What are some of the criteria when evaluating a COTS ETL platform?
The common criteria when evaluating building something custom or buying Commercial off the Shelf (COTS) platforms are:
- Price. How much license? How much is the cost of building from scratch? Are their maintenance contracts? Will I need to send my employee’s to external training? Will I need a third-party consultant on staff? What are the hosting costs?
- Maintenance. How difficult is it to upgrade? Are there solid upgrade paths? Where will I host the application? Is this a stand-alone/hosted install? Or can I deploy into the supporting app.
- Speed of ingest. How quickly do I need to process data? Custom built solutions are often much faster that COTS products.
- Use Case. Will this application be used on one project or across the entire organization? (Be honest with this one. I’ve worked with countless COTS products bought and installed for a particular project where the organization says “Lets beta test this million-dollar product on one project and if it works, let's implement it to the rest of the organization”. Only they never do and now they are stuck with a very expensive license).
- Governance. Do I need to maintain a separate set of users and roles? Can I limit who can access my data pipelines
- Flexibility. How easy is it to add new input data types. Can I change the process flow?
- Who will be developing it? A feature of COTS products is they often claim workflows (pipelines) can be created and managed by business users. This often requires special training of business users and a particular type of business user to maintain this. Sometimes companies will use an existing developer to work on the pipeline anyway.
There are many great products out there that will give flexible pipelines with the kitchen sink when building your data pipeline. This seems great! Buy a license, install, create a pipeline and let’s go! However, consider the hidden costs in this approach.
The majority of data pipelines can be boiled down to 6 main components. Some of these components can be reordered, others may be optional in a particular use case. However, understanding and fully considering each of the components is the key to ensuring data quality and pipeline reusability.
Submission Information Package (SIP)
Submission information package is the first step in any data pipeline. It is the data itself and how the data will get into the system, also known as raw data and data endpoints.
Raw data is the data files being submitted or ingested by the system. This could be multiple formats from multiple submitters. The questions to ask are
- What data types are we ingesting (Excel, XML, JSON) and are there any particular standards used in the industry (Industry SOPS, Schemas, XSD, format best practices)
- Do we need to create a standard? Or do we need to simplify an existing standard?
- Who will be submitting the data? Will it be from one partner or can multiple partners submit the same data types.
The endpoints discuss how the system gets the data. The questions to ask are
- Will partners be submitting data to us (Push)?
- Or will we need to get the data from the partners (Pull)?
High-level validation is the second step of any ETL process. High-Level validation is mainly concerned with ensuring that the data format being ingested is actually a format that the system knows how to ingest. For example, it checks that an XML file conforms to the schema, or that the CSV/Excel file has the correct columns. It is not concerned with details of the data like does the book have a title. Does the user have at least one account, etc.
This is the third step in any ETL process. The canonical transform is actually composed of 2 pieces. The canonical domain and the data mapping process.
The canonical form is a set of simple objects that all data formats can be mapped to. Often times the canonical will take the form of target database tables or the most complicated or standardized input data format. Industry data standards, like OMOP in healthcare, also represent canonical data models.
The data mapping process is the actual process of converting data from the input formats into the canonical. For example, taking a book title from a CSV file and putting it in the title field of the Book canonical object.
This step is important to the process because it allows for downstream processing to become reusable and easy to maintain.
The fourth step uses the canonical from step 3 and applies detailed validation rules against the data. These rules are important to keep database integrity. These could be simple validations such as all books must have a title or more complex rules such as that a book ownership shares total to 100%. Complex business rules usually focus on data relationships or aggregated data.
Business Rules may also focus on rules that are important to the organization. Data standards are often designed to be flexible across many use cases, however, your organization may only care about a particular aspect of the data standard. Business rules are how the system will filter out the needed data and the unneeded data.
The identification step is the fifth step in the ETL process. The identification step is how the system will reconcile new data with previously ingested data. This process will involve key identifiers to compare the data against our known data set (database, index, NO-SQL, external system, etc.). If a match (the data was identified) was found then this step may resolve how data is merged. This means if the data is known and there is a difference between the data store and incoming data how is it reconciled. Some uses case will require the data merging to happen in the persistence layer. This decision will ultimately be based on the business rules and the system’s requirements.
Depending on the needs of the business and the type of data being ingested, this step may need to be skipped. Also, in some use cases it is important to identify the data before running the business rules as the business rules may vary based on if the data is known (identified) or new (unidentified).
The identification step is discussed in-depth in part 5 of this series.
This is the sixth step in the data ingest process. It focuses on storing the data in our datastores for downstream use or for future data identification
Persistence is discussed in-depth in part 5 of this series.
The audit, governance, and work queue components are not steps in the overall ingest process but are important, but optional, components of an ETL system. These are the ‘kitchen sink’ features found in most COTS products. The way the data ingest pipelines are used and managed will ultimately determine if any of these 3 components need to be implemented.
The audit component is multi-faceted and covers many areas important to:
- Data provenance: Knowing the source of data stored in the data store
Business intelligence: This is buzz term for analytic about the pipeline and the data streaming through. It answers questions such as
- How much data is being ingested from a given source?
- How long are each of these steps taking to process?
- Where in process does data fail?
- Where are our bottlenecks?
- How many records are being created from each submission?
Decision Audit: Why decisions about the data were made.
- Why a record was identified?
- Why a data file did not pass a business rule?
- Why was a record persistence?
Governance covers who can access, run, and report on our data ingest pipeline? And who can change our data pipeline?
The work queue component is not a component that is often seen on many ingest system, but given the type of data being ingested it can greatly improve data quality, data transparency, and data throughput.
The Audit, Governance, and work queue pieces will be discussed in part 6 of this series.
The importance of data acquisition in a system is undeniably important and the quality of the data, ultimately determines the quality of the system. With these 6 basic components just about any data ingest pipeline can be designed and implemented with an assurance of data quality and pipeline reusability. In the next segment of this series we will be discussing the Submission Information Package