Extract Transform and Load: A Best Practices Guide (3 Part Series)
The first two major pieces of an Extract Transform and Load pipeline are the Submission Information Package (SIP) and High-level validation. The major components, as well as the benefits and usages of Extract Transform and Load, were discussed in the first article in the series. It is recommended that the reader have familiarization with the overall Extract Transform and Load process before continuing on. Data is important, and in this article, the subject is Data.
Submission Information Package
The Submission Information Package, or SIP for short, is the actually data submitted from the data partner to the data ingest system. The term SIP is taken form the Open Archival Information Standard (OAIS) which defines it as
Submission Information Package (SIP): An Information Package that is delivered by the Producer to the OAIS for use in the construction or update of one or more AIPs and/or the associated Descriptive Information.
This is the saying the data provided to the system for use of creating one or more records. For the purpose of this discussion, we will expand that definition and break into two pieces Raw Data and Data End Points
Raw data is the data files provided to the system from a data partner to the ingest system. The data can be a single data file, multiple data files, or a package, which often contains the data files and a metadata or descriptor file. In an OAIS system, a SIP is usually
- a bunch of files of various formats,
- A descriptor file for each file
- A manifest of all the files in the package
- An authentication file or digital signature
In most ingest systems the data will just be one or more predefined data files.
The questions to ask when it comes to raw data are
- Are there any particular standards used in the industry (Industry SOPS, Schemas, XSD, format best practices)?
- What data types is the system ingesting (Excel, XML, JSON)
- Do we need to create a standard? Or do we need to simplified an existing standard?
- Who will be submitting the data? Will it be from one partner or can multiple partners submit the same data types.
Are there any particular standards used in the industry?
Typically, the first question “Are there any particular standards used in the industry?” is the best place to start. In a larger industry, it will often be the case that a data transfer standard already exists. This should be the starting point for this discussion. The best way to find this out is to google, ask industry experts, and look at digital retail sources for the data.
Data types are often easily determined from answering the industry standard question. Typically, XML is used, but it is the case that CSV and fixed-width file formats will be used. JSON is gaining popularity, but in industries where data quality is highly importing this format is avoid, due to its free form nature. However, tools such as RelaxNG are available to set up a data structure standard for JSON. The adoption of such tools is limited and resources for use in certain programming languages may not be available or incomplete.
Do we need to create a standard or a simplify standard?
The question “do we need to create a standard or a simplify standard?” often comes from one of two situations
- No standards currently exist
- A standard exist, but it's
- overly complicated
- it no longer fits the industry's needs
- or less sophisticated data partners want to participate
Creating a new standard requires an understanding of the domain and facilitated communication between the data partners. Best practices when creating a new standard could fill a book and such is out of scope for this article.
The second situation is when a standard exists, but it's either overly complicated, it no longer fits the industry needs, or the ingest system is trying to allow smaller data partners to contribute. As is the case with the first situation care needs to be taken when creating a new standard. However, when building a system that is trying to allow small data partners to contribute, it is often the case that some data partners do not have the technology sophistication to implement a large standard. Often the solution is to create a UI to allow partners to manually enter data, or create a CSV/Excel template to allow partners to bulk upload data. When simplifying a format, it is important the format covers or make mandatory the basic required data of the larger formats.
Who will be submitting the data?
The last question, “who will be submitting the data?” will dictate the above standards question as well as the next section Data Endpoints.
Data endpoints are how data partners send the data. There are two basic forms of data endpoints, push and pull.
Note: Data retrieval by a data partner (the system sends responses) would also be considered a data endpoint, but is outside the scope of discussion for this article.
A push endpoint is where a data partner initiates the requests and sends the data to the system. Typically, these types of endpoints are web services and UI inputs. There are many articles about REST, web services, and UI design and best practice, so this article will not touch on these. However, it is recommended that the endpoints are just that endpoints, they should not dictate a whole new pipeline. Care should be taken to tie these endpoints into the data pipeline and reuse as many of the ETL components as possible. Typically this means mapping the data into a canonical. Business logic/rules are where data quality is scrutinized and keep the logic in one component makes the system easy to maintain and ensure data quality. If a UI
If a UI is offered to data partners, particularly if a CSV upload component is available it is often good to allow data partners to upload XML or JSON files via the UI. In addition, having a path to allow partners to validate the data without it being persisted to the database is useful when on-boarding new data partners.
A pull endpoint is where the ingest system reaches out to various data partners to ask for data. These pull jobs need to be scheduled and execute by the system and periodically poll the partnering system. Examples of this are when the ingest system subscribes to a data feed such as RSS or ATOM or data scrapes from a data source. Note: most data scraping endpoints will not require a formal ingest pipeline as data is often unofficial and may not be the best quality.
The one hybrid approach is SFTP or a shared directory. This requires that both the data partner and the ingest system to do pulls and pushes. In this case, the data partner will push data to a shared location. Then the ingest system will poll the shared directory for new data to ingest.
When pulling data from a shared location best practices are extremely important. things to consider are
- Sub-directories: These are used to break up data partner submissions in logical chunks. A best practice is to use a unique identifier or a timestamp as the subdirectory's name. The pulling system does want to scan old data every time.
- Batch complete files. The data partner needs a way to tell the pulling system that a data batch is ready for ingest. If the pulling system starts to scan when the pushing is still transferring the data it could result in a bad/incomplete batch. Often a blank file with a naming convention like 'batch.complete' is used. Be sure that the data partner knows to create the batch.complete file only after all data has been fully transferred.
- Data retention policies: Over the course of time a shared directory can get large, causing performance problems for the pulling system or maintenance costs for directories host. The question of what should be done with or how long should data be stored on the directory after it is ingested. Often times the ingesting system will delete or move the data to another location after ingest is successfully completed. Note: successfully complete can contain pipeline errors if the data is invalid. But if an unknown system error occurs, such as, the database is down, then data shouldn't be removed.
High-level validation deals strictly with the questions
- Is the data from a trusted source (i.e. does the system know the submitter)?
- Is the data in a format that is known?
Is the data from a trusted source?
Determining if the data is from a trusted is typically straight forward and solutions are already well known. Examples of solutions to this question are
- Access Control Lists (ACLs)
- API Keys
- User login credentials
- FTP credentials
- SHA keys
In more secure cases digital signatures and certificates can be used to ensure the data wasn't tempered with and chain of custody is maintained.
Is the data in a format that is known?
In this case we are talking about High-level validation, which does not care about business rules or fine-grained validation such as does a book include a title and an author. However, some data specifications, such as XSD, will have lower level validations included.
Some examples of high-level validations for various data formats are
- Is the data well-formed?
- Does the XSD validate?
- It should be pointed out that not all XSD validation packages are created equal. If the XSD spec is from a later version some XSD libraries in Python and Java may not catch all the rules. Note: Most of the Java libraries will validate the spec. However, most python XML processing libraries are tied to XMLLINT, which does not handle more of the complicated validation rules.
- Do all the expected header columns exist?
- Does the column header text match the expect column header text?
- Are the columns in order?
- Is there at least one row of data?
- Does the file extension match the expect extension?
- This one is really easy to spoof. However, when dealing with CSV or Excel most users are not sophisticated enough to change the file extension, since they are using a spreadsheet application. This check ensures they didn’t accidentally upload the wrong file.
- Tools exist to inspect and determine file format regardless of the file extension, however the footprint of these tools are often large and considered overkill for most data pipelines.
When working with Extract Transform and Load systems we are focusing on a system that ingesting data. That means data is the most important part. Considering the data first can help give clarity to the downstream components. Be sure to answer the questions
- What kind of will the system be ingesting?
- Who will provide this data?
- How will the data partners provide this data?
- How will system verify the data is in a format that the system understands.
For more information about Extract Transform and Load, be sure to read Extract Transform and Load Overview article. Also be on the look out for part 3 of this series Canonical Transformation.