Eddiesegal

Posted on Jun 12, 2019

5 Best Practices for Setting Up Your Data Warehouse in the Cloud

#cloud #datawarehouse #bigdata

Maintaining a data warehouse on-premise can be an expensive and complex endeavor. Initial capital costs can amount to millions of dollars for hardware and software, the time and effort spent on designing the architecture, devising a plan, and implementing the data warehouse solution.

While big enterprises can sustain the expenses, small and medium organizations usually aren’t able to create and maintain a data warehouse on-premise. Fortunately, the advent of cloud computing has made it possible to host data warehouses in the cloud, eliminating the costs associated with the on-premise model.

Data Warehousing in the Cloud

Given the cost and time implications, some organizations are moving their data warehouses to a cloud-based solution. This helps in decreased costs and an improved level of performance. Perhaps the most significant benefit of the cloud-based solution is the flexibility and scalability that it offers as well as the possibility of fast data growth when it comes to storage or processing capacity.

A cloud-based solution comes packed with lots of tools, technologies, and managed services that help in decreasing the overhead costs and complexity associated with maintaining on-premise data warehouses. Cloud-based data warehouses clear up your time and lower your costs, so you might focus your resources on other aspects of your business.
Before setting up your cloud-based data warehouse, be sure to create a cloud migration strategy that keeps in line with the industry standards. Below, you’ll find a review of five fundamental best practices for setting up a data warehouse in the cloud.

1. Understanding the Migration Challenges

When you migrate to a cloud-based Enterprise Data Warehouse (EDW) platform, you need to consider critical migration and design implications. To ensure that downtime is kept at a minimum, design a seamless integration strategy for any EDW functionalities you’re migrating to the cloud. This way you’ll migrate data to the new environment without interrupting any remaining on-premise processes.

Below, you’ll find a review of the main aspects every cloud migration strategy should cover.

Migrating Data to the Cloud

Migrating large volumes of data from one repository to another can often be time-consuming. When attempting to migrate an EDW, it is critical to accurately define data sets and volumes early in the process. This technique enables the optimal connectivity for data movement. An accurate project schedule for data migration time frames is also important.

Data Integration and Access

To enable the flow of data from on-premise repositories to the cloud requires an ETL process. Different ETL tools need to be validated to ensure proper cloud operation and support as well as the features required for integration with cloud-native EDW technologies. Be sure to cover these aspects in your migration strategy.

Data Transit Costs

The costs of migrating data can quickly add up if the movement of data is not efficient. As cloud service providers offer cost-effective data storage prices, you don’t want to waste this opportunity by having to migrate again. It is also a good idea to maximize data compression before transmitting.

Developer Experience

Cloud-based EDWs come with a certain amount of flexibility and capabilities that are usually not present in on-premise solutions. These can often prove to be challenging to developers who are required to keep pace with new features and constantly learn the new functionalities added by the cloud service provider.

Look for a service provider that offers user-friendly documentation and training opportunities. A robust knowledge base that contains examples, training, and related resources for the adoption of a cloud EDW can help ease up the tasks ahead.

While it is true that migration of any IT service can be a daunting prospect, EDWs pose an added risk because it may interrupt business continuity. When migrating to a cloud EDW, some key factors need to be analyzed and planned before migration:

Use of proprietary features in an on-premise EDW implementation - all use cases of proprietary features need to be adequately examined before migration to determine the best way to provide similar functionality in the cloud.
Data access and developer experience - developers need to have the option to use transfer their on-premise environments to their cloud-based EDW. This helps in inhibiting productivity loss when new workloads are deployed in the cloud or when integration between platforms is undertaken.
Cost of query - certain cloud services charge on the basis of the amount of data being queried. Though this proves to be advantageous with many workloads, it needs to be communicated to developers, in advance, to make sure tasks are performed efficiently. The cost of adding new reporting capabilities should be fully understood prior to the migration.

2. Use the Right Migration Strategy

During the migration process, data will be relocated from an existing solution to the cloud. Different migration strategies enable different data transfer, including migrating only a part of the data pipeline to the cloud.

Instead of locally maintaining a file server, you can directly inject unstructured data in a CSV or Excel format to object storage services like Amazon S3. This data is then transformed and processed via a data pipeline within the cloud for improved performance at a lower cost and management overhead.

Additionally, based on the volume of data and operational needs, the migration strategy may need to move data to the cloud warehouse. In cases in which the data volume is low and does not require continuous operation, you can set up a one-step migration system to extract and import the relevant data to the cloud-based data warehouse.

That said, if you are working with large data volumes that have continuous operational needs, such an approach is rendered impractical - due to the sheer amount of data migrated, and the fresh data that is added at each migration time frame.

To solve this issue you can break the process down into two steps. At the first stage, it is critical to extract the data during non-peak times so that the impact is minimal. Next, migrate the data to the cloud via a method that matches the format of the extracted data volume.

3. Create Snapshots and Backups On A Continuous Basis

Most of the cloud service providers replicate all your data within your data warehouse cluster, and all your data is usually backed up into an object storage service. Regardless of that, enterprises should give considerable thoughts to creating nightly backups at distributed locations, to prevent losing data due to negative events. Negative events can be anything from a cyber attack, to equipment failures, and natural disasters.

Redshift, by default, attempts to maintain at least three copies of your data - the original, a replica on the compute nodes, and a backup in Amazon S3. If you want to further customize the number of replicas or automate the snapshot creation process, you can use an AWS disaster recovery service offered by third-party vendors. The default retention period is 1 and you can configure this to be as long as 35 days.

4. Use ELT Over ETL for Batch Processing

There is an opportunity to optimize the data pipeline when the data warehouse solution is being set up in the cloud. It is a common practice to use Extract Transform and Load (ETL) when moving unstructured or semi-structured data to the data warehouse. ETL is mostly a continuous process with the extracted data being cleansed, transformed, enriched and loaded into a warehouse.

When it comes to data warehousing solutions in the cloud like Amazon Redshift, you can migrate extracted data directly to the warehouse - where it is efficient enough to perform data transformation within the warehouse itself. This modifies the order of the data pipeline to Extract Load and Transform (ELT).

5. Make Use of Managed Services for Warehousing Workflows

Organizations have the option of using the cloud as an Infrastructure as a ServiceIaaS (IaaS). They can install the software for data warehousing and implement ETL processes with cloud servers. However, this provides limited value when moving to a cloud service.

Key advantages of cloud services are the fully-managed or semi-managed data warehousing service as well as ETL / ELT solutions. This helps lower the overall cost of ownership for the organization. For example, Amazon Redshift provides customers with a petabyte-scale solution for data warehousing. This allows both data storage as well as querying for analytics.

Conclusion

Migrating your data warehouse to the cloud doesn’t have to be an expensive endeavor. With detailed planning and the right strategy, you can design a migration process that fits the scale of the project and your organization. Once you’re all set up in the cloud, you’ll be able to eliminate the costs and complexity associated with maintaining an on-premise data warehouse.

DEV Community