DEV Community: Limor Wanstein

Why Open Source is Invaluable to the Development and Testing Community

Limor Wanstein — Wed, 05 Sep 2018 12:57:40 +0000

In the software world today, many teams are adopting the culture of DevOps and DevSecOps. The modus operandi of Dev(Sec)Ops is to have Continuous Testing done as early as possible in the software development lifecycle before it reaches production – this is the concept of “shifting left” in terms of DevOps. This can be realized when teams are writing code using an approach known as Test-Driven Development where the test cases are written (in code) before the actual feature. This is helpful to define the scope and the features that will eventually fit into. It gives developers a target to aim for rather than thinking on-the-fly about what the features should actually be. The developers, in this case, are also the testers, and may even be the authors of test cases. There is no limit to the scope or test cases that one can write - it can range from testing the application’s security, feature, error handling, performance, etc.

The purpose of performing tests is to collect quality measurements of the code, so when it passes a certain quality, it moves on to the next stage. These measurements should also be collected and presented in a way that makes sense to both the developers and managers so that code quality can be determined at a quick glance. Plus, to perform tests at scale, automation is key. A repeatable test process, coupled with a tool that can easily integrate into a current ecosystem, is often preferred.

Security Advantages of Open Source in Testing

Leveraging on open source knowledge and tools to achieve the goals of development and testing is clearly an advantageous step for the developers. Open source pools the knowledge of many with vastly different experiences to achieve a common objective. The open source knowledge is usually co-created for developers by developers because there were no prior definition or knowledge base on a certain topic. What open source provides for is a channel to report or to contribute, and the topic of security is one that makes sense to open source. When someone reports a zero-day vulnerability, this knowledge is valuable to the whole community and will in turn drive efforts to find the fix as soon as possible, thus improving quality of the project.

Role Models of Open Source Security

The RSA algorithm, as an example for open source security implementation, was publicly derived after the research and scrutiny of many mathematics and computer experts, which has now become an industry standard and implemented widely in many systems as the go-to encryption algorithm.

The Common Vulnerabilities and Exposures and Exploit-DB are open source knowledge databases which security specialists keep abreast for the latest security data. These open source security vulnerabilities and intelligence are the bread and butter by of security professionals. But what's more valuable today is what has been built on top of these databases - a security service, complete with a tool, that feeds on existing security databases so that security vulnerability detection and remediation can be as quickly and seamlessly as possible.

Open Source Testing Tools

In the Dev(Sec)Ops context, developers are encouraged or tasked to test both the application and the infrastructure it runs on, hence testing can be grouped broadly into the above two categories respectively.

This (horizontal) flow diagram helps to visualize some of the open source tools used in a CICD pipeline during the software lifecycle that focuses on application security testing. Also, a tool like OWASP ZAP can also test for the dynamic behavior of a web application after or during deployment to benchmark it against the OWASP Top Ten Most Critical Web Application Security Risks.

This (vertical) diagram illustrates testing of the infrastructure where a testing tool “punches through” the full stack.

Netflix’s Simian Army penetrates through and tears down the whole infrastructure stack and tests for outcomes such as high availability and resiliency of the infrastructure. This army has since inspired many derivatives of monkeys to test for different infrastructure resiliency like the WazMonkey, KubeMonkey, and MonkeyOps.

Mapping Open Source Tools to DevOps Practices and Values

Open source tools are free to use and have a low financial barrier to entry. Secondly, they are usually a product of itself, but also tend to play nice and plug into other existing products well, and focus on doing one thing well - as opposed to enterprise products that try to do almost a little of everything. Developers and testers can “plug-n-play” to try out different tools and this seems to align with DevOps’ Third Way of continual experimentation and learning where testers get their hands dirty with the tool and learn if it fits their objectives well - looking for small wins that are measurable. If that does not suit their use cases, the transparent nature of the source code makes it modifiable for each tester’s own needs. At the same time, open source projects also rely on developers to give feedback and contribute to the projects in an organic manner, which is essential for the continuous improvement and growth of projects in a two-way relationship. With the vast variety of knowledge and tools in the open source space, developers and testers are spoilt for choices. Open source projects will only continue to grow in numbers as developers and testers build upon existing projects to achieve greater goals.

5 Best Practices for Setting up a Cloud Data Warehouse

Limor Wanstein — Wed, 23 May 2018 16:30:06 +0000

Credit: Pixabay

A data warehouse is a central repository of integrated data from disparate sources which has been optimized for analytics and reporting. A cloud data warehouse takes a new approach by offering data warehouse functionalities as a service instead of in a traditional on-premise data center.

Availing of a data warehouse as a service (DWaaS) approach offers several benefits, including:

Low-cost data warehousing because it negates the high expenses of setting up and maintaining an on-premise data center.
More time to focus on the value derived from data instead of the intricacies of managing a data center.
Effortless scalability, with the ability to easily adapt to increased (or decreased) data volumes and workloads.
Makes data warehousing more accessible to a greater number of companies that might not have the resources to build their own on-premise data center.

The transition to a cloud data warehouses can prove difficult for enterprises. Whether your company is moving from a legacy on-premise setup or the cloud represents your first foray into data warehousing, these five best practices will help you get it right from the start.

Understanding Cloud Data Warehouse Architecture

Each cloud data warehouse service provider has its own unique architecture, however, there are some common architectural trends you’ll need to become familiar with before choosing a vendor.

Massively parallel processing: several leading cloud data warehouse vendors use a massively parallel processing architecture, which partitions data across many servers or nodes, each with its own processor. The MPP design coordinates query workloads over all processors simultaneously for much faster data querying.
Columnar storage: in a columnar database, tables are stored by their column values rather than by row. This type of design leads to better performance by reducing the time needed to answer queries.
Serverless: some cloud data warehouse vendors have a serverless architecture wherein you don’t need to provision or manage any infrastructure.

Panoply has a good resource on cloud data warehouse architecture compared to traditional on-premise data warehouse architecture.

Satisfy Security & Compliance Requirements

Security and compliance concerns are significant stumbling blocks for enterprises considering a cloud data warehouse service, and rightly so. It is no surprise that decision-makers worry about the risks of information security violations in the cloud—when data is trusted in the hands of a vendor, this can create fear due to a lack of control. The fear of security issues is further compounded when you consider that the data warehouse is a core component of business intelligence.

It’s imperative to perform due diligence and ensure potential data warehouse cloud vendors can satisfy all your security and data protection requirements. Important things to look out for include:

Strong user authentication policies
Data encryption
Compliance with relevant industry regulations, for example, HIPAA for protected health information
Detailed documentation highlighting the vendor’s policies for data security, protection, archiving, and replication on virtualized environments and the underlying hardware

Conduct a Phased Migration

Don’t go all-in from the get-go—a phased approach makes the most sense in which you first migrate the most straightforward, least technical workloads to the cloud-based data warehouse. Variable workloads, such as financial reporting activities are good candidates for initial migration because they do not continuously use data warehouse resources. Business unit specific workloads as part of a departmental data mart are also good for initial migration.

Sustained workloads such as the daily reporting that drives tactical business decisions can be migrated later, as can business-critical auditable workloads. Doing things incrementally can build confidence and trust in the cloud data warehouse, and the experience gained will be invaluable for migrating more complex workloads and use cases.

Validate ETL Tools

Traditionally, ETL (Extract, Load, Transform) tools are used to move and integrate data from transactional systems into a data warehouse. The data is consolidated and transformed before it ends up in the data warehouse. It’s important for enterprises to ensure existing ETL tools and data flows are validated for a cloud implementation with support for cloud-native technologies

For enterprises making their first move into data warehousing, it might be best to look for native cloud data integration tools (ETL as a service), which have been built to service the challenge of integrating data from transactional on-premise database systems into a cloud repository.

Increasingly, enterprises are using an ELT approach in which raw data goes straight from source systems to the cloud data warehouse, where it is transformed as-needed for particular business use cases. The power of cloud-based infrastructure makes ELT more feasible.

Ensure Cost Transparency

Even though the cloud is undoubtedly a cheaper option for data warehousing, it’s still vital to have transparency on the potential costs incurred from data warehouse activities. Different vendors use different cost models, with some charging for the amount of data queried. Most vendors also charge for data egress (moving data out of the public cloud back to on-premise systems).

It’s a good idea in all cases to anticipate costs and communicate to relevant business users the importance of being efficient with the queries they perform on data residing in the data warehouse. Visibility on other costs is also important, including data backup and disaster recovery expenses.

Wrap Up

These best practices will help guide you in choosing the right service provider and getting set up in a cloud data warehouse. Once you have decided to make the transition, make sure your company builds a strong rapport with your chosen vendor, with a focus on responsiveness, performance, and engagement.

Docker Swarm Concepts, Tips, and Tricks for a Docker Beginner

Limor Wanstein — Wed, 10 Jan 2018 19:06:05 +0000

Docker Swarm is a container orchestrator, meaning it allows you to manage a cluster of Docker Engines. The typical way to manage a single container on one host system is to use the Docker Command Line Interface (CLI).

However, managing one container on one machine is a pretty limited use of containerized applications. When you want to manage multiple containers deployed on multiple hosts, the CLI falls short, and it’s necessary to use a dedicated orchestration tool like Docker Swarm.

Docker Swarm facilitates multi-container workloads deployed on a cluster of machines, thus extending the capabilities of Docker containers. More specifically, Docker Swarm’s use cases include:

Allocating tasks to container groups
Managing the lifecycle of individual containers
Scaling a cluster of containers up or down depending on workload
Providing failover in the event a node goes offline

This article outlines some of the main Docker Swarm concepts in addition to detailing some best practices for this orchestration tool. For a more detailed guide, check out this Docker Swarm 101 wiki page.

Docker Swarm Concepts

Docker Swarmkit

A new tool used within Docker to provide functionality for running a cluster and distributing tasks. Docker Swarm mode uses Swarmkit libraries & functionality to simplify and secure container management over multiple hosts.

Swarm

A swarm is a group of Docker hosts that run in swarm mode, operating as a single virtual host.

Node

A node is an individual Docker Engine that is part of a swarm. You can many nodes on a single computer, but the more typical use is when nodes are distributed across multiple machines. Nodes can act as managers, leaders, or workers.

Task

A task is the smallest scheduling unit in a swarm. Each task carries a container and the commands used to run it.

Manager Node

This type of node assigns tasks to worker nodes in addition to helping with the management of swarms.

Leader Node

One node is elected as a leader node using the Raft consensus algorithm. The leader node conducts orchestration tasks in the swarm.

Worker Node

Worker nodes simply receive and execute tasks designated by the manager nodes. Note that manager nodes function as both managers and workers by default.

Docker Swarm Best Practices

Like any good tool, Docker Swarm’s usefulness shines when combined with a set of best practices that make the most of Docker Swarm’s capabilities and ensure the smooth operation of your cluster.

Set Up A Staging Platform

It’s important to establish a staging platform that runs as a replica of the production configuration you will use, with a cluster of Docker containers running in swarm mode. This way, you can prove that the container cluster runs in a stable manner and avoid potential production issues.

Use Monitoring Tools

It’s imperative to closely monitor your Docker swarms for any potential issues that may lead to failures, including too much memory usage or network overloads. Good monitoring tools for container orchestration should provide the ability to apply custom rules, such as taking automatic actions which shut down nodes in the event of failures.

Protect Manager Tokens

The way a Docker swarm operates is that you create a single-node swarm using the docker swarm init command. The single node automatically becomes the manager node for that swarm. The output area of the docker swarm init command displays two types of tokens for adding more nodes—join tokens for workers and join tokes for managers. It’s important to safeguard the manager token because once a node becomes a manager, it can control an entire swarm.

Replace Unavailable Nodes

Sometimes a node becomes unavailable, meaning it’s a manager node but it cannot communicate with other manager nodes. It’s important to replace these nodes either by promoting an existing worker node to a manager, or adding a new manager node.

Either way, regularly run the docker node ls command from a manager node and look closely for nodes listed with an Unavailable value. You can identify worker nodes because their value on the output of this command is blank.

Make Use of Service Configurations

After you create a service that you want your containers to run, you can configure further settings for that service using the docker service update command. Specific options include setting the runtime environment for the service, reserving memory or CPU for a service, and choosing how the service should update. Note that in Docker swarm you either have global services or replicated services. A global service specifies a single service to run on the entire swarm, such as an anti-virus tool. A replicated service specifies a number of exact copies of the same task, such as three replicas of an Apache web server.

Maintain An Odd Number of Manager Nodes

To better leverage the Raft consensus algorithm’s fault tolerance for your swarm, try to maintain an odd number of manager nodes at all times. An odd quantity of manager nodes helps to ensure a quorum remains for processing requests in the event of manager node failures.

Forcibly Remove Compromised Nodes

If a node becomes compromised from a cyber attack, it’s useful to be able to remove it from the swarm without hesitation. In Docker swarm mode you can forcibly remove compromised nodes by using the following command:

$ docker node rm --force

You can combine this command with the specific node in question to forcibly get rid of a compromised node.

Closing Thoughts

Docker swarm is a useful container orchestration tool that allows for effortless scaling of container clusters and the ability to manage groups of containers deployed on multiple hosts as a single virtual host.
By following the best practices outlined in this article, you can ensure you get the most from Docker swarm mode, and avoid potential performance and security issues.

Image Source: https://hub.docker.com/_/swarm/

The Cloud Data Warehouse and ETL

Limor Wanstein — Wed, 27 Dec 2017 16:52:13 +0000

Overview

Do you want to understand how to turn the increasing quantities of data gathered by your business into actionable insights?

Before getting the important insights you need from BI tools and analytic applications, it's important to first understand a bit about cloud-based data warehouse solutions, which provide the bedrock for performing high-level data analysis and BI in modern enterprises.

In this post, you'll learn what exactly a cloud-based data warehouse is, how the important process of ETL relates to the data warehouse, and you'll get a comparison on two of the major cloud-based data warehouse solutions currently available—Amazon Redshift and Google BigQuery.

By the end of this post, you will be much better placed to choose a cloud-based data warehouse solution that provides a solid platform for analyzing your data and getting the insights from that data that can drive better business decisions.

What is a Cloud-Based Data Warehouse?

Cloud-based data warehouses are centralized repositories of data that utilize cloud infrastructure. In the cloud, a network of remote servers and computing resources are used to provide a data warehouse, instead of local servers hosted on-premise. Before you dive in further, check out this article by Panoply about the benefits of a data warehouse, and this definition and discussion of Redshift by TechTarget.

Data warehouses are vital for the success of any data analytics effort. Traditional on-premise data warehouses are extremely time-consuming and complex to build, requiring a significant investment in physical resources.
The cloud offers important advantages versus the traditional data warehouse, such as:

There's no need for companies to invest in data warehouse centers or other physical resources hosted on-site
Cloud data warehouse providers offer effortlessly scalable infrastructure and high performance

Several companies provide cloud-based data warehouse services. Some popular examples include:

Amazon Redshift—this is a fully managed data warehouse service that requires database admins to provision resources just like they would in an on-premise warehouse. The difference being, of course, that there are no physical resources needed on-site—you can take advantage of AWS (Amazon Web Services) vast cloud infrastructure and customize it to your specific needs. Amazon Redshift is by far the most widely used cloud-based data warehouse—60 percent of attendees at the 2016 Amazon Re:Invent cloud conference said they use Redshift.
Google BigQuery—Google BigQuery is a serverless service, meaning it abstracts away the provisioning, assigning, and maintaining of resources from its users. At its core, BigQuery uses a query engine that can pore through billions of rows of data at high speed.
Azure SQL Data Warehouse—this is Microsoft's cloud-based data warehouse offering. The Azure SQL data warehouse is both scalable and fully managed, making it similar to Amazon Redshift.
Snowflake—Snowflake is a cloud-based data warehousing service that allows users to store data on Amazon S3 and analyze that data using a query engine. Each cloud-based data warehouse service comes with its own pros and cons. At present, Amazon Redshift and BigQuery are two popular options, offered by two of the world's leading cloud providers, so it makes sense to compare them to understand what you can expect from these services.

Before comparing Redshift and BigQuery, though, it's important to understand the ETL process and how it relates to data warehousing.

What is ETL in the Data Warehouse

All enterprises that want to analyze historical data within a single data repository such as a data warehouse are faced with a common challenge—how to extract data from disparate sources, get that data into a cleansed, unified format, and load the data into the warehouse for use with BI tools and analytic applications.

ETL (Extract, Transform, Load) is one of the oldest methods for getting data from source systems to the data warehouse—something that is still required even with cloud-based systems. ETL uses a separate transformation engine outside the data warehouse to get data into an aggregated, structured format ready for warehouse use.

Some experts argue that the powerful modern cloud-based warehouses negate the need for a separate transformation process—data can be extracted in its raw form from sources and loaded instantly into the warehouse for unparalleled accessibility. Within the warehouse, queries can transform the data as it is needed by analysts.

The counter argument to ETL being outdated is that it requires significant knowledge beyond the scope of the layman BI analyst to manipulate and understand raw data. Consolidated data that uses a common structure and is understandable in business terms is required for many analysts to gain insights from that data.

Next you'll find out the differences and similarities between BigQuery and Redshift, including how to ETL data into these systems, and whether ETL is actually needed in both Redshift and BigQuery.

Redshift vs. BigQuery

The following points outline how BigQuery and Redshift compare with each other:

Cost

Redshift has a straightforward cost structure in which you pay either on-demand as you use cloud instances or you can get a cheaper rate by reserving cloud instances for up to three years. In BigQuery, you pay for storage at a rate of $0.02/GB and querying at $5 per GB data processed.

The separation of querying and storage makes BigQuery's cost structure both unpredictable and complex compared to Redshift. Furthermore, while BigQuery initially appears cheaper than Redshift, once you use the data by querying it in BigQuery, it quickly becomes more expensive than Redshift.

Performance

The below table shows how the two services performed when querying the same dataset with three different queries. Note that Redshift has been optimized for performance in this test with data compression and sort keys, which is more indicative of how it would be used.

Image source

ETL

Redshift's powerful infrastructure ensures it can quickly transform raw data within the service, which is known as an ELT data integration approach.

However, there are several tools available that can prepare data for use with Redshift by rapidly performing ETL for you, so that you don't tie up cluster resources. The advantage of ETL to Redshift services is that the users such as analysts can immediately begin querying the data with their analytic tools once it's loaded into Redshift—there is no need to aggregate, join, or otherwise cleanse the data.

BigQuery relies upon ETL for getting data from transactional databases into its system for analysis. This article gives an overview of how to perform ETL from a relational database into BigQuery. Without ETL in BigQuery, expect to pay a lot of money to transform the data using BigQuery UI.

Usability

By abstracting away details of underlying hardware and configurations, BigQuery is easier to use. However, once it's set up, Redshift is simple to work with for analysts. Both BigQuery and Redshift integrate well with all popular BI tools.

Both databases support a wide range of classic data types, like text, integers, floats, and booleans. BigQuery doesn't allow user-defined precision for stored values, while Redshift does. User-defined precision is important in applications where exact numeric operations are required, such as for financial data analysis. Both platforms have excellent documentation available, ensuring most issues you encounter while using the services have a clear solution.

Architecture

Even though the actual computing resources used in BigQuery are hidden from the end user, both BigQuery and Redshift have similar architectural structures. Both use massively parallel processing (MPP), which distributes queries across multiple servers for fast query execution. Additionally, both services use a columnar storage structure, making analytical queries much easier and faster to run.

Image source

Redshift vs. BigQuery - Which Should You Choose?

Redshift outperforms BigQuery in most analytical use cases, and it also has a simpler, cheaper cost structure. BigQuery is easier to get started with, but Redshift's customizability and its capability to store values with user-defined levels of precision make it more flexible.

Both Redshift and BigQuery have similar architectural designs and both services are ideal for use with high-speed modern ETL tools that can quickly extract, prepare, and load data for analytical use.

Redshift is the most popular cloud-based data warehouse service for a reason—it is the best.

Closing Thoughts

Enterprises are quickly realizing how beneficial cloud-based data warehouses are for analytical purposes. Amazon Redshift and Google BigQuery are two examples of popular cloud-based warehouse solutions.
Even with a cloud-based data warehouse service, you still need to figure out the best way to get data from all your source systems inside the data warehouse for analysis and reporting—ETL is one such method to achieve this.
When deciding on a cloud-based data warehouse solution, it's important to consider factors such as cost, usability, performance, design, and whether it supports ETL or another data integration method.