DEV Community: Edward Turner

Don't Forget to F.O.O.P

Edward Turner — Mon, 06 Jul 2020 17:00:29 +0000

Introduction

“All roads lead to Rome.” While this phrase indicates there are multiple solutions to a problem, there are specific solutions that are more adept at the problem. This analogy applies to the programming style of our software.
Throughout this article, we will discuss the different programming paradigms and the benefits of each.

Which Paradigm to Choose?

Our decision of which paradigm to choose should comply with the best practices with the Software Development Life Cycle (SDLC). The following principles are part of the best practices of SDLC: Single Responsibility and D.R.Y. (Don’t Repeat Yourself). These principles encourage software that is testable, maintainable, and stable. By complying with these principles, you can develop more confidence in your software and increase your development team’s velocity.
Two main paradigms comply with these: Functional Programming and Object-Oriented Programming. Throughout our discussion, we will cover both of the benefits and caveats of the paradigm.

We will use Python as the programming of choice to illustrate these concepts.

....

If you want to learn more, please continue reading here: https://towardsdatascience.com/dont-forget-to-f-o-o-p-5318cffc9ded

Enhancing Optimized PySpark Queries

Edward Turner — Mon, 06 Jul 2020 01:04:41 +0000

As we continue increasing the volume of data we are processing and storing, and as the velocity of technological advances transforms from linear to logarithmic and from logarithmic to horizontally asymptotic, innovative approaches to improving the run-time of our software and analysis are necessary.

These necessitating innovative approaches include utilizing two very popular frameworks: Apache Spark and Apache Arrow. These two frameworks enable users to process large volumes of data in a distributive fashion. These two frameworks, also, enables users to process larger volumes of data more quickly by using vectorized approaches. These two frameworks can easily facilitate big-data analysis. However, despite these two frameworks and their ability to empower users, there is still room for improvement, specifically within the python-ecosystem. Why can we confidently identify pockets of improvement in utilizing these frameworks within python? Let’s examine some features python has.

...

If you want to learn more, please continue reading here: https://towardsdatascience.com/enhancing-optimized-pyspark-queries-1d2e9685d882

Scaling DAG Creation With Apache Airflow

Edward Turner — Mon, 06 Jul 2020 00:56:10 +0000

One of the more difficult tasks within the Data Science community is not designing a model to a well-constructed business problem or developing the code-base to operate in a scalable environment, but, rather, arranging the tasks in the ETL, or in the Data Science pipeline, executing the model on a periodic basis and automating everything in-between.

This is where Apache Airflow comes to the rescue! With the Airflow UI to display the task in a graph form, and with the ability to programmatically define your workflow to increase traceability, it is much easier to define and configure your Data Science workflow in production.

One difficulty still remains, though. There are circumstances when the same modelling, monolithic, process is utilized and applied to different data sources. To increase performance, it is better to have each of these processes run concurrently, rather than add them to the same dag.

No problem, let us simply create a dag for each process, all with similar tasks, and schedule them to run at the same time. If we were to follow along software development principle DRY, is there a way to create multiple different dags with the same-type tasks without having to manually create them?

Is there a way to create multiple different dags with the same-type tasks without having to manually create them?

....

To read understand and read more regarding how to scale your Apache-Airflow DAGS, please continue reading here: https://towardsdatascience.com/scaling-dag-creation-with-apache-airflow-a7b34ba486ac

Making A Model Is Like Baking A Cake

Edward Turner — Mon, 06 Jul 2020 00:53:09 +0000

The Types of Cakes Available

As we progress further in our modern era, the advances of Data Science and Technology continue to make marvel strides in various fields of study and practice. As a result of the vast applicability of Data Science and Technology, various different types of models have been constructed.
To name a few: Generalized Linear Models, Support-Vector Machines, K-Nearest Neighbor Algorithm, Gradient Boosting Decision Trees, Random Forest, and Neural Networks.
Given the volume of the data and the complexity of the interactions within the data, various Data Science specific packages have been developed in a few different languages. Within python, we have, not exclusively, sklearn, xgboost, lightgbm, pyspark and H2O. Within R, we have, not exclusively, Caret, Prophet, SparkR, and xgboost.

Each of the aforementioned packages attempts to solve very specific Data Science problems. However, given the nature of the initial problem, we may need a more custom approach deriving the solution. Therein lies the problem each Data Scientist strives to solve.

Typically, due to the variety of the problems in the field, Data Scientists develops solutions using models, with predefined model architecture, and seeks to increase the performance of certain KPI metric performance the predefined model was not specifically made to solve.

Fortunately, there is a way to develop a modelling architecture to specifically solve your problem. By understanding the mathematical and statistical processes each of these predefined model architecture uses, it is possible to reverse engineer the model to specifically your problem.
First, let us examine one family of models, and then proceed to developing our own model....

To read understand and read more regarding how to create your own special unique ML custom model, please continue reading here: https://towardsdatascience.com/making-a-model-is-like-baking-a-cake-5f2443894c5f