DEV Community: Michael Obed

Creating and Integrating Data Pipeline Using Amazon S3 and Snowflake Data Warehouse. (Using SnowSQL).

Michael Obed — Thu, 11 Jul 2024 15:38:57 +0000

Data pipeline is widely used in the field of Data Engineering and Analytics to fetch data from external sources for instance AWS Redshift, S3 (Simple Storage Service), GCP (Google Cloud Platform), Oracle, Azure and much more industry –used technologies.

Data Pipeline – according to Snowflake is concerned with moving data from one place to a destination (such as a data warehouse or large storage service) while simultaneously optimizing and transforming the data. As a result, the data arrives in a state that can be analyzed and used to develop business insights.

This article aims to cover how integrate AWS S3 with Snowflake which is convenient when dealing and working with large data sources.

Requirements:

Knowledge of SQL for working in Snowflake.
Snowflake account and AWS
Data file in preferably CSV (Comma Separated Value), it can be large up to 160 GB of size. Beyond that AWS requires the Cloud developer to use other tools – this is beyond this article’s scope.

Step 1
Setup an AWS account as a root user.

Step 2
After a successful login into your account go to services as shown. For new users locate storage by scrolling downward then select S3.

Step 3
Upon selecting S3 we expect the display as follows – design might change later.

Click Create bucket then give it a name – in our case mybootcampbucket:

Click Create bucket to finish the process.

To upload the file select the newly created bucket (highlighted).

Upload the file now as depicted below.

Step 4
After uploading the file/dataset. Policy need to be set up. The policy facilitates permission for sharing to external integration when associated with identity and compliance to respective AWS account.
To setup policy click services then head to Security, Identity and Compliance. See below:

Select IAM (Identity Access Management) then open Policies.

After clicking ‘Create policy’, allocate the name to the policy to be created. Our case is Bootcamp2023.

Under policies, click JSON Policy, this is because Policies are written in JavaScript Object Notation (JSON) which is a data representation Key – Value pair syntax.

After setting up the policy, next we create the Role.

Give your role a name, in this scenario – our role is named Bootcamp_2023.

Setup the permission.

I have highlighted to note that we deal with This account in particular.
Next select the policy to setup a role for.

Finish up with the role creation.

Now copy the ARN (Amazon Resource Names) that will help in identifying resources uniquely.

Step 5 – Setting Up/Creating a Snowflake Account.
To create a Snowflake account head to Snowflake

After creating an account, we need to create our Warehouse – as shown below.

Displaying the pipeline dataset loaded from the AWS S3.

That is all for this article.

Data Engineering for Beginners: A Step - by - Step Guide.

Michael Obed — Tue, 31 Oct 2023 05:24:24 +0000

Data Engineering is one of the fastest rising career position in the data technology field. Data Engineering is one of the fastest rising career position in the data technology field. This is a complex field with task of making raw data usable to data scientists and groups within an organization. Data engineering encompasses numerous specialties of data science.
A data engineer is a type of software engineer who creates big data ETL pipelines to manage the flow of data through the organization.

Responsibilities of a Data Engineer
• Develops and maintains data pipelines
• Ensures Data quality & accuracy
• Design, develop and maintain database architecture
• More of ETL process – (Extract Transform Load) This is an automated process to get data into a database from the source.
• As long as the data engineer will be getting data from a given source, they will need to ensure data quality and accuracy.

Skills Required to Become a Data Engineer

1. SQL/NoSQL
Structured Query Language or simply SQL (‘sequel’) is the cream when it comes to fundamental of skill-set for Data Engineers and any other related field of data and databases. Modern data is greatly stored in Relational Database Management Systems (RDBMS) and hence this demands highly aspiring Data Engineers to master the SQL. There are a lot of Relational Database Systems including Oracle 23C, MySQL, SQL Server, PostgreSQL just but to mention a few.

For NoSQL, - This means that data is not stored in table format hence the data is in a non-relational database design does not require a schema, it offers rapid scalability to manage large and typically unstructured data sets. Examples of NoSQL database systems are MongoDB and Cassandra

2. Python/R
To be able to work with data from source and create automation tasks around data, the data engineer need to have knowledge of a programming language with respective packages or libraries. Python so far is highly considered given to its versatility in the libraries like Pandas, Airflow. For R, a Data Engineer have at their disposal Tidyverse is a collection of R packages, primarily for data engineering and analytics. These packages are ggplot2, purrr, tibble, dplyr, tidyr, stringr, readr, and forcats

3. Apache Kafka
This is an open-source distributed event streaming platform widely used in data engineering for building real-time data pipelines and streaming applications. It is designed to handle high-throughput, fault-tolerant, and scalable data streaming, making it a popular choice for processing and managing large volumes of data in real time.
It is used to design and implement robust, real-time data pipelines, enabling the efficient processing, storage, and analysis of streaming data.

4. AWS/Azure/GCP
As a Data Engineer, having a knowledge of Cloud Computing is an important part of you. Big Data which cannot be

5. Apache Spark
Apache Spark is a powerful open-source distributed computing system that is widely used in the field of data engineering. It is designed to handle large-scale data processing tasks and is known for its speed, ease of use, and versatility in supporting a wide range of applications.
It serves as a critical tool for performing data transformation, data integration, and data analytics. It provides a unified engine for big data processing that supports Java, Scala, Python, and R. This flexibility makes it easier for data engineers to work with different data formats and leverage their preferred programming languages for data processing tasks.
Apache Spark is particularly valuable in data engineering for the following reasons:

Speed and efficiency: Apache Spark can process large datasets much faster than traditional data processing systems, thanks to its in-memory processing capabilities and optimized execution engine.
Scalability: It can handle massive datasets and scale seamlessly to accommodate growing data volumes, making it suitable for handling big data challenges.
Fault tolerance: It ensures fault tolerance by storing intermediate results during processing, enabling the system to recover from failures without losing data.
Versatility: Apache Spark supports a wide range of data processing tasks, including batch processing, real-time streaming, machine learning, and graph processing, making it a versatile solution for diverse data engineering needs. Data engineers utilize Apache Spark to build and maintain data pipelines, process and analyze large datasets, and perform complex data transformations. By leveraging its capabilities, data engineers can efficiently manage and process big data, extract valuable insights, and build robust data-driven applications and systems.

6. Data Structures and Algorithms.
This is crucial in data engineering, which are fundamental building blocks for the efficient management and processing of data.
Data structures refer to the specific ways data is organized and stored in a memory of a computer, enabling efficient access, modification, and deletion of data. Algorithms are step-by-step procedures or formulas for performing specific tasks, such as searching, sorting, and data manipulation. Data engineering heavily relies on various algorithms to efficiently handle data operations, such as data transformation, data integration, and data aggregation.

7. Data Visualization
This is graphical representation of data and information. It plays a crucial role in presenting complex data in a visual format that is easy to understand and interpret. It enables data engineers to communicate insights and findings effectively to stakeholders, thus facilitating data-driven decision-making and understanding of complex data relationships.

Tools of Business Intelligence used for Data Visualization are Matplotlib, Power BI, Tableau, Seaborn just to mention a few.

The Complete Guide to Time Series Model

Michael Obed — Thu, 26 Oct 2023 06:15:44 +0000

Time series analysis comprises methods for analyzing time series data in order to extract meaningful statistics and other characteristics of the data while time series forecasting is the use of a model to predict future values based on previously observed values.

Combining time series analysis and time series forecasting we derive time series model which refers to data points ordered in time used in forecasting the future.

Forecasting Models

The tool used in conducting time series model is referred to as ARIMA an acronym for Auto Regressive Integrated Moving Average.

How ARIMA is applied
It is specified by three order parameters: p, d, q.
P refers to the number of autoregressive terms (AR)
d refers to how many non – seasonal differences are needed to achieve stationarity (I)
q refers to the number of lagged forecast errors in the prediction equation (MA)

Stationarity of Time Series

The stationarity of time series depends on:
• Mean
• Variance
• Co-variance

The variance of the series should not be a function of time.

The covariance of the i th term and the (i + m) th term should not be a function of time.

When the values are constant over a period of time.

Components Affecting Time Series

Trend
This is the increase or decrease in the series over a period of time. It persists over a long period of time.

Example: Population growth over the years can be seen as an upward trend.

Seasonality
Seasonality refers to the regular pattern of up and down fluctuations. It is a short –term variation occurring due to seasonal factors.

Example: sale of ice-cream increases on summer.

Cyclicity
This is a medium – term variation caused by circumstances, which repeat in irregular intervals.

Example: 5 years of economic growth, followed by 2 years of economic recession, followed by 7 years of economic growth followed by 1 year of economic recession.

Irregularity
This refers to the variations that occur due to unpredictable factors and also do not repeat in particular patterns.

Example: Variations caused by incidents like earthquakes, war, floods and other variations that have unpredictable factors.

Moving Average

**
The moving average model is probably the most simple approach to time series modeling. This model simply states that the next observation is the mean of all past observations.

Exponential Smoothing
Exponential smoothing uses similar logic to moving average, but this time, a different decreasing weight is assigned to each observation. In other words, less importance is given to observations as we move further from the present.
Mathematically, exponential smoothing is expressed as:

Alpha is a smoothing factor that takes values between zero and one. It determines how fast the weight decreases for previous observations.

Double Exponential Smoothing
Double exponential smoothing is used when there is a trend in the time series. In that case, we use this technique, which is simply a recursive use of exponential smoothing twice.
Mathematically:

Here, beta is the trend smoothing factor, and it takes values between zero and one.

Triple Exponential Smoothing
This method extends double exponential smoothing by adding a seasonal smoothing factor. Of course, this is useful if you notice seasonality in your time series.
Mathematically, triple exponential smoothing is expressed as:

Why Time Series Models?

There are a number of reasons why businesses and organizations through their respective Data personal conduct Time series analysis. The following are some of the applications of the Time Series Modeling.

_Forecasting Future Trends _
Based on previous data collected through time series models, businesses can predict how future trends may develop to protect their financial resources, explore new markets, restock inventory and perform other tasks.

Detecting Anomalies
Time series models also allow organizations to more easily spot data shifts that may signal unusual behavior or changes in the market.

Determining Patterns
Businesses that rely on seasonal sales, monthly online traffic spikes and other repetitive behavior can establish expectations based on time series models, gauging their overall health and performance.

Forecasting with Time Series Models

Healthcare
Time series models can be used to monitor the spread of diseases by observing how many people transmit a disease and how many people die after being infected.

Agriculture
Time series models take into account seasonal temperatures, the number of rainy days each month and other variables over the course of years, allowing agricultural workers to assess environmental conditions and ensure a successful harvest.

Finance
Financial analysts can leverage time series models to record sales numbers for each month and predict potential stock market behavior.

Cybersecurity
IT and cybersecurity teams can develop patterns in user behavior with time series models, allowing them to be aware of when behavior doesn’t align with normal trends.

Retail
Retailers may apply time series models to study how other companies’ prices and the number of customer purchases change over time, helping them optimize prices.

Exploratory Data Analysis Using Data Visualization Techniques

Michael Obed — Mon, 16 Oct 2023 14:03:39 +0000

Exploratory Data Analysis or EDA refers to the process of analyzing and summarizing datasets to gain insights into the data. This helps in understanding the data and identify patterns, relationships and anomalies.
EDA is a crucial step in any data analysis project. Some of the techniques used in EDA include Python, R, SQL, Excel
The objectives of EDA is to:

-Enable unexpected discoveries in the data
-Suggest hypotheses about the causes of observed phenomena
-Assess assumptions on which statistical inference will be based
-Support the selection of appropriate statistical tools and techniques
-Provide a basis for further data collection through surveys or experiments
To ease more on analyzing data Python is discussed more in details.

              **Why Python for EDA?**

Python is a general yet powerful and versatile programming language widely used in Data Analysis.
Python is rich in libraries and tools that make it easy to perform the EDA tasks. To mention these libraries are Pandas, NumPy, Matplotlib, Seaborn and Plotly – to be discussed further.
Python is easy to learn and to be used in EDA tasks hence it is an ideal choice for beginners and experts alike.
```
        **Introduction to Data Visualization**
```
Data Visualization helps in the communication of insights derived from complex datasets.
Python is preferred to do the task since it comes in handy with libraries that enables a Data Analyst or Data Scientist to create visualization for the data prepared.
The libraries are matplotlib, seaborn, and plotly help convey the findings effectively.
Matplotlib helps to create a 2d plot in Python.
Plotly helps to offer interactive capabilities to visualizations.
```
               **Steps in EDA**
```

To conduct EDA appropriately there are steps to be followed to ensure the Data Scientist in the end has a clean data. The steps are:

Before we dive into EDA lets prepare our environment.

Note: I used VS Code studio and Anaconda to conduct the Exploratory Data Analysis for the dataset provided.

Also, your PC needs to have a core i5 processor and at least 8GB RAM due to high data analysis processing power requirements.
Download VS code here and Anaconda Error! Hyperlink reference not valid.

After Downloading Anaconda and installing the interface should appear as follows:

Now click Launch - Notebook(Jupyter). The interface of Jupyter is as shown below.

Now the remaining part is to conduct the EDA process.

To Use VS Code – after downloading. Create a folder for your project.

To run the Python file open terminal via keyboard shortcut CTRL + ` (called back-tick) or locate the 3 dots on tab bar.

Run this command C:/ProgramData/anaconda3/python.exe "d:/Data Science BootCamp/Week 2/eda.py" (replace path with your own path to the file where scripts are written)

                  **Steps in EDA**

To conduct EDA appropriately there are steps to be followed to ensure the Data Scientist in the end has a clean data. The steps are:

Understand the data at hand – the dataset to be analyzed. This can be done by first importing the Python Libraries.

Clean the data using the imported libraries. This is done by inspecting the data.
Dirty data can contain null values, duplicates, inconsistent column names, spaces in between the data.
Then analyze the relationship in data variables.

Data Science for Beginners: 2023 - 2024 Complete Roadmap

Michael Obed — Fri, 06 Oct 2023 10:23:34 +0000

Data Science, Data Analysis, Data Engineering and Analytical Engineering

Data is the new gold and research done The Economist recently called data the world's most valuable resource and now; let’s see major fields related to data specialization covering Data Science, Data Analysis, Data Engineering and Analytics Engineering with roadmap in Data Science.

Data Science – This is a field of study that combines domain expertise, programming skills and knowledge of mathematics and statistics to extract meaningful insights from data. Data Science is a new field in that its existence paved in the way from the late 1980 and around 2015, this lucrative and promising field gained popularity and few years later; it became a very popular and regarded among high – paying industry.
Regarding data – this applies to all field, the stages are: capture, maintain, process, analyze communicate. After conducting the data stages all expected now is, through data – facilitate strategic business decisions.

To build a career in Data Science an individual has either go for formal path – earning a degree in Computer Science, Business Information Technology, degree or Masters in Statistics/Mathematics, or apprenticeships, boot camps and certifications or self-taught.

Courses and Certifications for career in Data Science
IBM Data Science Professional Certificate (Coursera)

This is a beginner level if paced at 10 hours a week will take the learner 5 months to finish with a capstone project for their portfolio and cement the Data Science skills.

Data Science, Machine Learning, Data Analysis, Python & R (Udemy)

This is free Data Science, Machine Learning, Data Analysis, Data Visualization using Python and R Programming that the learner in 8 hours will get the nitty gritty in Data Science and Machine Learning.

FreeCodeCamp

Python for everybody is a free video course series that teaches the basics of using Python 3.
The courses were created by Dr. Charles Severance (also known as Dr. Chuck). He is a Clinical Professor at the University of Michigan School of Information, where he teaches various technology-oriented courses including programming, database design, and web development.
This course is self-paced. You’ll be able to learn consistently at your own convenience.

In terms of job growth, all jobs in tech are expected to grow by 13% over the next 10 years, that is according to Shane Hummus - a renown Data - content creator on YouTube: https://youtu.be/O9nf1CqjGzI?si=2E7U8UXHAlguavqv

A Data Scientist is a professional responsible for collecting, analyzing and interpreting extremely large amounts of data.

Data Analysis – The process of inspecting, cleansing, transforming and modeling data with the goal of discovering useful information, informing conclusions, and supporting decision-making.
Data analysis involves working with smaller, structured datasets to answer specific questions or solve specific problems.

Courses and Certifications for career in Data Analysis
Google Data Analytics

This is a paid course fit for new learners in the world of Data Analysis. Even though it is a paid – subscription based, good news is that it has Financial Aid request available to cater for those who are limited to raise the per-module cost of around $ 39 USD. This course covers Excel, R programming, SQL, Tableau, working with GCP – Google Cloud Platform and Foundations of Data Analysis and much more.

Majority of what is learnt by in Data Science is also covered in Data Analysis. It can be noted that a Data Scientist can only work around the dataset where a Data Analyst is not able.

A Data Analyst is a professional able to use statistical methods to test hypotheses and draw conclusions from the data.

Data Engineering – This refers to the art of designing and building systems for collecting, storing, and analyzing data at scale. It is a broad field with applications in just about every industry.
A Data Engineer is a professional responsible for laying the foundations for the acquisition, storage, transformation, and management of data in an organization. They manage the design, creation, and maintenance of database architecture and data processing systems; this ensures that the subsequent work of analysis, visualization, and machine learning models development can be carried out seamlessly, continuously, securely, and effectively.

Analytics Engineering – This is a multi-disciplinary role and can be defined as the discipline of engineering applied to the practice of analytics and big data.
An Analytical Engineer refers to professional who blends technical expertise with domain knowledge to craft meaningful insights from data and deliver them to users in a timely manner.
(Separate article coming up)
**

Roadmap to Data Science

**
Roadmaps refers to procedures that determine a goal or the desired outcome and feature the significant steps or milestones required to reach it.
Learn the fundamentals of data science.
Programming: Having a good knowledge in programming is a key feature to any aspiring Data Scientist. Top languages are Python and R. Choose one tool and build skills to help you do the job.
Statistics: Learn probability, descriptive statistics, and inferential statistics.
Machine learning: a subfield of Computer Science that gives computers the ability to learn without being explicitly programmed. Machine Learning is used image recognition, natural language processing, and fraud detection.
Visualization – When all is done with data preparation, A Data Scientist might go step ahead to create visualization using Tableau, Power BI and other industry – standard tools and softwares.

Gain experience with data science tools
Python libraries: NumPy, Pandas, Matplotlib, Seaborn, and Scikit-learn are some of the most popular Python libraries for Data Science. They provide a wide range of functions for data manipulation, visualization and Machine Learning.

SQL: Structured Query Language is used to query and manage relational databases. MySQL, PostgreSQL, and Oracle are some of the most popular SQL databases.
Cloud computing platforms: Cloud computing platforms like AWS, Azure, and Google Cloud Platform offer a variety of services for data science, such as data storage, data processing, and machine learning.
Build a portfolio of data science projects
• Choose projects that are challenging yet realistic.
• Document your work thoroughly – write article about the project.
• Make your code clean, reusable and concise.
• Host your projects on a public platform like GitHub.

Network with those in the field of Data Science
Tips for networking:
• Be active on social media platforms like LinkedIn and Twitter.
• Attend data science meetups and conferences.
• Connect with other data scientists on LinkedIn.
• Reach out to data scientists you admire and ask for advice.

Apply for data science jobs
Tailor your resume and cover letter to each job you apply for.
Highlight your relevant skills and experience in your resume and cover letter.
Practice answering Data Science interviews.

That is all for this work. Happy kick-start Data Science Career.