DEV Community: VICTOR MAINA

Introduction to Python for Data Engineering

VICTOR MAINA — Thu, 29 Sep 2022 06:23:21 +0000

Demand, storage and usage of data is increasingly becoming more of a “must” rather than “if”. It is estimated that “By** 2025*, there will be **175 zettabytes* of data in the global data-sphere”.

Companies are now placing a higher value on data. Companies are discovering new ways to use data to their advantage. Data can and is being used to analyze the current status of their business, forecast the future, model their customers, avoid threats and develop new goods. Data Engineering is the linchpin in all these activities.

As** Data Engineering lies at the core of handling and processing data, the second question that begs to be asked is *“What tools/technologies can be leveraged to derive maximum benefit from data with minimum and least complicated effort.” *Python** here presents itself as an ideal candidate, Python is today’s most popular programming language with endless applications in various fields. It is ideally suited for deployment, analysis, and maintenance thanks to its flexible and dynamic nature. Thus here the concept “Python for Data Engineering” is introduced as one of the most crucial skills required in data engineering: to create Data Pipelines, set up Statistical Models, and perform a thorough analysis on them.

Python for Data Engineering mainly comprises Data Wrangling such as reshaping, aggregating, joining disparate sources, small-scale ETL(Extract, Transform ,Load), API interaction, and automation.

For numerous reasons, Python is popular. Its ubiquity is one of the greatest advantages. Python is one of the world’s three leading programming languages. For instance, in November 2020 it ranked second in the TIOBE Community Index and third in the 2020 Developer Survey of Stack Overflow.
Python is a general-purpose, programming language. Because of its ease of use and various libraries for accessing databases(Boto3, Psycopg2, mysql connectors) and storage technologies, it has become a popular tool to execute ETL jobs. Many teams use Python for Data Engineering rather than an ETL tool because it is more versatile and powerful for these activities.
Machine Learning and AI teams also use Python widely. Teams working together closely, typically have to communicate in the same language, while Python is the lingua franca in the field.
Another reason Python is more popular is its use in technologies such as Apache Airflow and libraries for popular tools such as Apache Spark. If you have tools like these in your business, it is important to know the languages you utilize.
Python Developer Community- There exists a very wide and rich python community that offers solutions and support for bugs that you might encounter .
Python for Data Engineering is popular rather than Java. Python has a broad range of characteristics that distinguish it from other languages of programming. Some of those features are given below:

Ease-of-Use: Both are expressive and we can achieve a high functionality level with them. Python is more user-friendly and concise. Python’s simple, easy-to-learn and read syntax makes it easy to understand and helps you write short-line codes as compared to Java.
Learning Curve: In addition to having support communities, they are both functional and object-oriented languages. Because of its high-level functional characteristics, Java is a bit more complex than Python to master. For simple intuitive logic, Python is preferable, whereas Java is better used in complex workflows. Concise syntax and good standard libraries are provided by Python.
Wide Applications: The biggest benefit of Python over Java is the simplicity of use in Data Science, Big Data, Data Mining, Artificial Intelligence, and Machine Learning

Top 5 Python Packages used in Data Engineering:

1. Pandas
Pandas is a Python open-source package that offers high-performance, simple-to-use data structures and tools to analyze data. Pandas is the ideal Python for Data Engineering tool to wrangle or manipulate data. It is meant to handle, read, aggregate, and visualize data quickly and easily.

2. pygrametl

pygrametl delivers commonly used programmatic ETL development functionalities and allows the user to rapidly build effective, fully programmable ETL flows.

3) petl

petl is a Python library for the broad purpose of extracting, manipulating, and loading data tables. It offers a broad range of functions to convert tables with little lines of code, in addition to supporting data imports from CSV, JSON, and SQL.

4) Beautiful Soup

Beautiful Soup is a prominent online scraping and parsing tool on the data extraction front. It provides Python for Data Engineering tools to parse hierarchical information formats, including on the web, for example, HTML pages or JSON files.

5) SciPy

The SciPy module offers a large array of numerical and scientific methods used in Python for Data Engineering that are used by an engineer to carry out computations and solve problems.

Introduction to Data Engineering

VICTOR MAINA — Thu, 29 Sep 2022 06:18:46 +0000

what is Data Engineering and what is the role of a data engineer?

According to one of the many definitions you will find there I would like to present to one popular one I like, Data engineering is the practice of designing and building systems for collecting, storing, and analyzing data at scale. The definition of the work data engineers do falls under the lines of :

i) Data engineers work in a variety of settings to build systems that collect, manage, and convert raw data into usable information for data scientists and business analysts to interpret. Their ultimate goal is to make data accessible so that organizations can use it to evaluate and optimize their performance.

ii) Data engineers design and build pipelines that transform and transport data into a format wherein, by the time it reaches the Data Scientists or other end users, it is in a highly usable state. These pipelines must take data from many disparate sources and collect them into a single warehouse that represents the data uniformly as a single source of truth.

Why Data Engineering ?

Data engineering as a role is a fairly new and emerging position. LinkedIn’s 2020 Emerging Jobs Report and Hired’s 2019 State of Software Engineers Report ranked Data Engineer jobs right up there with Data Scientist and Machine Learning Engineer. However, for some companies, especially those still finding their legs in data science or AI, it’s not always apparent what data engineering is, what role Data Engineers play within the analytics team and what skills are required (and should be vetted) to do the job. Careers in data engineering can or maybe influenced by passion for data, seeking being a problem solver, or even the money factor(as data engineering is a fairly new position it is not yet saturated and demand is still very high.)

Data Engineering Salary
Data engineering is a well-paying career. The average salary in the US is $115,176, with some data engineers earning as much as $168,000 per year, according to Glassdoor (May 2022) [4].

**Data Engineering Tools:-

Apache Hadoop**: is a foundational data engineering framework for storing and analyzing massive amounts of information in a distributed processing environment. Rather than being a single entity, Hadoop is a collection of open-source tools such as HDFS (Hadoop Distributed File System) and the MapReduce distributed processing engine.

Apache Spark: is a Hadoop-compatible data processing platform that, unlike MapReduce, can be used for real-time stream processing as well as batch processing. It is up to 100 times faster than MapReduce and seems to be in the process of displacing it in the Hadoop ecosystem. Spark features APIs for Python, Java, Scala, and R, and can run as a stand-alone platform independent of Hadoop.

Apache Kafka: is today’s most widely used data collection and ingestion tool. Easy to set up and use, Kafka, is a high-performance platform that can stream large amounts of data into a target like Hadoop very quickly.

Apache Cassandra is widely used to manage large amounts of data with lower latency for users and automatic replication to multiple nodes for fault-tolerance.

SQL and NoSQL (relational and non-relational databases)are foundational tools for data engineering applications. Historically, relational databases such as DB2 or Oracle have been the standard. But with modern applications increasingly handling massive amounts of unstructured, semi-structured, and even polymorphic data in real-time, non-relational databases are now coming into their own.

Programming Languages:-
Python is a very popular general-purpose language. Widely used for statistical analysis tasks, it could be called the lingua franca of data science. Fluency in Python (along with SQL) appears as a requirement in over two-thirds of data engineer job listings.

Ris a unique language with features that other programming languages lack. This vector language is finding use cases across multiple data science categories, from financial applications to genetics and medicine.
**
Java,** because of its high execution speeds, is the language of choice for building large-scale data systems. It is the foundation for the data engineering efforts of companies such as Facebook and Twitter. Hadoop is written mostly in Java.
**
Scala** is an extension of Java that is particularly suited for use with Apache Spark. In fact, Spark is written in Scala. Although Scala runs on JVM (Java Virtual Machine), the Scala code is cleaner and more concise than the Java equivalent.

“Torture the data, and it will confess to anything.” — Ronald Coase
**

Path to become a Data Engineer
**
With the right set of skills and knowledge, you can launch or advance a rewarding career in data engineering.

Develop your data engineering skills. Learn the fundamentals of cloud computing(could be GCP, AWS, Microsoft Azure), coding skills(most preferred Language being Python because of its robust architecture, readability and a vast community support), and database(Spark (PySPark, Spark SQL,SQL (Relational & NoSQL)

design as a starting point for a career in data science.
Coding: Proficiency in coding languages is essential to this role, so consider taking courses to learn and practice your skills. Common programming languages include SQL, NoSQL, Python, Java, R, and Scala.
Relational and non-relational databases: Databases rank among the most common solutions for data storage. You should be familiar with both relational and non-relational databases, and how they work.
ETL (extract, transform, and load) systems: ETL is the process by which you’ll move data from databases and other sources into a single repository, like a data warehouse. Common ETL tools include Xplenty, Stitch, Alooma, and Talend.
Data storage: Not all types of data should be stored the same way, especially when it comes to big data. As you design data solutions for a company, you’ll want to know when to use a data lake versus a data warehouse, for example.
Automation and scripting. Automation is a necessary part of working with big data simply because organizations are able to collect so much information. You should be able to write scripts to automate repetitive tasks.
Machine learning: While machine learning is more the concern of data scientists, it can be helpful to have a grasp of the basic concepts to better understand the needs of data scientists on your team.
Big data tools: Data engineers don’t just work with regular data. They’re often tasked with managing big data. Tools and technologies are evolving and vary by company, but some popular ones include Hadoop, MongoDB, and Kafka.
Cloud computing: You’ll need to understand cloud storage and cloud computing as companies increasingly trade physical servers for cloud services. Beginners may consider a course in Amazon Web Services (AWS) or Google Cloud.
Data security: While some companies might have dedicated data security teams, many data engineers are still tasked with securely managing and storing data to protect it from loss or theft.
Read more: 5 Cloud Certifications for Your IT Career

Get certified and learn from communities A certification can validate your skills to potential employers, and preparing for a certification exam is an excellent way to develop your skills and knowledge. Options include the Associate Big Data Engineer, Cloudera Certified Professional Data Engineer, IBM Certified Data Engineer, or Google Cloud Certified Professional Data Engineer. Learn as community also offers that group kind of support and learning together pushing and encouraging each other(this could be boot camps). Currently here in Kenya Data Science East Africa and Lux Tech Academy are running a boot camp dabbed “Data Engineering Mentorship program by Data Science East Africa”

Check out some job listings for roles you may want to apply for. If you notice a particular certification is frequently listed as required or recommended, that might be a good place to start.

Build a portfolio of data engineering projects. A portfolio is often a key component in a job search, as it shows recruiters, hiring managers, and potential employers what you can do.

You can add data engineering projects you’ve completed independently or as part of coursework to a portfolio website (using a service like Wix or Squarespace). Alternately, post your work to the Projects section of your LinkedIn profile or to a site like GitHub — both free alternatives to a standalone portfolio site.

Brush up on your big data skills with a portfolio-ready Guided Project that you can complete in under two hours. Here are some options to get you started — no software downloads required:

Create Your First NoSQL Database with MongoDB and Compass
Database Design with SQL Server Management Studio (SSMS)
Database Creation and Modeling using MYSQL Workbench
Read more: How to Build a Data Analyst Portfolio: Tips for Success

Start with an entry-level position. Many data engineers start off in entry-level roles, such as business intelligence analyst or database administrator. As you gain experience, you can pick up new skills and qualify for more advanced roles. See an example of a possible learning journey with this Data Engineering Career Learning Path from Coursera.

Do I need a degree to become a data engineer?

It’s not necessary to have a degree to become a data engineer, though some companies might prefer candidates with at least a bachelor’s degree. If you’re interested in a career in data engineering and plan to pursue a degree, consider majoring in computer science, software engineering, data science, or information systems.

Next steps
Whether you’re just getting started or looking to pivot to a new career, start building job-ready skills for roles in data engineering. I hope this article gave some perspective and insight to enable you kick start you journey.

Data Structure and Algorithms 102: Deep Dive into Data Structure and Algorithms

VICTOR MAINA — Thu, 29 Sep 2022 05:54:16 +0000

Algorithmic knowledge combined with a good understanding of the implementation of Data Structures is a perfect blend for anyone who aspires to work in the IT sector. In this topic I am going to cover a bit of a deep dive into data structures and algorithms. There are multiple algorithms and the general ones are:

Greedy Algorithm
Dynamic Programming
Divide & Conquer Algorithm
Branch and Bound Algorithm
Backtracking
Brute-Force Algorithm

There are also other algorithms such as searching algorithms, sorting Algorithms and domain-specific ML Algorithms.

Searching Algorithms:
Linear Search — O(n)
Binary Search — O(log n)
**
Sorting Algorithms:**
Selection Sort
Insertion Sort
Bubble Sort
Merge Sort
Quick Sort
Heap Sort
Radix Sort
Bucket Sort

Asymptotic Notations
Asymptotic notations are the mathematical notations used to describe the running time of an algorithm when the input tends towards a particular value or a limiting value.

For example: In bubble sort, when the input array is already sorted, the time taken by the algorithm is linear i.e. the best case.

But, when the input array is in reverse condition, the algorithm takes the maximum time (quadratic) to sort the elements i.e. the worst case.

When the input array is neither sorted nor in reverse order, then it takes average time. These durations are denoted using asymptotic notations.

There are mainly **three **asymptotic notations:

Big Theta (Θ)
Big Oh(O)
Big Omega (Ω)
Big-O Notation (O-notation)

Big-O notation represents the upper bound of the running time of an algorithm. Thus, it gives the worst-case complexity of an algorithm.

It tells us that a certain function will never exceed a specified time for any value of input.

Since it gives the worst-case running time of an algorithm, it is widely used to analyze an algorithm as we are always interested in the worst-case scenario.

Omega Notation (Ω-notation)
Omega notation represents the lower bound of the running time of an algorithm. Thus, it provides the best case complexity of an algorithm.

This always indicates the minimum time required for any algorithm for all input values, for the time complexity for any algorithm in the form of big-Ω, we mean that the algorithm will take at least this much time to complete it’s execution.

Theta Notation (Θ-notation)
Theta notation encloses the function from above and below. Since it represents the upper and the lower bound of the running time of an algorithm, it is used for analyzing the average-case complexity of an algorithm.

**Space complexity **is the amount of memory used by the algorithm (including the input values to the algorithm) to execute and produce the result. Sometime Auxiliary Space is confused with Space Complexity. But Auxiliary Space is the extra space or the temporary space used by the algorithm during it’s execution.

Space Complexity = Auxiliary Space + Input space

Memory Usage while Execution
While executing, algorithm uses memory space for three reasons:

Instruction Space It’s the amount of memory used to save the compiled version of instructions.

2.Environmental Stack

Sometimes an algorithm(function) may be called inside another algorithm(function). In such a situation, the current variables are pushed onto the system stack, where they wait for further execution and then the call to the inside algorithm(function) is made.

Data Space

Amount of space used by the variables and constants.

while calculating the Space Complexity of any algorithm, we usually consider only Data Space and we neglect the Instruction Space and Environmental Stack.

Calculating the Space Complexity
For calculating the space complexity, we need to know the value of memory used by different type of datatype variables, which generally varies for different operating systems, but the method for calculating the space complexity remains the same.

bool, char, unsigned char, signed char, __int8 –1 byte

_int16, short, unsigned short, wchar_t, __wchar t — 2 bytes

float, __int32, int, unsigned int, long, unsigned long — 4 bytes

double, __int64, long double, long long — 8 bytes

example

{
    int z = a + b + c;
    return(z);
}

a, b, c and z are all integer types, hence they will take up 4 bytes each, so total memory requirement will be (4(4) + 4) = 20 bytes, this additional 4 bytes is for return value. And because this space requirement is fixed for the above example, hence it is called Constant Space Complexity.

int sum(int a[], int n) { int x = 0;// 4 bytes for x for(int i = 0; i < n; i++) // 4 bytes for i { x = x + a[i]; } return(x); }

In the above code, 4*n bytes of space is required for the array a[] elements.
4 bytes each for x, n, i and the return value.
Hence the total memory requirement will be (4n + 12), which is increasing linearly with the increase in the input value n, hence it is called as Linear Space Complexity.

Time Complexity of Algorithms

Time complexity of an algorithm represents the amount of time required by the algorithm to run to completion. Time requirements can be defined as a numerical function T(n), where T(n) can be measured as the number of steps, provided each step consumes constant time.

calculating time complexities examples
Constant Time — O(1)
An algorithm is said to have a constant time when it is not dependent on the input data (n). No matter the size of the input data, the running time will always be the same. For example:

if a > b: return True else: return False
Now, let’s take a look at the function get_first which returns the first element of a list:
`
def get_first(data):
return data[0]

if name == 'main':
data = [1, 2, 9, 8, 3, 4, 7, 6, 5]
print(get_first(data))`

Independently of the input data size, it will always have the same running time since it only gets the first value from the list.

An algorithm with constant time complexity is excellent since we don’t need to worry about the input size.

Logarithmic Time — O(log n)
An algorithm is said to have a logarithmic time complexity when it reduces the size of the input data in each step (it don’t need to look at all values of the input data), for example:

for index in range(0, len(data), 3): print(data[index])
Algorithms with logarithmic time complexity are commonly found in operations on binary trees or when using binary search . Let’s take a look at the example of a binary search, where we need to find the position of an element in a sorted list:

`def binary_search(data, value):
n = len(data)
left = 0
right = n - 1
while left <= right:
middle = (left + right) // 2
if value < data[middle]:
right = middle - 1
elif value > data[middle]:
left = middle + 1
else:
return middle
raise ValueError('Value is not in the list')

if name == 'main':
data = [1, 2, 3, 4, 5, 6, 7, 8, 9]
print(binary_search(data, 8))`

Steps of the binary search:

calculate the middle of the list.
If the searched value is lower than the value in the middle of the list, set a new right bounder.
If the searched value is higher than the value in the middle of the list, set a new left bounder.
If the search value is equal to the value in the middle of the list, return the middle (the index).
Repeat the steps above until the value is found or the left bounder is equal or higher the right bounder. It is important to understand that an algorithm that must access all elements of its input data cannot take logarithmic time, as the time taken for reading input of size n is of the order of n.

other time complexities include:

Constant Time O(1),Logarithmic Time O(log n) , Linear Time O(n) ,Quasilinear Time O(n log n), Quadratic Time O(n²) Exponential Time O(2^n) Factorial Time O(n!).

Introduction to Data Structures and Algorithms

VICTOR MAINA — Thu, 29 Sep 2022 05:44:42 +0000

Data structures and algorithms are “components” of software development that enable one to write efficient( minimum memory and storage needs) software.

Data Structures — These can be looked as like the ingredients you need to build efficient algorithms. These are the ways to arrange data so that they (data items) can be used efficiently in the main memory. Examples: Array, Stack, Linked List.

Algorithms — These are a sequence of steps performed on data using efficient data structures to solve a given problem, be it a basic or real-life-based one. Examples include: sorting an array or adding two numbers and displaying the result. Simply put as the path between a problem and a solution.

characteristics that make up a good algorithm.

These include:

1.Input specified
2.Output specified
3.Definiteness
4.Effectiveness

Finiteness 6.Independent In recent years in software development, skills tests on data structures and algorithms have become more and more crucial; This does indicate and stress on the importance of learning and developing this skill and on the attainment of this skill in your arsenal puts you ahead of the crowd.

Algorithm example for adding two numbers:

1.start

2.prompt for first number as input(num1)

2.prompt for second number as input(num2)

3.Assign user input to num1 and num2 (num1 ← first number, num2 ← second number).

4.Add num1 and num2 (num1 + num2).

Assign the result to sum. (sum ← num1 + num2)

6.Print sum

7.Stop

Types of Data Structures:

Primitive data structures
Non-primitive data structures
Primitive Data structures

The primitive data structures are essentially the bare data types. These are int, char, float, double, and pointer which can only hold a single value.

Non-Primitive Data structures

The non-primitive data structure is divided into two types:

Linear data structure
Non-linear data structure

Linear Data Structures
The arrangement of data is in a sequential manner hence the name “linear data structure” , examples include Arrays, Linked list, Stacks, and Queues. In these data structures, one element is connected to only one another element in a linear form.

Non- linear Data structures are ideally the opposite of the linear Data structures, the elements are not limited to connections of only one other element in a linear form such as map, graph and tree.

Static and Dynamic Data Structures

Static data structures are fixed in size, while dynamic ones can increase or decrease in size. The size is usually in terms of space the data structure occupies. Therefore the size of a static data structure cannot be changed only its content.

Dynamic data structures are flexible, allowing you to change the size of the element and the contents in runtime. Each of the two has a different use case depending on what you want to do in your program.

This is a little bit of an introduction to Data Structures and Algorithms, I hope you had an amazing read.