DEV Community: SeattleDataGuy

Let’s Move Fast And Get Rid Of Data Engineers

SeattleDataGuy — Sat, 17 Sep 2022 20:57:53 +0000

We live in a data world that wants to move fast.

Just shove your data into a data lake and we'll figure it out later.

Create this table for this one dashboard that we will only look at once and then it will join the hundreds(if not thousands) of other dashboards that are ignored.

All of which can work in the short term but often leaves behind a lot of technical debt. One project I recently worked on literally dumped all of their raw data into an S3 bucket with no form of breakdown source or timing and it was quite chaotic to grasp what was going on.

All of this is driven by executives who need their data yesterday and don't want to wait as well as by the software and analysts teams who are often driven by very different motivations.

With all this pressure to move fast coming from all sides, an interesting solution I have come across a few times is...Let's just get rid of data engineering and governance.

Let's Just Get Rid Of Data Engineers?

Inspired By Codestrap

A recent conversation I had with a head of data forced me to pause.

They brought up how at several organizations they had worked for made the decision to cut data engineering. This allowed their analysts and data scientists to use dbt and cloud composer to build tables as need be. After all these end-users knew what data they needed and were now empowered to get it.

Now in a world where companies like Airbnb recently realized why they need to increase their data engineering initiatives this is always odd to hear.

Of course the end result for these companies was they did need to re-invest in data engineering.

But why cut data engineering in the in the first place?

Costs aside. Data engineering is often viewed as a bottleneck.

If you're a software engineer you're likely getting chased down by data engineers who want you to tell them before you update tables on the application. Maybe, they are even asking you to add in extra layers like data contracts.

Yikes. You don't want that. This means you won't be able to deliver that feature before your next review cycle.

If you're on the data science and analyst side, well then you often have to wait for data engineering to find time to pull your data into "core tables". Some of these data engineers might even insist on making sure to implement some level of standardization and data governance.

All of which means slower access to data.

There is a reason that data engineering and data governance exist. But if you really do want to get rid of your data engineering teams then here are some points you must consider.

Tracking Historical Data

On a few projects I completed earlier this year I noted there was a lack of tracking historical data. This will eventually lead to problems when your management asks:

How much revenue did we have per customer in NYC year over year?

Because all you will be reporting on is the current state of the data and not how it has changed over time.

The classic way to track changes in dimensions and entities is to use SCD or slowly changing dimensions.

But there are other options.

Ahmed Elsamadisi has discussed the idea of an activity schema that can help track changes on entities in a streaming fashion.

And at some other companies just store snapshots of every day in separate partitions in a table(not my favorite). The result of all of these is the ability for an end-user to track historical data over time. Meaning that when an analyst is asked about a year over year comparisons, they can answer accurately.

All that said, I would say there is a far more difficult problem that most companies need to deal with.

Data Integration

As pointed out by Bill Inmon earlier this year:

In order to be a data warehouse an organization must integrate data. If you don't do integration of data then you are not a data warehouse...

...But when it comes to integrating data I don't see that Snowflake understands that at all. If anything Snowflake is an embodiment of the Kimball model.

There are many cases where in order to move fast a team might load data from each data source as is without thinking about how it integrates with all the other various sources.

While working at Facebook, yes we were spoiled by the free lunches.

But truthfully, as a data engineer, I felt spoiled by how well integrated all the data was. Compared to when I worked at a hospital and was trying to integrate a finance system into a project management system where the project id field was an open text field that allowed for users to put multiple project ids into it(as well as the occasional budget number), Facebook was amazing.

It probably didn't help that the field was called project number not project id...

This was because the data sources all interacted with each other on an application level. Meaning there had to be ids that the sources themselves shared.

In fact, instead of having to figure out how we would join data together, we often had to figure out which IDs we would remove to avoid confusing analysts on which ID is the core ID to join on.

Data integration is often skipped when your team is just ingesting data via EL. Why think about integration? There is a current report that needs to be answered and it doesn't require bringing in another data source. However, the problem is eventually it will. Eventually, someone will want to ask about data from multiple sources.

In the end, you not only start running into integration issues but governance issues.

Data Governance

Finally governance. Now at large companies, data governance is usually pretty obvious in terms of necessity. There are committees that spend hours deciding how they will implement the various processes and policies to protect, standardize and better manage the use of data.

Entire departments are dedicated to these task. In these cases, the data engineering team is usually not involved in these decisions or at the very least they are not the SME. They might be the ones that programmatically implement the policies made by data governance. But are not the actual expert.

Of course, these companies are also dealing with tens if not hundreds of applications(many duplicates).

But there are a lot of data systems in the SMB and mid-market space too and not having some form of data governance strategy in a world that is becoming more data-aware poses a lot of risks.

Everything Has A Cost

Moving fast can work out early on because much of the modeling and data pipeline debt isn't apparent(and I didn't even reference data quality). However, as a company grows and its data needs mature limitations will bubble to the surface.

An executive will ask a question and the data team won't be able to answer it.

The ML team will invest 500k into a model only to figure out all of the data was wrong or from a table, which no one supports anymore.

And we will just add to the reasons that data teams fail.

All of which will expose the true cost of moving fast.

How To Start Your Next Data Engineering Project

SeattleDataGuy — Sat, 16 Apr 2022 21:08:30 +0000

Photo by Sigmund on Unsplash

Many programmers who are just starting out struggle with starting new data engineering projects. In our recent poll on YouTube, most viewers admitted that they have the most difficulty with even starting a data engineering project. The most common reasons noted in the poll were:

Finding the right data sets for my project
Which tools should you use?
What do I do with the data once I have it?

Let's talk about each of these points, starting with the array of tools you have at your disposal.

Picking the Right Tools and Skills

For the first part of this project I am going to borrow from Thu Vu's advice for starting a data analytics project.

Why not look at a data engineering job description on a head-hunting site and figure out what tools people are asking for? Nothing could be easier. We pulled up a job description from Smart Sheets Data Engineering. It is easy to see the exact tools and skills they are looking for and match them to the skills you can offer.

Below are several key tools that you will want to include.

Cloud Platforms

One of the first things you'll need for your data engineering project a cloud platform like Amazon Web Services (AWS). We always recommend that beginning programmers learn cloud platforms early. A lot of a data engineer's work is not completed on-premise(anymore). We do almost all work in the cloud.

So, pick a platform you prefer: Google Cloud Platform (GCP), AWS, or Microsoft Azure. It's not that important what platform you pick out of these three; it's important that you pick one of these three because people understand that if you have used one, it is likely you can easily pick up another.

Workflow Management Platforms And Data Storage Analytic Systems

In addition to picking a cloud provider, you will want to pick a tool to manage your automated workflows and a tool to manage the data itself. From the job description Airflow and Snowflake were both referenced.

These are not the only choices, by any stretch of the imagination. Other popular options for orchestration are Dagster, *and Prefect. We actually recommend starting with *Airflow and then looking to others as you get more familiar with the processes.

Don't be concerned with learning all these tools at first; keep it simple. Just concentrate on one or two, since this is not our primary focus.

Picking Data Sets

Data sets come in every color of the rainbow. Your focus shouldn't be to work with already processed data; your attention should center on developing a data pipeline and finding raw data sources. As the data engineer, you will need to know how to set up a data pipeline and pull the data you need from a raw source. Some great sources of raw data sets are:

OpenSky: provides information on where flights currently are and where they are going, and much, much more.
Spacetrack: is a tracking system that will track, identify, and catalog all artificial satellites orbiting the Earth.
Other data sets: New York Times Annotated Corpus, The Library of Congress Dataset Repository, the U.S. government data sets, and these free public data sets at Tableau.

Pulling data involves a lot of work, so when you are picking your data sets, you will probably find an API to help you extract the data into a comma-separated value (CSV) or parquet file to load into your data warehouse.

Now you know how to find your data sets and manipulate them. So, what do you do with all this raw data that you've got?

Visualize Your Data

An easy way to display your data engineering work is to create dashboards with metrics. Even if you won't be building too many dashboards in the future, you will want to create some type of final project.

Dashboards are an easy way to do so.

Here are a few tools you can look into:

With your data visualization tool selected you can now start to pick some metrics and questions you would like to track.

Maybe you would like to know how many flights occur in a single day. To build on that, you may want to know destinations by flight, time, or length and distance of travel. Discern some basic metrics and compile a graph.

Anything basic like this will help you get comfortable figuring out your question of "why?" This question needs to be answered before you begin your real project. Consider it a warm-up to get the juices flowing.

Now let's go over a fe w project ideas that you could try out.

3 Data Engineering Projects: Ideas

Beginning Data Engineering Projects Example: Consider using tools like Cloud Composer or Amazon Managed Workflows for Apache Airflow (MWAA). These tools let you circumvent setting up Airflow from scratch, which allows you more time to learn the functions of Airflow without the hassle of figuring out how to set it up. From there, use an API such as PredictIt to scrape the data and return it in eXtensible markup language (XML).

Maybe you are looking for data on massive swings in trading over a day. You could create a model where you identify certain patterns or swings over the day. If you created a Twitter bot and posted about these swings day after day, some traders would definitely see the value in your posts and data.

If you wanted to upgrade that idea, track down articles relating to that swing for discussion and post those. There is definite value in that data, and it is a pretty simple thing to do. You are just using a Cloud Composer to ingest the data and storing it in a data warehouse like BigQuery or Snowflake, creating a Twitter bot to post outputs to Twitter using something like Airflow.

It is a fun and simple project because you do not have to reinvent the wheel or invest time in setting up infrastructure.

Intermediate Example: This data engineering project is brought to us by Start Data Engineering(SDE). While they seem to reference just a basic CSV file about movie reviews, a better option might be to go to the New York Times Developers Portal and use their API to pull live movie reviews. Use SDE's framework and customize it for your own needs.

SDE has done a superior job of breaking this project down like a recipe. They tell you exactly what tools you need and when they are going to be required for your project. They list out the prerequisites you need:

In this example, SDE shows you how to set up Apache Airflow from scratch rather than using the presets. You will also be exposed to tools such as:

There are many components offered, so when you are out there catching the attention of potential future employers, this project will help you detail the in-demand skills employers are looking for.

Advanced Example: For our advanced example, we will use mostly open-source tools. Sspaeti's website gives a lot of inspiration for fantastic projects. For this project, they use a kitchen-sink variety of every open-source tool imaginable, such as:

In this project, you will scrape real estate data from the actual site, so it's going to be a marriage between an API and a website. You will scrape the website for data and clean up the HTML. You will implement a change-data-capture (CDC), data science, and visualizations with supersets.

This can be a basic framework from which you can expand your idea. These are challenges to make you push and stretch yourself and your boundaries. But, if you decide to build this, just pick a few tools and try it. Your project doesn't have to be exactly the same.

Now It's Your Turn

If you are struggling to begin a project, hopefully, we've inspired you and given you the tools and direction in which to take that first step.

Don't let hesitation or a lack of a clear premise keep you from starting. The first step is to commit to an idea, then execute it without delay, even if it is something as simple as a Twitter bot. That Twitter bot could be the inspiration for bigger and better ideas!

Data Platforms Vs Data Warehouses Vs Data Virtualization And Federated SQL Engine

SeattleDataGuy — Thu, 31 Mar 2022 03:26:09 +0000

There is a lot of discussion on how you should implement your data stacks today. Should you bundle or unbundle, should you use a traditional data warehouse, lakes, lake houses, or mesh?

Or a hybrid of all of them?

With all these different ways you can put together your data stack, the question becomes where to even start? How can your team set up a successful data stack and set of processes that ensure you can build reliable reporting quickly?

Well, it all starts with data. Data is the base layer for a company's analytics projects. Whether you're building data science models, standardized reporting, dynamic pricing API end-points, or KPIs, these are the appliances that will run on the electricity that is data.

With all these different ways companies can set up their data, tools to help them set up their core data layer have risen to answer every problem.

In this article, we will discuss some of the popular routes data solution providers are talking in terms of types of data layers.

Strategies for developing your data layer

Looking back a little over a decade ago, there were only one or two ways that were considered best practices in terms of storing data for analytics.

Often most companies would use some version of a standardized EDW. However, much of the standard best practices are being challenged. For example, data meshes have risen in popularity under the perception that companies can reduce the time from raw data to reportable data.

Besides data meshes, there are also streaming data-warehouses such as Matarlized and Vectorized, virtual data layers, and federated SQL engines like Denodo, Trino, and Varada.

Not to mention that concept of a data platform becoming more prominent as tools like Snowflake and Databricks fight for dominance in the data world.

Data Platforms

When Snowflake first came out they described themselves as the Cloud's first data warehouse. But that was a decade ago. Now Snowflake needed to change with the times and they described themselves as a Cloud Data Platform. Their data warehouse is still the core of their offerings. However, their recent purchase of Streamlit is a great example of how they are trying to morph into a data platform. They don't just want to own the storage and management, they want to own the value add as well.

This is likely due to several large competitors coming into the market. Databricks being one of those major players.

On the other side Databricks originated in academia and the open-source community, the company was founded in 2013. The original creators of Apache Spark™, Delta Lake, and MLflow wanted to create a unified, open platform for all your data.

The goal of both of these tools is to go beyond just a data warehouse or data lakehouse. Instead, they want to be the platform everyone else will build their applications and data stacks off of. Essentially the Microsoft or iPhone of data. Everything builds on their platform layer. Whether it be internal company data products or third party solutions like Fivetran and Rudderstack.

But neither Snowflake or Databricks would be anywhere without the classic data warehouse.

Data Warehouse

Data warehouses remain a standard choice when picking how to store your company's data. In particular, companies like FireboltDB have decided to not get involved in all the new buzzwords and decided to make a stand as a data warehouse.

In addition, there are plenty of other solutions that can be utilized in a data warehouse context. Some are more standard databases like, Postgres, others are possibly modern cloud data warehouses like BigQuery, Teradata, or Redshift.

In general, all of these systems have their nuances and some aren't necessarily a "data warehouse" but they are often built like them, and for now, none of them have visions to go much beyond the data warehouse. Data warehouses remain popular as a design choice due to the fact that we, as a community, have been building them forever. Most of us understand concepts such as fact and dimension tables as well as slowly changing dimensions. Even with most modern data warehouses taking a less strict approach to much of these design concepts they still play a role in developing many company's core data layers.

But these are usually built as a more classic data warehouse (even if BigQuery isn't a fan of standard data warehouses).

Streaming Data Warehouses

Still, another form of analytical data storage systems is utilizing streaming databases. For example, Materialize, is a SQL streaming database startup built on top of the open-source Timely Dataflow project. It allows users to ask questions of living, streaming data, connecting directly to existing event streaming infrastructure, like Kafka, and to client applications.

Engineers can interact with Materialize using a standard PostgreSQL interface, enabling plug-and-play integration of existing tooling. From there data professionals can build data products such as operational dashboards without ETLs.

In addition, Vectorized and Rockset are two other similar streaming database systems that allow developers to connect directly to data sources and run queries on top of them.

All of these options are attempting to approach data management and reporting in a very different way.

Data Virtualization/Federated SQL Engine

Some tools are working to directly query data across multiple data sources.

One such example is Varada. Varada indexes data in Trino in such a way that reduces the time the CPU is used for data scanning (via indexes) and frees the CPU up for other tasks like fetching the data extremely fast or dealing with concurrency. This allows SQL users to run various queries, whether across dimensions, facts as well as other types of joins and data lake analytics, on indexed data as federated data sources. SQL aggregations and grouping is accelerated using nanoblock indexes as well, resulting in highly effective SQL analytics. Paired with Varada's Control Center data engineers and system admins can more easily monitor their Presto/Trino cluster's and make educated decisions on how to manage them.

Another example is Denodo which allows you to connect to various data sources from Denodo itself. This includes Oracle, SQL Server, Hbase, etc.

It also uses VQL which stands for the virtual query language. This is similar to SQL and allows developers the ability to create views on top of all the databases that Denodo is connected to. This means you can connect different databases and tables across them and create new data sets.

In a world that is pushing more towards a Data Mesh every day, tools such as Trino, Varada do make a lot of sense. Instead of having to constantly get access to new sources of data in different types of data warehouses why not just query from all the different siloed data warehouses. That's easier said than done.

What Will Be Your Core Data Layer

There are many different ways companies can set up one of their most important assets and how a company's data is set up is crucial to its success.

Whether you pick a data warehouse or utilize tools like Trino, the end goal is to create a robust data system that all end-users can trust.

Of course, there are many challenges with every choice. If you decide to build a data warehouse, it'll take time, or if you pick a virtual data layer, then there can be a lot of confusion in terms of what data you should pull from.

Why is this all-important? Setting up an easy-to-use data layer is crucial for businesses to succeed. So do take your time and consider which data tool is right for you. In the end, it's about creating a sustainable, robust, and scalable data system that is cost-effective.

16 Data Analytics Buzzwords You Need To Know

SeattleDataGuy — Mon, 14 Mar 2022 21:28:31 +0000

Photo by Romain Vignes on Unsplash

Analytics Engineer = A BI Engineer who uses dbt

Data Mesh = We tried an EDW and it was taking too long so we went back to the siloed data approach but now we opened it up to the entire company

Data Quality Monitoring = A bunch of automated SQL statements

Low Code/No Code = Click, click, click, shoot how do I configure the underlying environment. Forget it, I am going back to Python

Self Service Analytics = Self-service after a data engineer or BIE spends 3 hours writing the query

Modern Data Stack = Everything is SQL now, No more excel

Source Of Truth = I used to trust this, until I learned about Truth V2

Reverse ETL = LTE

Data Lineage = Some unmaintained tool that was accurate once

Data Warehouse = it's just an analytical database where you can drop all your data randomly and maybe spend some years later to try to model them

BI For BI Teams = Analyzing your analysis tools

Democratizing Data = Let's give everyone access and see how many competing narratives we can get from the same data

Data Observability = Oh look we can parse Snowflake logs

Data Scientist = a Schrodinger's cat of: an analyst, or a Stats PhD

Lakehouse = We couldn't setup a Data Lake so we salvaged whatever we could and told our users it looks now like the previous DWH.

Data Lake = File server we just dump all our files in.

Special thanks to mehdio, Galen B, Ethan, Lauren and several others who all played a vital role in crystallizing these concepts.

Read/Watch More Videos On Data And Data Analytics Below:

3 SQL Interview Tips For Data Scientists And Data Engineers

5 Tips On How To Become a Better Data Analyst

How To Set Up A Successful Data Analytics Team

3 SQL Interview Tips For Data Scientists And Data Engineers

SeattleDataGuy — Wed, 16 Feb 2022 20:19:20 +0000

Photo by Elisa Ventur on Unsplash

SQL has become a common skill requirement across industries and job profiles over the last decade.

Companies like Amazon and Google will often demand that their data analysts, data scientists, and product managers are at least be familiar with SQL. This is because SQL remains the language of data.

This has lead to SQL being a staple part of many data professionals interview loops. Meaning you better plan to include SQL in your study sessions. That being said, just walking through SQL problems doesn't always cut.

In this article I will go over three tips to help you improve your SQL both for interviews and in general.

Attempt To Solve A SQL Problem In Multiple Ways

There is more than one way to solve most SQL problems.

But often we resort to the methods we have most recently used.

For example, if you just learned about analytical functions it can pigeonhole a lot of your future solutions. This is why it is important to try to solve SQL problems in multiple ways.

In order to do all of our problems I will be using problem sets from InterviewQuery which is a site that can be used by data scientists to practice much more than SQL.

But for now let's look at a simple problem they offer.

We're given two tables, a users table with demographic information and the neighborhood they live in and a neighborhoods table.

Write a query that returns all neighborhoods that have 0 users.

The tables we have are listed below.

Here we have two input tables and an expected output. Before reading further do you have any thoughts on how to get neighborhood names that have 0 users?

What is your first answer?

Regardless, et's look at this first example of a possible solution.


SELECT N.name
FROM neighborhoods N 
LEFT JOIN users U 
ON N.id = U.neighborhood_id
WHERE U.id is null

This solution relies on the LEFT JOIN to return any neighborhoods without users having a u.id that is NULL.

But this example can be a little confusing and not as readable for some as it's not explicit. The LEFT JOIN kind of makes it a little difficult to read. The reason for this is that you will have to mentally manage the logic in your head.

Let's take a quick look at the table the LEFT JOIN would create "WITHOUT THE FILTER".

Above you will see that user_id has null values. This is because we are LEFT joining with the left table being neighborhood. Meaning that where there are no users, there will still be neighborhoods.

But, this is just one way to solve this problem. Let's write this query differently.

Have any ideas?

You can check out the query below which is the same thing really but in a more explicit way.

SELECT  n.name
FROM neighborhoods n 
LEFT JOIN users u 
ON n.id = u.neighborhood_id
GROUP BY n.name
HAVING COUNT(u.id) = 0

This query uses a more explicit HAVING clause to only count u.ids. COUNT only counts when u.id is not null. So in that way, the query truly operates the same way.

So if there isn't a matching ID for neighborhood id in the user table, then it will replace the user-id with null. Thus, if I write where user id is NULL, I will get the same response as COUNT(u.id).

Break Down Questions And Logic

Many questions you get will require you to break down the logic into multiple steps. A straightforward example is often when questions ask you to get the first instance of an event.

InterviewQuery has a question like this where they ask:

Given a table of song_plays and a table of users, write a query to extract the earliest date each user played their third unique song.

The table are listed below:

Here, they have written out the fact that they want the third "unique" song. This should be a dead giveaway that there are multiple instances of song plays that would mess up your solution if you were just to implement a quick ROW_NUMBER() function.

A great way to help simplify this is to start rotating some shapes.

To start let's create a data set that only got a unique instance of every play.

What is interesting here is I disagree with the solution provided by InterviewQuery, or at least I am not 100% behind it. To get a unique song and first play date you should be able to just use MIN() to get the first date played.

You can do this with the query below.

 SELECT song_name, user_id, min(created_at) created_at
 FROM song_plays
 GROUP BY song_name, user_id

As you can see, all we need to do is get the minimum play date for the song id and user. This cleans up any of the messy repeated plays in a row. Making the data much easier to work with.

InterviewQuery suggests using row_number() to get the first play of a song. I find this a little less clean as you will now be forced to do a second query to get the first value. Whereas with the MIN() method you have one query and you have shrunk the data set down.

But as I referenced earlier, there is more than one way to get a lot of SQL answers.

In no way do I hate the row_number() function. In fact, in the next step, we will be using it. You can see it in the query below.

with t1 as (
    SELECT song_name, user_id, min(created_at) created_at
    FROM song_plays
    GROUP BY song_name, user_id
)
  SELECT u.name, t1.song_name, t1.created_at,
  row_number() over (partition by u.name order by created_at) rn 
  FROM t1 
  JOIN users u 
  ON t1.user_id = u.id

From here, the rest should be self explanatory. We need to query for the 3rd row number. This is show below.

with t1 as (
    SELECT song_name, user_id, min(created_at) created_at
    FROM song_plays
    GROUP BY song_name, user_id
),
t2 as(
    SELECT u.name, t1.song_name, t1.created_at,
    row_number() over (partition by u.name order by created_at) rn 
    FROM t1 
    JOIN users u 
    ON t1.user_id = u.id 
)
SELECT name, song_name, created_at
FROM t2 
WHERE rn = 3

There is one more problem. This actually won't solve InterviewQuery's question.

This brings up the other issue I have with InterviewQuery's current solution for this problem. InterviewQuery wants us to return all users even if they didn't have a third song play. This wasn't referenced in the question and I don't see the value. But these are all NITs.

They have a different solution that uses a LEFT JOIN and then returns both users with and without a third song play. So do be aware of this finicky detail.

Learn How To Do None Standard Joins

Many people are accustomed to straightforward joins. With a basic "=".

This is great, but the truth is JOINs can be used like WHERE clauses.

There was a point where JOINs didn't exist and WHERE clauses were used as JOINs. I still occasionally see people who JOIN using a long string of WHERE clauses instead because they either work currently with older systems or learned SQL at a time JOINs weren't commonplace(explicit JOINs were added in SQL-92).

What this means is you can use JOINs in a lot of interesting ways. For example, if you have ever had to calculate a rolling sum without a window function, you will know you can use the "> and <" signs to create duplicate rows that you will then aggregate.

In our final question we are asked:

Given a table of students and their SAT test scores, write a query to return the two students with the closest test scores with the score difference.

If there are multiple students with the same minimum score difference, select the student name combination that is higher in the alphabet.

This is what the table looks like:

Take a moment to figure out how you might tackle this question.

Don't rush here. There are a lot of ways you could overcomplicate this question in particular.

The answer is pretty short.

But it all starts with you needing to create a combination of every student and every student's score.

Let's look at a possible solution and break down why it works.

SELECT 
t1.student as one_student,
t2.student as other_student,
abs(t1.score - t2.score) as score_diff
FROM scores t1
JOIN scores t2 
   ON  t1.id < t2.id
ORDER BY 3,1,2 
limit 1

So why does this work? We are telling the query to do a self join in all cases where the ID is smaller than the ID we are joining on.

Meaning we are creating a table that looks like the table below:

Now that we have a combination of all the possible student IDs we can just find the ABS() score different and properly set the order.

That's it.

One final point I will add is that I did end up eventually deciding to use the "!=" vs "<". The less than only seemed to work with the specific set of names and I am just not 100% sure if the names were changed that the ordering would work as expected. Using the "!=" guarantees that we will test every combination of names in every order. Meaning that if the more "alphabetic" solution were actually in a combination of "3--2" vs "2--3" we check and order both sets.

How Will You Take Your SQL To The Next Level

SQL looks like it is here to stay. It seems if anything it is picking up speed. The challenge is how do you take your SQL to the next level. It's not just about learning new syntax.

It's about learning how the data you're working on can be manipulated using even simple clauses. In doing so you will be creating SQL at a much higher level and better prepare you for your SQL interviews.

Whether you're a data scientists, data engineer or analysts I wish you the best of luck!

If you enjoyed this article, then check out some of my other content:

Which Managed Version Of Airflow Should You Use?

5 Tips To Become A Great Analyst

What Is Trino And How It Manages Big Data

What I Learned From 100+ Data Engineering Interviews - Interview Tips

5 Tips On How To Become a Better Data Analyst

SeattleDataGuy — Sat, 12 Feb 2022 18:36:10 +0000

Image Source

Becoming a data analyst is one of the most common early career moves for people educated in STEM fields. Not to mention according to the Bureau of Labor Statistics the demand for data analysts will grow about 20% over the next few years.

Working as an analyst allows you to put skills learned in college to work while also giving you an opportunity to develop new skills on the job. Average salaries ranging from $60,000 to $80,000 also make data analyst positions lucrative starting points for younger professionals.

As companies have become more data-driven, the skills that go into working as an analyst have grown more technical. Many analysts today are skilled in programming languages like Python and R that are suitable for processing large data sets. Before developing these highly technical skills, however, there are some basic tips and tricks that all data analysts should learn. Here are some of the fundamentals you should focus on in order to become a better data analyst or data engineer.

1. Set Up a Clear Data Analytics Process

One of the fundamental parts of becoming a successful data analyst is to have a clear process set up for your projects. This will save you the time and trouble of approaching each project in an ad hoc manner. A simple data analytics process is outlined below:

Define the Question: Fully define the question you're trying to answer and the goals of your data analytics project.
Collect Data: Work with data engineers or other data professionals to gather relevant data for your project.
Clean the Data: Standardize the data you've collected and remove any incorrect or irrelevant entries.
Analyze the Data: Employ data analysis techniques to understand the data and drive answers to your question. This step can take many different forms, depending on the question you're trying to answer.
Share Your Results: Create data visualizations and resources that will help others understand the insights you've produced.

With this simple framework, you'll have a clear road map for outlining and completing data analytics projects. Following this basic process will also keep you from getting sidetracked as you conduct your analysis.

2. Don't Bury the Lede

When your analysis is finished, it's important that you're able to communicate your findings to others in an effective way. A key part of this is to keep your reports simple and concise. While it may be tempting to show all of your findings, it's better to condense your results down to a simple, understandable message.

For optimal communication, consider telling the story of your data with a few carefully selected charts. These should be relevant to the core question and easy for your audience to understand. Sum up your findings with a conclusion that answers the question and drives value for your audience. By doing this, you'll avoid confusion and keep your messaging focused on what your analysis has produced.

3. Data Analytics Peer Review

Putting a peer review process in place for your analysis is one of the best ways to ensure your work is sound and accurate. Getting a second set of eyes on your analysis can help you find potential errors or room for improvement. If a fellow analyst confirms your analysis, you'll know that your work is ready for presentation.

Peer review is especially important for less experienced data analysts. If you can get a more experienced analyst to review your work, you'll be able to learn from their insights and comments. It's also helpful for analysts with more technical roles who may be working on projects that were once primarily the domain of data engineers.

4. Triple-Check Your Data

Whenever you're working with data, it's good practice to assume that the datasets have at least some flaws. These flaws can range from simple organizational errors to completely erroneous pieces of data. For this reason, you should get into the habit of triple-checking your data as you conduct your analyses.

Finding flaws or inaccuracies in your datasets will help you provide a better analysis. In many cases, you'll be able to simply fix the issues you find and then proceed with your work. In others, though, you may discover much larger problems that require substantially more work to resolve.

There are times when it's even advisable to ignore flawed data altogether. While you might lose some information, you have to know when a problem is too labor-intensive to be worth fixing. This decision will depend on the project and the nature of the problems with your data.

While checking your data, it's a good idea not to assume that anything is accurate. Even something that seems foolproof may have errors that could throw your analysis off. Columns in which data are entered from a dropdown menu, for example, seem like they should be free from errors. If there are invalid options in the menu, though, you could end up with flawed data as a result. Assume that there are errors in any dataset and conduct a thorough search to find them. By doing this during the data cleaning phase of your project, you can save yourself from having to backtrack and fix mistakes later on.

5. Know When To Stop Your Analysis

A final critical skill that is frequently overlooked is knowing when to stop your analysis. Having a set endpoint is a key part of the data analytics process. When you reach that endpoint, you need to be able to stop and finalize your analysis.

Without a clear endpoint, you can easily think of new questions to ask and find yourself going down rabbit holes that aren't relevant to your project. While there are times when further exploration delivers useful insights, endless data analysis frequently fails to produce valuable results.

Knowing when to stop relates directly to the first step of the data analytics process outlined above. If you don't know exactly what question you're trying to answer, it's very difficult to know where to stop. With a clearly defined question, you should have a natural endpoint beyond which there's no need for further analysis.

Conclusion

The tools and resources available to data analysts and engineers are constantly changing as technology evolves. The fundamental role of an analyst, however, remains the same. As a data analyst, your primary task will always be to provide valuable, data-driven insights that help your business or organization achieve its goals.

By focusing on these fundamental skills, you can give yourself an extremely strong foundation as an analyst. From there, you can build your technical skills to expand your capabilities. Whatever tools you're using, though, keeping these basic principles in mind will help you improve as an analyst and create more value for your employer.

How to Become a Data Engineer in 2022

SeattleDataGuy — Sun, 23 Jan 2022 04:04:38 +0000

Photo by ThisisEngineering RAEng on Unsplash

If you were paying attention in 2021, then you would know that data engineering jobs are on the rise.

Jay Feng's report analyzing 10,000 data science interviews found that interviews for data engineering jobs increased by 40% in 2020. There has also been a massive increase in funding in the data engineering space and the explosion of data-driven content has led to a revival of the term data engineering. I've also seen an increase in the number of companies offering data engineering internships. This is not just at Amazon and Facebook, but also at companies like Spotify.

All of this information is good news for aspiring data engineers. It's possible to come out of college, land a good job, and thrive in this growing field.

But what does it take to become a data engineer in 2022? What skills do you need to have today so that they are still relevant five years from now?

In this article, I'll cover what skills emerging data engineers need, where they can apply for jobs, and what they can do to stand out in this competitive field.

What Skills Do Data Engineers Need

What skills are companies asking for from junior data engineers?

It's important to know the basics. Most interviews, at a junior or intern level won't expect you to know more than the three sections below.

Python
SQL
Data modeling and ETL development

Those are generally on all interviews. An exception is that if you are in coding you could learn either Python, Java, or Scala. If you know two of those languages, you can usually figure out the last one.

If you're going into an internship or junior position, you shouldn't be expected to know much more than these requirements. If you're going into a mid-level position, then there might be an expectation to know a little more.

Getting A Data Engineer Internship At Amazon

Let's take a look at this data engineering internship opportunity from Amazon. They expect that you have a bachelor's degree, which makes sense. They also expect you to know Python, how to create data pipelines, databases, and warehouse modeling concepts. All of this makes sense.

When you scroll down to the preferred qualifications, there are some questionable statements. Amazon is asking for a master's degree, which seems unnecessary for any sort of computer engineering, software engineering, or data engineering job. These skills are not generally learned in college and there's not exactly a data engineering master's degree.

I also find it frustrating that they want applicants to articulate the basic differences between data types. I'm not sure what they're going for there because NoSQL and relational data types are very similar. I understand what they're getting at but wish they had phrased that point better. They should have said something like understanding the difference between SQL and NoSQL or something similar.

However, they have stated that you'll need a lot of SQL and data modeling experience. I've interviewed at Amazon, and those are definitely skills you will need.

Getting A Data Engineer Internship At Facebook

Taking a look at this data engineer intern role at Facebook, they also require experience with SQL Python. They also call out the need to understand distributed systems, which I think is unnecessary, having worked at Facebook or Meta. You don't need to know things that are under the hood, like Presto or Hive, because your job will just involve writing SQL. Nor will you be writing any MapReduce jobs.

Where To Apply

I found a lot of junior data engineer positions just by Googling. There are a lot more than were around when I was going into data engineering.

Some companies don't seem to understand the term junior and are asking for 2+ years of experience whereas a junior role should be 0--2 years of experience.

However, there are plenty of junior and internship positions advertised. Thus, you can just ignore the ones that are asking for too much experience.

At least make sure that the position is paying well (at least $80--100K) if it asks for a lot of experience. The Amazon internship that we looked at is paying $7,700 a month for a position based in Colorado, which makes it close to $100,000 a year.

How To Stand Out

Although there are a lot of positions for data engineers, there are also a lot more people wanting to become data engineers. Entry-level positions are paying up to $100K per annum, which would put you into the top 15% of American earners. You could even make more if you worked at a successful startup with equity options. So, it's important to know how to stand out.

To get a job typically you would study hard, apply for positions and network and get referrals. However, to stand out, you need to promote yourself. You could be the smartest data engineer in the world but if you don't know how to promote yourself, no one will know.

In 2022, there are a ton of avenues to promote yourself. You can make videos, write articles, and share your code, as well as what you're currently learning and/or working on.

Channels to publish content include:

LinkedIn
Medium
Your own website
Your own GitHub repository
YouTube

This kind of content helps recruiters and companies get to know you as a person, not just as a resume that hits their desks.

The content publishing method can be tailored to suit your personality and skills. Some people are great at creating their own projects, like doing open-source work and writing code. For people who don't like coding as much, you can work on more high-level concepts and write blogs. This might involve understanding data modeling and writing basic data modeling breakdowns. For example, you could write articles about what the different tables in a warehouse are or compare modern concepts like data mesh, data fabric, and data warehousing.

This type of work helps you to prepare for interviews as well as produce content to share with potential employers. Platforms like YouTube are great because you can share your work and ideas with what feels like an infinite number of people. I've had hiring managers reach out to me because there are not a lot of people publishing videos about data engineering.

Conclusion

There are many things you can do to increase your chances of becoming a data engineer in 2022. You can start by learning the right skills and applying for positions. However, it's important to also promote yourself through content publishing. This will help you stand out from the crowd and show potential employers that you're passionate about data engineering. There are many avenues to publish content, so find one that best suits your personality and skillset.

Connect With Me:

✅ YouTube:https://www.youtube.com/channel/SeattleDataGuy
✅ Website: https://www.theseattledataguy.com/
✅ LinkedIn: https://www.linkedin.com/company/18129251
✅ Personal Linkedin: https://www.linkedin.com/in/benjaminrogojan/
✅ FaceBook: https://www.facebook.com/SeattleDataGu

8 Great Data Engineering Youtube Channels

SeattleDataGuy — Sat, 15 Jan 2022 15:33:18 +0000

Photo by Dario on Unsplash

Looking at the data engineering Youtube space early last year. There weren't that many people talking about data engineering.

There were a few but overall most data-focused videos were on data science, algorithms, and machine learning.

But, somewhere in the middle of last year, it seemed like more and more YouTubers were talking about data engineering.

Some were referencing the trend itself of growth in data engineering job opportunities and others, like myself, decided to start making videos to help future data engineers.

In this article, I wanted to help signal boost some of these great data engineering Youtube channels you should check out!

So let's dive in.

Zack Wilson

Since the first time I saw a post from Zack Wilson on Linkedin, I was waiting for his Youtube channel.

And towards the end of 2021, he delivered.

In one of his first videos, Zack discusses if data engineers should learn Scala. He does a great job discussing the pros and cons of Scala. From the fact that Scala allows for a more software engineering-driven approach to pipelines to the fact that there are only a limited number of positions that truly require Scala.

But don't take my word for it. This short but concise clip does a great job answering the question.

">
All Videos Here

Andreas Kretz

Andreas Kretz is the OG data engineering Youtuber. He has been putting out loads of content on data engineering for years now.

Last year I really enjoyed several of his posts where he comment on videos where companies were breaking down their data infrastructure.

Andreas clearly has a lot of practical experience and can

In this video he reviews the AWS architecture that Nielsen from a Data Engineering Experts view. They use it to process up to 55TB a day.

Really nice use case of Apache Spark and RDS. Nice use of AWS Lambda functions for content delivery and AWS SQS for job management. Teaches you a lot about Data Engineering.

">
All Videos Here

Darshil Parmar

Darshil has started putting out a lot of great content for the past year. He has discussed his data engineering journey, how he went from working for companies to freelancing and so much more.

Actually his recent video I think is great for people just getting started.

Here is his description:

How did I become a data engineer?

What are the skills needed for data engineering?

How to get a job as a data engineer and many more?

Data engineering is a new branch of engineering, but the data analysis field is growing rapidly and so is the demand for professionals who can handle this field. Data scientists are redefining how businesses use data by helping them access, analyze, and interpret data in their everyday lives. In order to be successful in this field, there are many skills you need to learn. Data engineering, or the ability to store and access data, is one of them. In this video, Darshil Parmar talks about how he became a data engineer and what skills he learned on the job. --- Darshil Parmar

">
All Videos Here

Seattle Data Guy (Me)

Alright. I, of course, am going to put myself on this list. I have personally really been enjoying putting out videos on data engineering. Currently, I would say the focus of my channel has been more early-career style videos.

However, in the coming year, I would like to dive more into other topics such as data infrastructure, different data tools like dbt, Trino, Airflow, and so on.

But we shall see what the future holds!

The video I picked was me reviewing some of my favorite books for data engineering. If you're new to data engineering I am sure you can find a good new book to read.

">
All Videos Here

Karolina Sowinska

Karolina Sowinska has also been a long-standing data Youtuber. She does tend to have a broader set of topics other than data engineering but she has a lot of classic data engineering videos.

For example, she has a great quick intro for Airflow that I shared below.

In this long-awaited Airflow for Beginners video, she will show you how to install Airflow from scratch, and how to schedule your first ETL job in Airflow.

So don't miss out!
">
All Videos Here

Mehdio

mehdio is new to the Youtube space but we are glad to have him jump aboard!

We have collaborated on a writing project and I really do think he has a great perspective on data engineering.

Although his channel only has one or two videos, I can see it growing over the next year.

The video I picked here was Mehdio's only core video. In this video Mehdio will look at data engineering from a historical perspective as well as try to predict where things are going in the space.
">
All Videos Here

E-Learning Bridge

Shashank Mishra is a data engineer that has worked at Amazon, McKinsey & Company and PayTm. He uses his vast experiences and wide network to create videos to help young data engineers learn more about the role and how they can get a job at a top tech company.

In the video below Shashank will go over some of the SQL topics you will need to know as you study for your FAANG interviews.

">
All Videos Here

Ken Jee

Ok, Ken Jee usually puts out content more around data science. However, one of his recent videos was focused on data engineering.

So I did include him on the list. Maybe he'll jump on the data engineering bandwagon.

The video I picked below is said data engineering video.

In this video, he broke down why it seems like companies are hiring more data engineers in general. He also discussed how this would impact data scientists.

Overall, I believe this video gives a good explanation of why so many companies are struggling to get value from their data strategy. If you're curious how the data world is shaping up, then I think Ken will provide a lot of strong context.

">
All Videos Here

So, Do You Want To Become A Data Engineer?

After all those great videos.

The question remains.

Should you become a data engineer?

This is something only you can answer. Do you like creating data pipelines and automating data systems?

Do you want to play the middleman between applications and software teams and the analytical teams?

Hopefully the combination of the videos above have helped you form a better picture of what being a data engineer really means as well as provided context on how you can grow your data engineering career.

Thank you for reading!

If you want to watch or read more:

Which Managed Version Of Airflow Should You Use?

What Is Trino And How It Manages Big Data

What I Learned From 100+ Data Engineering Interviews - Interview Tips

5 Big Data Experts Predictions For 2022 - From The Modern Data Stack To Data Science

How I Would Become A Data Engineer in 2022

Why You Should Become A Data Engineer And Not A Data Scientist

SeattleDataGuy — Sun, 26 Dec 2021 22:59:57 +0000

Photo by Magnet.me on Unsplash

Data Science is a career path that many people are currently choosing based on an essay written a decade ago. Yes, it has been almost ten years since the Harvard Business Review essay was published, yet we are all still choosing Data Science as a professional path. However, if you read through LinkedIn recently, you will notice that many people have posted that they are recovering data scientists who have now become data engineers.

Many of us are drawn into the subject of Data Science just to realize it isn't for us finally. And, while there is some gatekeeping around Data Engineering, it appears that many people who have initially been data scientists eventually switched over to Data Engineering.

What may be the reason for this? In this blog post, I hope to help you avoid the whole transition from data scientist to data engineer by giving you a few fundamental reasons why you should become a data engineer rather than a data scientist.

Disclaimer: Some who read the title of this article might assume this is some form of:

"We Don't Need Data Scientists" or "Data Engineering Is Better Than Data Science" type article. That's not the purpose of this article. It is meant to discuss why someone may prefer being a data engineer. Of course, if you work at a small enough company you might be a little of both.

You Like Building Stuff

If you enjoy building infrastructure, programming, and writing object-oriented code rather than merely procedural code to interact with data, you may be more of a data engineer.

Data engineers develop data pipelines, infrastructure, monitoring, and other aspects that aren't immediately related to models. Data engineers operationalize or productionize a model, which implies taking the analysis or Jupyter notebook that a data scientist created and applying it in a sustainable and robust system, rather than simply pressing "run" on that Jupyter notebook every day.

We prefer the discipline and process of constructing infrastructure instead of simply burying it in data frames that no one can access or adequately QA. Data engineers enjoy having a tangible end product. We don't want just an analysis; we want a table, a pipeline, a data warehouse, or a Data Lake.

You Like Feeling Done

Data engineers enjoy the sense of accomplishment that comes with finishing a project. Data science has an unending capacity to generate questions after questions, making your analysis endless. I have watched my data science counterparts often finish an analysis on a single data set, only to have to dig into the data set more due to the business asking even more questions.

But, as data engineers, we have a general standard to adhere to a table, a data pipeline, or something along those lines. Once we've created it, we're done. Sure, our stakeholders may say, "Oh, I wanted to add this column as well," but that's a new project or assignment, and we already know we've completed the previous project. In order to take on this new task we would need to reprioritize all of our current work.

The preceding is not necessarily true in the field of Data Science. It can be an infinite complex of questions, none of which will ever lead to an answer.

If having no actual end product is what you enjoy about your job, Data Science may be for you. However, if you prefer having a finished product at the end of the day, Data Engineering may be a better fit.

You Don't Like Being The Center of Attention

Another advantage of being a data engineer is that we are not always the center of attention. Data Science has a lot of sexiness and glamour, whereas data engineers can hide in the background, which many of us enjoy. Instead of spending a lot of times in front of co-workers explaining the impact of your model, we can often hide behind our keyboards and build our tables for our partners.

So, being a data engineer is ideal if you prefer completing your task without attracting a lot of attention and questioning. You get to do your work, and you know that once it's done, you can pass it over to the data scientist, who will analyze the data and then jump in front of a stakeholder or manager and explain what their results mean.

In that way, data engineering is an excellent job for folks who prefer interacting with a keyboard over dealing with other people.

Of course, I'm not implying that all data engineers are introverts or dislike interacting with others. You can get the opportunity to talk and share with others if you so desire. If you want to take it, it's there for you.

And I will always be a proponent of improving your soft skills, especially in terms of communication. Strong communications can help you leap forward in your career. Why else do you think I started a Youtube channel.

However, the business is more concerned with the analysts and data science findings. They are concerned with the amount of money they will save, the final model, and the influence on the business. Data engineering is critical, and data scientists cannot do their jobs without it. But few people care how the sausage is produced, they only care that it is on their plate.

You Prefer SQL Over Pandas

Photo by Pascal Müller on Unsplash

Finally, if you prefer SQL over Pandas, you might be more of a data engineer.

I've discovered that data scientists seem to prefer using Pandas, and most data engineers tend to lean towards SQL.

In one way or another, both manipulate data. But if you need to execute a sophisticated, 1000-line query, I can only imagine how insane it would look in Pandas and how many calls to so many functions it would entail.

What's fantastic is that we live in a world where Spark, SparkSQL and DataBricks exist, so we can all play in the same arena and use a similar engine, but SQL will continue to be the data language since it has withstood the test of time.

So Which Data Career Is Right For You

So there you have it: four reasons why you should consider becoming a data engineer rather than a data scientist. If any of these points resonated with you, you might want to seek a career as a data engineer.

It's easy to be caught up in the Data Science allure, given that it's had its still a very sexy job. Still, other data-related areas, such as data engineering, need completely different day-to-day tasks and team relationships.

The skills necessary for these jobs, as well as the end deliverables, vary greatly. The bottom line is, to be honest with ourselves and select our best-suited, ideal career.

Of course, there are still other roles like analytics engineer and data analyst. So good luck finding the right career.

If you're interested in learning more about data engineering or data science, then consider these articles and videos.

My Favorite Books For Data Engineers - From Streaming To Software Engineering

Which Managed Version Of Airflow Should You Use?

What Is Trino And How It Manages Big Data

What I Learned From 100+ Data Engineering Interviews - Interview Tips

What Is Trino And Why Is It Great At Processing Big Data

SeattleDataGuy — Thu, 09 Dec 2021 20:45:52 +0000

Big data is touted as the solution to many problems. However, the truth is, big data has caused a lot of big problems.

Yes, big data can provide more context and provide insights we have never had before. It also makes queries slow, data expensive to manage, requires a lot of expensive specialists to process it, and just continues to grow.

Overall, data, especially big data, has forced companies to develop better data management tools to ensure data scientists and analysts can work with all their data.

One of these companies was Facebook who decided they needed to develop a new engine to process all of their petabytes effectively. This tool was called Presto which recently broke off into another project called Trino.

In this article, we outline what Trino is, why people use it and some of the challenges people face when deploying it.

Presto Trino History

Before diving into what Trino is and why people use it. Let's recap how Presto turned into Trino.

What Is Trino

Let's be clear. Trino is not a database. This is a misconception. Just because you utilize Trino to run SQL against data, doesn't mean it's a database.

Instead, Trino is a SQL engine. More specifically, Trino is an open-source distributed SQL query engine for adhoc and batch ETL queries against multiple types of data sources. Not to mention it can manage a whole host of both standard and semi-structured data types like JSON, Arrays, and Maps.

Another important point to discuss about Trino is the history of Presto. In 2012 Martin Traverso, David Phillips, Dain Sundstrom and Eric Hwang were working at Facebook and developed Presto to replace Apache Hive to better process the hundreds of petabytes Facebook was trying to analyze.

Due to the creators desire to keep the project open and community based they open sourced it in November 2013.

But due to Facebook wanting to have tighter control over the project there was an eventual split.

The original creators of Presto decided that they wanted to keep Presto open-source and in turn pursued building the Presto Open Source Community full-time. They did this under the new name PrestoSQL.

Facebook decided to build a competing community using The Linux Foundation®. As a first action, Facebook applied for a trademark on Presto®. This eventually led to litigation and other challenges that forced the original group who developed Presto to rebrand.

Source

Starting in December 2020 PrestoSQL was rebranded as Trino. This has been a little confusing but now Trino supports a lot of end-users and has a large base of developers that commit to it regularly.

How does Trino work?

Trino is a distributed system that utilizes an architecture similar to massively parallel processing (MPP) databases. Like many other big data engines there is a form of a coordinator node that then manages multiple worker nodes to process all the work that needs to be done.

An analyst or general user would run their SQL which gets pushed to the coordinator. In turn the coordinator then parses, plans, and schedules a distributed query. It supports standard ANSI SQL as well as allows users to run more complex transformations like JSON and MAP transformations and parsing.

Why People Use Trino

Trino, being a Presto spin-off, has a lot of benefits that came from its development by a large data company that needs to easily query across multiple data sources without spending too much time processing ETLs. In addition, it was developed to scale on cloud-like infrastructure. Although, most of Facebook's infrastructure is based on its internal cloud. But let's dig into why people are using Trino.

Agnostic Data Source Connections

There are plenty of options when it comes to how you can query your data. There are tools like Athena, Hive and Apache Drill.

So why use Trino to run SQL?

Trino provides many benefits for developers. For example, the biggest advantage of Trino is that it is just a SQL engine. Meaning it agnostically sits on top of various data sources like MySQL, HDFS, and SQL Server.

So if you want to run a query across these different data sources, you can.

This is a powerful feature that eliminates the need for users to understand connections and SQL dialects of underlying systems.

Cloud-focused

Presto's fundamental design of running storage and computing separately makes it extremely convenient to operate in cloud environments. Since the Presto cluster doesn't store any data, it can be auto-scaled depending on the load without causing any data loss.

Use Cases For Trino

Adhoc queries And Reporting --- Trino allows end-users to use SQL to run ad hoc queries where your data resides. More importantly, you can create queries and data sets for reporting and adhoc needs.

Data Lake Analytics --- One of many of the common use cases for Trino is being able to directly query data on a data lake without the need for transformation. You can query data that is structured or semi-structured in various sources. This means you can create operational dashboards without massive transformations.

Batch ETLs --- Trino is a great engine to run your ETL batch queries. That's because it can process large volumes of data quickly as well as bring in data from multiple sources without always needing to extract data from sources such as MySQL.

Challenges For Trino Users

At this point, you might be assuming that everyone should use Trino. But it's not that simple. Using Trino requires a pretty large amount of set-up for one. However, there are also a few other issues you might deal with upon set-up.

Federated Queries Can Be Slow

The one downside of federated queries is that there can be some trade-offs in speed. This can be caused by a lack of meta-data stored and managed by Trino to better run queries. In addition, Presto was initially developed at Facebook that essentially has their own cloud. For them to expand it and grow it as they need increased speed isn't a huge problem. However, for other organizations, in order to get the same level of performance they might need to spend a lot more money to add more machines to their clusters. This can become very expensive. All to manage unindexed data.

One such example is Varada. Varada indexes data in Trino in such a way that reduces the time the CPU is used for data scanning (via indexes) and frees the cpu up for other tasks like fetching the data extremely fast or dealing with concurrency. thus allows SQL users to run various queries, whether across dimensions, facts as well as other types of joins and datalake analyticss, on indexed data as federated data sources. SQL aggregations and grouping is accelerated using nanoblock indexes as well, resulting in highly effective SQL analytics.

Configuration And Set-Up

Setting up Trino isn't straightforward. Especially when it comes to optimizing performance and management. In turn, many system admins and IT teams will need teams to both set up and manage their instances of Trino.

One great example of this is an AWS article titled "Top 9 Performance Tuning Tips For PrestoDB".

Lack Of Enterprise Features

One of the largest challenges faced by companies utilizing Trino is that out of the box, there aren't a lot of features geared towards enterprise solutions. That is to say, features that revolve around security, access control, and even expanded data source connectivity are limited. Many solutions are trying to be better provided in this area.

A great example of this is Starburst Data.

Starburst Enterprise has several features that help improve the Trino's lacking security features. For example, Starburst makes it easy for your team to set up access control.

The access control systems all follow role-based access control mechanisms with users, groups, roles, privileges and objects.

This is demonstrated in the image below

Source

This makes it easy for your security teams and data warehouse administrators to manage who has access to what data.

Starburst also offers other helpful security features such as auditing and encryption.

This enables companies to implement a centralized security framework without having to code their own modules for Trino.

Conclusion

Big data will continue to be a big problem for companies that don't start looking for tools like Trino to help them manage all their data. Trino's ability to be an agnostic SQL engine that can query large data sets across multiple data sources is a great option for many of these companies. But as discussed, Trino is far from perfect. It isn't fully optimized in a way for enterprise companies to take advantage of its full abilities. In addition, due to Trino's brute force approach to speed, it sometimes comes at a cost. It becomes very expensive to get the beneficial speed without indexing.

This is where many new solutions are coming into the fold to make Trino more approachable. In the end, big data can be maintained and managed, you just need the right tools to help set yourself up for success.

If you enjoyed this article, then check out these videos and articles below.

Data Engineer Vs Analytics Engineer Vs Analyst

Why Migrate To The Modern Data Stack And Where To Start

5 Great Data Engineering Tools For 2021 --- My Favorite Data Engineering Tools

4 SQL Tips For Data Scientists

What Are The Benefits Of Cloud Data Warehousing And Why You Should Migrate

ETLs vs ELTs: Why are ELTs Disrupting the Data Market?

SeattleDataGuy — Tue, 30 Nov 2021 13:53:52 +0000

Photo by Fabio Sasso on Unsplash

In the business world, cloud technology has become more and more dominant in recent years. Right now, research shows that about 50% of all business data is stored in the cloud, which just demonstrates the importance of external data sources and their place in the modern business environment.

In seeking to keep up with digital transformation and data trends, many businesses are turning to ELT (Extract, Load, and Transform) tools. Besides accommodating heavy workloads, ELTs help teams integrate data.

In this post, we'll take a look at ELTs and how they compare to ETLs, and why ELTs have become such a disruptive force in the data market.

Let's dive in.

ETLs vs. ELTs

While ETLs and ELTs both deal with data, they are, in fact, different tools. ETL is the Extract, Transform, and Load process for addressing data, while ELT is Extract, Load, and Transform. In an ETL model, data migrates from its original source to a data warehouse, where it is staged. In ELT, you will often use a very different data storage paradigm.

Both ELT and ETL involve the following three steps:

Extract

The extraction phase of both ELT and ETL solutions involves pulling source data from the original database. In an ELT model, the data goes right to a data storage system. In classic ETLs you're pulling the data into a staging area.

Transform

Transformation is the process of changing the data's structure and is what ultimately allows the data to integrate with the target data system and the rest of the information it contains.

Load

Loading is the process of moving the data into a data storage system, which prepares it for analysis.

ETL and ELT perform these steps in different orders. Teams deciding between the two solutions will need to determine whether to transform their data before or after moving it into the data storage repository.

In the data science landscape, both ETL and ELT are necessary technologies. Because information sources, including unstructured NoSQL databases and structured SQL databases, almost never use the same formats, data must be transformed and enriched before it can be analyzed as a whole. By digesting this data, ETL and ELT solutions step in to allow business intelligence platforms to do their job.

What each is good for: ETL is beneficial for organizations concerned about data compliance and privacy, since it cleans sensitive and secure data before sending it to the data warehouse. ETL, in contrast, is excellent for sophisticated data transformations and tends to be more affordable than ELT.

Why Should I Use an ELT?

While both ETLs and ELTs have their place in the data landscape, more and more organizations are choosing to adopt ELT tools to address the volume and speed of their big data sources, which often overload the more traditional ETL tools.

When used correctly, ELT tools streamline analysis data preparation. Because ELTs load data into the framework where it will eventually be processed, staged, and transformed, they allow teams to skip some busy work associated with data transformation.

Here are a few of the benefits of using ELT systems:

Fewer Physical Infrastructure Requirements

ETL tools serve as a sort of physical location for the steps between extracting data and loading it into repositories. In light of that, organizations that want to integrate data into target systems must purchase and maintain these tools to do so.

ELTs, meanwhile, don't require that intermediate step, which means they require less physical infrastructure and specific resources. Instead, the target system's engine performs the transformation, rather than the engines native to ETL tools.

More Efficient Data Staging

ETL tools cleanse data and prepare it for transformation. ELT tools, however, stage data after loading it into a data warehouse, lake, or cloud storage solution. This makes the data staging process significantly more efficient and reduces latency across the board. Additionally, leading ELT tools make fewer demands on initial data sources and reduce the in-between steps associated with data processing.

Accelerated Time to Value (Sort of)

Because they transform data within a target system, ELT tools speed up the time to value for teams. This allows data scientists and analysts working with big data to leverage and transform data quickly and to implement machine learning techniques for better analysis.

ETL tools, on the other hand, require a manual coding process to ensure data conformity and uniformity, which adds time to the experience and increases latency across the board.

The Place of ELTs in the Modern Data Market: How ELTs Impact Data Warehousing

One of the largest benefits of ELT systems is the way they improve both data lakes and warehouses. Regardless of which solution a team is using, ELT tools significantly reduce the time required to prepare data for use. By loading data into a data lake framework, ELTs allow organizations to take advantage of the processing engines within the solution when it comes time to stage and transform data.

This serves a few distinct purposes. Besides providing immense scalability and leveraging parallel processing, it eliminates the requirement that organizations rely on conventional data modeling to unify their data.

Here are a few of the other ways ELT solutions overhaul data warehousing:

Streamlined Architecture

ELT tools streamline the process of preparing data for use. Because there is no in-between layer with built-in processing power limitations, the ELT can handle both data staging and transformation, which streamlines the experience for users.

Rapid Data Incorporation

ELT solutions make it possible to incorporate data rapidly into both warehouses and lakes. With traditional methods, these sources can be difficult and clunky to use, leading to unnecessary latency and delays.

Some ELT Tools

There are lots of options for ELTs. Honestly, ELTs are less about the tool and more about the method.

However, many solutions that market themselves as ELTs is that they also often automated connectors that allow users to quickly develop their pipelines.

Fivetran

Fivetran is a highly comprehensive ELT tool that is becoming more popular every day. This tool allows efficient collection of customer data from related applications, websites, and servers. The data collected is then transferred to other tools for analytics, marketing, and warehousing purposes.

Not only that, Fivetran has plenty of functionality. It has your typical source to destination connectors and it allows for both pushing and pulling of data. The pull connectors will pull from data sources in a variety of methods including ODBC, JDBC, and multiple API methods.

Like many other ELT tools, Fivetran push connectors receive data that a source sends, or pushes, to them. In push connectors, such as Webhooks or Snowplow, source systems send data to Fivetran as events.

Most importantly Fivetran allows for different types of data transformations. Putting the T in ELT. They also allow for both scheduled and triggered transformations. Depending on the transformations you use, there is also other features like version control, email notification, and data validations.

Stitch..Sort Of..

Stitch was developed to take a lot of the complexity out of ETLs and ELTs. One of the ways Stitch does this is by removing the need for data engineers to create pipelines that connect to APIs like in Salesforce and Zendesk.

It also attaches to a lot of databases as well like MySQL. Having access to a broad set of API connectors is only one of the many benefits that makes Stitch easy to use.

Stitch also removes a lot of the heavy lifting as far as setting up cron jobs for when the task should run as well as manages a lot of logging and monitoring. ETL frameworks like Airflow do offer some similar features. However, these features are much less straightforward in tools like Airflow and Luigi.

Stitch is done nearly entirely with a GUI. This can make this a more approachable option for non-data engineers. It does allow you to add rules and set times when your ETLs will run.

Airbyte

Airbyte is a new open-source (MIT) EL+T platform that started in July 2020. It has a fast-growing community and it distinguishes itself by several significant choices:

Airbyte's connectors are usable out of the box through a UI and an API, with monitoring, scheduling, and orchestration. Their ambition is to support 50+ connectors by EOY 2020. These connectors run as Docker containers so they can be built in the language of your choice. Airbyte components are also modular and you can decide to use subsets of the features to better fit in your data infrastructure (e.g., orchestration with Airflow or K8s or Airbyte's...)

Similar to Fivetran, Airbyte integrates with DBT for the transformation piece, hence the EL+T. While contrary to Singer, Airbyte uses one single open-source repo to standardize and consolidate all developments from the community, leading to higher quality connectors. They built a compatibility layer with Singer so that Singer taps can run within Airbyte.

Airbyte's goal is to commoditize ELT, by addressing the long tail of integrations. They aim to support 500+ connectors by the end of 2021 with the help of its community.

The Future of ELT

Teams who need to accommodate the power, size, and speed of big data may search in vain for a solution that can help them achieve their motive. Fortunately, ELT is changing that dynamic. Designed to help teams forsake the traditional layers of data processing and transformation and modernize the approach, ELTs simplify both integration and architecture, decreasing latency and offering agile, enhanced performance.

When compared to traditional ETL methods, it's clear that ELTs are the way of the future, as far as data processing is concerned. More sustainable, effective, and timely overall, ELT methods provide more flexibility and customization for organizations who want to control their data integration and implementation.

By offering high speeds, rapid load times, and an invitingly low maintenance requirement, cloud-based ELT systems place the burden of transformation on the data destination, eliminating the need for data staging. This helps organizations enjoy a simpler relationship to their data, without sacrificing power.

If we look into the future of data processing, it stands to reason that ELTs will rapidly become the de facto system for organizations focused on efficiency, scalability, and reliability. While both solutions have their strengths and weaknesses, the ELT has emerged as the undeniable favorite of many organizations around the globe.

Data Engineer Vs Analytics Engineer Vs Analyst

Why Migrate To The Modern Data Stack And Where To Start

5 Great Data Engineering Tools For 2021 -- My Favorite Data Engineering Tools

4 SQL Tips For Data Scientists

What Are The Benefits Of Cloud Data Warehousing And Why You Should Migrate

Why Not to Become a Data Engineer

SeattleDataGuy — Mon, 29 Nov 2021 14:16:24 +0000

As companies struggle to manage their massive and complex data sets, the necessity for data engineers has become more apparent.

Data engineering became the fastest-growing single job in 2019 with 50 percent year-on-year growth, and there's little reason to believe demand for data engineers will slow soon. As with all careers, though, there are pros and cons to data engineering. Here's what you need to know about this up-and-coming job field and some of the reasons you may or may not want to pursue it.

What Is a Data Engineer?

Data engineers move, remodel, and manage data sets from 10s if not 100s of internal company applications so analysts and data scientists don't need to spend their time constantly pulling data sets.

They may also create a core layer of data that lets different data sources connect to it to get more information or context. Data engineers spend their time developing data pipelines, managing data warehouses and maintaining all the various infrastructure components they develop along the way.

These specialists are usually the first people to handle data. They process the data so it's useful for everyone, not just the systems that store it.

There are obvious reasons to become a data engineer --- like a high salary and numerous opportunities due to limited competition within the job market --- but we're not focusing on those today. Instead, let's ask the question, why not to become a data engineer.

What Skills Does a Data Engineer Need?

To assist you as a new data engineer, I have created a skill set pyramid that can be thought of as a hierarchy of skill set needs. This will help you focus on the skills you should learn first, allowing you to build a solid foundation as you move on to more specific skills. Just remember, the way you learn each step of the pyramid does not need to be overly rigid or stay in a strict order. You can layer each step, helping you progress as you learn. Let’s get started!

Reasons Not to Become a Data Engineer

Despite being an in-demand career that promises high earnings and job security, becoming a data engineer isn't for everyone. As with most professions, it's important to consider your own skills, talents, and personality before choosing a career in data engineering. Here are some of the reasons you may not want to become a data engineer.

You're Letting Money Drive Your Career Choice

While money is certainly important, it shouldn't be the driving force behind your career choice. Assuming you're planning to work in the tech field anyway, it's better to choose a role you will genuinely enjoy, even if the earnings could be a bit lower. A difference of $5,000 or even $10,000 in earnings won't drastically impact the lifestyle of a highly paid tech worker, especially once taxes are taken into account. The level of enjoyment you derive from your work, on the other hand, will affect your overall happiness and satisfaction in your professional life.

The average data engineer earns $92,650 per year, which is significantly above the overall US average of $53,490. The financial benefits of becoming a data engineer, however, become much less clear when compared to other jobs in the tech field. The average software engineer, for example, can expect to make about $87,690. As you can see, the difference between a software engineer's salary and a data engineer's salary is fairly negligible. If software engineering would be a more fulfilling job for you, the slightly higher average salary isn't worth going into data engineering.

You Don't Enjoy the Engineering Mindset

Data engineering requires you to adopt and deploy an engineering mindset, which some people can find rather constrictive. Because data engineers often need to create pieces of infrastructure that other engineers can maintain in the future, they must work within a strict set of rules and standards. These rules are extremely important but can also seem burdensome to those who prefer more creative freedom in their projects.

This isn't to say, of course, that there aren't creative aspects to the engineering mindset. High-level problem solving, for instance, often requires engineers to develop creative solutions. Likewise, engineers use creative problem-solving skills to continuously improve their projects. In order to be a successful data engineer, you'll need to be able to balance your creative impulses with the rigorous mindset of an engineering professional.

You Aren't Flexible Enough

One of the most interesting aspects of being a data engineer is the flexibility and lack of definition for the role. Because of its highly interdisciplinary nature, data engineering combines elements of data analysis, programming, modeling, machine learning, and many other specific skills. Becoming a professional data engineer requires you to flexibly adapt and deploy these various skills as needed for the specific project you're working on.

The downside of this interdisciplinary approach to data engineering is that it requires more flexibility than most other tech jobs. In your data engineering career, you may take on drastically different roles at different companies while maintaining the same job title. If you prefer to have a well-defined, set role, you likely won't enjoy the somewhat chaotic world of data engineering.

You Don't Enjoy Continuous Learning

A final reason you may not want to become a data engineer is that you don't enjoy the process of continuous learning. Technologies are constantly shifting and evolving, requiring data engineers to update their skills on an ongoing basis. The cloud data warehouse tool Snowflake, for example, has seen substantial growth over the last 10 years as companies have embraced cloud computing. As trends like this emerge, data scientists must learn to use new tools and technologies to stay at the cutting edge.

If one of your career goals is to eventually stop learning and rest on your laurels, data engineering won't be a good fit for you. Indeed, this is true of almost all roles in the tech industry. Continuous learning is crucial for staying on top of trends and technologies, and even the most seasoned experts must pursue ongoing education to remain relevant. Failure to stay on top of new developments practically guarantees that your skills will eventually become outdated. While you might be able to work with older technologies at a handful of companies, your career options will narrow significantly when you stop updating your skills.

Conclusion: Know Yourself

As you can see, knowing yourself and your preferences is essential to deciding whether a career in data engineering is right for you. In addition to knowing what you like to do in terms of specific tasks and working conditions, it's also important to consider your own personality. Data engineering is a suitable role for people who prefer to work in the background within a company, rather than directly driving conversations with management using data insights. If you prefer that more extroverted side of data, though, you may enjoy a role as a data scientist.

Overall, becoming a data engineer is a great career choice for people who love detail, following engineering guidelines, and building pipelines that allow raw data to be turned into actionable insights. As mentioned earlier, a career in data engineering also offers excellent earning potential and strong job security. With that said, the job isn't for everyone. If some of the reasons detailed above seem to describe you, it may be a good idea to give data engineering a second thought and explore other tech careers that could fit you better.

If you want to read/watch more about data engineering, then check out the links below:

Data Engineer Vs Analytics Engineer Vs Analyst

Why Migrate To The Modern Data Stack And Where To Start

5 Great Data Engineering Tools For 2021 -- My Favorite Data Engineering Tools

4 SQL Tips For Data Scientists

What Are The Benefits Of Cloud Data Warehousing And Why You Should Migrate