By Ben Rogojan
We recently did an AMA on Reddit. The most common question that came up was what is the difference between a data scientist and a data engineer. So we wanted to make a more in-depth post on the subject.
There are a lot of data specialist positions that sound similar and use similar tools so it can be difficult to know what the role of each role should do. In addition, smaller companies might be limited on how many data engineers or data scientists they can hire. This means that many times the specific tasks and goals might start to intermingle.
This can make it much more difficult to clearly differentiate the two roles. So we wanted to go over how the two positions were different by discussing the different goals, mindsets, tools and backgrounds data engineers and data scientists have.
Before we go into the difference we would like to make a quick preface. The truth is, many data scientists and data engineers will perform the tasks of other technical roles. A data scientists may need to develop an ETL and a data engineer might need to develop an API and front-end. So the distinctions we are pointing out below are just to make clear where the technical differences are.
The goals of a data engineer are much more task and development focused. Data engineers build automated systems and model data structures to allow data to be efficiently processed. This means the goal of a data engineer is to create and develop tables and data pipelines to support analytical dashboards and other data customers (like data scientists, analysts, and other engineers). It's similar to most engineers. There is a lot of designing, assumptions, limitations, and development that occurs in order to be able to perform a final task. Each design and solution having its own set of limitations, even if it all can perform the end task.
In comparison, data scientists tend to be question focused. In the sense that they are looking for ways to reduce costs/increase profits, improve customer experience or business efficiencies. This means they need to ask and then answer questions ( ask a question, hypothesize and then conclude). So they need to ask questions like, what impacts patient readmission, would a customer spend more if shown an add like A vs. B, is there a faster route to deliver packages? Skipping over the rest of the process. The goal from here is to find an answer to whatever question is posed. It might be a final conclusion or to more questions. Throughout the process, data scientists analyze, gather support and can develop a conclusion to the question.
This is where things can get confusing. Data scientist and data engineers both often rely on python and SQL. However, how the two tech roles use these skills varies. Again, this ties back to the mindset differences. Python is a very robust language that has libraries that help manage operational tasks as well as analytical ones.
Similarly, data scientist queries will be ad-hoc focused (e.g. questions focused). Whereas data engineers queries will be focused on cleaning up and transforming data.
Now, one other common question when it comes to the differences of data engineers and data scientists is what background is required.
Data engineering and data science both require some understanding of data and programming. Even if it is a limited scope. However, there are some distinctions that go beyond programming. Specifically for data scientists. Due to the fact that a data scientist is more like a researcher, having a background that is research-based is a benefit.
This might be in economics, psychology, epidemiology, etc. Combine a research background with SQL, Python and a good sense of business and you have a data scientist. Now, these are not set in stone degrees. In fact, we have run into a data scientist with various degrees. Most employers will prefer to hire a data scientist with at least a masters degree that has some sort of technical or mathematical focus.
Data engineering positions usually won't require a masters degree. Data engineering is more about being a developer. This requires much more practical experience rather than theoretical knowledge. So gaining a Masters does not supply the same value.
Let's say a director of a healthcare company decide they would like to figure out how to reduce the number of patients readmitted prior to 30-days from their original visit. From a data point of view, there are a couple of things that need to occur.
Data scientists will need to figure out what drives patient readmission. That is the question they will be trying to answer. Based on the conclusions they reach, they will work with the business to develop metrics and policies to help improve patient readmission rates.
Data engineers will be developing tables to help support the data scientists answer the question while at the same time developing analytical tables to help track past and future patient readmission metrics. How these metrics are created will be driven by the answers the data scientist get.
Data scientists and data engineers have plenty of differences. They have different goals, and backgrounds, but this is where the value of utilizing both together comes from. The fact that data engineers focus more on engineering robust systems allows data scientists to query data easily and analyze it efficiently. Their partnership is what brings companies value from data.
We hope this post was helpful! Please feel free to reach out with any questions you may have.
Also, feel free to read more about programming, data science and data engineering:
Hadoop Vs Relational Database
Analyzing Medicare Data Using BigQuery
Top 10 Business Intelligence (BI) Implementation Tips
5 Great Big Data Tools For The Future - From Hadoop To Cassandra
Creating 3D Printed WiFi Access QR Codes with Python
The Interview Study Guide For Data Engineers