DEV Community

Exploratory Data Analysis; Much Time & Effort?

Matt Curcio on April 15, 2019

Kind Ladies & Gentlemen of Data Science, I am curious about your datascience work flow. In particular, how much time and effort do you spend o...

Read full post

Helen Anderson • Apr 15 '19

At my workplace, we have many, many data sources. Some cleaned up nicely with metadata, others kinda cleaned up but with no way of knowing what means what, others are completely raw, granular and need a lot of work.

I take one day a week to do a deep dive into a dataset I'm not familiar with and get to know it really well. That way I know what I'm in for the next time I'm asked for something in relation to that table, bucket or file.

I've heard of teams doing EDA Saturdays once a month to dig into a data set each and then do a show and tell to share the knowledge.

It takes time, but is so worth it. If you don't know what the data is doing there is no way you can gain any insight from it.

Matt Curcio • Apr 18 '19

Hi Helen,
I am curious. What other Data Science blogs or sites do you follow?

Helen Anderson • Apr 18 '19

Hey Matt

I'm not a Data Scientist so don't follow anything specific to that area.

My favourites blogs at the moment are:

Data36 - data36.com/ for tutorials and hands-on learning.

Simple Analytical - simpleanalytical.com/ for commentary on being in the data world and the ups and downs of being an analyst.

Mode - mode.com/blog/ - content for analysts by analysts.

Soft Skills Engineering - softskills.audio/ - my favourite podcast advice show about non-technical topics.

Matt Curcio • Apr 18 '19

Great, thank you very much.

I noticed that your profile says B.I. How different do you consider B.I. Is from D.S.?

Helen Anderson • Apr 18 '19 • Edited

In my workplace BI works on reports and models that show what has happened in the past and a little bit of forecasting.

The Data Scientists work on AI/ML/NLP sometimes using our models, sometimes using granular raw data

Matt Curcio • Apr 15 '19

Interesting! I can see where a separate day (hopefully not my wkend ;)) would be useful.

But what statistical tests do you find yourself drawn to? Correlations, normality, or even boxplots?
Thanks,

Fred Ross • Apr 16 '19

My exploratory work tends to fall in one of three categories:

Diagnose what went wrong with this system.
Estimate if this project is worth pursuing.
Try to poke holes in a model before deploying it.

It's all goal directed against a mental model of what I'm doing. For (1), it's trying to eliminate swathes of the control flow and data flow of the system as quickly as possible. For (2), it's getting rough estimates of the economic impact of an idea. For (3), it's trying to find situations that will break a model, rather like the adversarial process a security researcher goes through.

Instead of writing up a huge list of things to explore, treat it the way Tukey did in his book Exploratory Data Analysis that started the field. You start with a data set, and sequentially ask questions about it and answer them as fast as possible, since often you can cut off a whole line of questioning at the very beginning with a quick check. (It's a great book, by the way...especially to see what he developed to do this kind of work by hand.)

Matt Curcio • Apr 16 '19

Thanks Fred,
I have seen references to John Tukey's book in my reading. It does look interesting.

Matt Curcio • Apr 18 '19

Hi Fred, I am curious. What other Data Science blogs or sites do you follow?

Fred Ross • Apr 18 '19

I don't.

Matt Curcio • Apr 18 '19

Hi Fred,
I am curious. What other Data Science blogs or sites do you follow?

Waylon Walker • Jan 8 '21

I am always pushing for my team to do more EDA. EDA can be done really quick and help steer a project with little effort. Like you said it's easy to generate a whole bunch of things out of a script that you can build over the course of an hour. However, it seems that we jump a bit too quick to want to set up a full project with production workflows, tables, and dashboards to maintain. Much of it is necessary to get the end-user a nice interface that they can filter and slice on. The benefit of doing an hour of EDA over months of Dashboarding can be huge.