DEV Community

Dylan Lisk
Dylan Lisk

Posted on

A novices explanation of Data Science

The first set of this process is data acquisition which can be accomplished by SQL queries, API calls, web scraping, or simply downloading a file like .csv. For this brief overview I won't touch on data acquisition.

So you have your data in a .csv file so what now?

   import pandas as pd
   import numpy as np
   pd.read_csv('path/data.csv')
Enter fullscreen mode Exit fullscreen mode

What that does is import pandas and names it pd. I then call the pandas read_csv() function to create a pandas.DataFrame object with the data.
With a DataFrame in hand we are able to much more easily address missing values and data trimming.

As a novice my go to strategy deal with missing values is to simply cut out the offending observations.

DataFrame.dropna()
Enter fullscreen mode Exit fullscreen mode

You can also impute values into the data such as the mean or median as you can preserve the rest of the data in the row. If the feature you are imputeing is important to the target be careful. It would be better to mess as little as possible with highly correlated features.

Now we deal with outliers. How do we do that?

You guessed it! I cut that shit out. Now there are many statistical tests and methods to determine what is an outlier. I don't really know them at the moment so I arbitrarily cut the data. To be a little more intelligent and if you have separate test data you can find the min() and max() values chop your train data to reflect those values.

Ok so let's look at our data.

  import matplotlib.pyplot as plt
  smp = DataFrame.sample(100)
  pd.plotting.scatter_matrix(smp)
Enter fullscreen mode Exit fullscreen mode


Alt Text
The first thing to look at is the diagonal plot of histograms. This shows us the distribution of the values of each variable. For a lot of modeling techniques we want our features to have a normal distribution.

What is normal?

Alt Text
This is a normal distribution. If your distribution isn't normal what can you do besides panic? You can try to transform your data.

   DataFrame['log_col'] = np.log1p(DataFrame['col']
Enter fullscreen mode Exit fullscreen mode

Create a new column with values log plus oned and see how that works. Data science is iterative so if that doesn't work try something else.

So we have our data where we want it. Now we bust out the models. The go to python package for modeling is scikit-learn I recommend visiting https://scikit-learn.org/dev/index.html it is a one stop shop for all things machine learning.

Before you throw your data into a model you need to do a couple things.
Split your data into a train and test set. The model is trained on the train set and tested on the test set. Pretty self explanatory. Next scale the data down. Models are happy when everything is scaled down to 1 so that coefficients stay manageable. Some of sklearn's scalers can help with outliers as well.

What time is it?

It's modeling time baby

You would think this is where the real work begins and you would be wrong. You just did most of work doing the data cleaning, exploration and feature engineering.

I touch on model selection in a future post that I will link here

Top comments (0)