How to Data Science Pt.1 Starter Kit

#datascience #jobsearch #learning #career

Hi all,

As a fellow job-seeker, I write this humbly and with the intent of helping others build up and bridge their own gaps in trying to enter this field. Most of this post is just a consolidation of tips and sources from data scientists and educators that have helped me along the way. Though not successful myself in the immediate goal of employment, I am hopeful that at least one person will get started or improve on something towards this collective goal.

Step 1: Coding Language & Editor:

In the last few years, Python has overtaken R and others to take the crown of the language to learn should you want to enter this field. FreeCodeCamp has an extensive collection of certifications that are suited towards beginners. This four hour course for an absolute beginner is great. You have explanations, and step by step live practice as you move along where you can track your progress. The site has a data analysis, machine learning, and more tracks that go over the materials beyond the basics, but more on solidifying that below!

Now the editor: This comes second because freeCodeCamp has a hands-on approach, so you are able to code while learning on Scrimba, their platform of choice. You will however start delving in, and want something for yourself! Code needs a home too.
If you are a beginner, Anaconda works best because it comes with the main preinstalled packages, Python dependencies built in, and a way to add notes as you learn in a formatted manner with Jupyter Notebooks. You can choose to transition to other editors, so please do not let the above statement be gospel if you do not enjoy using Anaconda with Jupyter Notebooks long term. I transitioned to Visual Studio Code because of the centrality and simplicity of the tool. However, you will be installing a lot more packages by yourself.

Step 1a: Underlying Theory -> Statistics!

Yes. At least for the basics of machine learning, you will need to have a little bit of foundation underlying your practice. Coding is just a way to help you apply theory to solve decision-driven data problems. You need to be able to know how and what you are evaluating in the raw data in order to be able to extract information from it. A great book, first recommended by @DataSciBae on Twitter, is O'Reilly's Practical Statistics. As a math degree holder, I found it was effective in pointing out the direction of focus so that one can work on improving and understanding those particular concepts in theoretical and applied mathematics in order to use and properly practice ML techniques. Get the 2nd edition that has both R and Python examples. See below for the thread and her advice. Also consider following her for more!!

Even more basic is an understanding of probability and statistics. If this book does not spell it out too clearly for you, consider taking Khan Academy's Statistics and Probability course

Step 2: Big Data and Business Intelligence

At this point you will have some basics of the landscape, but there's more! Think of this point as having laid down the foundation cement on your home (where I come from we build homes with bricks and cement only, but this is a story for another day). Now we add bricks! The meat of data science is going to be lost if you don't know how to handle data. Understanding concepts such as data cleaning, exploratory analysis, and also data mining (because it does not magically appear!) is important. UC Davis has a Coursera specialization to fit this need. It is also suited for beginners, and will guide you through querying, SQL (structured querying language), and Apache Spark among other tools and skills.

Step 3: Dig Deeper -> What did you like?

So if you have gotten this far, you would understand how to mine, clean, query, and visualize data, how to create models and explore data for insights, and of course, you are a Pythonista (R people, don't kill me)! So now what?
There are so many roads, reader. Each of the above items mentioned are jobs in and of themselves. If you are a fan of data collection and visualization, you could be a data analyst. If the process of acquiring and preparing data to be used by others appeals to you more, then you are somewhere between a database administrator (governance) and a data engineer. What about just the models? The only things you care about are normalizing data and building the best model in the world (which spoiler alert, you should know doesn't exist) and communicate these insights to stakeholders. Well then hello, data scientist! You want more? Get to combining all of these skill sets and some software engineering ones to become a machine learning engineer then (which is a land mine by itself and one I intend to enter)!

So all that blabbing and diddly daddle above didn't really guide you towards anything specific because like Nelly Furtado said, the more you know, the less you know.

At this point, you have to make a decision about which direction you want to go. After you do that, come back because we are going to take a deep dive into resources that would enhance any one of those jobs above (also because I have to finish putting those together ha!). Each track has more tools and specialization to focus on. Also note, the data scientist has multiple directions into which they could veer. We'll get into those too in the next (or next next) article of this series.

Until then, enjoy Nelly's music (¡Tambien en Espanol aqui!) and this is 1/?

Sia