DEV Community

Cover image for The Ultimate Guide to Getting Started in Data Science
Marriane Akeyo
Marriane Akeyo

Posted on • Updated on

The Ultimate Guide to Getting Started in Data Science

We can define data science as an inter disciplinary field that uses scientific methods to prepare data for analysis, including cleansing, aggregating, and manipulating the data to perform advanced data analysis.In this article I am going to provide a guide on the various topics you can cover ,in order to get a head start at this awesome field.

1.Introduction to Python, Mastering Python basics.

Python is a high-level, general-purpose programming language. Its design philosophy emphasizes code readability with the use of significant indentation.Due to its simple structure and the qualifications above it is highly preferred for data science and data analytics.You can visit my posts on introduction to python and python data structures to get a quick introduction to this wonderfull language.
A quick and easy way to run your python codes would be using Jupyter notebook on anaconda.Jupyter notebook is a text editor which comes in very handy when advancing from writing simple python codes to using python libraries to perform data science tasks.
Anaconda on the other hand, is an open distribution of the R and python programming languages that simplifies package management and deployment.You are first required to install anaconda then install jupyter notebook present in anaconda.
Here is a blog post that sheds more light on jupyter notebook for python and anaconda.

Tip:Once you have anaconda installed and you are familiar with vs code , you can install jupyter notebook as an extension and create a new file using the (.ipynb) extension to use jupyter notebook in vs code.
Once you run a file,you will be prompted to select the environment you want to use to run your notebook.Ensure you select the base environment to smoothen your process.

You can also use the google colab which is much easier since all the packages are pre installed ,all you need to do is import and use them,so long as you have a stable network connection.Please opt for this option after you get the basics right so as to ease your coding process

2.Introduction to Databases

Once you have tried out some python projects and feel comfortable with moving forward , then learn about databases.
A database simply consists of a collection of data which are stored and accessed electronically.Data is fetched from the database with the help of queries which ask the database to perform various functions.There exists two types of databases , relational and non-relational databases.
A relational database usually has tables and data is stored in the tables in form of rows and columns.An example is the SQL database.
A non-relational database is also refferred to as NoSQL database.It stores data in a format known as JSON which uses human readable formart to store and retrieve data. Data is stored as a collection of objects containing key value pairs which are separated from each other with the help of a semi-colon.

user
 {
'id':'qwe245ert',
'name':'John',
'occupation':'Doctor'
}
Enter fullscreen mode Exit fullscreen mode

In the example above , we have a user object with keys id,name and occupation each containing its own value.Some non-relational databases include Mongodb and Amazon DynamoDB.
For further insight on the differences of relational database and non-relational database checkout this video.

3.Understanding Python libraries used in Datascience

This is always the exciting part for me and I do hope it excites you too.Python contains very useful libraries that are used to perform various data science tasks .Thus understanding these packages is essential to your progress in this field.To use these libraries we must install them using pip3 install library_name and then import it into our file.

The libraries include:

  • Pandas

  • Numpy

  • Matplotlib

  • Seaborn

  • Pyforest

a)Pandas

This is the python package or library that is used in data analysis.It provides a dataframe that allows you to play around with your data and structure it however you want.Pandas is mostly prefered for its flexibility and ability to work with big data.
Most people prefer to refer to a dataframe as (df) when analysing their data.Some functionalities that come with pandas include :
Reading the dataset in any format:

df.read_csv('url_of_the_excel_sheet')
Enter fullscreen mode Exit fullscreen mode

Viewing part of the dataset

df.head()
Enter fullscreen mode Exit fullscreen mode

Locate anything:

df.loc['column_name']

#locating using an integer
df.iloc[column(s)]

Enter fullscreen mode Exit fullscreen mode

Sorting

df.sort_values('column_to_sort_by' , ascending=False)
Enter fullscreen mode Exit fullscreen mode

Just to mention a few .Check out this blog for more clarifications

b)Numpy
Numpy is the same a lists in python.It is used to store multidimentional arrays eg 1D,2D or 3D arrays.Numpy is ussually preferred over lists since it uses fixed values wherease values in list vary.Numpy less storage space compared to list due to the fact that all the blocks storing data are next to each other (contiguous blocks) whereas in list the blocks are far apart and pointers are used to store data.Numpy enables us to do all the mathematics we need .
Using this library, we can perform more that we could not with lists eg

a=[1,2,3,4]
b=[4,5,6,7]
a*b
Enter fullscreen mode Exit fullscreen mode

The above code produces and error.However, in numpy we accomplish this using:

a=np.array([1,2,3,4])
b=np.array([4,5,6,7])
a*b
Enter fullscreen mode Exit fullscreen mode

we get [4,10,12,28]
Now click here and get more information on how to get started with numpy.

c.)Matplotlib
This is an open source drawing library that converts our dataset into drawings that can be used to understand and further explain our dataset.You can generate various charts eg plots,histograms,bar charts,pie charts and scatter plots.If you are curious about what I am talking about , let this tutorial feed your interests.
Note: this library feeds on data that is as a result of the two libraries above it .In short, analyse your data by performing mathematical operations on it then plot a suitable chart of it.
d.)Seaborn
This is where we visualise information from matrices and dataframes in order to make attractive statistical plots.This is just a compliment of matplotlib and not a substitute.In short it adds more flavor to our charts and makes them look more appealing.In order to use seaborn, you need to understand how it approaches data visualizations.
More info..

Now ,after understanding the functionalities of the above libraries, we can use one awesome library called pyforest to import all our required libraries with ease.How cool is that guys.....

#First install the library in your terminal 
pip3 install pyforest 

#Now write this 
from pyforest import *

#You can veiw all imports by typing
lazy_imports()
Enter fullscreen mode Exit fullscreen mode

4. Connecting to a database of your choice
We now know the python syntax ,libraries and databases present.How about we combine the knowledge and come up with something cool....
Yes I am talking about connecting our python files to the database.This is very important since we have seen we will work with a collection of data that we will regularly store and retrieve.Check this post about connecting a postgres database to python,Python and Dynamodb dataand connecting to MYSQL.When working with databases ,we will also get to understand some usefull techniques that data must undergo like data flattening and how to flatten data into a table
and some common errors found in data like missing values ,bad values , duplicates and methods of avoiding such errors .

I hope this article helps you to get started with data science.Do not let the topics overwhelm you.Try to learn on each day and for sure you will be at a better place within a month or even weeks.....
Here are some code examples of the topics covered aboveincase you need more examples.
Happy coding!!!

Top comments (0)