Data Science is a multidisciplinary blend of data analysis, algorithm development, and technology to solve analytically complex data-driven problems. Data Scientists are a new class of analytical data experts who have the technical skills to solve these complex data problems and is undoubtedly the sexiest job of the 21st century. The typical tasks of a data scientist comprise of data collection, data pre-processing, and data analysis.
Previously, there were not that many tools to assist data scientists, and they relied mostly upon writing their custom code. However, during the past five years, the landscape had changed so much that most of these tasks/routines have been automated using various tools, and there are dozens and dozens of them. However, data scientists often seek out new tools that help them solve and find answers to their complex data problems. For this very reason, many data scientists consider programming knowledge an integral part of data science. Not all data scientist can code, however, it is helpful to be aware of tools that can assist and organize programming.
Here is an attempt to list the top 5 must-have tools for a data scientist, in no particular order. This list is more focused on data scientists who would like to get their hands dirty with code.
The tools below would guide you from setting up, actually coding, version control, organising the code and finally exposing your predictive models to the real world.
Anaconda / Anaconda Navigator (Windows)
Anaconda is a free and open-source distribution of the Python and R programming languages for scientific computing (data science, machine learning applications, large-scale data processing, predictive analytics, etc.), that aims to simplify package management and deployment.
Previously, the academic world inclined heavily towards using Matlab. However, due to the advancements, the Python has witnessed recently, and thanks to the open-source community, the wave has now shifted towards the Python ecosystem. Anaconda Navigator is a desktop graphical user interface (GUI) included in Anaconda distribution that allows users to launch applications and manage Conda packages, environments and channels without using command-lines and is no doubt, a must-have programming tool to get started with Python programming in no time.
Installation of Anaconda-Navigator comes with many packages such as:
- Jupyter Notebook
- Visual Studio - Code
Now that you have set up everything you need to get started with writing the code. This next tool would help you code and see the outcome in real time.
If you are just getting started with Python, Jupyter Notebooks are the best way to learn. They provide “interactiveness” (code as you go) as a web application in which you can create and share documents that contain live code, equations, visualizations as well as text. The Jupyter Notebook is one of the ideal tools to help you to gain the data science skills you need.
Software/code development can never be complete. It's an ongoing process and evolves with the data. The next tool would help you version control and manage projects enabling collaboration.
Git is the most widely used version control system. Github is the world's leading software development platform that uses Git for version control. A version control system is something that records changes to a file or set of files over time so that you can recall specific versions later. Git is an essential tool as it helps you work with others, and it is something you find in many workplaces.
Using Git, nothing is lost as one can always go back to see previous versions of their programs. It can handle conflicts while synchronizing work done by different people on different machines, so it scales as your team does. Knowing Git makes it easier to contribute to open source development of packages.
Git and its subversion control systems are vital in keeping the record of entire projects; it is not devised to manage sheer code snippets. Often it's not a good practise committing sensitive information to a GIT repository.
For this purpose, a code snippet manager can be used. One such unique snippet organizer is DECS (Decentralized Encrypted Code Snippets).
Pricing: FREE / Paid Enterprise Version
DECS offers an all-in-one workspace to store securely and tightly control access to code snippets, sensitive information such as to proprietary snippets, tokens, configurations, certificates etc.
Data analysis pipelines are usually the same, rather than remembering the complete routine about a specific pipeline and re-writing the same code across several experiments/applications, it’s always easier to remember what that pipeline/algorithm is instead.
So using DECS, one can store the code snippets or routines fully encrypted and use them across several applications just by copy-pasting or downloading. These features help in organizing an infinite amount of valuable information without actually forgetting. Code capture on the go feature of DECS enables you to copy the code from anywhere on the web with just one click. As data stored on DECS is end-to-end encrypted, you can store sensitive data including the API keys, and always is just a search away, saving much time. It is also Decentralised and FREE.
Next is the tool that helps you expose your work as an API to other external tools/users and helps you scale up.
Flask is a handy tool for exposing the machine learning models as web calls and is useful for building microservices. Flask is fun and easy to set up, as it says on the Flask website. That’s true, as this microframework for Python offers a powerful way of annotating Python function with a REST endpoint. So basically using Flask, the machine learning models can be published as an API to be accessible by users/customers/clients or any other 3rd party business applications. This way, one could even commercialize their machine learning models as a web service.
These are the tools that can help you get started, up and rolling. What do you think of the list? Let me know if you have any other tools that are super useful for a data scientist.