2.11.19 - Sprinting

#devjournal #datascience #machinelearning #selftaught

My weekly accountability report for my self-study approach to learning data science.

Studying concepts and interview questions.One of my goals was to study flashcards every day and that happened three times. An improvement over the previous week but still short of my goal. I'm doing around 25 questions a day which takes about 15 minutes and the Anki app is cool because once I start mastering the question it will swap those out for new questions. I currently have around 500 questions including Chris Albon's Machine Learning Flashcards but I'm also slowly adding new concepts as I encounter them. I wish I had started this when I first started learning but in some ways, I wasn't ready, since I didn't even have a good idea how to wrap my head around all the things I needed to learn. My decks are divided between Python & Coding, Machine Learning, Deep Learning, and the hand-drawn Machine Learning cards.
Freelancing Work is going well. Learning stuff. Getting to work with another cloud service provider. Also Postgres! It's reassuring that my rusty database skills are coming back. I'm also excited to do some named entity recognition at some point soon.
SPRINT! The big news this week, is that I participated in the WiMLDS ScikitLearn Open Source Sprint this weekend. It was, frankly, a stretch for me, but I learned a lot. Mostly around the open-source process, testing processes, and using git and GitHub to open a PR. But also I became more familiar with the types of challenges that open source contributors face.
The issue I worked on was getting docstrings in a numpydoc format, first by running a script that flagged inconsistencies and then attempting to fix the problems. But there's only so much I want to try to do on my first attempt. I think I changed three minor things like removing a blank line and rearranging a section. Yet somehow that took the entire day and I had questions at every step. Fortunately, a twitter friend was there and we helped each other with moral support. The day went by insanely fast because I was concentrating so much.
I didn't do any fast.ai and I might put that on the back burner at least while I'm doing freelance work. The eternal question is what do I need to know to get my first job in data science AND how do you balance your time, since you can't learn it all. But depending on who you ask, you'll get different answers. While I really enjoy deep learning I think there are more job opportunities looking for a wide variety of machine learning skills.
Which brings me to my reading list. I keep hearing that understanding the production process is critical for data scientists. Vicki Boykis wrote about it in this "oh-so-helpful-to-the-self-study-data-science-learner" article Data Science is Different Now:
Along with data cleaning, what’s become more clear as the hype cycle continues its way to productivity is that data tooling and being able to put models into production has become even more important than being able to build ML algorithms from scratch on a single machine, particularly with the explosion of the availability of cloud resources.

This article has so many good parts that I keep returning to it every few months or so on my journey. It also seems to confirm Sharpest Minds' portfolio approach, where mentors help ensure that mentees have a project that demonstrates an understanding of the complete pipeline from data acquisition through deployment. From their FAQ:

Normally you'll focus on industry best practices in deploying ML models to production, devops, writing clean code, and doing proper data engineering and data cleaning.

Knowing this, I keep looking for opportunities to get better at git, cloud and docker tools, development environments, etc. I've also signed up for Full Stack Deep Learning Bootcamp which takes place in a few weeks. The fear I always have is that I'll be over my head, but watching a few videos of previous sessions, I'm hopeful that while it will be challenging it won't be overwhelming.

I have a couple of other links that I've come across recently that I'm putting on my reading list as well. Chip Huyen had a really great thread on blog articles that discuss platforms and deployment:

To learn how to design machine learning systems, I find it really helpful to read case studies to see how great teams deal with different deployment requirements and constraints. Here are some of my favorite case studies.
— Chip Huyen (@chipro) October 28, 2019

This O'Reilly book caught my eye. Building Machine Learning Powered Applications Going from Idea to Product By Emmanuel Ameisen. Finally, Ben Webber is working on a book and is sharing on a chapter by chapter basis as blog posts called Data Science in Production which I can't wait to start reading.

Cover Photo via Good Free Photos

The Next Generation Developer Platform

Coherence is the first Platform-as-a-Service you can control. Unlike "black-box" platforms that are opinionated about the infra you can deploy, Coherence is powered by CNC, the open-source IaC framework, which offers limitless customization.

Learn more

DEV Community

2.11.19 - Sprinting

Cover Photo via Good Free Photos

The Next Generation Developer Platform

Top comments (0)

See why 4M developers consider Sentry, “not bad.”

Read next

Step-by-Step AI Reasoning System Improves Language Model Accuracy by 8.5%

AI Language Models Use Hidden Geometry to Add Numbers

New AI Method Cuts Image Learning Costs by 30% While Boosting Accuracy

New 4-Bit Training Method Cuts AI Model Memory Usage in Half While Maintaining Accuracy

Okay