Eugene Yan

Posted on Oct 22, 2020 • Originally published at eugeneyan.com

Why Have a Data Science Portfolio and What It Shows

#datascience #learning #career

Thinking of building your data science portfolio? If we google for “data science portfolio”, we’ll get many results on “how” to build one.

However, most resources don’t discuss enough about the “why” and the “what”. Why work on personal projects and build a portfolio? What does a portfolio demonstrate, other than technical skills?

Whether you’re starting on your first or fifth personal project, I hope this will help you find a meaningful “why” and make projects more enjoyable and sustainable. We'll also hear from awesome creators on their motivations for building and writing (please view in the original post). In addition, we’ll discuss the various skills (technical and non-technical) and traits projects demonstrate so you can pick projects that better demonstrate your strengths.

(Note: I’ll address portfolios and personal projects interchangeably. Nonetheless, portfolios can extend beyond personal projects to include work-related projects too.)

Getting a job shouldn’t be the only “Why”

Getting a job is usually the main reason for building a portfolio. Sometimes, it’s necessary if we don’t have the relevant education or experience. Nonetheless, it’s extrinsic motivation, where we do something for an external reward (i.e., a job) and not for its own sake. This can reduce intrinsic motivation and lead to dependence on external rewards. We might stop once we get that job or continuously fail (and don’t get rewarded).

Thus, other than to get a job, we should also find intrinsic reasons for working on personal projects. These reasons will make the project naturally satisfying, where the work is its own reward.

One reason is to learn and practice. Perhaps we’re fascinated by a branch of deep learning. Or we want to get more hands-on experience to hone our skills. Either way, the knowledge and skills gained are often transferable to work and will make us more effective data scientists. It also aligns with a key factor of motivation: Mastery, the desire to improve.

"Learning is a treasure that will follow its owner everywhere." — Chinese Proverb (学习是永远跟随主人的宝物)

Another reason is to help others. This includes volunteering with non-profit organizations such as DataKind, developing and releasing a helpful package, or sharing about what we’ve learned (via writing or talks). This aligns with another key motivation factor: Purpose, the desire to contribute to the bigger picture.

"If you want to lift yourself up, lift up someone else." — Booker T. Washington

Finally, we do personal projects because they’re enjoyable. We start projects for the sake of fun, or to scratch a “this should exist” itch. They could also be hobbies. Nonetheless, over time, it builds up to become an impressive portfolio created through consistent effort.

"It’s hard to beat someone who’s having fun.”

In the next section, we'll see some amazing personal projects and hear from their creators on their "why". Often, it's to scratch an itch, a way to learn, and to help others.

Portfolios come in two flavors: Code and Content

Most of the time, discussions on data science portfolios refer to code. Such projects involve acquiring public data, performing statistical analysis, plotting visuals, or training machine learning models. It could also include contributions to open-source libraries, as well as data science competitions. Some people may obsess over how much code they write, but don’t sweat it if you’re not committing daily.

Content-based projects are less discussed. These are (technical) content you share via papers or writing online, or talks you give at conferences and meetups. It includes well-written READMEs on git repos, as well as video walkthroughs (e.g., how-tos, summaries, etc.). After we complete a code-based project, we should follow up by writing about it and share it so others can benefit from it too.

Portfolios don’t just demonstrate technical skills

Most portfolios demonstrate skills and traits. On skills, both technical and soft skills are shown and important to hiring managers.

Technical skills are straightforward to demonstrate and also the most observable. Code-based portfolios show we’re able to do the work and are another data point beyond the resume. They also help to earn trust with recruiters and hiring managers. Depending on the project, you can demonstrate:

Data acquisition and preparation (e.g., scrape some data, and format and clean it)
Data storytelling (e.g., tell a story around the data, with statistics and visuals)
Machine learning (e.g., train and deploy a model, with validation and metrics)
Deployment (e.g., serve your machine learning app online for others to use)
Software engineering (e.g., readability, maintainability, unit tests, documentation)

Portfolios also demonstrate soft skills. Over the long term, they have as much, if not greater, impact on performance. Their effect is obvious when tackling nebulous problems and working with other people. They include:

Solving problems from scratch: Problem framing and figuring out the right metrics
Writing and talks: Ability to communicate, a key skill for an effective data scientist
Teaching: Understanding of a subject and the ability to explain it simply
Contributing to a project: Teamwork and ability to collaborate remotely on code

"In a high-IQ job pool, soft skills like discipline, drive, and empathy mark those who emerge as outstanding." — Daniel Goleman

Beyond skills, portfolios also demonstrate traits. These are seldom mentioned but I think they can be more important when making hiring decisions.

Having personal projects demonstrate curiosity and passion. It shows you’re curious to learn about something on your own. And working on it in your free time demonstrates you’ve more passion than 99% of people. Given the fast pace that tech—especially data and machine learning—evolves, this curiosity is essential to staying effective.

It also shows willingness and ability to learn. Working on projects exposes challenges not faced in MOOCs. How to clean data. How to explore the search space of data preparation, feature engineering, and machine learning. How to build a basic front-end. How to train and deploy in the cloud. These aren’t taught in MOOCs; the way to learn is through hands-on experience. Personal projects show self-learning beyond regular MOOCs.

Finally, a portfolio is evidence of persistence. Most data science projects are vague and difficult. If you’re new to programming, you might get frustrated with bugs and syntax errors, or mess up your virtual environment for the 128th time. You’ll also face less obvious issues such as:

How to work with data that doesn’t fit in memory (e.g., images, click logs)
How to make models converge faster, if they converge at all
How to run experiments quickly and cheaply in the cloud

Having a portfolio of non-beginner projects and being able to share the challenges faced while working on them demonstrates grit, which is a good predictor of success.

When deciding between two similar entry/mid-level candidates, one who’s less technically qualified but is high on curiosity, grit, and learning ability (“traits”), and another who’s only strong on technical skills (“tech-skills”), I’m more likely to hire on traits.

I’ve observed both tech-skills and traits candidates hired and their progress over time. The tech-skills candidate will start contributing value earlier. But with the right environment, challenges, and mentoring, the traits candidate will learn fast, outperform, and eventually deliver superior results.

"Hire for attitude, train for skill." – Herb Kelleher

A great portfolio vs. the traits and skills to build one

Job offers are sometimes attributed to having a great portfolio. That’s no surprise as portfolio artifacts are directly observable relative to skills and traits. (And occasionally, it’s bootcamps touting themselves.) However, I think it’s hard to distinguish if someone got a job because of an awesome portfolio, or because they had the skills and traits to build one.

IMHO, the traits and skills are a prerequisite to building a great portfolio. And they reinforce each other. As we work on a project, we gain hands-on experience and improve our technical and soft skills. It also hones our persistence and learning ability. The growth is then reflected in the next project—it’s a virtuous cycle.

What’s more likely to help land a job? A great portfolio? Or the skills and traits to build one? The portfolio will help a resume stand out among the sea of resumes and get a first-round interview. But it’s the traits and skills that will secure the job offer and lead to high performance in the role.

Don’t focus on the portfolio; focus on the process

A portfolio is just an artifact of our skills, traits, and working process. It’s the destination; it’ll take care of itself if we focus on the journey.

While trying to build our portfolios, we should find projects that are intrinsically rewarding. They should be fun, personally meaningful, and stretch our abilities—this makes it more sustainable. Over time, brick by brick, a portfolio emerges. It’ll take a while, so let’s get to work.

Thanks to Vincent Warmerdam, Liling Tan, Jay Alammar, Amit Chaudhary, and Elle O’Brien for generously sharing their work and process.

Thanks to Yang Xinyi, David Golden, Kyla Scanion, Robert Cobb, Ross Richey, and Compound for reading drafts of this.

Great projects and why their creators built them

Vincent Warmerdam has several projects listed on his site and most of the code is open source. The projects are a combination of useful (e.g., word embedding visualizations, scikit-lego) and fun (e.g., cron scheduler) and have great documentation. Here’s his take on why he builds and shares these projects:

“Some of those tools (mainly; whatlies) are written as part of my job. So I gotta admit that I’m a ‘lil bit lucky there. A lot of the other tools originated more from a “this should exist”-feeling. I’ve learned a lot from making these tools, sure, but the reason why they exist is because it is scratching an itch.”

Another great example is Liling Tan’s work on NLP. He builds corpora and tools for NLP. This includes multilingual corpus, word sense disambiguation, and “character vomiting”. There’s a mix of quirky and useful, and a lot of learning. Here’s why he built them and his advice on sharing code publicly (without being embarrassed):

“Usually it starts with scratching my own itch or satisfying some curiosity. For example, the “character vomiting” tool was built to identify all possible unicode characters that can be generated for specific languages for an NLP task. So I dug into the unicode specification and learned a whole lot about similarities and peculiarities of languages and how Unicode categorize different character sets.

Like viral TikTok videos, you’ll never know which open source becomes popular, so open sourcing your code often is a good way to expose yourself to feedbacks and sometimes get great ideas from feature requests. And for those that are afraid of people being critical at your code publicly, my two cents worth is never to be ashamed of the code you write/release, we all started from zero and everyone is constantly learning in the computing/data world, see how to concatenate strings.”

Made With ML (MWML) has a thread of personal projects showcased on their platform. It includes applying research to product, building ML apps, as well as teaching and sharing about data science journeys. The MWML team shared that a few of these folks got hiring into computer vision and joined Weights and Biases. (Also, here’s a great collection of projects from their DS Incubator.)

Great writing and why their creators share

Jay Alammar’s site is a great example of amazing content-based projects. There’s virtually no one that learns about Transformers or BERT, ELMo, and co. without Jay’s illustrated guides. It’s clear that he enjoys demystifying NLP techniques for the rest of the world, and puts in care and effort into creating his content. This is due to his curiosity and desire to help others understand research easier.

“My ML work is motivated by:

Intense curiosity about the topics I write about and fascination about the developments in NLP.
Writing, visualizing, and publishing my work forces me to learn much deeper than if I was just to read a paper.
Reading cutting-edge work in the field is often intimidating, I find. But I found if I give a certain concept enough time and focus, I can understand it in simpler terms than I would gleam from original papers. By elucidating my new-found understanding visually, I hope to make it easier for others to quickly grasp these concepts.
I love the collaborative and open sharing of code and concepts in software and ML fields. I’ve benefitted from incredible software, documentation, and research that people voluntarily put out there for everyone. I want to be a part of that virtuous cycle.”

Along the same vein, Amit Chaudhary writes weekly to explain machine learning concepts using diagrams, animations, and intuition. It’s part of his approach of taking small steps to get better at his craft. I enjoyed his breakdown of behavioral testing for NLP models and information retrieval evaluation metrics. It started as a hobby and has since helped him make new friends.

“I initially started writing just as a hobby to share what I was learning. In the process of helping others, it turned out to be a great way to discover my interest areas, connect with interesting people in the ML space, and build a portfolio. I feel everyone faces some unique challenges and resource gaps in their space and can help fill that gap through their writing.”

Another interesting example is Elle O’Brien’s writing. Her content is data science with a touch of quirky. I enjoyed her content on using machine learning to bake the most average cookie and visualizing big data of big hair (mouse over the visuals!). They go beyond the cookie-cutter content we see on Medium. Here's her process of using side projects to learn and pay it forward, which also led to her current role.

“I use side projects as a way to motivate myself to learn data science techniques really thoroughly. For example, once I realized you could teach neural networks to generate completely ridiculous content, I figured I could finally know how a computer would make up romance novel titles. And that started me wanting to use deep learning.

My process is to start with a question, go wherever that question takes me, and then share the project. Sharing your work is important. Everything I learned about the practical, hands-on aspects of data science, I learned, from people who have shared their software and their datasets and their thinking. So sharing is “paying it forward”. It also helps you build credentials and network; I got my current job through Twitter after I shared a project using a generative adversarial network.

Something worth noting: When I did these side projects, I was doing a doctoral degree that was teaching a lot, but not much about modern machine learning (all the action happening the last few years in deep learning, for example). Side projects made sure I was establishing some credentials there, so I’d be able to get the jobs I wanted when I graduated. And also so I didn’t “miss out” on all the action :) ”