DEV Community: Eugene Yan

Almost Everything You Need To Know About Data Discovery Platforms

Eugene Yan — Mon, 02 Nov 2020 01:23:10 +0000

In the past year or two, many companies have shared their data discovery platforms (the latest being Facebook’s Nemo). Based on this list, we now know of more than 10 implementations.

I haven’t been paying much attention to these developments in data discovery and wanted to catch up. I was interested in:

The questions these platforms help answer
The features developed to answer these questions
How they compare with each other
What open source solutions are available

By the end of this, we’ll learn about the key features that solve 80% of data discoverability problems. We’ll also see how the platforms compare on these features, and take a closer look at open source solutions available.

Why do we need data discovery platforms and what do they do?

Data discovery platforms help us find data faster. Imagine yourself as a new joiner in the organization. You need data for analysis, or to build a machine learning system. How would you find the right tables and columns to use? How would you quickly assess their suitability?

Finding the right data can take a lot of time. Before Lyft implemented their data discovery platform, 25% of the time in the data science workflow was spent on data discovery. Similarly, 80% of Shopify’s data team felt that the discovery process hindered their ability to deliver results.

Lyft found that 25% of time is spent on data discovery (source)

Data discovery platforms catalog data entities (e.g., tables, ETL jobs, dashboards), metadata (e.g., ownership, lineage), and make searching them easy. They help answer “Where can I find the data?” and other questions that users will have.

Questions we ask in the data discovery process

Before discussing platform features, let’s briefly go over some common questions in the data discovery process.

Where can I find data about ____? If we don’t know the right terms, this is especially challenging. For user browsing behavior, do we search for “click”, “page views”, or “browse”? A common solution is free-text search on table names and even columns. (We’ll see how Nemo improves on this in the next section.)

What is the data about? What columns does the data have? What are the data types? What do they mean? Displaying table schemas and column descriptions go a long way here.

Who can I ask for access? Ownership and how to get permissions should be part of the metadata displayed for each table.

How is the data created? Can I trust it? Before using the data in production, we’ll want to ensure its reliability and quality. Who’s creating the data? Is it a scheduled data cleaning pipeline? Or does an analyst manually run it for monthly reporting? Also, how widely is the data used? Displaying usage statistics and data lineage helps with this.

How should I use the data? Which columns are relevant? What tables should I join on? What filters should I apply to clean the data? To address this, one way is to display the most frequent users of each table so people can ask them. Alternatively, we can provide statistics on column usage.

How frequently does the data refresh? If delays are common, what is the extent of it? Stale data can reduce the effectiveness of time-sensitive machine learning systems. Also, what is the period of data? If the table is only a few weeks old, we won’t have enough for machine learning. A simple solution is to show table creation dates, partition dates, and when it was last updated.

Features to find, understand, and use data

The features of data discovery platforms can be grouped into the various stages of data discovery. First, identifying the right data. Then, learning and assessing the suitability of the data. Last, figuring out how to use it.

Finding via free-text search or other smarter ways

How do we help users find the data they need? All data discovery platforms allow users to search for table names that contain a specified term. Some go beyond that by also searching column names, table and column descriptions, and user-input description and comments. This is usually implemented by indexing the metadata in Elasticsearch.

Assuming we have many search results, how should we rank them? For Lyft and Spotify, ranking based on popularity (i.e., table usage) was a simple and effective solution. While not always correlated, widely used tables tend to relevant, better maintained, and more production worthy. This is implemented by parsing query logs for table usage and adding it to Elasticsearch documents (i.e., tables) for ranking.

Facebook’s Nemo takes it further. First, search terms are parsed with a spaCy-based library. Then, table candidates are generated via Unicorn, the same infra they use for scalable search on the social graph. Finally, candidates are ranked based on social signals (e.g., table users) and other features such as kNN-based scoring. Taken together, this gives Nemo the ability to parse natural language queries.

Nemo's search architecture; don't expect this in other platforms (source)

While seldom mentioned, another way to help users find data is via recommendations. This is usually on the home page. Recommendations can be based on popular tables within the organization and team, or tables recently queried by the user. New, “golden” datasets by data publishers can also be recommended to raise awareness. 20% of monthly active users used homepage recommendations when Spotify implemented this.

Understanding via schemas, previews, statistics, lineage

As users browse through tables, how can we help them quickly understand the data? To address this, most platforms display the data schema, including column names, data types, and descriptions.

If the user has read permissions, we can also provide a preview of the data (100 rows). Pre-computed column-level statistics can also be made available. They include:

All columns: Counts and proportion of null values
Numerical columns: Min, max, mean, median, standard deviation
Categorical columns: Number of distinct values, top values by proportion
Date columns: Range of dates in the data

Column statistics in Amundsen (source)

Providing data lineage also helps users learn about upstream dependencies. ETL jobs (e.g., scheduled via Airflow) can be linked to let users inspect scheduling and delays. This is helpful when evaluating data sources for production.

Learning how to use via other user behavior

After users have found the tables, how can we help them get started? A simple way is to show people associated with the table. Owners can help with granting permission. Frequent users can help with a walk-through of the data and its idiosyncrasies. (Lyft’s and LinkedIn’s platforms include people as an entity that can be attached to a table).

However, the availability of such gurus can be a bottleneck. A more scalable approach is to attach additional metadata to the table itself.

To help users find the most relevant columns, we can provide column usage statistics for each table. Also, users will need to learn which tables to join on. Providing a list of mostly commonly joined tables, as well as the joining columns, can help with this. Getting such data requires query log parsing.

To give users even greater detail on how the data is used, we can provide recent queries on the table. Users can then examine how others are cleaning (which columns to apply IS NOT NULL on) and filtering (how to filter on product category). This makes tribal knowledge more accessible. Spotify’s platform displays this, together with columns usage statistics and commonly joined tables.

Another useful feature is data lineage. This helps users learn about downstream tables that consume the current table, and perhaps the queries creating them. Users can learn how queries are adapted for different uses cases (i.e., tables) and reach out to downstream users to learn more. They might also find downstream tables that fully meet their requirements and use them directly. This reduces compute and storage costs. Several platforms support lineage, including Twitter’s Data Access Layer, Uber’s Databook, and Netflix’s Metacat.

Before using the data in production, users will want to know how frequently it’s updated. Indicating how the data is partitioned by time (e.g., day, hour) can help. Alternatively, data discovery platforms can integrate with an orchestrator like Airflow. Users can then examine scheduled ETL jobs and the time taken for them to complete. Lyft’s Amundsen has Airflow integration through I’m uncertain about the extent of it.

Task durations on Airflow (source)

High-level comparison across features

I’ve compiled a high level comparison based on publicly available information. (Note: This is likely to be incomplete; please reach out if you have additional information!) A few observations:

All platforms have free-text search (via Elasticsearch or Solr). Only Amundsen (Lyft) and Lexikon (Spotify) include recommendations on the home page.
All platforms show basic table information (i.e., schema, description). Amundsen (Lyft) and Databook (Uber) include data previews and column statistics.
Most platforms have data lineage built-in. A notable exception is Amundsen. Nonetheless, native data lineage is a priority in the 2020 roadmap.
Five platforms are open-sourced (we’ll discuss them below). Nonetheless, Spotify has shared about Lexicon in great detail with a focus on product features. Maybe it’ll be open-sourced soon?

|                             | Search | Recommendations | Schemas & Description | Data Preview | Column Statistics | Space/cost metrics | Ownership | Top Users | Lineage | Change Notification | Open Source | Documentation | Supported Sources                                     | Push or Pull |
|-----------------------------|--------|-----------------|-----------------------|--------------|-------------------|--------------------|-----------|-----------|---------|---------------------|-------------|---------------|-------------------------------------------------------|--------------|
| Amundsen (Lyft)             | ✔      | ✔               | ✔                     | ✔            | ✔                 |                    | ✔         | ✔         | Todo    |                     | ✔           | ✔             | Hive, Redshift, Druit, RDBMS, Presto, Snowflake, etc. | Pull         |
| Datahub (LinkedIn)          | ✔      |                 | ✔                     |              |                   |                    | ✔         | ✔         | ✔       |                     | ✔           | ✔             | Hive, Kafka, RDBMS                                    | Push         |
| Metacat (Netflix)           | ✔      |                 | ✔                     |              | ✔                 | ✔                  |           | Todo      |         | Todo                | ✔           |               | Hive, RDS, Teradata, Redshift, S3, Cassandra          |              |
| Atlas (Apache)              | ✔      |                 | ✔                     |              |                   |                    |           |           | ✔       | ✔                   | ✔           | ✔             | HBase, Hive, Sqoop, Kafka, Storm                      | Push         |
| Marquez (Wework             | ✔      |                 | ✔                     |              |                   |                    |           |           | ✔       |                     | ✔           |               | S3, Kafka                                             |              |
| Databook (Uber)             | ✔      |                 | ✔                     | ✔            | ✔                 |                    |           |           | ✔       |                     |             |               | Hive, Vertica, MySQL, Postgress, Cassandra            |              |
| Dataportal (Airbnb)         | ✔      |                 | ✔                     |              | ✔                 |                    | ✔         | ✔         |         |                     |             |               | Unknown                                               |              |
| Data Access Layer (Twitter) | ✔      |                 | ✔                     |              |                   |                    |           |           | ✔       |                     |             |               | HDFS, Vertica, MySQL                                  |              |
| Lexikon (Spotify)           | ✔      | ✔               | ✔                     |              |                   |                    | ✔         | ✔         |         |                     |             |               | Unknown                                               |              |

Scroll right 👉 (Is there a better way to do this in Markdown?)

Assessing five open source solutions

DataHub (LinkedIn)

LinkedIn’s DataHub started as WhereHows (released in 2016). Since then, WhereHows has been re-architected (based on the lessons they’ve learned) into DataHub. In the process, the monolithic WhereHows has been broken into two stacks: a modular UI frontend and a generalized metadata backend. DataHub was officially released on GitHub in Feb 2020 and can be found here.

DataHub has all the essential features including search, table schemas, ownership, and lineage. While WhereHows cataloged metadata data around a single entity (datasets), DataHub provides additional support for users and groups, with more entities (e.g., jobs, dashboards) coming soon. It has good documentation and can be tested locally via docker.

Ownership types on DataHub (source)

The open-source version supports metadata from Hive, Kafka, and relational databases. The internal version has support for additional data sources and more connectors might be made available publicly.

Given the maturity of DataHub, it’s no wonder that it has been adopted at nearly 10 organizations include Expedia, Saxobank, ad Typeform. Also, 20+ other organizations are building a POC or evaluating the use of DataHub.

Amundsen (Lyft)

Lyft wrote about Amundsen in April 2019 and open-sourced it in Oct that year. Nonetheless, the code has been available since Feb 2019 as part of the open-source soft launch. Since then, Amundsen has been working with early adopter organizations such as ING and Square.

Amundsen helps us find data via search (with popularity ranking) and recommendations (via the home page). Table detail pages are rich with information including row previews, columns statistics, owners, and frequent users (if they’re made available). While Amundsen lacks native data lineage integration, it’s on the 2020 roadmap. Other items on the roadmap including integration with a data quality system (Great Expectations perhaps?), improving search ranking, and displaying commonly joined tables.

Amundsen's detail page (source)

Since its release, an amazing community has gathered around Amundsen. The community has contributed valuable features such as extractors for BigQuery and Redshift, integration with Apache Atlas, and markdown support for the UI.

Amundsen has a rich set of integrations. This includes connecting to over 15 types of data sources (e.g., Redshift, Cassandra, Hive, Snowflake, and various relational DBs), three dashboard connectors (e.g., Tableau), and integration with Airflow. Many of these have been contributed by the community. It also has good documentation to help users get started and test it locally via Docker.

Despite being the new kid on the block, Amundsen has been popular and is adopted at close to 30 organizations, including Asana, Instacart, iRobot, and Square. In July 2020, it joined the Linux AI Foundation as a new incubation project.

Metacat (Netflix)

Netflix shared about Metacat in Jun 2018. In addition to the usual features such as free-text search and schema details, it also includes metrics that can be used for analyzing cost and storage space. There’s also a push notification system for table and partition changes. This allows users to be notified of schema changes, or when a table is dropped so that infra can clean up the data as required.

Metacat supports integrations for Hive, Teradata, Redshift, S3, Cassandra, and RDS. In addition to data discovery, Metacat’s goal is to make data easy to process and manage. Thus the emphasis on physical storage metrics (e.g., cost) and schema change notifications. Netflix also shared that it was working on schema and metadata data versioning and metadata validation.

While Metacat is open source, there isn’t any documentation for it (currently TODO on the project README). There’s also no information about other organizations adopting Metacat.

Marquez (WeWork)

WeWork shared about Marquez in Oct 2018, with a focus on data quality and lineage. (Additional slides on Marquez). It focuses on metadata data management including data governance and health (via Great Expectations), and catalogs both datasets and jobs.

Marquez includes components for a web UI and Airflow, and has clients for Java and Python. While it’s easy to test Marquez locally via docker, there isn’t much documentation on its website or GitHub.

Apache Atlas (Hortonworks)

Atlas started incubation at Hortonworks in Jul 2015 as part of the Data Governance Initiative. It had engineers from Aetna, JP Morgan, Merck, SAS, etc. collaborating with Hortonworks. While initially focused on finance, healthcare, pharma, etc., it was later extended to address data governance issues in other industries. Atlas 1.0 was released in Jun 2018 and it’s currently on version 2.1.

Atlas’ primary goal is data governance and helping organizations meet their security and compliance requirements. Thus, it has rich features for tagging assets (e.g., sensitive, personally identifiable information), tag propagation to downstream datasets, and security on metadata access. It also has notifications on metadata changes. For data discovery, it has free-text search, schema details, and data lineage. It also includes advanced search where users can query via a syntax similar to SQL.

Personally identifiable information tag propagation on Atlas (source)

Atlas supports integration with metadata sources such as HBase, Hive, and Kafka, with more to be added in the future. It also allows users to create and update metadata entities via REST API. Documentation for Atlas is comprehensive and the code can be found here.

How other organizations have adopted these platforms

Various organizations have shared their experiences with DataHub and Amundsen. Expedia shared about evaluating both Atlas and DataHub and going into production with DataHub (the video also includes a demo). Square shared how they adopted Amundsen to support user privacy.

It was particularly interesting to see how ING adopted both Atlas and Amundsen. Atlas handled metadata management, data lineage, and data quality metrics, while Amundsen focused on search and discovery. Table popularity scores were calculated via Spark on query logs to rank search results in Amundsen.

How ING uses both Atlas and Amundsen (source)

A few other companies shared how they evaluated various open source and commercial solutions (e.g., SaxoBank, SpotHero). Do you know of more? Please let me know!

Whale, a lightweight data discovery tool

While not a full-fledged data discovery platform, Whale helps with indexing warehouse tables in markdown. This enables search, editing, and versioning.

It’s still fairly new and not much is written about it yet. Given the lack of search and a UI, it seems targeted towards developers for now. Nonetheless, if you’re looking to try a lightweight solution, you might find it useful. Testing it with data scraped from the Bigquery public project seems easy.

Not as sexy, but a critical first step

While not as sexy as machine learning or deployment, data discovery is a crucial first step of the data science workflow. I’m glad more attention is being paid to it, and grateful for the teams open sourcing their solutions.

Is your organization struggling with data discovery? If so, take a look at Amundsen, Atlas, and DataHub. Or if you’re trying to develop one in-house, consider how your features will help users answer their questions.

How has your experience with data discovery platforms been? Would love to hear how they helped, and the challenges you faced.

References

Thanks to Yang Xinyi and Okkar Kyaw for reading drafts of this.

Why Have a Data Science Portfolio and What It Shows

Eugene Yan — Thu, 22 Oct 2020 04:12:02 +0000

Thinking of building your data science portfolio? If we google for “data science portfolio”, we’ll get many results on “how” to build one.

However, most resources don’t discuss enough about the “why” and the “what”. Why work on personal projects and build a portfolio? What does a portfolio demonstrate, other than technical skills?

Whether you’re starting on your first or fifth personal project, I hope this will help you find a meaningful “why” and make projects more enjoyable and sustainable. We'll also hear from awesome creators on their motivations for building and writing (please view in the original post). In addition, we’ll discuss the various skills (technical and non-technical) and traits projects demonstrate so you can pick projects that better demonstrate your strengths.

(Note: I’ll address portfolios and personal projects interchangeably. Nonetheless, portfolios can extend beyond personal projects to include work-related projects too.)

Getting a job shouldn’t be the only “Why”

Getting a job is usually the main reason for building a portfolio. Sometimes, it’s necessary if we don’t have the relevant education or experience. Nonetheless, it’s extrinsic motivation, where we do something for an external reward (i.e., a job) and not for its own sake. This can reduce intrinsic motivation and lead to dependence on external rewards. We might stop once we get that job or continuously fail (and don’t get rewarded).

Thus, other than to get a job, we should also find intrinsic reasons for working on personal projects. These reasons will make the project naturally satisfying, where the work is its own reward.

One reason is to learn and practice. Perhaps we’re fascinated by a branch of deep learning. Or we want to get more hands-on experience to hone our skills. Either way, the knowledge and skills gained are often transferable to work and will make us more effective data scientists. It also aligns with a key factor of motivation: Mastery, the desire to improve.

"Learning is a treasure that will follow its owner everywhere." — Chinese Proverb (学习是永远跟随主人的宝物)

Another reason is to help others. This includes volunteering with non-profit organizations such as DataKind, developing and releasing a helpful package, or sharing about what we’ve learned (via writing or talks). This aligns with another key motivation factor: Purpose, the desire to contribute to the bigger picture.

"If you want to lift yourself up, lift up someone else." — Booker T. Washington

Finally, we do personal projects because they’re enjoyable. We start projects for the sake of fun, or to scratch a “this should exist” itch. They could also be hobbies. Nonetheless, over time, it builds up to become an impressive portfolio created through consistent effort.

"It’s hard to beat someone who’s having fun.”

In the next section, we'll see some amazing personal projects and hear from their creators on their "why". Often, it's to scratch an itch, a way to learn, and to help others.

Portfolios come in two flavors: Code and Content

Most of the time, discussions on data science portfolios refer to code. Such projects involve acquiring public data, performing statistical analysis, plotting visuals, or training machine learning models. It could also include contributions to open-source libraries, as well as data science competitions. Some people may obsess over how much code they write, but don’t sweat it if you’re not committing daily.

Content-based projects are less discussed. These are (technical) content you share via papers or writing online, or talks you give at conferences and meetups. It includes well-written READMEs on git repos, as well as video walkthroughs (e.g., how-tos, summaries, etc.). After we complete a code-based project, we should follow up by writing about it and share it so others can benefit from it too.

Portfolios don’t just demonstrate technical skills

Most portfolios demonstrate skills and traits. On skills, both technical and soft skills are shown and important to hiring managers.

Technical skills are straightforward to demonstrate and also the most observable. Code-based portfolios show we’re able to do the work and are another data point beyond the resume. They also help to earn trust with recruiters and hiring managers. Depending on the project, you can demonstrate:

Data acquisition and preparation (e.g., scrape some data, and format and clean it)
Data storytelling (e.g., tell a story around the data, with statistics and visuals)
Machine learning (e.g., train and deploy a model, with validation and metrics)
Deployment (e.g., serve your machine learning app online for others to use)
Software engineering (e.g., readability, maintainability, unit tests, documentation)

Portfolios also demonstrate soft skills. Over the long term, they have as much, if not greater, impact on performance. Their effect is obvious when tackling nebulous problems and working with other people. They include:

Solving problems from scratch: Problem framing and figuring out the right metrics
Writing and talks: Ability to communicate, a key skill for an effective data scientist
Teaching: Understanding of a subject and the ability to explain it simply
Contributing to a project: Teamwork and ability to collaborate remotely on code

"In a high-IQ job pool, soft skills like discipline, drive, and empathy mark those who emerge as outstanding." — Daniel Goleman

Beyond skills, portfolios also demonstrate traits. These are seldom mentioned but I think they can be more important when making hiring decisions.

Having personal projects demonstrate curiosity and passion. It shows you’re curious to learn about something on your own. And working on it in your free time demonstrates you’ve more passion than 99% of people. Given the fast pace that tech—especially data and machine learning—evolves, this curiosity is essential to staying effective.

It also shows willingness and ability to learn. Working on projects exposes challenges not faced in MOOCs. How to clean data. How to explore the search space of data preparation, feature engineering, and machine learning. How to build a basic front-end. How to train and deploy in the cloud. These aren’t taught in MOOCs; the way to learn is through hands-on experience. Personal projects show self-learning beyond regular MOOCs.

Finally, a portfolio is evidence of persistence. Most data science projects are vague and difficult. If you’re new to programming, you might get frustrated with bugs and syntax errors, or mess up your virtual environment for the 128th time. You’ll also face less obvious issues such as:

How to work with data that doesn’t fit in memory (e.g., images, click logs)
How to make models converge faster, if they converge at all
How to run experiments quickly and cheaply in the cloud

Having a portfolio of non-beginner projects and being able to share the challenges faced while working on them demonstrates grit, which is a good predictor of success.

When deciding between two similar entry/mid-level candidates, one who’s less technically qualified but is high on curiosity, grit, and learning ability (“traits”), and another who’s only strong on technical skills (“tech-skills”), I’m more likely to hire on traits.

I’ve observed both tech-skills and traits candidates hired and their progress over time. The tech-skills candidate will start contributing value earlier. But with the right environment, challenges, and mentoring, the traits candidate will learn fast, outperform, and eventually deliver superior results.

"Hire for attitude, train for skill." – Herb Kelleher

A great portfolio vs. the traits and skills to build one

Job offers are sometimes attributed to having a great portfolio. That’s no surprise as portfolio artifacts are directly observable relative to skills and traits. (And occasionally, it’s bootcamps touting themselves.) However, I think it’s hard to distinguish if someone got a job because of an awesome portfolio, or because they had the skills and traits to build one.

IMHO, the traits and skills are a prerequisite to building a great portfolio. And they reinforce each other. As we work on a project, we gain hands-on experience and improve our technical and soft skills. It also hones our persistence and learning ability. The growth is then reflected in the next project—it’s a virtuous cycle.

What’s more likely to help land a job? A great portfolio? Or the skills and traits to build one? The portfolio will help a resume stand out among the sea of resumes and get a first-round interview. But it’s the traits and skills that will secure the job offer and lead to high performance in the role.

Don’t focus on the portfolio; focus on the process

A portfolio is just an artifact of our skills, traits, and working process. It’s the destination; it’ll take care of itself if we focus on the journey.

While trying to build our portfolios, we should find projects that are intrinsically rewarding. They should be fun, personally meaningful, and stretch our abilities—this makes it more sustainable. Over time, brick by brick, a portfolio emerges. It’ll take a while, so let’s get to work.

Thanks to Vincent Warmerdam, Liling Tan, Jay Alammar, Amit Chaudhary, and Elle O’Brien for generously sharing their work and process.

Thanks to Yang Xinyi, David Golden, Kyla Scanion, Robert Cobb, Ross Richey, and Compound for reading drafts of this.

Great projects and why their creators built them

Vincent Warmerdam has several projects listed on his site and most of the code is open source. The projects are a combination of useful (e.g., word embedding visualizations, scikit-lego) and fun (e.g., cron scheduler) and have great documentation. Here’s his take on why he builds and shares these projects:

“Some of those tools (mainly; whatlies) are written as part of my job. So I gotta admit that I’m a ‘lil bit lucky there. A lot of the other tools originated more from a “this should exist”-feeling. I’ve learned a lot from making these tools, sure, but the reason why they exist is because it is scratching an itch.”

Another great example is Liling Tan’s work on NLP. He builds corpora and tools for NLP. This includes multilingual corpus, word sense disambiguation, and “character vomiting”. There’s a mix of quirky and useful, and a lot of learning. Here’s why he built them and his advice on sharing code publicly (without being embarrassed):

“Usually it starts with scratching my own itch or satisfying some curiosity. For example, the “character vomiting” tool was built to identify all possible unicode characters that can be generated for specific languages for an NLP task. So I dug into the unicode specification and learned a whole lot about similarities and peculiarities of languages and how Unicode categorize different character sets.

Like viral TikTok videos, you’ll never know which open source becomes popular, so open sourcing your code often is a good way to expose yourself to feedbacks and sometimes get great ideas from feature requests. And for those that are afraid of people being critical at your code publicly, my two cents worth is never to be ashamed of the code you write/release, we all started from zero and everyone is constantly learning in the computing/data world, see how to concatenate strings.”

Made With ML (MWML) has a thread of personal projects showcased on their platform. It includes applying research to product, building ML apps, as well as teaching and sharing about data science journeys. The MWML team shared that a few of these folks got hiring into computer vision and joined Weights and Biases. (Also, here’s a great collection of projects from their DS Incubator.)

Great writing and why their creators share

Jay Alammar’s site is a great example of amazing content-based projects. There’s virtually no one that learns about Transformers or BERT, ELMo, and co. without Jay’s illustrated guides. It’s clear that he enjoys demystifying NLP techniques for the rest of the world, and puts in care and effort into creating his content. This is due to his curiosity and desire to help others understand research easier.

“My ML work is motivated by:

Intense curiosity about the topics I write about and fascination about the developments in NLP.
Writing, visualizing, and publishing my work forces me to learn much deeper than if I was just to read a paper.
Reading cutting-edge work in the field is often intimidating, I find. But I found if I give a certain concept enough time and focus, I can understand it in simpler terms than I would gleam from original papers. By elucidating my new-found understanding visually, I hope to make it easier for others to quickly grasp these concepts.
I love the collaborative and open sharing of code and concepts in software and ML fields. I’ve benefitted from incredible software, documentation, and research that people voluntarily put out there for everyone. I want to be a part of that virtuous cycle.”

Along the same vein, Amit Chaudhary writes weekly to explain machine learning concepts using diagrams, animations, and intuition. It’s part of his approach of taking small steps to get better at his craft. I enjoyed his breakdown of behavioral testing for NLP models and information retrieval evaluation metrics. It started as a hobby and has since helped him make new friends.

“I initially started writing just as a hobby to share what I was learning. In the process of helping others, it turned out to be a great way to discover my interest areas, connect with interesting people in the ML space, and build a portfolio. I feel everyone faces some unique challenges and resource gaps in their space and can help fill that gap through their writing.”

Another interesting example is Elle O’Brien’s writing. Her content is data science with a touch of quirky. I enjoyed her content on using machine learning to bake the most average cookie and visualizing big data of big hair (mouse over the visuals!). They go beyond the cookie-cutter content we see on Medium. Here's her process of using side projects to learn and pay it forward, which also led to her current role.

“I use side projects as a way to motivate myself to learn data science techniques really thoroughly. For example, once I realized you could teach neural networks to generate completely ridiculous content, I figured I could finally know how a computer would make up romance novel titles. And that started me wanting to use deep learning.

My process is to start with a question, go wherever that question takes me, and then share the project. Sharing your work is important. Everything I learned about the practical, hands-on aspects of data science, I learned, from people who have shared their software and their datasets and their thinking. So sharing is “paying it forward”. It also helps you build credentials and network; I got my current job through Twitter after I shared a project using a generative adversarial network.

Something worth noting: When I did these side projects, I was doing a doctoral degree that was teaching a lot, but not much about modern machine learning (all the action happening the last few years in deep learning, for example). Side projects made sure I was establishing some credentials there, so I’d be able to get the jobs I wanted when I graduated. And also so I didn’t “miss out” on all the action :) ”

How to Install Google’s Scalable Nearest Neighbors (ScaNN) on Mac

Eugene Yan — Thu, 15 Oct 2020 02:44:49 +0000

A few months back, Google shared about Scalable Nearest Neighbors, ScaNN (Paper, Code) for efficient vector similarity search. It seemed to beat the SOTA benchmarks on angular distance (i.e., >2x throughput for a given recall level).

ANN Benchmarks on the GloVe embeddings (dim=100) (source)

Recently, I found some time to try it out but was frustrated by how tricky it was to install on a Mac. Here are the steps I took to install it successfully.

Step-by-step walkthrough

First, we install the necessary compilers.

brew install bazel
brew install llvm
brew install gcc

Then, we set up our Python version via pyenv

brew update && brew upgrade pyenv
pyenv --version
> pyenv 1.2.21

pyenv install 3.8.6. # Doesn't work with 3.9 yet
pyenv local 3.8.6
python --version
> Python 3.8.6

Now, we create our virtual environment.

python -m venv .venv
source .venv/bin/activate
pip install --upgrade pip

ScaNN is part of the google-research repo which is huge. There are more than 200 directories in there and we don’t need all of them. Thus, we’ll do the following to only checkout the ScaNN directory.

git clone --depth 1 --filter=blob:none --no-checkout https://github.com/google-research/google-research.git
git checkout master -- scann
cd scann

Next, we’ll need to install the Python dependencies.

pip install wheel
python configure.py
# There might be complaints about "tensorflow 2.3.1 requires numpy<1.19.0,>=1.16.0, but you'll have numpy 1.19.2 which is incompatible." but it's fine

Several issues prevent a direct installation and we’ll be manually fixing them here.

First, we’ll update .bazelrc and .bazel-query.sh. (It’s not absolutely necessary to update .bazel-query.sh but I thought we do it anyway for completeness). We should replace:

TF_SHARED_LIBRARY_NAME="ensorflow_framework.2"

With:

TF_SHARED_LIBRARY_NAME="libtensorflow_framework.2.dylib"

Then, we’ll need to update the C++ imports by replacing (there are four of these):

#include <hash_set>

With:

#include <ext/hash_set>

Now, we can build it via bazel. Instead of using clang-8 as specified, I just used the latest version of clang and it worked fine.

CC=/usr/local/opt/llvm/bin/clang CXX=/usr/local/opt/gcc/bin/gcc bazel build -c opt --copt=-mavx2 --copt=-mfma --cxxopt="-D_GLIBCXX_USE_CXX11_ABI=0" --cxxopt="-std=c++17" --copt=-fsized-deallocation --copt=-w :build_pip_pkg

If it builds successfully, we should see output similar to this.

INFO: Elapsed time: 316.366s, Critical Path: 206.32s
INFO: 1066 processes: 319 internal, 747 local.
INFO: Build completed successfully, 1066 total actions

Then, we build the Python wheel:

./bazel-bin/build_pip_pkg

And now we can install it:

pip install scann-1.1.1-<replace with your package suffix>

You can test if the installation was successful in Python:

import scann
scann.scann_ops_pybind.builder()

Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
TypeError: builder() missing 3 required positional arguments: 'db', 'num_neighbors', and 'distance_measure'

You should get the error if installation was successful. Here’s a sample demo on using it.

How Prototyping Can Help You to Get Buy-In

Eugene Yan — Mon, 12 Oct 2020 01:06:34 +0000

Sometimes, well-thought-out proposals backed by extensive research and data will fail to convince decision-makers. But a demo of a simple prototype will get them excited and ready to commit. I'm surprised how often this happens.

While we should still start projects with a one-pager to clarify, socialize, and get feedback on our ideas, if we're not making headway, one option is to build and demo a prototype.

Why prototypes work: They're more concrete

Prototypes make our vision more accessible. For non-technical laymen, ideas and designs can be too abstract. It’s hard to imagine the end-user experience from a document. In contrast, prototypes—especially those with a graphical user interface (GUI)—make it easier to understand the deliverable's look and feel. By helping others understand our ideas better, we increase our chances of getting buy-in.

Prototypes also serve as proof of technology. What we claim to do with data and machine learning can be a stretch. (That said, the reverse, where we’re expected to do magic with a 500-row excel, also happens.) While trying to get buy-in, I’m occasionally met with skepticism—“Is that really doable?” By building a prototype, the proposal becomes real and achievable, making it easier to convince decision-makers.

"It always seems impossible until it is done." – Nelson Mandela

Last, prototypes make it easier to get feedback. Instead of attending a presentation or reading a proposal, people can interact with our app. They can test it with their data (e.g., product images, user interactions) and see the outcomes. By letting them interact with a prototype of our proposed idea, we increase our chances of getting feedback.

But prototypes don't explain "Why"

While prototypes can show "What does it look like?" and "How will it work?", they don't explain “Why should we build it?” Thus, if the "why" for the project is still unclear, we're better off writing a one-pager to clarify the problem, intent, and success criteria. Else, it doesn’t matter how well our prototype works if it doesn't solve the right problem.

Also, depending on the scale of the project, a quick prototype may not be viable. Large scale projects (e.g., drone delivery, self-driving cars, in-house ML platform) will need significant resources and can't be prototyped in a week or two. Getting a first prototype off the ground might require months, if not years.

How a prototype worked when a roadmap didn’t

In a previous role, I tried convincing stakeholders that our team should start working on computer vision. We had a massive catalog of product images and could use it to serve customers better. We would start with image classifiers to improve product categorization, before adapting them for image search and recommendation.

However, I failed to get buy-in. Some felt that our team lacked the expertise (read: "You won't be able to do it") and that it wasn’t a viable business opportunity.

I was disappointed but undeterred. If I couldn’t build it at work, I would have to build it in my free time. Using image data scraped from Amazon, I hacked together a Theano implementation (it was a while back) of ResNet and applied transfer learning for image classification. To build image search, I used embeddings from the penultimate layer and calculated cosine similarity. This took months to learn and build but it worked decently.

Demo of image classifier and image search on fashion products

The effort was well worth it. Stakeholders were clearly excited when I demo-ed the prototypes. In particular, they found image-based search & recommendation to be a compelling use case, especially for visual shopping (e.g., fashion, furniture, toys). (A big chunk of transactions came from fashion.) This kickstarted our effort to invest in GPU clusters and work on computer vision applications.

Image search on toys and furniture

In 2018, the image search feature went live on the app.

Now customers can visually search for products via photos (source)

How to build a prototype now?

Many libraries make it easier to build and deploy machine learning prototypes in Python. For web application frameworks, Flask and Bottle are widely used while FastAPI is gaining popularity. I recently switched to FastAPI and like it very much. Here’s a great comparison for Flask users by Amit Chaudhary.

To build a basic front-end, we can use a combination of Jinja templates and CSS, and perhaps a framework like Bootstrap. Streamlit is also a popular option. (I haven’t tried Streamlit though and would love to hear your experience with it.)

To serve our prototype, we can use Docker to wrap and deploy it in a container. For cloud servers, I tend to use Amazon Elastic Compute Cloud (EC2) spot instances as they’re cheap. There are also options such as Heroku and Digital Ocean.

Not sure how to get started? Here are a couple of additional resources:

Learn from my mistake with an early prototype: `curl`

One of my first machine learning projects was a title-based product classifier. I was able to build a system that achieved 95% accuracy and was looking to get user feedback. Thus, I wrapped it in a Flask app and deployed it on our internal servers.

To use the product classifier, all you had to do was to update the product title in the curl command below. It was as simple as it could get (or so I thought). After sharing the curl command and API specs with business and ops stakeholders, I eagerly monitored the logs for my first users.

curl -d '{"title":"title of product"}' -H "Content-Type: application/json" -X POST http://internal-url/categorize

No one used it 😞. I learned that most of my stakeholders were on windows machines and didn’t have easy access to curl. They were also unfamiliar with using the terminal or a tool like Postman—I had to simplify it further.

To get around this, I spent some time building a simple front-end (the predecessor to the image classifier and image search UI) and shared my prototype again. This time around, instead of a curl command, stakeholders received a web url.

A simple GUI for product classification

The result? My tiny two-thread Flask server crashed from too many concurrent requests.

I learned a valuable lesson from this experience: No matter how good your tech is, non-technical users are unlikely to try unless it has a GUI and is easy to use. Since then, I’ve always taken the time to add a simple GUI to my prototypes.

Try building a prototype early in your next project

Are you having difficulty communicating your vision and getting buy-in? Why not spend a week or two building a prototype? You'll be surprised how effective they can be.

Do you have stories and experiences of prototypes helping to push a project forward? I would love to hear about them in the comments below.