Neeraj Kashyap

Posted on Dec 20, 2020 • Edited on Dec 22, 2020

Entropy for developers

#datascience #machinelearning #programming

Entropy is a mathematical concept which quantifies the amount of information in an observation.

This makes entropy useful to you, as a programmer, whenever you are working with data. When working with large datasets, entropy-based heuristics can tell you where to focus your attention to get the most bang for your buck.

At Bugout, we have a tagging system for knowledge bases very similar to the tags that DEV uses for their posts. We calculate the informative power of each tag using entropy. We use these calculations to help our users categorize their knowledge and navigate their knowledge bases more efficiently. Entropy helps us power these user experiences with simple calculations we do in our database as opposed to using complicated machine learning models.

Entropy and related information theoretic notions are very useful in machine learning, as well.

Decision tree algorithms tend to use entropy to learn effective branching structures.

When training neural networks, entropy is usually incorporated into the training process via a loss function. Cross entropy, for example, is a commonly used loss function when training classifiers. A good understand of entropy not only helps you understand the behavior of such models, it also helps you tune your models more effectively.

If you are curious to learn more about entropy and start using it to improve your data-related work, this post will get you started.

Towards the end of the post, I also describe some instances during my career where the concept of entropy has come in handy.

This post assumes a basic understanding of probability and some familiarity with mathematical notation.

Definition

Say you have a probability distribution with outcomes $1, 2, \ldots, n$ , each of which can occur with probabilities $p_1, p_2, \ldots, p_n$ .

The entropy of this probability distribution is:

H(p_1, \ldots, p_n) := -\sum_{j=1}^n p_j log(p_j)

Properties

Much more important than the definition are the properties that it embodies.

First, entropy increases with the number of (equally likely) outcomes:

H\left(\frac{1}{n}, \ldots, \frac{1}{n}\right) < H\left(\frac{1}{n+1},\ldots,\frac{1}{n+1}\right)

Example: Coins and dice

H(1/2,1/2)<H(1/6,1/6,1/6,1/6,1/6,1/6),H(1/2, 1/2) < H(1/6, 1/6, 1/6, 1/6, 1/6, 1/6),H(1/2,1/2)<H(1/6,1/6,1/6,1/6,1/6,1/6),
which says that the roll of a fair die contains more information than the flip of a fair coin.

Second, entropy is maximized when outcomes are equally likely:

H(p_1, \ldots, p_n) < H\left(\frac{1}{n}, \ldots, \frac{1}{n}\right)

for any probability distribution

(p_1, \ldots, p_n)

for which some

p_j \neq 1/n

This feeds into our intuition that the amount of information in an event is the amount of uncertainty it resolves for us.

Example: Fair coins and biased coins

H(1/4,3/4)<H(1/2,1/2), H(1/4, 3/4) < H(1/2, 1/2), H(1/4,3/4)<H(1/2,1/2),
which says that we get much less information when flipping a coin which is biased to land on heads only 25% of the time than when flipping a coin.

Units

One of the choices you have to make when calculating the entropy of a probability distribution is the base you will choose for the logarithm.

This choice doesn't usually matter much. This is because changing the base only changes the calculation by a constant factor. Since you usually use entropy to make comparisons between probability distributions, multiplying all calculations by a constant factor doesn't affect how they are placed relative to each other.

Consider the calculation of

H(1/2, 1/2) = -\left[\frac{1}{2}log\left(\frac{1}{2}\right) + \frac{1}{2}log\left(\frac{1}{2}\right)\right] = -log\left(\frac{1}{2}\right)

If you take the logarithm to the base of 2, you get

H(1/2, 1/2) = 1

If you generalize this calculation to powers of 2, you get:

H(1/2^n, \ldots, 1/2^n) = n

For any probability distribution with equally likely outcomes, using a base of 2 on the logarithm, entropy counts the minimum number of bits required to represent the possible outcomes from the distribution.

Claude Shannon, father of information theory (pictured above), made this connection more explicit:

In information theory, Shannon's source coding theorem establishes the statistical limits to possible data compression for data whose source is an independent identically-distributed random variable, and the operational meaning of the Shannon entropy.

View on Wikipedia>

The important takeaway here is that you can think of entropy as answering the question, "What is the smallest number of bits I need to represent a sample from this probability distribution?"

Real-world applications

I list a few examples from my own career where knowing about entropy was extremely useful.

Bugout tags

Bugout is a knowledge base that accepts knowledge from a variety of sources (Slack, GitHub, IDEs, terminals), and makes it discoverable to users wherever they work.

This is what it looks like when someone adds knowledge to their knowledge base from Slack:

We place no restriction on the number of tags a user can associate with an entry in their knowledge base. Instead, we rely on intelligently suggesting tags to users when they create content and when they are searching for knowledge.

The first iteration of our tag suggestions were based on how frequently tags were used in a given knowledge base. Tags which appeared more commonly were higher up in the list of suggestions.

The engineering tag shows up on close to 80% of the entries in our team knowledge base. It certainly doesn't help too much when searching through a knowledge base. As developers repeatedly accept engineering as a suggested tag when adding content into their knowledge base, the informative power of the engineering tag just decreases as its frequency of use increases.

We have found entropy to be a much better quantification of the importance of a tag than frequency. It delivers a much better user experience especially for users navigating a knowledge base.

For each tag in a knowledge base, we think of each entry as an observation of whether or not the tag appears. With $p_{tag}$ as the proportion of entries having the given tag, the entropy of the tag is $H(p_{tag}, 1-p_{tag})$ .

The best thing about this is that it's easy to calculate in our Postgres database. Here is an example of an (unoptimized) entropy query for tags in Bugout knowledge bases (internally, knowledge bases are represented as journals):

select tag,
    journal_id,
    num_entries_with_tag,
    num_entries,
    num_entries_with_tag / num_entries as proportion_entries,
    case num_entries_with_tag/num_entries
        when 0 then 0
        when 1 then 0
        else -(num_entries_with_tag / num_entries) * ln(num_entries_with_tag/num_entries) - ((num_entries - num_entries_with_tag) / num_entries) * ln((num_entries - num_entries_with_tag)/num_entries)
    end
    as entropy
from (
        select journal_entry_tags.tag as tag,
            journals.id as journal_id,
            cast (count(journal_entries.id) as float) as num_entries_with_tag,
            cast (
                (
                    select count(*)
                    from journal_entries
                    where journal_entries.journal_id = journals.id
                ) as float
            ) as num_entries
        from journal_entries
            inner join journal_entry_tags on journal_entries.id = journal_entry_tags.journal_entry_id
            inner join journals on journals.id = journal_entries.journal_id
        group by journals.id,
            journal_entry_tags.tag
    ) as intermediate_stats;

Outperforming Amazon, Google, and Microsoft APIs on image classification tasks

I used to run the AI team at Doc.ai. At the time, we built image models that provided our users and their medical providers with valuable insights about their physical and mental health.

Being serious about user privacy, especially when dealing with sensitive health data, we built our classification models to run on our users' mobile phones. This meant we couldn't just plug into AWS Rekognition, the Google ML APIs or Microsoft Cognitive Services, as that would leak user health information to Amazon, Google, and Microsoft.

We set out to train models which were competitive with the major cloud APIs by collecting a LOT of relevant images (10s of millions) and boostrapping our model with the classifications returned by the cloud APIs.

At the end of the day, we were able to outperform the cloud APIs in the many cases because of how we used entropy to our advantage.

You can think of image classification results as a probability vector over the possible classes represented in an image or in a segment of that image. We calculated the entropy of the classification results that the cloud APIs returned for each of the images in our dataset. This calculation quantified the amount of uncertainty their models had about their results. This uncertainty decomposed into two components:

The quality of the cloud API model in question
The quality of the image for the given classification task

For a given cloud API, we could take the entropy (largely) as a measure of image quality. Thus, filtering on entropy allowed us to cheaply build very high quality training and evaluation datasets without contracting any kind of labeling service (which would have been exorbitantly expensive at the scale of data we were working with).

Entropy heuristics were also very important in helping us understand the hyperparameter spaces of the models we trained. For each model, during our hyperparameter tuning phase, we would run ~30 experiments concurrently, and it cost us thousands of dollars a week. As a tiny startup, it was important that we be efficient about hyperparameter tuning. Simple entropy and cross entropy calculations over our dataset and classifications performed by various model checkpoints helped us narrow in on high-performing hyperparameters very quickly.

My team at Doc.ai released a lot of this work as the open source Tensor/IO framework (which also allows you to train models across many user devices using federated learning):

doc-ai / tensorio

Declarative, On-Device Machine Learning for iOS, Android, and React Native. Deploy. Predict. Train.

Introduction

Tensor/IO is a lightweight, cross-platform library for on-device machine learning, bringing the power of TensorFlow and TensorFlow Lite to iOS, Android, and React Native applications. We now also support PyTorch on Android and will be bringing support for PyTorch to iOS soon. Tensor/IO does not implement any machine learning itself but works with an underlying library such as TensorFlow to simplify the process of deploying and using models on mobile phones.

Declarative

Tensor/IO is above all a declarative interface to your model. Describe the input and output layers to your model using a plain-text language and Tensor/IO takes care of the transformations needed to prepare inputs for the model and to read outputs back out of it, allowing you to focus on what you know instead of a low-level C++ interface.

On-Device

Tensor/IO runs on iOS and Android mobile phones, with bridging for React Native, and it runs the…

View on GitHub

Happy to answer any questions and continue the discussion. Just leave a comment below. :)

DEV Community

Entropy for developers

Definition

Properties

Units

Real-world applications

Bugout tags

Outperforming Amazon, Google, and Microsoft APIs on image classification tasks

doc-ai / tensorio

Declarative, On-Device Machine Learning for iOS, Android, and React Native. Deploy. Predict. Train.

Introduction

Declarative

On-Device

Top comments (0)

Read next

Designing and Implementing Ant Design Global App Tour for React Apps.

Understanding Strings and Arrays in Java

JavaScript Interview Cheat Sheet - Part 2

Un senior destaca por su forma de pensar, no de codear