DEV Community: bennettandrewm

Some (Pleasant) Surprises about the Surprise Module: A Beginner's Thoughts

bennettandrewm — Fri, 16 Feb 2024 21:18:44 +0000

Why this Matters:

Recommendation systems are a critical component to boost engagement on streaming services and social media. By mitigating indecision, users are likely to spend more time on these platforms, improving their financial performance. An obvious example is movie selection, but recommendations systems work well for any widely distributed products with definitive user impact. A popular module for this is surprise, a package in the python scikit family. But is it really helpful? The answer is... Yes! We would, as data scientist, be better off not using it?... Also, yes!

Background:

The surprise module is a tool for collaborative filtering of explicit ratings systems. It has numerous built in algorithms - including Simon Funk's Single Variable Decomposition (SVD) algorithm that won the netflix competition back in 2005. It allows you to tune hyperparameters to test different methods on your particular dataset, similarly to standard scikit methods. For collaborative filtering, it includes item-based vs user-based analysis and a number of KNN and SVD methods. It has a simple install, and integrates nicely into the scikit environment, because, well, that's how it was designed. So let's dig deeper.

Pleasant Surprises

Simplicity

The best thing about surprise is the simple, plug-n-play nature of it. If you're working within python already, and have a dataset suitable for explicit rating systems, then it has some very easy operations to get you right into collaborative filtering. For instance, you can do the following right from your Jupyter Notebook as this blog will walk you through the very simple basics (fyi - you may need to have the updated Conda package installed prior to this).

First, install it, obviously.

>>> pip install scikit-surprise

Handling Datasets

One of the best things about surprise is the ease with which it handles datasets. You just import the relevant functions Reader and Dataset.

>>> from surprise import Reader, Dataset

From here, we go one of two ways, use a stored dataset or pull in a new one.

New Datasets

You can upload any database and it will automatically read the number of unique users and items, provided that it's properly formatted. It requires a "userid ratingid rating [timestamp]" structure for the labels. This doesn't save a pre-processing step, per se, but once it's uploaded correctly, you can strategize about best methods of filtering, prior to the actual modeling and hypertuning.

The code is simple for say, a csv or pandas dataframe

Pandas Dataframe

>>> reader = Reader(rating_scale=(0.0,5.0))
data = Dataset.load_df("sample_data")

Subtle note - you must instantiate the Reader with the rating scale (there's a default setting but it's nice to write out in code for reference/readability).

Other files
This code was taken from the surprise website and modified for ease.

# sample path to dataset file
file_path = os.path.expanduser("~/sample_data.csv")

# instatiate the reader class with the "format" 
# and a "separator"

reader = Reader(line_format="user item rating timestamp",
                sep="\t", rating_scale=(0.0,5.0))

Here, you have to specify the separator used in the file, whether it's .csv, .data, etc.

# instatiate your dataset with the Dataset module
data = Dataset.load_from_file(file_path, reader=reader)

Built-in Datasets

The surprise module also has built in datasets to work with, including jester (a collection of jokes) and Movielens (classic database used for movie ratings). This makes for a certain ease in building recommendation systems if you're just looking to get some experience. We'll utilize one of those built in sites now.

#read in movielens dataset to surprise format
data = Dataset.load_builtin("ml-100k")

# we will create a test set for validation, this will be 
# used later when we fit the model
trainset, testset = train_test_split(data, test_size = 0.2)

You'll recognize the familiarity with Python Scikit because...

Python Scikit Ecosystem

Chances are your already working in python's scikit ecosystem. 'surprise' has similar verbiage around cross-validating, train/test sets, and estimators and transformers like .fit, among others.

To provide an example, we'll download a sample Single Variable Decomposition (SVD) algorithm (more on this later). We'll also import the accuracy module, which includes a variety of metrics.

>> from surprise import accuracy, SVD

# We'll use the famous SVD algorithm.
>> algo = SVD()

Now we can utilize our previous testset

# Train the algorithm on the trainset, 
# and predict ratings for the testset
>> algo.fit(trainset)
>> predictions = algo.test(testset)

# Then compute RMSE
>> accuracy.rmse(predictions)

RMSE: 0.9405
0.9405357087305851

Wow, we were able to instantly get a prediction from the SVD algorithm of this dataset. Let's talk about some of the available algorithms in surprise.

Algorithms within Surprise

Existing Algorithms

To aid in your quest, surprise has a number of built in models available. The specialties include a variety of KNN algorithms and SVD, including the now famous algorithm from Simon Funk which one Netflix's competition. The full list from the homepage with the RSME of predictions from sample datasets (Movielens) is shown below of those predictions.

Build-Your-Own Algorithm

One of the nice features about surprise is that you can build your own algorithms. Big deal, you might think, but it does provide a way to integrate with some of the existing algorithms in a seamless manner. For example, if you're feeling confident, (or have additional domain knowledge) you could build a new algorithm and ensemble it with built-in algorithms to create a (sort-of) hybrid filtering system.

Downsides/Limitations:

To grasp the limitations of the surprise module, it's important to understand a few different filtering systems. surprise module works incredibly well with collaborative filtering of explicit ratings. Maybe too well...

Bad for Students

What!?!?! (I can hear you say). Yes, I said it. It's not great for learning because... well... it's too good and too focused. It's such a simple, plug-n-play model used only for collaborative filtering of explicit ratings systems that it can be a crutch if you're a student. If you're working on a tight deadline in the private sector, then yes, import the surprise module and get your model finished. But, if you need to explore, and learn, and try new things, it can be too easy for users to be helpful, especially when working beyond collaborative filtering for explicit ratings systems. More on that below.

Explicit vs Implicit Ratings System

A foundational element to understand is that 'surprise' does not support implicit ratings systems or content-filtering. Understanding the differences in these systems is critical to successful implementation of surprise.

Explicit Ratings

Explicit ratings rely on a known element to specifically rate satisfaction or preference. A nice example of this is the movie rating system on a scale of say, 1-10. We can rely on this numeric value to indicate the level of satisfaction a user has with a movie. We then use this information to predict how user's would rate movies they haven't seen. It becomes a straight forward prediction model once we've done the collaborative filtering.

Implicit Ratings

Implicit Ratings use other data besides a precise rating to determine satisfaction. Let's take our movie rating example and apply it to a typical evening with Netflix. Netflix doesn't ask us to rate a movie explicitly, but they do have data on WHAT movies we watched previously and the numbers of minutes we viewed, at least. If I watch an entire movie, the implication is that I enjoyed it But it's not certain, as I was never asked explicitly. It's helpful to think of implicit ratings like a confidence metric as opposed to something for certain. Perhaps someone watches something while they're scrolling or doing work. They may finish a TV episode or movie, but did they really like it? It's hard to know explicitly. On the hand, if someone has watched every episode of the Sopranos, start to finish, I have high confidence they enjoyed it. The advantage of implicit ratings is that the data collection is far simpler, only tracking a user's behavior history. It doesn't erode the user's experience with frustrating surveys disrupting their escapism.

Content Filtering

The other limitation is content filtering - the module has no built-in capabilities for this. But what is it? Content filtering relies on meta-data to tell you about the product. It only needs to know one thing you've watched or enjoyed, and then can recommend something very similar. It's different from collaborative filtering because it doesn't rely on multiple users, they're user history, and multiple products. Just the last thing you watched and the product that has similar data about the content.

Let's stick with our movie example. A title alone may not tell you much about the movie, but the year it was made, the genre, the actors, or some keyword descriptions can go a long way. This is the meta-data that describes the film. Think about a "hilarious", "Will Ferrell", "comedy" movie that perhaps you've just watched. I can recommend at least five others that you would probably also watch just off the strength of those keywords. Now... you may be all Will Ferrell'd out for the evening, but you might keep it in mind next time.

It's the epitome of "Because you watched X, you might like Y." It's helpful for "cold start" problems because it needs very little, if any user history. You just match the user with the most similar product they just experienced. The down side is that it doesn't factor in dissimilar products that you might like. We all like variety in our lives, even if we have consistent taste. The other weakness is that it relies on the quality of the meta data, regardless of how trustworthy it is. Was that meta-data generated from a single user or did it come from many users or some larger database? The Will Ferrell example is easy, but sometimes it's just a "period", "comedy"/"drama", starring "Elle Fanning" entitled "The Great". This is a highly rated series available on streaming platforms, and hopefully the metadata contains a reference to "Catherine, The Great", or it might miss the Russophile market segment.

Summary

The surprise module is a very simple, streamlined, plug-n-play method for collaborative filtering of explicit rating systems. It's a py sci-kit, so it integrates nicely with the data science python environment. It handles datasets and hypertuning easily with a variety of built-in algorithms to help modeling, as well as functionality to build your own algorithms. It's well suited for explicit ratings, thinks like movies, books, or music, where many, many people have definitive reaction to a shared experience/product. It's too simple, actually. If you're a student needing to learn, or you need a recommendation system besides collaborative filtering with explicit ratings, then I might try something else.

SOURCES

Surprise Module https://surpriselib.com/

Surprised Kid https://dev-to-uploads.s3.amazonaws.com/uploads/articles/kqpgywefh202v6xej29w.png

Seeding, Reproducibility, and other Random Thoughts on the Random Module

bennettandrewm — Fri, 26 Jan 2024 16:04:49 +0000

Why Random Module

When studying data science and machine learning, the random() function in python is vital. Whether developing code, experimenting with data visualizations, or just a naval-gazing data nerd, it's critical to use and understand.

But what is it? How do you use it? And why does the number 42 always come up? This article will dive into random bits (no pun intended) to know about the Python random number generation (rng).

Random or... not so Random

Let's define informally what we mean by random. In Python, the random number generator creates pseudo random numbers, meaning from an algorithm. It uses the system time, with additional math on top, to generate these numbers. It's deterministic, so not perfectly random. But as Larry David would say, they're "pretty, pretty, pretty good."

Seeding and Reproducibility

Import

When using the random function, remember to import the module into python... duh

import random

Random.Seed

This function makes your random number reproducible. What does that mean? Every time you call for a random number without it, Python will generate a different number than the previous occasion. Meaning, that random number is unique to that instantaneous request. Sometimes though, you want the SAME random number each time (reproducibility). If you're running the same code over and over for debugging/development/whatever, you want to verify that you're getting the CORRECT result, say, 42.

This is where .seed comes in. You're planting a seed, so to speak, so that every time you generate a random number, it's NOT unique to that compiling instant.

The code is simple:

random.seed(42)

Now we will get reproducibility in our data. Let's move on to generating actual data.

Generating Data

Let's give common examples of code to get a number or a sequence of numbers or elements.

Generating a Number (Ints or Floats)

random.randint (a,b)

Returns a random integer between a,b inclusively. If I send the arguments (4,9), it returns 8

>>> random.randint(4,9)
9

random.randrange(start, stop, step)

It will return an integer between two numbers, accounting for the step.

>>>random.randrange(2,12,5)
7

random.random ()

This generates a random float between 0.0 and 1.0.

>>> random.random()
0.11133106816568039

random.uniform (a,b)

This generates a random float between the numbers you send it.

>>> random.uniform(3, 6)
5.224651499279499

Please note from the library "The end-point value b may or may not be included in the range depending on floating-point rounding in the equation a + (b-a) * random()."

random.choice(seq)

This is exactly how it sounds - you're getting a random element from a sequence that you provide. It's an illusion of course. Life isn't random - but predetermined by time. A sequence could be an array a tuplet, anything. Let's see an example:

>>> #tummy_prob is a sequence of seven numbers representing the
>>> # probability I will have tummy trouble on a given day of the week
>>> tummy_prob = [0.24, 0.35, .01, .05, .81, 0.36, .06]
>>> random.choice(tummy_prob)
0.35

Yikes! I'm staying home today...

Working with Many Elements

random.shuffle(x)

It will randomly shuffle a sequence that you send it. You send it X, it gives you a X, in a different order. Let's try a safer example with a deck of 5 cards.

>>>cards = [3,5,8,7,9]
>>>random.shuffle(cards)
>>>cards
[3, 7, 8, 5, 9]

random.sample(population, k)

Can return of list of k unique elements. Used for random sampling without replacement. You send it a population - it returns a list.

>>>people_heights = [5.25, 6.0, 6.2, 5.75, 5.5, 5.9]
>>>random.sample(people_heights, 2)
[6.0, 5.75]

random.choices(population)

This returns a random element from a population. A population is one or more sequences.

>>>people_heights = [5.25, 6.0, 6.2, 5.75, 5.5, 5.9]
>>>random.choices(people_heights)
[6.2]

Notice how it returned a list, so it could have returned multiple elements.

Other Random Notes

If you've made it to the end, you're obviously a dedicated, patient reader who is ever so curious about the...

Number 42

It's from Guardians of the Galaxy. At the end of the book, the computer, Deep Thought, when asked what the "Answer to the Ultimate Question of Life, the Universe, and Everything." responds with 42.

I hope this helps. That's all for now.

SOURCES

Python Library https://docs.python.org/3/library/random.html

Turtle Image https://dev-to-uploads.s3.amazonaws.com/uploads/articles/lhbmlz3pitjm1hv1dv50.jpg

Learning how the Machines Learn: An Overview of Statistical Bases

bennettandrewm — Thu, 31 Aug 2023 16:00:34 +0000

Overview

To understand the basics of machine learning, it's important to grasp the foundational concepts. This post discusses inferential vs predictive statistics and regression vs classification. It also reviews 6 foundational algorithms to Machine Learning: Linear Regression, Logistic Regression, K Nearest Neighbors, Naive Bayes, Decision Trees, and Support Vector Machines. We'll also do a quick overview of popular loss functions for these algorithms with a brief explanation.

Why this is Important?

The real fun of machine learning comes from implementing neural networks and deep learning. Before we can walk there, we must crawl (sorry). These 6 algorithms represent the real basics of machine learning, from which more complex systems form. Once we get here, we can start using statistics to predict and generate content. Predict? Yes... that's correct. What, you thought statistics were just for inferences? Well, it can be, but let's discuss the difference.

Inferential vs Predictive

Inferential statistics focuses on the relationships between variables, establishing causal links between independent and dependent variables. Prediction, while not ignoring causality, focuses on the accuracy with which you can predict a certain outcome. To illustrate this difference, let's use climate.

Inferential Statistics - Example

There's consensus that the temperature of the earth is warming, but debate about exactly what's causing it. And, for the sake of the discussion, let's assume we're experts in the domain. If we wanted to understand causation, we would apply inferential principles, gathering data such as tree cover, greenhouse gas emissions, etc. as our independent variables, and gather some global air temperature data as our dependent variable. We then run some analysis, perhaps a linear regression, and determine which variables have the greatest weight (affect) on that temperature metric. As long as we were cognizant of correlation risks, our results would indicate which variable has the strongest link to global temperatures.

Inferential Statistical Metrics

With inferential, we might focus on p-values that could rule out a null hypothesis, perhaps considering R-squared (for accuracy) on certain models. We won't get into details here, but a small enough p-value could statistically rule out the opposite case of what we're trying to prove, which is ultimately the goal in establishing causation.

Predictive Statistics - Example

Now, returning to our climate dilemma, let's think about predictions. Can we try to predict what the weather will be tomorrow? Well, yes in fact, we can. And meteorologists do it every day, multiple times a day. Do we care how they got to their conclusion? Maybe. But we really care how accurate they are. Perhaps that's why you hear, "AccuWeather" forecast as a brand name for the technology.

Predictive Statistics Metrics

On predictive, we focus on things like a confusion metric, which consider false positives, false negatives, true positives, and true negatives. And from here we dive right into Accuracy, which is a measure of correct predictions (sum of true positives and true negatives) against the correct observations (sum of observed positives and negatives). This leads us to measure how "far off" our predicted values are from all of our observed values. Error, in other words.

Classifier vs Regression

Now that we have reviewed some of the statistical foundations of Machine Learning, we can focus on predictive analytics. Let's do a quick reminder of some differences between regression and classifier method, and then we'll dig into some algorithms.

Regression

With a regression, the goal is to reduce all of the complexities of your data set to a simpler, underlying relationship. We know it won't be perfect, but hopefully it's close. We can think of it as trying to UNIFY the data.

Classification

With classification, we SEPARATE the data by making clear distinctions. We look at a big mass of info and start divvying it up.

Loss Function

Circling back to error, it's a good time to delve into the idea of loss functions. This is critical in understanding how these programs perform optimization. Error, or in many cases, Mean Squared Error, is a popular term. When we use it as a loss function, we're constantly iterating our main algorithm to try and minimize the MSE. This is done by gradient descent, analyzing how rapidly our MSE is changing with each iteration and adjusting the parameters to minimize the descent. This is a mouthful, but ultimately loss functions represent an inaccuracy in our model that we're trying to reduce.

Algorithms

So, let's look at six algorithms below with help from a useful blog post (and subsequent diagram) called Daily Dose of Data Science.

Regression

The below chart shows ML Algorithms and Loss Functions

It's important to remember that some algorithms can be implemented as either regression or classification.

1. Linear Regression

Attempts to find a unifying expression for a continuous or non-discrete variable prediction. MSE or (RMSE) is the accuracy metric for the loss function that drives optimization.

2. Logistic Regression

Attempts to find a unifying expression for bilateral classification prediction. The Cross Entropy Loss determines how far your results are from either bilateral classification.

Classification

Again, we can see the following chart for classification.

3. Decision Tree

Decision Tree creates a series or path of splits (into 2 groups each time) between values in a single variable. Ideal for binomial classification, the algorithm creates a split, almost like a rule, that tries to group a certain range of variables with certain outcomes. Information Gain details how successful that split is.

4. Support Vector Machines

Tries to subdivide data using a linear demarcation. Hinge loss tells us how thick this "split" is in our model, and the thicker it is the better. If this sounds like a vague explanation of hinge loss, well, it is. This article goes into better detail.

5. K Nearest Neighbors

This algorithm locates a certain data point in the desired feature and analyzes what the variables around it. It reports a vote of the most likely classification based on some distance K from the data point you are trying to predict. Essentially, this algorithm is "lazy" and there's no loss function. You give it an input you're looking to predict, and it reports a vote. there's no optimization effort.

6. Naive Bayes

This algorithm follows from the original Bayesian theory that determines the probability that certain features are responsible for certain classification results. The order can vary, unlike Decision Tree, and it only matters that once you know one variables outcome, you can use that to determine another variable's effect on the prediction. There's not much to optimize per se, you just iterate through each variable to determine the effects on classification.

Summary

This blog post provides a summary of foundational elements of Machine Learning. It discusses inferential vs predictive statistics, classification vs regression, and then jumps into popular algorithms. We reviewed loss functions, and now, you could be ready to jump into neural networks and deep learning.

Sources

An Algorithm-wise Summary of Loss Functions in Machine Learning Loss functions of 16 ML algorithms in a single frame, Avi Chawla, Sept 30, 2023.
https://www.blog.dailydoseofds.com/p/an-algorithm-wise-summary-of-loss?utm_source=post-email-title&publication_id=1119889&post_id=137547091&utm_campaign=email-post-title&isFreemail=true&r=2ce3uv&triedRedirect=true

A definitive explanation to the Hinge Loss for Support Vector Machines Vagif Aliyev, Towards Data Science, Nov 23, 2020
https://towardsdatascience.com/a-definitive-explanation-to-hinge-loss-for-support-vector-machines-ab6d8d3178f1**

Axis Headaches? Examples for Formatting Tick Labels (Matplotlib)

bennettandrewm — Tue, 18 Jul 2023 21:19:51 +0000

Introduction

If you're like me, you go to bed around the 40th tweak to your graph, bleary eyed and beaten. Formatting tick labels in particular is incredibly frustrating, especially when Offset Notation ruins an axis.

To make things easier, I've laid some simple formatting examples for an uncooperative axis using the set_major_formatter feature. This is not a comprehensive list of all formatting options but is simple and effective for some of the more obvious cases.

Purpose

This is useful because there are many ways to display numbers. Often, it's money ($x,000 for instance), but could include percentages, engineering notation, logarithmic scales, decimals, dates, or countless others. Getting your graph to tell a story is vital for any visualization. Hopefully this will save you some time.

Background

If you want to scroll down to the examples below, feel free to skip this section. For those who want more, here's a little background on axis and the set_major_formatter function.

Axis Class

To format our axis, we're going to work in the axis class of matplotlib. axis class you ask? What's the difference between axis and axes? The axes essentially contains everything in the plot, where axis just pertains to the y or x-axis ticks themselves, especially tick location and formatting. To access the x axis, you would type ax.xaxis. For the y axis, ax.yaxis. See informal diagram below courtesy of Madplotlib

Formatter

Within the Axis class, two common objects that pertain to tick display are Locator and Formatter. I'm going to utilize a formatter, specifically, set_major_formatter. This formatter accepts either a str, a function, or a pre-built formatter instance. We'll discuss each of these three options in the following sections.

Examples

Tick Formatting - String

Intro

Our first example will show how to format using a str.method() argument. When doing this, we use typical str.format() with set_major_formatter. To do this, you pass an x with a colon inside the {}. Let's see how a typical line would look.

So how does it work? To highlight the different formatting options. I'll use a simple graph as a template.

With this Source Code underneath:

df_sample = pd.DataFrame([400, 200, 200, 800, 100, 0, 1200, 700, 800, 700, 200, 400],
            [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])

ax = df_sample.plot.bar()

ax.set_title('Sample')
ax.set_xlabel('X-Axis')
ax.set_ylabel('Y-Axis')
ax.legend().remove()

I gave the plot a title and label for the x and y axes but no other major formatting.

Now let's see how we can pretty this up!

Sample 1: Dollars

SO, let's say the y-axis represents money and I want to show it in dollars. Specifically, I'd like to add '$' and commas separating the 0s. Let's see what happens when I add one line of Code

ax.yaxis.set_major_formatter('${x:,.0f}')

This is typical str.format() with the , used to separate 0s and .0f to signify the number of decimal places (0, in this case).

Sample 2: Percentages

What about percentages. Let's show the y-axis with percentage sign and 2 decimal places. I can use our familar syntax but with a tweak:

ax.yaxis.set_major_formatter('%{x:,.2f}')

Sample 3: Additional Text

How does it look with a longer string? Let's have a little fun.

ax.yaxis.set_major_formatter('Total Days {x:,.0f} until the Apocalypse')

This is bad form, but you can see how the plot adjusts to show the full sentence with a comma separating the 0s.

Tick Formatting - Pre-Built Formatter Instance

Intro

Besides string format, another option is pre-built formatter instances that matplotlib has which can also be utilized for more specific number representation. A few common options have to do with dates, engineering notation, etc. We'll explore a few here.

Sample 4a: Date.Formatter (Manual)

Let's use the same dataset we used above and see how Date.Formatter looks.

As you can see, we got the numbers to appear as dates. How did we do that? We had to import matplotlib.dates and then use Date.Formatter I used the same dataset to easily see how the formatting works but it's important to note how int or float types are converted to date times. In Date.Formatter, this is "done by converting date instances into days since an epoch (by default 1970-01-01T00:00:00)" (from Matplotlib.org). Why 1970? Who knows, but I agree that everything before 1970 could have been created by a simulation.

Let's look at the Source Code underneath to see what we added. The first line import matplotlib.dates as mdates

import matplotlib.dates as mdates

df_sample = pd.DataFrame([400, 200, 200, 800, 100, 0, 1200, 700, 800, 700, 200, 400],
            [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])

ax = df_sample.plot.bar()

ax.set_title('Sample')
ax.set_xlabel('X-Axis')
ax.set_ylabel('Y-Axis')
ax.legend().remove()

ax.yaxis.set_major_formatter(mdates.DateFormatter("%Y-%b"))

Note the last line we used here. I used the mdates.DateFormatter function to show the Year and the 3-Letter Month using ("%Y-%b"). I could use it to display a number a few ways like "%m-%d-%y" (Month-Day-Year), for example. These are both considered manual ways to utilize DateFormatter.

Sample 4b: Date (Automatic)

For a more automated, layout conscious format there's a function called ConciseDataFormatter. This automatically configures that date in the most concise way, based on the plot. Unlike previous examples, it requires the location of the tick labels. For simplicity sake, I'll use the existing tick locations using a locater instance called get_major_formatter. Let's see:

ax.yaxis.set_major_formatter(mdates.ConciseDateFormatter(ax.yaxis.get_major_locator()))

You can see the formatter decided to display just the month at the bottom, the days of the month along the y-axis, and the final year and month at the top. This doesn't always reveal the desired look but it's good to know if you're trying to save space on your figure.

Sample 5: Engineering Notation (Manual)

When displaying scientific units, engineering notation is often the best way. To do this, we need to import the EngFormatter, and then use set_major_formatter while specifying our units. So, with our same original dataset and code we would simply add:

from matplotlib.ticker import EngFormatter

and then

ax.yaxis.set_major_formatter(EngFormatter(unit='kg'))

The plot now shows the 'k' representing 1,000 next to the kilogram.

Sample 6: Logarithm Exponents

And of course, for the ever popular Logarithmic Exponents, there is a formatter that will return said exponents, in this case of log base 10 (remember, our y-axis from the original dataset was 0-1200).

The y axis shows the exponent of the log. How did we get 3 and 3.08? (Obnoxious reminder: 10^x = 1000, so x = 3. And 10x = 1200, so x = 3.08).

Other Pre-built Function Formatters

This is the complete list here, courtesy of Madplotlib.org

Tick Formatting - Custom Function Formatter

The third option for the set_major_formatter is to write a custom formatting function. This formatter function would have to take in x for the value of the tick, and pos, for the position of the tick. It would then return a str of what you want displayed.

Summary

This post should familiarize you with how to use the set_major_formatter function and provide some simple examples. These examples should save you time and are also applicable to other formatters (like set_minor_formatters). I've also provided additional resources of topics not covered here. This is important because every graph tells a story for your audience. The more your storytelling is clear and concise, the better it is for everyone!

Resources

Matplotlib.org - Axis.set_major_formatter(formatter)
https://matplotlib.org/stable/api/_as_gen/matplotlib.axis.Axis.set_major_formatter.html#matplotlib.axis.Axis.set_major_formatter

Anatomy of a Figure
https://matplotlib.org/stable/gallery/showcase/anatomy.html

Matplotlib.org - Tick Locating and Formatting
https://matplotlib.org/stable/api/ticker_api.html#tick-locating-and-formatting

Matplotlib.org - Date Tick Labels
https://matplotlib.org/stable/gallery/text_labels_and_annotations/date.html#sphx-glr-gallery-text-labels-and-annotations-date-py