<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: bennettandrewm</title>
    <description>The latest articles on DEV Community by bennettandrewm (@bennettandrewm).</description>
    <link>https://dev.to/bennettandrewm</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F1106630%2F27baa8e9-65b3-4d88-b2a9-e70cb466fa37.png</url>
      <title>DEV Community: bennettandrewm</title>
      <link>https://dev.to/bennettandrewm</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/bennettandrewm"/>
    <language>en</language>
    <item>
      <title>Some (Pleasant) Surprises about the Surprise Module: A Beginner's Thoughts</title>
      <dc:creator>bennettandrewm</dc:creator>
      <pubDate>Fri, 16 Feb 2024 21:18:44 +0000</pubDate>
      <link>https://dev.to/bennettandrewm/some-pleasant-surprises-about-the-surprise-module-a-beginners-thoughts-hia</link>
      <guid>https://dev.to/bennettandrewm/some-pleasant-surprises-about-the-surprise-module-a-beginners-thoughts-hia</guid>
      <description>&lt;h2&gt;
  
  
  Why this Matters:
&lt;/h2&gt;

&lt;p&gt;Recommendation systems are a critical component to boost engagement on streaming services and social media. By mitigating indecision, users are likely to spend more time on these platforms, improving their financial performance. An obvious example is movie selection, but recommendations systems work well for any widely distributed products with definitive user impact. A popular module for this is &lt;code&gt;surprise&lt;/code&gt;, a package in the python scikit family. But is it really helpful? The answer is... Yes! We would, as data scientist, be better off not using it?... Also, yes!&lt;/p&gt;

&lt;h2&gt;
  
  
  Background:
&lt;/h2&gt;

&lt;p&gt;The surprise module is a tool for collaborative filtering of explicit ratings systems. It has numerous built in algorithms - including Simon Funk's Single Variable Decomposition (SVD) algorithm that won the netflix competition back in 2005. It allows you to tune hyperparameters to test different methods on your particular dataset, similarly to standard scikit methods. For collaborative filtering, it includes item-based vs user-based analysis and a number of KNN and SVD methods. It has a simple install, and integrates nicely into the scikit environment, because, well, that's how it was designed. So let's dig deeper.&lt;/p&gt;

&lt;h2&gt;
  
  
  Pleasant Surprises
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Simplicity
&lt;/h3&gt;

&lt;p&gt;The best thing about &lt;code&gt;surprise&lt;/code&gt; is the simple, plug-n-play nature of it. If you're working within python already, and have a dataset suitable for explicit rating systems, then it has some very easy operations to get you right into collaborative filtering. For instance, you can do the following right from your Jupyter Notebook as this blog will walk you through the very simple basics (fyi - you may need to have the updated Conda package installed prior to this). &lt;/p&gt;

&lt;p&gt;First, install it, obviously.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; pip install scikit-surprise
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h3&gt;
  
  
  Handling Datasets
&lt;/h3&gt;

&lt;p&gt;One of the best things about &lt;code&gt;surprise&lt;/code&gt; is the ease with which it handles datasets. You just import the relevant functions &lt;code&gt;Reader&lt;/code&gt; and &lt;code&gt;Dataset&lt;/code&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; from surprise import Reader, Dataset
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;From here, we go one of two ways, use a stored dataset or pull in a new one.&lt;/p&gt;

&lt;h4&gt;
  
  
  New Datasets
&lt;/h4&gt;

&lt;p&gt;You can upload any database and it will automatically read the number of unique users and items, provided that it's properly formatted. It requires a "userid ratingid rating [timestamp]" structure for the labels. This doesn't save a pre-processing step, per se, but once it's uploaded correctly, you can strategize about best methods of filtering, prior to the actual modeling and hypertuning. &lt;/p&gt;

&lt;p&gt;The code is simple for say, a csv or pandas dataframe&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Pandas Dataframe&lt;/em&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; reader = Reader(rating_scale=(0.0,5.0))
data = Dataset.load_df("sample_data")
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Subtle note - you must instantiate the Reader with the rating scale (there's a default setting but it's nice to write out in code for reference/readability).&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Other files&lt;/em&gt;&lt;br&gt;
This code was taken from the &lt;code&gt;surprise&lt;/code&gt; website and modified for ease.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# sample path to dataset file
file_path = os.path.expanduser("~/sample_data.csv")

# instatiate the reader class with the "format" 
# and a "separator"

reader = Reader(line_format="user item rating timestamp",
                sep="\t", rating_scale=(0.0,5.0))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Here, you have to specify the separator used in the file, whether it's &lt;code&gt;.csv&lt;/code&gt;, &lt;code&gt;.data&lt;/code&gt;, etc.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# instatiate your dataset with the Dataset module
data = Dataset.load_from_file(file_path, reader=reader)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Built-in Datasets
&lt;/h4&gt;

&lt;p&gt;The surprise module also has built in datasets to work with, including jester (a collection of jokes) and Movielens (classic database used for movie ratings). This makes for a certain ease in building recommendation systems if you're just looking to get some experience. We'll utilize one of those built in sites now.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;#read in movielens dataset to surprise format
data = Dataset.load_builtin("ml-100k")

# we will create a test set for validation, this will be 
# used later when we fit the model
trainset, testset = train_test_split(data, test_size = 0.2)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You'll recognize the familiarity with Python Scikit because...&lt;/p&gt;

&lt;h3&gt;
  
  
  Python Scikit Ecosystem
&lt;/h3&gt;

&lt;p&gt;Chances are your already working in python's scikit ecosystem. 'surprise' has similar verbiage around cross-validating, train/test sets, and estimators and transformers like &lt;code&gt;.fit&lt;/code&gt;,  among others.&lt;/p&gt;

&lt;p&gt;To provide an example, we'll download a sample Single Variable Decomposition (SVD) algorithm (more on this later). We'll also import the &lt;code&gt;accuracy&lt;/code&gt; module, which includes a variety of metrics.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt;&amp;gt; from surprise import accuracy, SVD

# We'll use the famous SVD algorithm.
&amp;gt;&amp;gt; algo = SVD()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we can utilize our previous testset&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;# Train the algorithm on the trainset, 
# and predict ratings for the testset
&amp;gt;&amp;gt; algo.fit(trainset)
&amp;gt;&amp;gt; predictions = algo.test(testset)

# Then compute RMSE
&amp;gt;&amp;gt; accuracy.rmse(predictions)

RMSE: 0.9405
0.9405357087305851
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Wow, we were able to instantly get a prediction from the SVD algorithm of this dataset. Let's talk about some of the available algorithms in &lt;code&gt;surprise&lt;/code&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Algorithms within Surprise
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Existing Algorithms
&lt;/h4&gt;

&lt;p&gt;To aid in your quest, &lt;code&gt;surprise&lt;/code&gt; has a number of built in models available. The specialties include a variety of KNN algorithms and SVD, including the now famous algorithm from Simon Funk which one Netflix's competition. The full list from the &lt;a href="https://surpriselib.com/"&gt;homepage&lt;/a&gt; with the RSME of predictions from sample datasets (Movielens) is shown below of those predictions.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg9u1kladybjc22e4equn.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fg9u1kladybjc22e4equn.png" alt="algorithm_accuracy" width="746" height="499"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Build-Your-Own Algorithm
&lt;/h4&gt;

&lt;p&gt;One of the nice features about &lt;code&gt;surprise&lt;/code&gt; is that you can build your own algorithms. Big deal, you might think, but it does provide a way to integrate with some of the existing algorithms in a seamless manner. For example, if you're feeling confident, (or have additional domain knowledge) you could build a new algorithm and ensemble it with built-in algorithms to create a (sort-of) hybrid filtering system.&lt;/p&gt;

&lt;h2&gt;
  
  
  Downsides/Limitations:
&lt;/h2&gt;

&lt;p&gt;To grasp the limitations of the &lt;code&gt;surprise&lt;/code&gt; module, it's important to understand a few different filtering systems. &lt;code&gt;surprise&lt;/code&gt; module works incredibly well with collaborative filtering of explicit ratings. Maybe too well...&lt;/p&gt;

&lt;h3&gt;
  
  
  Bad for Students
&lt;/h3&gt;

&lt;p&gt;What!?!?! (I can hear you say). Yes, I said it. It's not great for learning because... well... it's &lt;em&gt;too&lt;/em&gt; good and &lt;em&gt;too&lt;/em&gt; focused. It's such a simple, plug-n-play model used only for collaborative filtering of explicit ratings systems that it can be a crutch if you're a student. If you're working on a tight deadline in the private sector, then yes, import the surprise module and get your model finished. But, if you need to explore, and learn, and try new things, it can be too easy for users to be helpful, especially when working beyond collaborative filtering for explicit ratings systems. More on that below.&lt;/p&gt;

&lt;h3&gt;
  
  
  Explicit vs Implicit Ratings System
&lt;/h3&gt;

&lt;p&gt;A foundational element to understand is that 'surprise' does not support &lt;em&gt;implicit&lt;/em&gt; ratings systems or content-filtering. Understanding the differences in these systems is critical to successful implementation of &lt;code&gt;surprise&lt;/code&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  Explicit Ratings
&lt;/h4&gt;

&lt;p&gt;Explicit ratings rely on a known element to specifically rate satisfaction or preference. A nice example of this is the movie rating system on a scale of say, 1-10. We can rely on this numeric value to indicate the level of satisfaction a user has with a movie. We then use this information to predict how user's would rate movies they haven't seen. It becomes a straight forward prediction model once we've done the collaborative filtering.&lt;/p&gt;

&lt;h4&gt;
  
  
  Implicit Ratings
&lt;/h4&gt;

&lt;p&gt;Implicit Ratings use other data besides a precise rating to determine satisfaction. Let's take our movie rating example and apply it to a typical evening with Netflix. Netflix doesn't ask us to rate a movie explicitly, but they do have data on WHAT movies we watched previously and the numbers of minutes we viewed, at least. If I watch an entire movie, the implication is that I enjoyed it But it's not certain, as I was never asked explicitly. It's helpful to think of implicit ratings like a confidence metric as opposed to something for certain. Perhaps someone watches something while they're scrolling or doing work. They may finish a TV episode or movie, but did they really like it? It's hard to know explicitly. On the hand, if someone has watched every episode of the Sopranos, start to finish, I have high confidence they enjoyed it. The advantage of implicit ratings is that the data collection is far simpler, only tracking a user's behavior history. It doesn't erode the user's experience with frustrating surveys disrupting their escapism. &lt;/p&gt;

&lt;h2&gt;
  
  
  Content Filtering
&lt;/h2&gt;

&lt;p&gt;The other limitation is content filtering - the module has no built-in capabilities for this. But what is it? Content filtering relies on meta-data to tell you about the product. It only needs to know one thing you've watched or enjoyed, and then can recommend something very similar. It's different from collaborative filtering because it doesn't rely on multiple users, they're user history, and multiple products. Just the last thing you watched and the product that has similar data about the content.&lt;/p&gt;

&lt;p&gt;Let's stick with our movie example. A title alone may not tell you much about the movie, but the year it was made, the genre, the actors, or some keyword descriptions can go a long way. This is the meta-data that describes the film. Think about a "hilarious", "Will Ferrell", "comedy" movie that perhaps you've just watched. I can recommend at least five others that you would probably also watch just off the strength of those keywords. Now... you may be all Will Ferrell'd out for the evening, but you might keep it in mind next time.&lt;/p&gt;

&lt;p&gt;It's the epitome of "Because you watched X, you might like Y." It's helpful for "cold start" problems because it needs very little, if any user history. You just match the user with the most similar product they just experienced. The down side is that it doesn't factor in dissimilar products that you might like. We all like variety in our lives, even if we have consistent taste. The other weakness is that it relies on the quality of the meta data, regardless of how trustworthy it is. Was that meta-data generated from a single user or did it come from many users or some larger database? The Will Ferrell example is easy, but sometimes it's just a "period", "comedy"/"drama", starring "Elle Fanning" entitled "The Great". This is a highly rated series available on streaming platforms, and hopefully the metadata contains a reference to "Catherine, The Great", or it might miss the Russophile market segment.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;The surprise module is a very simple, streamlined, plug-n-play method for collaborative filtering of explicit rating systems. It's a py sci-kit, so it integrates nicely with the data science python environment. It handles datasets and hypertuning easily with a variety of built-in algorithms to help modeling, as well as functionality to build your own algorithms. It's well suited for explicit ratings, thinks like movies, books, or music, where many, many people have definitive reaction to a shared experience/product. It's too simple, actually. If you're a student needing to learn, or you need a recommendation system besides collaborative filtering with explicit ratings, then I might try something else.&lt;/p&gt;

&lt;h3&gt;
  
  
  SOURCES
&lt;/h3&gt;

&lt;p&gt;Surprise Module &lt;a href="https://surpriselib.com/"&gt;https://surpriselib.com/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Surprised Kid &lt;a href="https://dev-to-uploads.s3.amazonaws.com/uploads/articles/kqpgywefh202v6xej29w.png"&gt;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/kqpgywefh202v6xej29w.png&lt;/a&gt;&lt;/p&gt;

</description>
      <category>recommendationsystems</category>
      <category>datascience</category>
      <category>machinelearning</category>
      <category>collaborativefiltering</category>
    </item>
    <item>
      <title>Seeding, Reproducibility, and other Random Thoughts on the Random Module</title>
      <dc:creator>bennettandrewm</dc:creator>
      <pubDate>Fri, 26 Jan 2024 16:04:49 +0000</pubDate>
      <link>https://dev.to/bennettandrewm/seeding-reproducibility-and-other-random-thoughts-on-the-random-module-11fc</link>
      <guid>https://dev.to/bennettandrewm/seeding-reproducibility-and-other-random-thoughts-on-the-random-module-11fc</guid>
      <description>&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flhbmlz3pitjm1hv1dv50.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Flhbmlz3pitjm1hv1dv50.jpg" alt="Turtle_stack" width="750" height="750"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Why Random Module
&lt;/h2&gt;

&lt;p&gt;When studying data science and machine learning, the &lt;code&gt;random()&lt;/code&gt; function in python is vital. Whether developing code, experimenting with data visualizations, or just a naval-gazing data nerd, it's critical to use and understand.&lt;/p&gt;

&lt;p&gt;But what is it? How do you use it? And why does the number 42 always come up? This article will dive into random bits (no pun intended) to know about the Python random number generation (rng).&lt;/p&gt;

&lt;h2&gt;
  
  
  Random or... not so Random
&lt;/h2&gt;

&lt;p&gt;Let's define informally what we mean by random. In Python, the random number generator &lt;a href="https://docs.python.org/3/library/random.html"&gt;creates &lt;em&gt;pseudo&lt;/em&gt; random numbers&lt;/a&gt;, meaning from an algorithm. It uses the system time, with &lt;a href="https://www.sciencedirect.com/topics/computer-science/mersenne-twister"&gt;additional math on top&lt;/a&gt;, to generate these numbers. It's deterministic, so not perfectly random.  But as Larry David would say, they're "pretty, pretty, pretty good." &lt;/p&gt;

&lt;h2&gt;
  
  
  Seeding and Reproducibility
&lt;/h2&gt;

&lt;h4&gt;
  
  
  Import
&lt;/h4&gt;

&lt;p&gt;When using the random function, remember to import the module into python... duh&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import random
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  Random.Seed
&lt;/h4&gt;

&lt;p&gt;This function makes your random number reproducible. What does that mean? Every time you call for a random number without it, Python will generate a different number than the previous occasion. Meaning, that random number is unique to that instantaneous request. Sometimes though, you want the SAME random number each time (reproducibility). If you're running the same code over and over for debugging/development/whatever, you want to verify that you're getting the CORRECT result, say, 42. &lt;/p&gt;

&lt;p&gt;This is where &lt;code&gt;.seed&lt;/code&gt; comes in. You're planting a seed, so to speak, so that every time you generate a random number, it's NOT unique to that compiling instant.&lt;/p&gt;

&lt;p&gt;The code is simple:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;random.seed(42)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Now we will get reproducibility in our data. Let's move on to generating actual data.&lt;/p&gt;

&lt;h2&gt;
  
  
  Generating Data
&lt;/h2&gt;

&lt;p&gt;Let's give common examples of code to get a number or a sequence of numbers or elements.&lt;/p&gt;

&lt;h3&gt;
  
  
  Generating a Number (Ints or Floats)
&lt;/h3&gt;

&lt;h4&gt;
  
  
  random.randint (a,b)
&lt;/h4&gt;

&lt;p&gt;Returns a random integer between a,b inclusively. If I send the arguments (4,9), it returns 8&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; random.randint(4,9)
9
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  random.randrange(start, stop, step)
&lt;/h4&gt;

&lt;p&gt;It will return an integer between two numbers, accounting for the step.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt;random.randrange(2,12,5)
7
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  random.random ()
&lt;/h4&gt;

&lt;p&gt;This generates a random float between 0.0 and 1.0.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; random.random()
0.11133106816568039
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  random.uniform (a,b)
&lt;/h4&gt;

&lt;p&gt;This generates a random float between the numbers you send it.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; random.uniform(3, 6)
5.224651499279499
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Please note from the library "The end-point value b may or may not be included in the range depending on floating-point rounding in the equation a + (b-a) * random()."&lt;/p&gt;

&lt;h4&gt;
  
  
  random.choice(&lt;em&gt;seq&lt;/em&gt;)
&lt;/h4&gt;

&lt;p&gt;This is exactly how it sounds - you're getting a random element from a sequence that you provide.  It's an illusion of course. Life isn't random - but predetermined by time. A sequence could be an array a tuplet, anything. Let's see an example:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt; #tummy_prob is a sequence of seven numbers representing the
&amp;gt;&amp;gt;&amp;gt; # probability I will have tummy trouble on a given day of the week
&amp;gt;&amp;gt;&amp;gt; tummy_prob = [0.24, 0.35, .01, .05, .81, 0.36, .06]
&amp;gt;&amp;gt;&amp;gt; random.choice(tummy_prob)
0.35
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Yikes! I'm staying home today...&lt;/p&gt;

&lt;h3&gt;
  
  
  Working with Many Elements
&lt;/h3&gt;

&lt;h4&gt;
  
  
  random.shuffle(x)
&lt;/h4&gt;

&lt;p&gt;It will randomly shuffle a sequence that you send it. You send it X, it gives you a X, in a different order. Let's try a safer example with a deck of 5 cards.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt;cards = [3,5,8,7,9]
&amp;gt;&amp;gt;&amp;gt;random.shuffle(cards)
&amp;gt;&amp;gt;&amp;gt;cards
[3, 7, 8, 5, 9]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  random.sample(population, k)
&lt;/h4&gt;

&lt;p&gt;Can return of list of k unique elements. Used for random sampling without replacement. You send it a population - it returns a list.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt;people_heights = [5.25, 6.0, 6.2, 5.75, 5.5, 5.9]
&amp;gt;&amp;gt;&amp;gt;random.sample(people_heights, 2)
[6.0, 5.75]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  random.choices(&lt;em&gt;population&lt;/em&gt;)
&lt;/h4&gt;

&lt;p&gt;This returns a random element from a population. A population is one or more sequences.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;&amp;gt;&amp;gt;&amp;gt;people_heights = [5.25, 6.0, 6.2, 5.75, 5.5, 5.9]
&amp;gt;&amp;gt;&amp;gt;random.choices(people_heights)
[6.2]
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Notice how it returned a list, so it could have returned multiple elements.&lt;/p&gt;

&lt;h3&gt;
  
  
  Other Random Notes
&lt;/h3&gt;

&lt;p&gt;If you've made it to the end, you're obviously a dedicated, patient reader who is ever so curious about the...&lt;/p&gt;

&lt;h4&gt;
  
  
  Number 42
&lt;/h4&gt;

&lt;p&gt;It's from &lt;em&gt;Guardians of the Galaxy&lt;/em&gt;. At the end of the book, the computer, Deep Thought, when asked what the "Answer to the Ultimate Question of Life, the Universe, and Everything." responds with 42. &lt;/p&gt;

&lt;p&gt;I hope this helps. That's all for now.&lt;/p&gt;

&lt;h2&gt;
  
  
  SOURCES
&lt;/h2&gt;

&lt;p&gt;Python Library &lt;a href="https://docs.python.org/3/library/random.html"&gt;https://docs.python.org/3/library/random.html&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Turtle Image &lt;a href="https://dev-to-uploads.s3.amazonaws.com/uploads/articles/lhbmlz3pitjm1hv1dv50.jpg"&gt;https://dev-to-uploads.s3.amazonaws.com/uploads/articles/lhbmlz3pitjm1hv1dv50.jpg&lt;/a&gt;&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>random</category>
      <category>python</category>
      <category>beginners</category>
    </item>
    <item>
      <title>Learning how the Machines Learn: An Overview of Statistical Bases</title>
      <dc:creator>bennettandrewm</dc:creator>
      <pubDate>Thu, 31 Aug 2023 16:00:34 +0000</pubDate>
      <link>https://dev.to/bennettandrewm/false-myths-of-false-positives-3kmd</link>
      <guid>https://dev.to/bennettandrewm/false-myths-of-false-positives-3kmd</guid>
      <description>&lt;h2&gt;
  
  
  Overview
&lt;/h2&gt;

&lt;p&gt;To understand the basics of machine learning, it's important to grasp the foundational concepts. This post discusses inferential vs predictive statistics and regression vs classification. It also reviews 6 foundational algorithms to Machine Learning: Linear Regression, Logistic Regression, K Nearest Neighbors, Naive Bayes, Decision Trees, and Support Vector Machines. We'll also do a quick overview of popular loss functions for these algorithms with a brief explanation.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why this is Important?
&lt;/h2&gt;

&lt;p&gt;The real fun of machine learning comes from implementing neural networks and deep learning. Before we can walk there, we must crawl (sorry). These 6 algorithms represent the real basics of machine learning, from which more complex systems form. Once we get here, we can start using statistics to predict and generate content. Predict? Yes... that's correct. What, you thought statistics were just for inferences? Well, it can be, but let's discuss the difference.&lt;/p&gt;

&lt;h2&gt;
  
  
  Inferential vs Predictive
&lt;/h2&gt;

&lt;p&gt;Inferential statistics focuses on the relationships &lt;em&gt;between&lt;/em&gt; variables, establishing causal links between independent and dependent variables. Prediction, while not ignoring causality, focuses on the accuracy with which you can predict a certain outcome. To illustrate this difference, let's use climate. &lt;/p&gt;

&lt;h3&gt;
  
  
  Inferential Statistics - Example
&lt;/h3&gt;

&lt;p&gt;There's consensus that the temperature of the earth is warming, but debate about exactly what's causing it. And, for the sake of the discussion, let's assume we're experts in the domain. If we wanted to understand causation, we would apply inferential principles, gathering data such as tree cover, greenhouse gas emissions, etc. as our independent variables, and gather some global air temperature data as our dependent variable. We then run some analysis, perhaps a linear regression, and determine which variables have the greatest weight (affect) on that temperature metric. As long as we were cognizant of correlation risks, our results would indicate which variable has the strongest link to global temperatures.&lt;/p&gt;

&lt;h3&gt;
  
  
  Inferential Statistical Metrics
&lt;/h3&gt;

&lt;p&gt;With inferential, we might focus on p-values that could rule out a null hypothesis, perhaps considering R-squared (for accuracy) on certain models. We won't get into details here, but a small enough p-value could statistically rule out the opposite case of what we're trying to prove, which is ultimately the goal in establishing causation.&lt;/p&gt;

&lt;h3&gt;
  
  
  Predictive Statistics - Example
&lt;/h3&gt;

&lt;p&gt;Now, returning to our climate dilemma, let's think about predictions. Can we try to predict what the weather will be tomorrow? Well, yes in fact, we can. And meteorologists do it every day, multiple times a day. Do we care how they got to their conclusion? Maybe. But we really care how accurate they are. Perhaps that's why you hear, "AccuWeather" forecast as a brand name for the technology.&lt;/p&gt;

&lt;h3&gt;
  
  
  Predictive Statistics Metrics
&lt;/h3&gt;

&lt;p&gt;On predictive, we focus on things like a confusion metric, which consider false positives, false negatives, true positives, and true negatives. And from here we dive right into Accuracy, which is a measure of correct predictions (sum of true positives and true negatives) against the correct observations (sum of observed positives and negatives). This leads us to measure how "far off" our predicted values are from all of our observed values. Error, in other words.   &lt;/p&gt;

&lt;h2&gt;
  
  
  Classifier vs Regression
&lt;/h2&gt;

&lt;p&gt;Now that we have reviewed some of the statistical foundations of Machine Learning, we can focus on predictive analytics. Let's do a quick reminder of some differences between regression and classifier method, and then we'll dig into some algorithms.&lt;/p&gt;

&lt;h3&gt;
  
  
  Regression
&lt;/h3&gt;

&lt;p&gt;With a regression, the goal is to reduce all of the complexities of your data set to a simpler, underlying relationship. We know it won't be perfect, but hopefully it's close. We can think of it as trying to UNIFY the data. &lt;/p&gt;

&lt;h3&gt;
  
  
  Classification
&lt;/h3&gt;

&lt;p&gt;With classification, we SEPARATE the data by making clear distinctions. We look at a big mass of info and start divvying it up.&lt;/p&gt;

&lt;h3&gt;
  
  
  Loss Function
&lt;/h3&gt;

&lt;p&gt;Circling back to error, it's a good time to delve into the idea of loss functions. This is critical in understanding how these programs perform optimization. Error, or in many cases, Mean Squared Error, is a popular term. When we use it as a loss function, we're constantly iterating our main algorithm to try and minimize the MSE. This is done by gradient descent, analyzing how rapidly our MSE is changing with each iteration and adjusting the parameters to minimize the descent. This is a mouthful, but ultimately loss functions represent an inaccuracy in our model that we're trying to reduce.&lt;/p&gt;

&lt;h2&gt;
  
  
  Algorithms
&lt;/h2&gt;

&lt;p&gt;So, let's look at six algorithms below with help from a useful blog post (and subsequent diagram) called &lt;a href="https://www.blog.dailydoseofds.com/p/an-algorithm-wise-summary-of-loss?utm_source=post-email-title&amp;amp;publication_id=1119889&amp;amp;post_id=137547091&amp;amp;utm_campaign=email-post-title&amp;amp;isFreemail=true&amp;amp;r=2ce3uv&amp;amp;triedRedirect=true"&gt;Daily Dose of Data Science&lt;/a&gt;. &lt;/p&gt;

&lt;h3&gt;
  
  
  Regression
&lt;/h3&gt;

&lt;p&gt;The below chart shows ML Algorithms and Loss Functions&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc9e8s2tzkzcu2adyt9a2.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fc9e8s2tzkzcu2adyt9a2.png" alt="Regression" width="584" height="108"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;It's important to remember that some algorithms can be implemented as either regression or classification.&lt;/p&gt;

&lt;h4&gt;
  
  
  1. Linear Regression
&lt;/h4&gt;

&lt;p&gt;Attempts to find a unifying expression for a continuous or non-discrete variable prediction. MSE or (RMSE) is the accuracy metric for the loss function that drives optimization.&lt;/p&gt;

&lt;h4&gt;
  
  
  2. Logistic Regression
&lt;/h4&gt;

&lt;p&gt;Attempts to find a unifying expression for bilateral classification prediction. The Cross Entropy Loss determines how far your results are from either bilateral classification. &lt;/p&gt;

&lt;h3&gt;
  
  
  Classification
&lt;/h3&gt;

&lt;p&gt;Again, we can see the following chart for classification.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3tbw3e12ypgftgprjhpw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/cdn-cgi/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2F3tbw3e12ypgftgprjhpw.png" alt="Classification" width="582" height="181"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  3. Decision Tree
&lt;/h4&gt;

&lt;p&gt;Decision Tree creates a series or path of splits (into 2 groups each time) between values in a single variable. Ideal for binomial classification, the algorithm creates a split, almost like a rule, that tries to group a certain range of variables with certain outcomes. Information Gain details how successful that split is.&lt;/p&gt;

&lt;h4&gt;
  
  
  4. Support Vector Machines
&lt;/h4&gt;

&lt;p&gt;Tries to subdivide data using a linear demarcation. Hinge loss tells us how thick this "split" is in our model, and the thicker it is the better. If this sounds like a vague explanation of hinge loss, well, it is. This &lt;a href="https://towardsdatascience.com/a-definitive-explanation-to-hinge-loss-for-support-vector-machines-ab6d8d3178f1"&gt;article&lt;/a&gt; goes into better detail.&lt;/p&gt;

&lt;h4&gt;
  
  
  5. K Nearest Neighbors
&lt;/h4&gt;

&lt;p&gt;This algorithm locates a certain data point in the desired feature and analyzes what the variables around it. It reports a vote of the most likely classification based on some distance K from the data point you are trying to predict. Essentially, this algorithm is "lazy" and there's no loss function. You give it an input you're looking to predict, and it reports a vote. there's no optimization effort.&lt;/p&gt;

&lt;h4&gt;
  
  
  6. Naive Bayes
&lt;/h4&gt;

&lt;p&gt;This algorithm follows from the original Bayesian theory that determines the probability that certain features are responsible for certain classification results. The order can vary, unlike Decision Tree, and it only matters that once you know one variables outcome, you can use that to determine another variable's effect on the prediction. There's not much to optimize per se, you just iterate through each variable to determine the effects on classification.&lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;This blog post provides a summary of foundational elements of Machine Learning. It discusses inferential vs predictive statistics, classification vs regression, and then jumps into popular algorithms. We reviewed loss functions, and now, you could be ready to jump into neural networks and deep learning.&lt;/p&gt;

&lt;h2&gt;
  
  
  Sources
&lt;/h2&gt;

&lt;p&gt;&lt;em&gt;An Algorithm-wise Summary of Loss Functions in Machine Learning Loss functions of 16 ML algorithms in a single frame&lt;/em&gt;, Avi Chawla, Sept 30, 2023.&lt;br&gt;
&lt;a href="https://www.blog.dailydoseofds.com/p/an-algorithm-wise-summary-of-loss?utm_source=post-email-title&amp;amp;publication_id=1119889&amp;amp;post_id=137547091&amp;amp;utm_campaign=email-post-title&amp;amp;isFreemail=true&amp;amp;r=2ce3uv&amp;amp;triedRedirect=true"&gt;https://www.blog.dailydoseofds.com/p/an-algorithm-wise-summary-of-loss?utm_source=post-email-title&amp;amp;publication_id=1119889&amp;amp;post_id=137547091&amp;amp;utm_campaign=email-post-title&amp;amp;isFreemail=true&amp;amp;r=2ce3uv&amp;amp;triedRedirect=true&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;A definitive explanation to the Hinge Loss for Support Vector Machines&lt;/em&gt; Vagif Aliyev, Towards Data Science, Nov 23, 2020&lt;br&gt;
&lt;a href="https://towardsdatascience.com/a-definitive-explanation-to-hinge-loss-for-support-vector-machines-ab6d8d3178f1"&gt;https://towardsdatascience.com/a-definitive-explanation-to-hinge-loss-for-support-vector-machines-ab6d8d3178f1&lt;/a&gt;**&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>beginners</category>
      <category>classification</category>
      <category>lossfunction</category>
    </item>
    <item>
      <title>Axis Headaches? Examples for Formatting Tick Labels (Matplotlib)</title>
      <dc:creator>bennettandrewm</dc:creator>
      <pubDate>Tue, 18 Jul 2023 21:19:51 +0000</pubDate>
      <link>https://dev.to/bennettandrewm/axis-headaches-examples-for-formatting-tick-labels-matplotlib-4o6j</link>
      <guid>https://dev.to/bennettandrewm/axis-headaches-examples-for-formatting-tick-labels-matplotlib-4o6j</guid>
      <description>&lt;h2&gt;
  
  
  Introduction
&lt;/h2&gt;

&lt;p&gt;If you're like me, you go to bed around the 40th tweak to your graph, bleary eyed and beaten. Formatting tick labels in particular is incredibly frustrating, especially when Offset Notation ruins an axis.&lt;/p&gt;

&lt;p&gt;To make things easier, I've laid some simple formatting examples for an uncooperative axis using the &lt;code&gt;set_major_formatter&lt;/code&gt; feature. This is not a comprehensive list of all formatting options but is simple and effective for some of the more obvious cases. &lt;/p&gt;

&lt;h2&gt;
  
  
  Purpose
&lt;/h2&gt;

&lt;p&gt;This is useful because there are many ways to display numbers. Often, it's money ($x,000 for instance), but could include percentages, engineering notation, logarithmic scales, decimals, dates, or countless others. Getting your graph to tell a story is vital for any visualization. Hopefully this will save you some time.&lt;/p&gt;

&lt;h2&gt;
  
  
  Background
&lt;/h2&gt;

&lt;p&gt;If you want to scroll down to the examples below, feel free to skip this section. For those who want more, here's a little background on &lt;code&gt;axis&lt;/code&gt; and the &lt;code&gt;set_major_formatter&lt;/code&gt; function.&lt;/p&gt;

&lt;h3&gt;
  
  
  Axis Class
&lt;/h3&gt;

&lt;p&gt;To format our axis, we're going to work in the &lt;code&gt;axis&lt;/code&gt; class of matplotlib. &lt;code&gt;axis&lt;/code&gt; class you ask? What's the difference between &lt;code&gt;axis&lt;/code&gt; and &lt;code&gt;axes&lt;/code&gt;? The &lt;code&gt;axes&lt;/code&gt; essentially contains everything in the plot, where &lt;code&gt;axis&lt;/code&gt; just pertains to the y or x-axis ticks themselves, especially tick location and formatting. To access the x axis, you would type &lt;code&gt;ax.xaxis&lt;/code&gt;. For the y axis, &lt;code&gt;ax.yaxis&lt;/code&gt;. See informal diagram below courtesy of &lt;a href="https://matplotlib.org/stable/gallery/showcase/anatomy.html"&gt;Madplotlib&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--FNvytErD--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/8secr1d6wrtcw90k9mjq.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--FNvytErD--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/8secr1d6wrtcw90k9mjq.png" alt="Image description" width="423" height="423"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Formatter
&lt;/h3&gt;

&lt;p&gt;Within the Axis class, two common objects that pertain to tick display are &lt;code&gt;Locator&lt;/code&gt; and &lt;code&gt;Formatter&lt;/code&gt;. I'm going to utilize a &lt;code&gt;formatter&lt;/code&gt;, specifically, &lt;code&gt;set_major_formatter&lt;/code&gt;. This &lt;code&gt;formatter&lt;/code&gt; accepts either a str, a function, or a pre-built &lt;code&gt;formatter&lt;/code&gt; instance. We'll discuss each of these three options in the following sections.&lt;/p&gt;

&lt;h2&gt;
  
  
  Examples
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Tick Formatting - String
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Intro
&lt;/h4&gt;

&lt;p&gt;Our first example will show how to format using a &lt;code&gt;str.method()&lt;/code&gt; argument. When doing this, we use typical &lt;code&gt;str.format()&lt;/code&gt; with &lt;code&gt;set_major_formatter&lt;/code&gt;. To do this, you pass an &lt;code&gt;x&lt;/code&gt; with a colon inside the &lt;code&gt;{}&lt;/code&gt;. Let's see how a typical line would look.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--puE2FkZ5--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/uvhqwxjzwq3axzb0u0w4.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--puE2FkZ5--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/uvhqwxjzwq3axzb0u0w4.png" alt="Image description" width="410" height="230"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;So how does it work? To highlight the different formatting options. I'll use a simple graph as a template.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--SCT3ftfu--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/r85wcs522zjmqgzj0ccc.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--SCT3ftfu--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/r85wcs522zjmqgzj0ccc.png" alt="Image description" width="411" height="279"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;With this Source Code underneath:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;df_sample = pd.DataFrame([400, 200, 200, 800, 100, 0, 1200, 700, 800, 700, 200, 400],
            [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])

ax = df_sample.plot.bar()

ax.set_title('Sample')
ax.set_xlabel('X-Axis')
ax.set_ylabel('Y-Axis')
ax.legend().remove()
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I gave the plot a title and label for the x and y axes but no other major formatting.&lt;/p&gt;

&lt;p&gt;Now let's see how we can pretty this up!&lt;/p&gt;

&lt;h4&gt;
  
  
  Sample 1: Dollars
&lt;/h4&gt;

&lt;p&gt;SO, let's say the y-axis represents money and I want to show it in dollars. Specifically, I'd like to add '$' and commas separating the 0s. Let's see what happens when I add one line of Code&lt;/p&gt;

&lt;p&gt;&lt;code&gt;ax.yaxis.set_major_formatter('${x:,.0f}')&lt;/code&gt;  &lt;/p&gt;

&lt;p&gt;This is typical &lt;code&gt;str.format()&lt;/code&gt; with the &lt;code&gt;,&lt;/code&gt; used to separate 0s and &lt;code&gt;.0f&lt;/code&gt; to signify the number of decimal places (0, in this case).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--E4WeBRCF--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/al8ubq8wwskthcmw86hf.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--E4WeBRCF--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/al8ubq8wwskthcmw86hf.png" alt="Image description" width="411" height="279"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Sample 2: Percentages
&lt;/h4&gt;

&lt;p&gt;What about percentages. Let's show the y-axis with percentage sign and 2 decimal places.  I can use our familar syntax but with a tweak:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;ax.yaxis.set_major_formatter('%{x:,.2f}')&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--sEGhqirj--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ilk2oo2ptsfysippscbm.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--sEGhqirj--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ilk2oo2ptsfysippscbm.png" alt="Image description" width="431" height="279"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Sample 3: Additional Text
&lt;/h4&gt;

&lt;p&gt;How does it look with a longer string? Let's have a little fun.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;ax.yaxis.set_major_formatter('Total Days {x:,.0f} until the Apocalypse')&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--X4vOJteO--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/f9mbp43po2ohzler9gqy.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--X4vOJteO--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/f9mbp43po2ohzler9gqy.png" alt="Image description" width="571" height="276"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is bad form, but you can see how the plot adjusts to show the full sentence with a comma separating the 0s.  &lt;/p&gt;

&lt;h3&gt;
  
  
  Tick Formatting - Pre-Built Formatter Instance
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Intro
&lt;/h4&gt;

&lt;p&gt;Besides string format, another option is pre-built formatter instances that matplotlib has which can also be utilized for more specific number representation. A few common options have to do with dates, engineering notation, etc. We'll explore a few here. &lt;/p&gt;

&lt;h4&gt;
  
  
  Sample 4a: Date.Formatter (Manual)
&lt;/h4&gt;

&lt;p&gt;Let's use the same dataset we used above and see how Date.Formatter looks.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--3Q2qx6k5--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xstez3r7sow4k4rs0ov9.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--3Q2qx6k5--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/xstez3r7sow4k4rs0ov9.png" alt="Image description" width="428" height="276"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As you can see, we got the numbers to appear as dates. How did we do that? We had to import &lt;code&gt;matplotlib.dates&lt;/code&gt; and then use &lt;code&gt;Date.Formatter&lt;/code&gt; I used the same dataset to easily see how the formatting works but it's important to note how &lt;code&gt;int&lt;/code&gt; or &lt;code&gt;float&lt;/code&gt; types are converted to date times. In &lt;code&gt;Date.Formatter&lt;/code&gt;, this is "done by converting date instances into days since an epoch (by default 1970-01-01T00:00:00)" (from Matplotlib.org). Why 1970? Who knows, but I agree that everything before 1970 could have been created by a simulation.&lt;/p&gt;

&lt;p&gt;Let's look at the Source Code underneath to see what we added. The first line &lt;code&gt;import matplotlib.dates as mdates&lt;/code&gt;&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;import matplotlib.dates as mdates

df_sample = pd.DataFrame([400, 200, 200, 800, 100, 0, 1200, 700, 800, 700, 200, 400],
            [1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12])

ax = df_sample.plot.bar()

ax.set_title('Sample')
ax.set_xlabel('X-Axis')
ax.set_ylabel('Y-Axis')
ax.legend().remove()

ax.yaxis.set_major_formatter(mdates.DateFormatter("%Y-%b"))
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Note the last line we used here. I used the &lt;code&gt;mdates.DateFormatter&lt;/code&gt; function to show the Year and the 3-Letter Month using ("%Y-%b"). I could use it to display a number a few ways like "%m-%d-%y" (Month-Day-Year), for example. These are both considered manual ways to utilize &lt;code&gt;DateFormatter&lt;/code&gt;. &lt;/p&gt;

&lt;h4&gt;
  
  
  Sample 4b: Date (Automatic)
&lt;/h4&gt;

&lt;p&gt;For a more automated, layout conscious format there's a function called &lt;code&gt;ConciseDataFormatter&lt;/code&gt;. This automatically configures that date in the most concise way, based on the plot. Unlike previous examples, it requires the location of the tick labels. For simplicity sake, I'll use the existing tick locations using a locater instance called &lt;code&gt;get_major_formatter&lt;/code&gt;. Let's see:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;ax.yaxis.set_major_formatter(mdates.ConciseDateFormatter(ax.yaxis.get_major_locator()))&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--pzpbhoyn--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/4fia19oura0xv7fj64b7.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--pzpbhoyn--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/4fia19oura0xv7fj64b7.png" alt="Image description" width="392" height="276"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;You can see the formatter decided to display just the month at the bottom, the days of the month along the y-axis, and the final year and month at the top. This doesn't always reveal the desired look but it's good to know if you're trying to save space on your figure.&lt;/p&gt;

&lt;h4&gt;
  
  
  Sample 5: Engineering Notation (Manual)
&lt;/h4&gt;

&lt;p&gt;When displaying scientific units, engineering notation is often the  best way. To do this, we need to import the &lt;code&gt;EngFormatter&lt;/code&gt;, and then use &lt;code&gt;set_major_formatter&lt;/code&gt; while specifying our units. So, with our same original dataset and code we would simply add:&lt;/p&gt;

&lt;p&gt;&lt;code&gt;from matplotlib.ticker import EngFormatter&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;and then &lt;/p&gt;

&lt;p&gt;&lt;code&gt;ax.yaxis.set_major_formatter(EngFormatter(unit='kg'))&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--AR6j_smJ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ux4rlhexwjd4dti6hl66.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--AR6j_smJ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ux4rlhexwjd4dti6hl66.png" alt="Image description" width="412" height="276"&gt;&lt;/a&gt;&lt;br&gt;
The plot now shows the 'k' representing 1,000 next to the kilogram. &lt;/p&gt;

&lt;h4&gt;
  
  
  Sample 6: Logarithm Exponents
&lt;/h4&gt;

&lt;p&gt;And of course, for the ever popular Logarithmic Exponents, there is a formatter that will return said exponents, in this case of log base 10 (remember, our y-axis from the original dataset was 0-1200).&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--2WQUXoGl--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ygpx1lz9iha7m5tbeiqt.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--2WQUXoGl--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/ygpx1lz9iha7m5tbeiqt.png" alt="Image description" width="423" height="423"&gt;&lt;/a&gt;&lt;br&gt;
The y axis shows the exponent of the log. How did we get 3 and 3.08? (Obnoxious reminder: 10^x = 1000, so x = 3. And 10x = 1200, so x = 3.08).&lt;/p&gt;

&lt;h4&gt;
  
  
  Other Pre-built Function Formatters
&lt;/h4&gt;

&lt;p&gt;This is the complete list here, courtesy of Madplotlib.org&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--VMd3bIdo--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/bhgnt6kvdkbdi7rmapd0.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--VMd3bIdo--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://dev-to-uploads.s3.amazonaws.com/uploads/articles/bhgnt6kvdkbdi7rmapd0.png" alt="Image description" width="680" height="651"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Tick Formatting - Custom Function Formatter
&lt;/h3&gt;

&lt;p&gt;The third option for the set_major_formatter is to write a custom formatting function. This formatter function would have to take in &lt;code&gt;x&lt;/code&gt; for the value of the tick, and &lt;code&gt;pos&lt;/code&gt;, for the position of the tick. It would then return a &lt;code&gt;str&lt;/code&gt; of what you want displayed. &lt;/p&gt;

&lt;h2&gt;
  
  
  Summary
&lt;/h2&gt;

&lt;p&gt;This post should familiarize you with how to use the &lt;code&gt;set_major_formatter&lt;/code&gt; function and provide some simple examples. These examples should save you time and are also applicable to other formatters (like &lt;code&gt;set_minor_formatters&lt;/code&gt;). I've also provided additional resources of topics not covered here. This is important because every graph tells a story for your audience. The more your storytelling is clear and concise, the better it is for everyone!&lt;/p&gt;

&lt;h3&gt;
  
  
  Resources
&lt;/h3&gt;

&lt;p&gt;Matplotlib.org - Axis.set_major_formatter(formatter)&lt;br&gt;
&lt;a href="https://matplotlib.org/stable/api/_as_gen/matplotlib.axis.Axis.set_major_formatter.html#matplotlib.axis.Axis.set_major_formatter"&gt;https://matplotlib.org/stable/api/_as_gen/matplotlib.axis.Axis.set_major_formatter.html#matplotlib.axis.Axis.set_major_formatter&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Anatomy of a Figure&lt;br&gt;
&lt;a href="https://matplotlib.org/stable/gallery/showcase/anatomy.html"&gt;https://matplotlib.org/stable/gallery/showcase/anatomy.html&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Matplotlib.org - Tick Locating and Formatting&lt;br&gt;
&lt;a href="https://matplotlib.org/stable/api/ticker_api.html#tick-locating-and-formatting"&gt;https://matplotlib.org/stable/api/ticker_api.html#tick-locating-and-formatting&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Matplotlib.org - Date Tick Labels&lt;br&gt;
&lt;a href="https://matplotlib.org/stable/gallery/text_labels_and_annotations/date.html#sphx-glr-gallery-text-labels-and-annotations-date-py"&gt;https://matplotlib.org/stable/gallery/text_labels_and_annotations/date.html#sphx-glr-gallery-text-labels-and-annotations-date-py&lt;/a&gt;&lt;/p&gt;

</description>
      <category>beginners</category>
      <category>plots</category>
      <category>formatting</category>
      <category>ticklabels</category>
    </item>
  </channel>
</rss>
