DEV Community: Luciano Strika

How to Create a Spoiler Tag in HTML

Luciano Strika — Sun, 24 Mar 2024 05:46:42 +0000

Many forums or blogs make use of the spoiler tag: a little button or anchor that, if clicked, reveals otherwise invisible content.

I wanted to add this functionality to the site for Tables of Content, so I figured adding this guide here could be useful both for my own future reference and for anyone else looking for a concise explanation.

In this post we will code a spoiler tag: an anchor that shows or hides an HTML element when clicked.

What we will do to implement a spoiler tag is divided in three parts.

CSS Example Class

    .spoiler-content {
        display: none;
    }

With the display property set to none, we make content invisible (but still part of the page's HTML). Setting this to block would make it visible again.

HTML Part

<a href="#" onclick="toggleSpoiler(event, '1')">Table of Contents</a>

<div class="spoiler-content" id='1'>
    <!-- Your invisible content here.> </!-->
</div>

Pretty straightforward: the div has the spoiler-content class that makes it invisible by default, and a unique id. We pair that content with that anchor by sending the same id as the second argument to the toggleSpoiler function.

JavaScript Part

function toggleSpoiler(event, id) {
    event.preventDefault();
    var spoilerContent = document.getElementById(id);
    if (spoilerContent.style.display === 'none' || spoilerContent.style.display === '') {
        spoilerContent.style.display = 'block';
    } else {
        spoilerContent.style.display = 'none';
    }
}

For the script, we define toggleSpoiler such that, given an id, if the element with that id is visible it becomes hidden, or viceversa. I add the check for display === '' as for some reason in this toy example, js was detecting the value as '' the first time even if the class was correctly applied, so that you don't need to click twice to reveal the content.

And there you have it, a simple spoiler tag in plain HTML/JS. Note that you could make the anchor a button or any other thing instead, and make the div contain any arbitrary HTML elements.

Ant Colony Optimization and the Travelling Salesman Problem

Luciano Strika — Mon, 12 Sep 2022 14:48:52 +0000

Ant Colony Optimization algorithms always intrigued me. They are loosely based in biology and the real protocols ants use to communicate and plan routes. They do this by coordinating through small pheromone messages: chemical trails they leave as they move forward, signaling for other ants to follow them. Even though each ant is not especially smart, and they follow simple rules individually, collectively they can converge to complex behaviors as a system, and amazing properties emerge.

In the computational sense, Ant Colony Optimization algorithms solve complex optimization problems for which a closed-form or polynomial solution does not exist, by trying different "routes" across some relevant space or graph, and trying to find the most efficient one (typically the shortest) from two points that satisfies some constraints.

Personally, I had a debt with myself from 5 years ago from an Algorithms III class where Ant Colony Optimization was mentioned as an alternative to simulated annealing and Genetic Algorithms, but not expanded on and left as an exercise for future study. I remember back then the concept sounded interesting, but since I was busy with other matters I decided to postpone studying it. Now I find myself having more free time, so I finally decided to give it a try. What better way to verify I learned than coding an Ant Colony Optimization algorithm from scratch and showing it here?

First, let's start with some motivation: why would you want to learn about Ant Colony Optimization?

The Travelling Salesman Problem

One especially important use-case for Ant Colony Optimization (ACO from now on) algorithms is solving the Travelling Salesman Problem (TSP).

This problem is defined as follows: Given a complete graph G with weighted edges, find the minimum weight Hamiltonian cycle. That is, a cycle that passes through each node exactly once and minimizes the total weight sum.

Note that the graph needs to be complete: there needs to exist an edge conecting each possible pair of nodes. For graphs based in real places, this makes sense: you can just connect two places with an edge with a weight equal to their distance, or their estimated travel time.

For a concrete example, look at the following graph.

In this case, the salesman wants to visit every home once and get back to where it started. Each edge joining two houses has a numeric label, representing the travel time between them in minutes. The salesman is a busy man, and would prefer to take as little time as possible in visiting all the houses. What would be the most efficient route?

As an example, if we started from the house on the top left, we would want to go bottom, right, center, left again for a total of 80 minutes of travel. You can take a little time to convince yourself that is the right answer by hand, since this is a small case. Try to find a different route that would take less time to visit the four houses.

Why is the Travelling Salesman Problem important? Many reasons.

First of all, TSP appears everywhere in logistics. Imagine you need to make multiple deliveries with a truck. You have packages, each of which has to go to a different place. What is the most time-efficient order to deliver them in and then go back to the warehouse? You just found the Travelling Salesman Problem.

TSP is also important because it is an NP-Complete problem. That means in the family of NP (nondeterministic polynomial time) problems -those problems for which verification of a solution takes polynomial time, even if finding that solution is harder-, it is in the hardest category: if we found a polynomial time solution for it, then since any other NP problem can be transformed into a TSP problem (sometimes through very esoteric means, but still) in polynomial time too, we would have found a polynomial solution for all NP problems.

Finding TSP can be solved in polynomial time would prove P=NP. This would be huge. To the point of being considered one of this century's biggest questions. Suddenly swathes of hard problems would become easier to solve, and many new applications would open up, with multiple kinds of software becoming vastly more efficient. What it would do for logistics would probably contribute significantly to the world's GDP and global trade.

But before I digress further, now that we know what TSP is, let's see how to solve it. For more information, I recommend the Wikipedia article on TSP.

Ant Colony Optimization: Solving TSP

There are many possible ways to solve the Travelling Salesman Problem for a given graph. As discussed above, there is no fast way to get the best solution for an arbitrary graph for certain, at least not without it taking a very long time.

The trivial way to solve TSP would be to look at all the possible Hamiltonian Cycles and keep the best one. This would imply looking at all possible orderings of nodes, which grow factorially -O(N!)- with the number N of nodes. Growing factorially is much worse than growing exponentially, for any base. It is so bad that even parallelism would not help: since adding a single node makes the problem N times harder, each extra node in the graph would require we grow the infrastructure superexponentially just to keep up. This would be extremely inefficient.

Due to this, instead of looking for the exact solution for a graph, what most frameworks and solvers do is finding approximate solutions: can we find a way of connecting all nodes in a cycle that is "good enough"? To achieve this, multiple optimization algorithms exist. the Networkx framework for graphs in Python solves TSP with Christofides or Simulated Annealing, for example, of which the latter is quite similar to Ant Colony Optimization. Christofides has the nice property of never being wrong by more than 50% (so if the best cycle has a weight of 100, Christofides is guaranteed to find a cycle of weight at most 150).

The algorithm we will see today is one such way of approximating a solution.

In Ant Colony Optimization Algorithms, we will run a simulation of "ants" traversing the graph, constrained to only move in cycles, visiting each node exactly once. Each ant will leave, after finishing its traversal, a trail of pheromones that is proportional to the inverse weight of the discovered cycle (that is, if the cycle the ant encountered is twice as big, it will leave half the pheromones on each edge of the graph it went through, and so on).

Finally, though we will make ants choose which edge to go through on each step of their traversal randomly, they will assign more preference to edges with more pheromones on them, and less preference to those with less pheromones. Additionally, if an edge is longer, it will receive less preference, since it implies higher travel times.

These two preference adjustments could be linear, or any other polynomial (in my case, I tried many different coefficients and found the optimum to be sublinear for the pheromones, and quadratic or **1.5 for the distance).

The pseudocode Wikipedia gives is:

procedure ACO_MetaHeuristic is
    while not terminated do
        generateSolutions()
        daemonActions()
        pheromoneUpdate()
    repeat
end procedure

For this post, I coded Ant Colony Optimization (initially proposed by Marco Dorigo in 1992 in his PhD thesis) from scratch in Python using the Wikipedia article as a reference. I then ran a few experiments with it and benchmarked it against other algorithms for different problem instances.

I used numpy for the traversals and other numerical operations, and pytest for testing. The whole code is available on GitHub, but I will show you the main parts step-by-step now. If you're not interested in how the Ant Colony Optimization algorithm works in detail, you can skip straight to the results and benchmarks.

First of all, I designed a minimal Graph class, whose code I will not include here since it is very simple. Suffice it to say that the .distance property holds an adjacency matrix with the weight -distance- for each edge.

Then I coded the traverse_graph function, which represents a single ant going through the graph one node at a time, constrained to move in a cycle.

The ant starts from a given node, and will at each step choose from among every node it has not stepped on yet, with a weighted distribution that assigns preference proportional to an edge's pheromone load and to the inverse of its distance, each raised to a power that is a hyperparameter coefficient (alpha and beta respectively).

That is, the probability of choosing a certain edge will be proportional to:

Where P is the level of pheromones in that edge, and D the distance the edge covers. To get the distribution we sample from at each random jump, we normalize these weight coefficients so they add up to one.

After that, the optimization procedure itself consists of:

Initialize the graph with a constant (typically initially very high, to encourage exploration) amount of pheromones on each edge.
Make k ants start from random nodes and traverse the graph using the procedure defined above.
For each traversal, update the level of pheromones in its edges according to the function Q/total_weight, where Q is a hyperparameter (a constant) and total_weight is the sum of the distances of all the edges in the cycle. If using elitism, add to the list of traversals the best one we have encountered so far, to incentivize the ants not to deviate too far from it.
If a cycle was found that beats the best one so far, update it.
All pheromone levels are multiplied by a degradation constant, another hyperparameter between 0 and 1 that represents the passage of time and prevents bad past solutions to influence good recent ones too much.
Repeat for a certain number of iterations, or until convergence.

Intuitively, this converges to short cycles because each ant is leaving more pheromones in the edges on its cycle the shorter it is and, as old pheromones fade over time, and new ants favor edges with more pheromones in them, new cycles will tend to be ever shorter. Crucially, as each ant is choosing its next step at random, even though they will tend to pick the candidates with the most pheromone every time, they will also have a non-negligible probability of picking a different edge and going off exploring. Should that lead to a better cycle overall, then that ant will tell future ants about it by leaving even more pheromones, as the cycle is shorter.

Over time, we would expect the average ant traversal to get shorter and shorter.

Additionally, I tried a few more modifications to the algorithm: the 'elite' or best candidate can be specified manually at the start (as that allows for reusing of the best solution from other runs) and I designed a protocol for increasing the amount of pheromones everywhere by a constant if progress stagnated -no new best cycle found after patience iterations-, though I did not achieve better results through that. Also, after running k ants, I only updated the pheromone trails with the best k/2 ants' traversals instead of using them all. This did improve results quite significantly, as did using elite candidates --not keeping them made the algorithm more unstable and it converged a lot more slowly.

Here is the whole function in all its glory (with comments for sanity).

Some possible improvements for this algorithm that I didn't have the time for:

Traversals could be trivially paralllelized since each ant is independent. This can be done very easily using the multiprocessing Python module, but it doesn't work on Mac by default. In this tradeoff, I chose portability over speed.
Choosing the next jump in a traversal can be done in parallel with numpy vector multiplication, which resulted in everything running about 5x faster. However due to numerical instability, a jump could be performed to the same node over and over, even though I was multiplying by zero, and solving this bug would have taken more time than I thought worth it. If you find a way to make this work for all cases, then feel free to make a pull request and you will get the credit and a link.

Tests and Results

After coding the algorithm and testing it in toy cases, I was very happy to find that the internet had provided me with a wealth of different graphs and TSP problems to try it on.

I got my first small but real test case from this Medium Article using real Germany cities data. I was happy to see ACO found the optimal solution in seconds!

Then I found the huge Santa Claus Challenge with coordinates data representing millions of houses in Finland (for Santa to visit). The entire dataset did not fit in memory, so I could not verify how close my solution got to the best ones in the challenge, but taking ever bigger samples let me see how fast or slow each part of the program was for profiling. Go to the challenge's article for a fun read.

Finally, my favorite resource for finding TSP problems, often with their optimal cycle's weight, was Heidelberg University's site.

I used that site's Berlin dataset for most of my benchmarking and hyperparameter optimization, from which I found the best alpha and beta values to be around 0.9 and 1.5.

I was very happy to see that, while Networkx's TSP solve took 2 seconds and this program took a couple minutes, my solution for that dataset had a weight of ~44000 whereas Networkx's was around 46k. This proves for some cases, even though slower, ACO algorithms could be a good approach for solving TSP problems.

Experiments

Encouraged by the comments in Reddit, I decided to experiment further and see how the optimization behaved in different situations.

Particularly, since ACO can be updated online, it is supposed to perform very well in dynamic network or logistics problems where the graph is shifting in real time, in comparison with other algorithms which need to be re-run from scratch.

Since the ants update their pheromone trails in real time, whenever there is a shift in the edge's distances they should eventually notice it and change their path to reflect it. For instance if two nodes got closer (the distance value in the edge joining them was reduced) then more ants should want to cross between them, and its pheromone load should grow larger. Alternatively if two nodes grow farther apart, the ants should shun them more.

To test whether this was the case, I tried two experiments. In both of them I started with the Berlin graph I had looked at earlier, which I knew the algorithm converged in after about 500 iterations of 50 ants each.

For the first experiment, after the 500th iteration I selected the edge with the highest amount of pheromones, and made its weight 10 times bigger. That is, if the edge was joining nodes i and j, then the distance between them grew 10 times larger.

I wanted to see how quickly the swarm would respond to this change, so I plotted the pheromone load for that edge from iteration 500 onwards for 500 more iterations.

As you can see, the ants don't respond instantaneously to the changes, but after 30 iterations they have adapted to them and do not visit that edge nearly as often as before. Its pheromone level remains very low afterward, with occasional peaks probably due to some of the exploration incentives I set.

For the second experiment, I took a random hamiltonian cycle and divided all of its edges by 10. This way, this cycle suddenly became tempting for the ants, as it was a cheap way of traversing the whole graph, smaller by an order of magnitude. Again this change took place in the 500th iteration, so I wanted to see how the ants reacted.

I looked at the mean pheromone load for edges in the diminished cycle, and this is what it looked like.

As expected, the ants were highly incentivized to deviate from their known paths and explore this cycle (it had a third of the weight of the next smallest cycle that the colony had found so far). After a single iteration, the average pheromone levels for that cycle had increased dramatically.

This shows that, as long as the algorithm contemplates the possibility of change by always encouraging a minimum level of exploration, new opportunities can be exploited as they arise.

Interestingly, if the minimum level of pheromones was plotted instead of the mean, it did not rise very much. I think this is because even after dividing by ten, a few of the edges in the best solution were still not included in this cycle. This can further be attested by the dip in average pheromone levels by the end of the graph above. I believe in the last 50 iterations a cycle was found that contained an edge that had not been diminished, but was nonetheless small enough to present an improvement.

Conclusions

We showed that Ant Colony Optimization can be implemented quite easily in Python, and since many of its operations can be vectorized or parallelized it should not be too slow, though not it is not as fast as Christofides's algorithm or others.

More importantly, we showed that in many datasets, ACO can converge to the optimal solution, and in many others its flexibility allows it to find better solutions (shorter traversals) than simpler algorithms.

Additionally, it could be seen that one of the best properties of Ant Colony Optimization over other algorithms is its capability for online adaptation to changes in the system. In certain situations this could prove critical for performance, especially if rapid response is encouraged.

On a more philosophical level, I think it is beautiful how by specifying a large set of simple agents that each follow very few rules, we could solve a problem that is known to be hard.

I would like to try Ant Colony Optimization for problems other than TSP in the future, so if you know of any other applications where ACO shines, let me know!

If you enjoyed this article, please share it on Twitter or with a friend. I write these for you and would be happy if more people can read them and share my love for algorithms.

Feature Visualization on Convolutional Neural Networks (or: Making your own Deep-Dream with Keras)

Luciano Strika — Sat, 30 May 2020 01:20:19 +0000

According to Wikipedia, apophenia is “the tendency to mistakenly perceive connections and meaning between unrelated things” . It is also used as “the human propensity to seek patterns in random information”. Whether it’s a scientist doing research in a lab, or a conspiracy theorist warning us about how “it’s all connected”, I guess people need to feel like we understand what’s going on, even in the face of clearly random information.

Deep Neural Networks are usually treated like “black boxes” due to their inscrutability compared to more transparent models, like XGboost or Explainable Boosted Machines.

However, there is a way to interpret what each individual filter is doing in a Convolutional Neural Network, and which kinds of images it is learning to detect.

Convolutional Neural Networks rose to prominence since at least 2012, when AlexNet won the ImageNet computer vision contest with an accuracy of 85%. The second place was at a mere 74%, and a year later most competitors were switching to this “new” kind of algorithm.

They are widely used for many different tasks, mostly relating to image processing. These include Image Classification, Detection problems, and many others.

I will not go in depth into how a Convolutional Neural Network works, but if you’re getting started in this subject I recommend you read my Practical Introduction to Convolutional Neural Networks with working TensorFlow code.

If you already have a grasp of how a Convolutional Neural Network works, then this article is all you need to know to understand what Feature Visualization does and how it works.

How does Feature Visualization work?

Normally, you would train a CNN feeding it images and labels, and using Gradient Descent or a similar optimization method to fit the Neural Network’s weights so that it predicts the right label.

Throughout this process, one would expect the image to remain untouched, and the same applies to the label.

However, what do you think would happen if we took any image, picked one convolutional filter in our (already trained) network, and applied Gradient Descent on the input image to maximize that filter’s output , while leaving the Network’s weights constant?

Suddenly, we have shifted perspectives. We’re no longer training a model to predict an image’s label. Rather, we’re now kind of fitting the image to the model, to make it generate whatever output we want.

In a way, it’s like we’re asking the model “See this filter? What kind of images turn it on?”.

If our Network has been properly trained, then we expect most filters to carry interesting, valuable information that help the model make accurate predictions for its classification task. We expect a filter’s activation to carry semantic meaning.

It stands to reason then, that an image that “activates” a filter, making it have a large output, should have features that resemble those of one of the objects present in the Dataset (and among the model’s labels).

However, given that convolutions are a local transformation , it is common to see the patterns that trigger that convolutional filter repeatedly “sprout” in many different areas of our image.

This process generates the kind of picture Google’s Deep Dream model made popular.

In this tutorial, we will use TensorFlow’s Keras code to generate images that maximize a given filter’s output (namely, the filter’s ouputs’ average, since the output is technically a matrix).

Implementing Filter Visualization

As I mentioned before, for this to work we would need to first train a Neural Network classifier. Luckily, we don’t need to go through that whole messy and costly process: Keras already comes with a whole suite of pre-trained Neural Networks we can just download and use.

Using a Pre-trained Neural Network

For this article, we will use VGG16, a huge Convolutional Neural Network trained on the same ImageNet competition Dataset. Remember how I mentioned AlexNet won with an 85% accuracy and disrupted the Image Classification field? VGG16 scored 92% on that same task.

VGG16 is a convolutional neural network model proposed by K. Simonyan and A. Zisserman from the University of Oxford in the paper “Very Deep Convolutional Networks for Large-Scale Image Recognition”. The model achieves 92.7% top-5 test accuracy in ImageNet, which is a dataset of over 14 million images belonging to 1000 classes. It was one of the famous model submitted to ILSVRC-2014. It makes the improvement over AlexNet by replacing large kernel-sized filters (11 and 5 in the first and second convolutional layer, respectively) with multiple 3×3 kernel-sized filters one after another. VGG16 was trained for weeks and was using NVIDIA Titan Black GPU’s.

https://neurohive.io/en/popular-networks/vgg16/ — VGG16 – Convolutional Network for Classification and Detection (emphasis mine)

For these experiments I will be using Google colab’s GPU machine, and tweaking Keras Library’s example of Filter Visualization code.

For a breakdown of how the original script works, you can see Keras Blog. I only made slight changes to it to easily configure file names, and other minor details, so I don’t think it’s worth it to link to my own notebook.

What the important function does is:

Define a loss function that’s equal to the chosen filter’s mean output.
Initialize a small starting picture , typically with random uniform noise centered around RGB(128,128,128) (I actually played around with this a bit, and will expand on it later).
Compute the gradient of the input picture with regards to this loss, and perform gradient descent
Repeat N times, then resize the picture to make it slightly bigger (default value was 20%). We start with a small picture and make it increasingly bigger as we generate the filter’s maximizing image, because otherwise the algorithm tends to create a small pattern that repeats many times, instead of making a lower-frequency pattern with bigger (and, subjectively, more aesthetically pleasing) shapes.
Repeat the last two steps until reaching the desired resolution.

That’s pretty much it. The code I linked to has a few more things happening (image normalization, and stitching together many filters’ generated images into a cute collage) but that is the most important bit.

Here’s the code for that function, not that scary now that you know what’s going on right?

Now for the fun part, let’s try this out and see which kinds of filters come out.

My Results Trying out Feature Visualization

I read many different examples of Feature Visualization articles before giving it a shot. Here are some of the things I learned.

The first Convolutional Layers (the ones closer to the inputs) generate simpler visuals. They’re usually just rough textures like parallel wavy lines, or multicolored circles.

Visualization of Convolutional Filter on VGG16, second layer.

The Convolutional layers closer to the outputs generate more intricate textures and patterns. Some even resemble objects that exist, or sorta look like they may exist (in a very uncanny-valley kind of way).

This is also where I had the most fun, to be honest. I tried out many different “starting images”, from random noise to uniform grey, to a progressive degrade.

The results for any given filter all came out pretty similar. This makes me think given the number of iterations I used, the starting image itself became pretty irrelevant. At the very least, it did not have a predictable impact on the results.

Feature Visualization for Block 4, filters in first convolutional layer of VGG16. Most of the patterns look regular and granular, but a lot more complicated than the early, rustic textures we saw on the first layers.

Filter Visualization for Block 4, filters in second convolutional layer of VGG16. Note how the patterns are very repetitive, but generate textures that look a lot more sophisticated than in the first layers.

Filter Visualization for Block 4, filters in third convolutional layer of VGG16. Some more porous patterns seem to emerge.

As we go deeper, and closer to the fully connected layers , we reach the last Convolutional Layer. The images it generates are the most intricate by far, and the patterns they make many times resemble real life items.

Filter Visualization for Block 5, filters in first convolutional layer of VGG16

Filter Visualization for Block 5, filters in third convolutional layer of VGG16

Block 5, filters in second Convolutional Layer. Isn’t it just crazy that all of these patterns emerge just from maximizing a “simple” (albeit hyperdimensional) mathematical function?

Now, looking into these images in search of patterns, it is easy to feel like one is falling into apophenia. However, I think we can all agree some of those images have features that really look like … you can zoom in and complete that sentence on your own. Feature Visualization is the new gazing at clouds.

My own guess is it’s just a new kind of abstract art.

Let me show you some of the filters I found most visually interesting:

The texture kind of reminds me of an Orange peel

This one looks like clouds or cotton.

This one looks like spirals infested with fungi (great name for a band!)

This one is just too crazy, so I’ll wrap this up with it.

I have about 240 more of these, if there’s enough interest I can make a gallery out of them, but I feared it may turn repetitive after a while.

Finally, If you use the classification layer’s cells to generate an image, it will usually come out wrong, greyish and ugly. I didn’t even try this out, since the results weren’t that interesting. It’s good to keep it in mind however, especially when you read headlines about AI taking over soon or similar unwarranted panic.

Conclusion

To be honest, I had a lot of fun with this project. I hadn’t really heard about Google Colab until a couple weeks ago (thanks to r/mediaSynthesis). It feels great to be able to use a good GPU machine for free.

I’d also read most of the papers on this subject a couple years ago, then never got around to actually testing the code or doing an article like this. I’m glad I finally scratched it out of my list (or Trello, who am I kidding?).

Finally, in the future I’d like to try out different network architectures and visualize how the images morph at every iteration, instead of simply looking at the finished product.

Please let me know in the comments which other experiments or bibliography could be worth checking to keep expanding on this subject!

If you liked this article, please consider tweeting it or sharing anywhere else!

Follow me on Twitter to discuss any of this further or keep up to date with my latest articles.

The post Feature Visualization on Convolutional Neural Networks (Keras) appeared first on Data Stuff.

3 Programming Books to Read During Lockdown

Luciano Strika — Mon, 20 Apr 2020 04:40:49 +0000

Be it an O’Reilly book, or some of the Computer Science classics, many programming books can help you level up in your career as a Developer.

This can be especially important when you are getting started in Software Development, or in a programming language like Python.

These last months have been quite heavy and stressful for many of us, what with the Apocalypse taking place and all that.

So why not take advantage of the situation, and use our newfound free time to double down on our studies and read some programming books?

This may be the time to be twice as productive. As Paul Graham said:

“If it is possible to make yourself into a great hacker, the way to do it may be to make the following deal with yourself: you never have to work on boring projects (…), and in return, you’ll never allow yourself to do a half-assed job.”

Without further ado, let’s see book number 1.

Automate the Boring Stuff with Python

If you’re new to programming, there’s this very early stage when you’re still realizing the huge potential software can have , especially when applied to automation.

There’s a big difference between performing a simple task manually, and doing it a thousand times faster with a script.

There’s also a case to be made that Python is the best language to get started with Software Development, since its syntax and environments can be less daunting than C or Java. This way, you can spend less time doing this kind of “set up” things, freeing you up to focus on what’s important: solving actual problems.

I think Automate the Boring Stuff really sets itself apart from other programming books in this area: showing you from the get-go which typical day-to-day problems you can solve with Python scripts, or with code, really.

From basic program flow and logic to more advanced tasks like Web Scraping, this book walks with you all the way from beginner to proficient , without holding your hand too much.

My favorite project from that book is the one on the chapter Handle the Clipboard Content , which teaches you how to copy and paste text programmatically, eventually making a super-clipboard which stores more than one text.

I have a personal attachment to this book, as I used it to learn Python when I was still in high school, deciding on whether to study Computer Science or not.

If you work in an office and you’re thinking of pivoting into programming , this book is for you.

Here’s a link to Automate the Boring Stuff in Amazon.

Introduction to Algorithms (Cormen)

To every Computer Science student, Cormen et al.’s Introduction to Algorithms is our bible.

This book has been sitting on my shelf for years.

It has helped me prepare for many exams , or whenever I need to brush up on Data Structures before an interview.

Especially if you’re planning to get into Software Development without getting a college degree, this book is a definite must-read.

This Computer Science book is the most comprehensive study of basic Data Structures and Algorithms you will find.

It covers:

Algorithmic Complexity (with the best explanation of Big-O notation I’ve seen so far).
Sorting Algorithms (many sorting algorithms).
Graphs and Graph-related Algorithms (especially Binary trees).
Hash tables and hashing algorithms.
Dynamic Programming , Greedy Algorithms, Divide-and-Conquer Algorithms.

These topics and many others are explained in understandable terms, but with mathematical rigor and correctness. Not only that, but they often come up both in day-to-day work , and in interview problems.

Keep in mind this is a university level book, packed full with formal proofs and mathematical notation.

Even so, I think most developers will agree it is generally entertaining to read (if you don’t find Data Structures fun, make sure you’re picking the right career!), and explains most concepts really clearly and succinctly.

If you need to learn how a hash table works, or want to be able to build a binary search tree from scratch, or just need a quick brush up on sorting algorithms before an interview, this is the book for you.

As before, here’s a link to Cormen et al.’s Introduction to Algorithms in Amazon.

And, speaking of interviews…

Cracking the Code Interview

Ok, hear me out here. If you’re starting from scratch, I think Automate the Boring Stuff is the most practical way to start learning Python and programming.

And if you want to dive deeper and learn more advanced or theoretical Computer Science concepts, like Algorithms and Data Structures, then Cormen’s Introduction to Algorithms is the undisputed book to go.

However, when all is said and done, there is a craftsmanship that you can only learn by doing, and practicing.

As Charles Darwin once said:

“I have always maintained that, excepting fools, men did not differ much in intellect, only in zeal and hard work.”

If that’s the stage where you feel you’re at, then the best thing you can do is practice a lot , with many different problems.

That’s exactly what Cracking the Code Interview (CTCI, for friends) has to offer.

Sure, the first chapter deals more with the “soft” aspects of a Software Interview (which, again, if you plan to apply for a SWE job eventually, then you should master those too!).

But the rest of the book? Chapter after chapter of fun, challenging problems taken straight out of Google’s, Microsoft’s or Facebook’s interview processes. And, they are divided into categories so you can practice one subject at a time.

Feel like you need to polish your bit manipulation skills? CTCI has a chapter for you.

Want to practice thinking on your feet and deciding which Data Structures fit each kind of problem setup? CTCI has you covered, too.

I did feel my Software Interview skills improved after reading CTCI and going through all its exercises. However, that’s definitely not the most important part. The most valuable thing I got from CTCI, is practice: hands-on practice, solving many different problems through code.

To get started, be sure to check out Cracking the Code Interview on Amazon.

Conclusion

So far, I’ve made recommendations for:

A very language-focused programming book for your first steps as a Developer.
A more academic or broader book for the more theoretically-oriented readers.
A last, very practical book with a lot of exercises for everyone, old and new to coding.

Each of these programming books has helped me learn a lot. Some have saved my skin on more than one occasion -or exam!-.

When I am preparing for an interview, or a tough exam, there are no other books I’d rather have (though, if you read this far and are thinking ‘hey, he didn’t mention !’ this is your time to shine! Hit me up in the comments and I’ll make sure to add it to my reading list).

I hope at least some of these books will be as helpful to you or your programmer friends, too!

Have you already read any of these books? Are you reading any of them? Let me know what you think of them in the comments!

I’d love to know your opinion, both if you liked them or not. Especially if you can offer a recommendation for what you think is a better alternative!

If you want to get into Data Science or Machine Learning, check out my older post Machine Learning Books to Level Up as a Data Scientists.

We can also discuss these books on Twitter, Medium of dev.to if you’re interested.

I want to hear your opinions!

(small disclaimer: all of these links are Amazon affiliate links. This means I get a small commission if you buy them. However, I’ll only review books I’ve actually read, and have genuinely recommended to people in real life)

The post 3 Programming Books for Beginners to Read During Lockdown appeared first on Data Stuff.

What is the one tip you would give to new bloggers out there?

Luciano Strika — Sun, 10 Nov 2019 17:50:12 +0000

I've been blogging for about a year now, and feel like I've learned a lot of things in this time, even though I'm nowhere near as experienced as those big-time guys I see on social networks.

The few things I'd say to a newcomer are:

Make a plan and stick to it: posting consistently over a long period of time is better than making bursts of content every once in a while (I wish I had the discipline to follow this advice, it really pays off).
Focus on promoting your content almost as much as you focus on writing it. At least if you really care about it reaching a big audience.
Don't spend too much time focusing on improving your site speed. As developers, I think we have a tendency to want to optimize every last inch of website performance. I know I do. And I've spent countless hours, enough to write like five good quality articles, just optimizing those last points in Google PageSpeed, or those few points in GTMetrix. My advice? Get a few plugins to do the job for you, get to like 90 or 95 pagespeed and then just focus on content.
I wish I'd known that from the start.

By the way, if you're using WordPress, the PageSpeed Ninja plugin was a gamechanger for me. It's a lot better than Autoptimize, which is what Google suggested to me most of the time.

So what about you? Bloggers of the tech world, lords of the web, which pearls of wisdom do you think every other blogger should receive?

Markov Chains: Training AI to Write Game of Thrones

Luciano Strika — Fri, 25 Oct 2019 05:29:17 +0000

Markov chains have been around for a while now, and they are here to stay. From predictive keyboards to applications in trading and biology, they’ve proven to be versatile tools.

Here are some Markov Chains industry applications:

Text Generation (you’re here for this).
Financial modelling and forecasting (including trading algorithms).
Logistics: modelling future deliveries or trips.
Search Engines: PageRank can seen as modelling a random internet surfer with a Markov Chain.

So far, we can tell this algorithm is useful, but what exactly are Markov Chains?

What are Markov Chains?

A Markov Chain is a stochastic process that models a finite set of states , with fixed conditional probabilities of jumping from a given state to another.

What this means is, we will have an “agent” that randomly jumps around different states, with a certain probability of going from each state to another one.

To show what a Markov Chain looks like, we can use a digraph , where each node is a state (with a label or associated data), and the weight of the edge that goes from node a to node b is the probability of jumping from state _ a _ to state _ b _.

Here’s an example, modelling the weather as a Markov Chain.

Source

We can express the probability of going from state a to state b as a matrix component , where the whole matrix characterizes our Markov chain process, corresponding to the digraph’s adjacency matrix.

Source

Then, if we represent the current state as a one-hot encoding, we can obtain the conditional probabilities for the next state’s values by taking the current state, and looking at its corresponding row.

After that, if we repeatedly sample the discrete distribution described by the n-th state’s row, we may model a succession of states of arbitrary length.

Markov Chains for Text Generation

In order to generate text with Markov Chains, we need to define a few things:

What are our states going to be?
What probabilities will we assign to jumping from each state to a different one?

We could do a character-based model for text generation, where we define our state as the last n characters we’ve seen, and try to predict the next one.

I’ve already gone in-depth on this for my LSTM for Text Generation article, to mixed results.

In this experiment, I will instead choose to use the previous k words as my current state, and model the probabilities of the next token.

In order to do this, I will simply create a vector for each distinct sequence of k words, having N components, where N is the total quantity of distinct words in my corpus.

I will then add 1 to the j-th component of the i-th vector, where i is the index of the i-th k-sequence of words, and j is the index of the next word.

If I normalize each word vector, I will then have a probability distribution for the next word, given the previous k tokens.

Confusing? Let’s see an example with a small corpus.

Training our chain: toy example.

Let’s imagine my corpus is the following sentence.

This sentence has five words

We will first choose k: the quantity of words our chain will consider before sampling/ predicting the next one. For this example, let’s use k=1.

Now, how many distinct sequences of 1 word does our sentence have? It has 5, one for each word. If it had duplicate words, they wouldn’t add to this number.

We will first initialize a 5×5 matrix of zeroes.

After that, we will add 1 to the column corresponding to ‘sentence’ on the row for ‘this’. Then another 1 on the row for ‘sentence’, on the column for ‘has’. We will continue this process until we’ve gone through the whole sentence.

This would be the resulting matrix:

The diagonal pattern comes from the ordering of the words.

Since each word only appears once, this model would simply generate the same sentence over and over, but you can see how adding more words could make this interesting.

I hope things are clearer now. Let’s jump to some code!

Coding our Markov Chain in Python

Now for the fun part! We will train a Markov chain on the whole A Song of Ice and Fire corpus (Ha! You thought I was going to reference the show? Too bad, I’m a book guy!).

We will then generate sentences with varying values for k.

For this experiment, I decided to treat anything between two spaces as a word or token.

Conventionally, in NLP we treat punctuation marks (like ‘,’ or ‘.’) as tokens as well. To solve this, I will first add padding in the form of two spaces to every punctuation mark.

Here’s the code for that small preprocessing, plus loading the corpus:

We will start training our Markov Chain right away, but first let’s look at our dataset:

We have over 2 million tokens, representing over 32000 distinct words! That’s a pretty big corpus for a single writer.

If only he could add 800k more…

Training our chain

Moving on, here’s how we initialize our “word after k-sequence” counts matrix for an arbitrary k (in this case, 2).

There are 2185918 words in the corpus, and 429582 different sequences of 2 words, each followed by one of 32663 words.

That means only slightly over 0.015% of our matrix’s components will be non-zero.

Because of that, I used scipy’s dok_matrix (dok stands for Dictionary of Keys), a sparse matrix implementation, since we know this dataset is going to be extremely sparse.

After initializing our matrix, sampling it is pretty intuitive.

Here’s the code for that:

There are two things that may have caught your attention here. The first is the alpha hyperparameter.

This is our chain’s creativity: a (typically small, or zero) chance that it will pick a totally random word instead of the ones suggested by the corpus.

If the number is high, then the next word’s distribution will approach uniformity. If zero or closer to it, then the distribution will more closely resemble that seen in the corpus.

For all the examples I’ll show, I used an alpha value of 0.

The second thing is the weighted_choice function. I had to implement it since Python’s random package doesn’t support weighted choice over a list with more than 32 elements, let alone 32000.

Results: Generated Sentences

First of all, as a baseline, I tried a deterministic approach: what happens if we pick a word, use k=1, and always jump to the most likely word after the current one?

The results are underwhelming, to say the least.

**I** am not have been a man , and the Wall . " " " " 
**he** was a man , and the Wall . " " " " " " " 
**she** had been a man , and the Wall . " " " " " "

Since we’re being deterministic, ‘a’ is always followed by ‘man’, ‘the’ is always followed by ‘Wall’ (hehe) and so on.

This means our sentences will be boring, predictable and kind of nonsensical.

Now for some actual generation, I tried using a stochastic Markov Chain of 1 word, and a value of 0 for alpha.

1-word Markov Chain results

Here are some of the resulting 15-word sentences, with the seed word in bold letters.

' **the** Seven in front of whitefish in a huge blazes burning flesh . I had been' ' 
**a** squire , slain , they thought . " He bathed in his head . The' ' 
**Bran** said Melisandre had been in fear I’ve done . " It must needs you will' ' 
**Melisandre** would have feared he’d squired for something else I put his place of Ser Meryn' ' 
**Daenerys** is dead cat - TOOTH , AT THE GREAT , Asha , which fills our' ' 
**Daenerys** Targaryen after Melara had worn rich grey sheep to encircle Stannis . " The deep'

As you can see, the resulting sentences are quite nonsensical, though a lot more interesting than the previous ones.

Each individual pair of words makes some sense, but the whole sequence is pure non-sequitur.

The model did learn some interesting things, like how Daenerys is usually followed by Targaryen, and ‘would have feared’ is a pretty good construction for only knowing the previous word.

However, in general, I’d say this is nowhere near as good as it could be.

When increasing the value of alpha for the single-word chain, the sentences I got started turning even more random.

Results with 2-word Markov chains

The 2-word chain produced some more interesting sentences.

Even though it too usually ends sounding completely random, most of its output may actually fool you for a bit at the beginning (emphasis mine).

' **the world**. _And Ramsay loved the feel of grass_ _welcomed them warmly_ , the axehead flew'' 
**Jon Snow**. _You are to strike at him_ . _The bold ones have had no sense_'' 
**Eddard Stark** had done his best to give her _the promise was broken_ . By tradition the'' 
**The game** of thrones , so you must tell her the next buyer who comes running ,'' 
**The game** trail brought her messages , strange spices . _The Frey stronghold was not large enough_'' 
**heard the** scream of fear . I want to undress properly . Shae was there , fettered'

The sentences maintain local coherence (You are to strike at him, or Ramsay loved the feel of grass), but then join very coherent word sequences into a total mess.

Any sense of syntax, grammar or semantics is clearly absent.

By the way, I didn’t cherry-pick those sentences at all, those are the first outputs I sampled.

Feel free to play with the code yourself, and you can share the weirdest sentences you get in the comments!

As a last experiment, let’s see what we get with a 3-word Markov Chain.

3-Word Chain Results

Here are some of the sentences the model generated when trained with sequences of 3 words.

' **I am a** master armorer , lords of Westeros , sawing out each bay and peninsula until the'' 
**Jon Snow is** with the Night’s Watch . I did not survive a broken hip , a leathern'' 
**Jon Snow is** with the Hound in the woods . He won’t do it . " Please don’t'' 
**Where are the** chains , and the Knight of Flowers to treat with you , Imp . "'' 
**Those were the** same . Arianne demurred . " So the fishwives say , " It was Tyrion’s'' 
**He thought that** would be good or bad for their escape . If they can truly give us'' 
**I thought that** she was like to remember a young crow he’d met briefly years before . "'

Alright, I really liked some of those, especially the last one. It kinda sounds like a real sentence you could find in the books.

Conclusion

Implementing a Markov Chain is a lot easier than it may sound, and training it on a real corpus was fun.

The results were frankly better than I expected, though I may have set the bar too low after my little LSTM fiasco.

In the future, I may try training this model with even longer chains, or a completely different corpus.

In this case, trying a 5-word chain had basically deterministic results again, since each 5-word sequence was almost always unique, so I did not consider 5-words and upwards chains to be of interest.

Which corpus do you think would generate more interesting results, Especially for a longer chain? Let me know in the comments!

If you wish to learn even more about Markov Chains, consider checking this in-depth book. That’s an affiliate link, which means I get a small commission from it.

Coding MapReduce in C from Scratch using Threads: Map

Luciano Strika — Sat, 19 Oct 2019 05:45:55 +0000

Hadoop’s MapReduce is not just a Framework, it’s also a problem-solving philosophy.

Borrowing from functional programming, the MapReduce team realized a lot of different problems could be divided into two common operations: map , and reduce.

Both mapping and reducing steps can be done in parallel.

This meant as long as you could frame your problem in that specific way, there would be a solution to it that could easily be run in parallel. This will usually result in a big performance boost.

That all sounds good, and running things on parallel is usually a good thing, especially when working at scale. But, some of you on the back may be wondering, what are Map and Reduce?

What is MapReduce?

In order to understand the MapReduce framework, we need to understand its two basic operations: Map and Reduce.

They’re both high order functions: Meaning they are functions that can take other functions as their argument.

Specifically, when you need to convert a certain sequence of elements of type A into a result, or series of results of type B, you will:

Map all your inputs to a different domain: that means you will transform each of them with a chosen function, applying it to each element.
Group the mapped elements by some criterion, usually a grouping key.
Reduce the mapped elements on each group with some other function. This function needs to take two arguments and return a single one of the same type, successively running an operation between an accumulator and each value in our collection. It should be commutative and associative , as parallel execution won’t guarantee any order for the operations.

To make this clearer, let’s see an example.

Example of a MapReduce solution

Suppose you’re working for an e-commerce company, and they give you a log file of this form:

John Surname bought 2 apples Alice Challice bought 3 bananas John Surname bought 5 pineapples

Then they ask you to tell them how many fruits each customer bought.

In this case, after parsing this file to turn it into an actual format, like CSV, you could easily go through each line, and add the number of bought fruits on a dictionary under each name.

You could even solve it with a bit of Bash scripting, or load the CSV on a Pandas DataFrame and get some statistics.

However, if the log file was a trillion lines long, bash scripting wouldn’t really cut it. Especially not if you’re not immortal.

You would need to run this in parallel. Let me propose a MapReduce-y way of doing it:

Map each line to a Pair of the form <Name, Quantity> by parsing each string.
Group by Name.
Reduce by summing the quantities.

If you’re familiar with SQL and relational databases, you may have thought of a similar solution. The query would look something like

select user, sum(bought_fruits)

from fruit_transactions group by user;

Why MapReduce scales

Notice how the mapper doesn’t need to see the whole file , just some of the lines. The reducer , on the other hand, only needs to have the lines that have the same Name (the ones that belong to the same group).

You could do this with many different threads on the same computer, and then just join the results.

Or, you could have many different processes running the map jobs, and feeding their output to another set running the reducing job.

If the log was big enough, you could even be running Mapper and Reducer processes on many different computers (say, on a cluster), and then joining their results on some lake in the end.

This kind of solution is very common in ETL jobs and other data-intensive applications, but I won’t delve any further into applications.

If you wish to learn more about this kind of scalable solutions, I recommend you check this O’Reilly book on designing applications at scale.

Programming MapReduce in C

Now that you have an understanding of what MapReduce is, and why MapReduce scales, let’s cut to the chase.

For this first article, we will program two different implementations of the Map function.

One of them will be single-threaded , to introduce a few concepts and show a simple solution. The other one will use the pthread library to make an actually multi-threaded , and much faster version of Map. Finally, we will compare the two and run some benchmarks.

As usual, all the code is available on this C GitHub project.

Single threaded implementation of Map in C

First of all, let’s remember what Map does.

The Map function receives a sequence and a function , and returns the result of applying that function to each element in the sequence.

Since this is C, representing a sequence can be very straight forward: we can just use a pointer to whatever type we’re mapping over!

However, there’s a catch. C is statically typed , and we would like our Map function to be as generic as possible. We want it to be able to map over a sequence of elements of any type (provided they all share a type. Let’s not get carried away here, boys).

How do we solve this? There are probably a few different solutions to this problem. I chose the one that looked like the most simple one, but feel free to pitch in with other ideas.

We will use sequences of void*, and cast everything to this type. This means every element will be represented as a pointer to a random memory address, without specifying a type (or size).

We will trust whatever function we are calling over all these sequence elements knows how to cast them to the right type before using them. We’re effectively delegating that problem away.

A smaller problem we need to solve is sequence length. A pointer to void doesn’t carry the information of how many elements the sequence has. It only knows where it starts, not where it ends.

We will solve this other problem by passing sequence length as a second argument. Knowing that, our Map function becomes pretty straightfoward.

You see, the function receives a void** to represent the sequence it will map over, and a void* (*f)(void*) function that transforms elements of some generic type to another (or the same) one.

After that, we can use our Map function on any sequence. We only need to do some awkward wrapping and pointer arithmetic beforehand.

Here’s an example, using a function that returns 1 for prime numbers and 0 for the others.

As expected, the resulting pointer points to a sequence of integers: 1 corresponds to prime numbers, 0 to composite ones.

Now we’ve gone through the single-threaded Map function, let’s see how to make this run on parallel in C.

Multi-threaded Map function in C

In order to use parallel execution in C, we can either turn to processes, or threads.

For this project, we will be using threads, as they’re more lightweight and, in my opinion, their API is a bit more intuitive for this kind of tutorial.

(If you want to add a benchmark using processes and forking, feel free to make a pull request!)

How to use threads in C

Threads’ API in C is quite intuitive, if only a bit obscure at first.

To use them, we will have to #include <pthread.h>. Pthreads‘ man page explains their interface quite nicely. However, for this tutorial, all we will use is the pthread_create function.

pthread_create takes four arguments:

A pointer to a pthread_t: the actual thread.
A configuration struct. In this case, we will use NULL for default config.
The function we want the thread to run. Unlike a process, a thread will only run a function until it returns, rather than continuing the execution of arbitrary code. This function must take a single void* argument and return another void* value.
The input of the aforementioned function. It must be cast to void*.

After calling on pthread_create, a parallel thread of execution will begin running the given function.

Once we call pthread_create for each of the chunks we wish to map, we will have to call pthread_join on each of them, which makes the parent (original) thread wait until all the threads it spun finish running.

Otherwise, the program would end before the mapping was done.

Now, let’s feast our eyes on some code.

Using pthread for Parallel MapReduce in C

To code MapReduce’s Map function in C, the first thing we are going to do is define a struct that can store the generic inputs and outputs for it, as well as the function we will be mapping with.

Since parallel execution requires some manner of slicing and partitioning , we will store that logic inside this structure as well, using two different indices for the start and end of our slice.

Next, we will code the function that actually does the mapping: it will cycle the inputs from start to end, storing the result of applying the mapped function to each input in the outputs’ pointer.

Finally the star of the show, the function that starts the threads, assigns a map_argument to each of them, and waits for all the map jobs to run, finally returning the results.

Notice how this function allows us to choose how many threads we want, and partitions the data accordingly. It also handles pthreads‘ creation and joining.

Finally, the way we would call this function in main looks something like this:

concurrent_map((void**) numbers, twice, N, NTHREADS)

Where NTHREADS is the number of threads we want, and N is how many elements numbers has.

Now the code is done, let’s run some benchmarks! Is this really going to be faster? Will all this wrapper code make things a lot slower? Let’s find out!

Map in C, Benchmarks: Single-threaded vs Multi-threaded

In order to measure performance improvements from using parallel Map, I tested some single-threaded algorithms against their multi-threaded counterparts.

First benchmark: slow_twice

For my first test, I used the slow_twice function, which simply multiplies each number by 2.

You may be wondering, ‘why is it called slow?’. The answer is simple: we will double each number 1000 times.

This makes the operation slower, so we can measure time differences without having to use so many numbers that initialization takes too long. It also lets us benchmark the case of many memory writes.

Since execution time for each number is constant, the non-parallel algorithm’s time grows pretty much linearly on input size.

I then ran it with 2, 4 and 8 threads. My laptop has 4 cores, and I found that to be the optimum number of threads to use as well. For some other algorithms, I’ve found using a multiple of my quantity of cores to be optimum, but this hasn’t been the case.

Benchmark Results

I ran each benchmark 10 times and took the average, just in case.

Here are the results:

| Time (s): | 5000000 elements | 10000000 elements |
| single-threaded | 18.91 | 37.47 |
| 2-threads | 9.78 | 19.49 |
| 4-threads | 6.46 | 12.85 |
| 8-threads | 8.60 | 17.18 |

For both test cases, using 4 threads was about three times faster than the single-threaded implementation. This proves using Parallel Map is a lot faster than using a common single-threaded version.

There was also a cost to adding more than 4 threads, probably due to the overhead of initialization and context switching.

Second benchmark: is_prime

For this benchmark I coded a naive prime testing function: it simply iterates through all the numbers smaller than the input, and returns 1 if any divides it, 0 otherwise.

Notice how this function takes O(n) instead of O(1) for each element, so a few partitions of our data (which is ordered) will be a lot slower than the others. I wonder how this will affect running times?

| Time (s): | 150000 elements | 300000 elements |
| single-threaded | 5.02 | 18.73 |
| 2-threads | 3.76 | 13.78 |
| 4-threads | 2.73 | 10.14 |
| 8-threads | 2.43 | 8.70 |

In this case, again the parallel algorithm beats the single-threaded one. No big surprises there. However, this time there’s an improvement when using over 4 threads!

I think this is because when partitioning our inputs, dividing it into smaller chunks makes the slowest partition take less time , thus making our bottleneck smaller.

Conclusion

I had a lot of fun running this experiment.

Picking how many threads to use turns out to be a lot harder than just “use the same amount as cores”, and depends a lot on our input even for very dumb algorithms.

This may help us understand why optimizing a cluster’s configuration can be such a daunting task for a big application.

In the future, I may add a parallel reduce implementation to complete this little framework.

A few other benchmarks that might’ve been fun and I may run in the future are Map in C vs Python List Comprehensions, and C vs SIMD-Assembly.

Remember you can use this code any way you like, or run your own experiments, and if you do please don’t forget to let me know your results in the comments!

Feel free to contact me on Twitter, Medium or dev.to for anything you want to say or ask to me!

If you want to level up as a Data scientist, check out my best Machine Learning books list and my Bash tutorial.

Why do Neural Networks Need an Activation Function?

Luciano Strika — Mon, 01 Jul 2019 00:21:12 +0000

Why do Neural Networks Need an Activation Function? Whenever you see a Neural Network’s architecture for the first time, one of the first things you’ll notice is they have a lot of interconnected layers.

Each layer in a Neural Network has an activation function, but why are they necessary? And why are they so important? Learn the answer here.

What are activation functions?

To answer the question of what Activation Functions are, let’s first take a step back and answer a bigger one: What is a Neural Network?

What are Neural Networks?

A Neural Network is a Machine Learning model that, given certain input and output vectors, will try to “fit” the outputs to the inputs.

What this means is, given a set of observed instances with certain values we wish to predict, and some data we have on each instance, it will try to generalize those data so that it can predict the values correctly for new instances of the problem.

As an example, we may be designing an image classifier (typically with a Convolutional Neural Network). Here, the inputs are a vector of pixels. The output could be a numerical class label (for instance, 1 for dogs, 0 for cats).

This would train a Neural Network to predict whether an image contains a cat or a dog.

But what is a mathematical function that, given a set of pixels, returns 1 if they correspond to the image of a dog, and 0 to the image of a cat?

Coming up with a mathematical function that did that by hand would be impossible. For a human.

So what we did is invent a Machine that finds that function for us.

It looks something like this:

Single hidden layer Neural Network. Source.

But you may have seen this picture many times, recognize it for a Neural Network, and still not know exactly what it represents.

Here, each circle represents a neuron in our Neural Network, and the vertically aligned neurons represent each layer.

How do Neural Networks work?

A neuron is just a mathematical function, that takes inputs (the outputs of the neurons pointing to it) and returns outputs.

These outputs serve as inputs for the next layer, and so on until we get to the final, output layer, which is the actual value we return.

There is an input layer, where each neuron will simply return the corresponding value in the inputs vector.

For each set of inputs, the Neural Network’s goal is to make each of its outputs as close as possible to the actual expected values.

Again, think back at the example of the image classifier.

If we take 100x100px pictures of animals as inputs, then our input layer will have 30000 neurons. That’s 10000 for all the pixels, times three since a pixel is already a triple vector (RGB values).

We will then run the inputs through each layer. We get a new vector as each layer’s output, feed it to the next layer as inputs, and so on.

Each neuron in a layer will return a single value, so a layer’s output vector will have as many dimensions as the layer has neurons.

So, which value will a neuron return, given some inputs?

What does a Neuron do?

A neuron will take an input vector, and do three things to it:

Multiply it by a weights vector.
Add a bias value to that product.
Apply an activation function to that value.

And we finally got to the core of our business: that’s what activation functions do.

We’ll typically use non-linear functions as activation functions. This is because the linear part is already handled by the previously applied product and addition.

What are the most commonly used activation functions?

I’m saying non-linear functions and it sounds logic enough, but what are the typical, commonly used activation functions?

Let’s see some examples.

ReLU

ReLU stands for “Rectified Linear Unit”.

Of all the activation functions, this is the one that’s most similar to a linear one:

For non-negative values, it just applies the identity.
For negative values, it returns 0.

In mathematical words,

This means all negative values will become 0, while the rest of the values just stay as they are.

This is a biologically inspired function, since neurons in a brain will either “fire” (return a positive value) or not (return 0).

Notice how combined with a bias, this actually filters out any value beneath a certain threshold.

Suppose our bias had a value of -b. Any input value lower than b, after adding the bias will become negative. This turns to a 0 after applying ReLU to it.

Sigmoid

The sigmoid function takes any real number as input, and returns a value between 0 and 1. Since it is continuous, it effectively “smushes” values:

If you apply the sigmoid to 3, you get 0.95. Apply it to 10, you get 0.999… And it will keep approaching 1 without ever reaching it.

The same happens in the negative direction, except there it converges to 0.

Here’s the mathematical formula for the sigmoid function.

As you see, it approaches 1 as x approaches infinity, and approaches 0 if x approaches minus infinity.

It is also symmetrical, and has a value of 1/2 when its input is 0.

Since it takes values between 0 and 1, this function is extremely useful as an output if you want to model a probability.

It’s also helpful if you wish to apply a “filter” to partially keep a certain value (like in an LSTM’s forget gate).

Why do Neural Networks Need an Activation Function?

We’ve already talked about the applications some different activation functions have, in different cases.

Some let a signal through or obstruct it, others filter its intensity. There’s even the tanh activation function: instead of filtering, it turns its input into either a negative or positive value.

But what why do our Neural Networks need Activation Functions? What would happen if we didn’t use them?

I found the explanation for this question in Yoshua Bengio’s awesome Deep Learning book, and I think it’s perfectly explained there.

We could, instead of composing our linear transformations with non-linear functions, make each neuron simply return their result (effectively composing them with the identity instead).

But then all of our layers would simply stack one affine (product plus addition) transformation after another. Each layer would simply add a vector product, and vector addition, to the previous one.

It can be shown (and you can even convince yourself if you try the math with a small vector on a whiteboard) that this composition of affine transformations, is equivalent to a single affine transformation.

Effectively, this whole “Neural Network” where all activation functions have been replaced by the identity would be nothing more than a vector product and a bias addition.

There are many problems a linear transformation can’t solve, so we would effectively be shrinking the quantity of functions our model could estimate.

As a very simple but earthshaking example, consider the XOR operator.

Try to find a two-element vector, plus a bias that can take x1 and x2, and turn them into x1 XOR x2. Go ahead, I’ll wait.

…

Exactly, you can’t. nobody can. However, consider

If you work the math, you’ll see this has the desired output for each possible combination of 1 and 0.

Congratulations! You’ve just trained your first Neural Network!

And it’s learned a problem a linear model could never have learned.

Conclusions

I hope after this explanation, you now have a better understanding of why Neural Networks need an Activation Function.

In future articles, I may cover other Activation Functions and their uses, like SoftMax and the controversial Cos.

So what do you think? Did you learn anything from this article? Did you find it interesting? Was the math off?

Feel free to contact me on Twitter, Medium or dev.to for anything you want to say or ask to me!

Data Scientists|Engineers: What are the Frameworks you use the most at your job?

Luciano Strika — Fri, 28 Jun 2019 05:20:11 +0000

I've seen a lot of statistics about programmers, but not specifically about Data Scientists or Engineers.
Because of that, I'd like to propose this survey:

Do you identify as a Data Scientist or Data Engineer?
What are the languages and frameworks you use the most at your job? Say, the top 5 or 6.

As a bonus:

What are the Frameworks you wish you were using instead?

I will write back with some visuals or analysis if this gets enough traction.
Of course all the data will be public here, so you can do your own too.

LSTM: How to Train Neural Networks to Write like Lovecraft

Luciano Strika — Mon, 24 Jun 2019 00:37:59 +0000

LSTM Neural Networks have seen a lot of use in the recent years, both for text and music generation, and for Time Series Forecasting.

Today, I’ll teach you how to train a LSTM Neural Network for text generation, so that it can write with H. P. Lovecraft’s style.

In order to train this LSTM, we’ll be using TensorFlow’s Keras API for Python.

I learned about this subject from this awesome LSTM Neural Networks tutorial. My code follows this Text Generation tutorial‘s closely.

I’ll show you my Python examples and results as usual, but first, let’s do some explaining.

What are LSTM Neural Networks?

The most vanilla, run-of-the-mill Neural Network, called a Multi-Layer-Perceptron, is just a composition of fully connected layers.

In these models, the input is a vector of features, and each subsequent layer is a set of “neurons”.

Each neuron performs an affine (linear) transformation to the previous layer’s output, and then applies some non-linear function to that result.

The output of a layer’s neurons, a new vector, is fed to the next layer, and so on.

Source

A LSTM (Long Short-term Memory) Neural Network is just another kind of Artificial Neural Network, which falls in the category of Recurrent Neural Networks.

What makes LSTM Neural Networks different from regular Neural Networks is, they have LSTM cells as neurons in some of their layers.

Much like Convolutional Layers help a Neural Network learn about image features, LSTM cells help the Network learn about temporal data, something which other Machine Learning models traditionally struggled with.

How do LSTM cells work? I’ll explain it now, though I highly recommend you give those tutorials a chance too.

How do LSTM cells work?

An LSTM layer will contain many LSTM cells.

Each LSTM cell in our Neural Network will only look at a single column of its inputs, and also at the previous column’s LSTM cell’s output.

Normally, we feed our LSTM Neural Network a whole matrix as its input, where each column corresponds to something that “comes before” the next column.

This way, each LSTM cell will have two different input vectors : the previous LSTM cell’s output (which gives it some information about the previous input column) and its own input column.

LSTM Cells in action: an intuitive example.

For instance, if we were training an LSTM Neural Network to predict stock exchange values, we could feed it a vector with a stock’s closing price in the last three days.

The first LSTM cell, in that case, would use the first day as input, and send some extracted features to the next cell.

That second cell would look at the second day’s price, and also at whatever the previous cell learned from yesterday, before generating new inputs for the next cell.

After doing this for each cell, the last one will actually have a lot of temporal information. It will receive, from the previous one, what it learned from yesterday’s closing price, and from the previous two (through the other cells’ extracted information).

You can experiment with different time windows, and also change how many units (neurons) will look at each day’s data, but this is the general idea.

How LSTM Cells work: the Math.

The actual math behind what each cell extracts from the previous one is a bit more involved.

Forget Gate

The “forget gate” is a sigmoid layer, that regulates how much the previous cell’s outputs will influence this one’s.

It takes as input both the previous cell’s “hidden state” (another output vector), and the actual inputs from the previous layer.

Since it is a sigmoid, it will return a vector of “probabilities”: values between 0 and 1.

They will multiply the previous cell’s outputs to regulate how much influence they hold, creating this cell’s state.

For instance, in a drastic case, the sigmoid may return a vector of zeroes, and the whole state would be multiplied by 0 and thus discarded.

This may happen if this layer sees a very big change in the inputs distribution, for example.

Input Gate

Unlike the forget gate, the input gate’s output is added to the previous cell’s outputs (after they’ve been multiplied by the forget gate’s output).

The input gate is the dot product of two different layers’ outputs, though they both take the same input as the forget gate (previous cell’s hidden state, and previous layer’s outputs):

A sigmoid unit , regulating how much the new information will impact this cell’s output.
A tanh unit , which actually extracts the new information. Notice tanh takes values between -1 and 1.

The product of these two units (which could, again, be 0, or be exactly equal to the tanh output, or anything in between) is added to this neuron’s cell state.

The LSTM cell’s outputs

The cell’s state is what the next LSTM cell will receive as input, along with this cell’s hidden state.

The hidden state will be another tanh unit applied to this neuron’s state, multiplied by another sigmoid unit that takes the previous layer’s and cell’s outputs (just like the forget gate).

Here’s a visualization of what each LSTM cell looks like, borrowed from the tutorial I just linked:

Source: Text Generating LSTMs

Now that we’ve covered the theory, let’s move on to some practical uses!

As usual, all of the code is available on GitHub if you want to try it out, or you can just follow along and see the gists.

Training LSTM Neural Networks with TensorFlow Keras

For this task, I used this dataset containing 60 Lovecraft tales.

Since he wrote most of his work in the 20s, and he died in 1937, it’s now mostly in the public domain, so it wasn’t that hard to get.

I thought training a Neural Network to write like him would be an interesting challenge.

This is because, on the one hand, he had a very distinct style (with abundant purple prose: using weird words and elaborate language), but on the other he used a very complex vocabulary, and a Network may have trouble understanding it.

For instance, here’s a random sentence from the first tale in the dataset:

At night the subtle stirring of the black city outside, the sinister scurrying of rats in the wormy partitions, and the creaking of hidden timbers in the centuried house, were enough to give him a sense of strident pandemonium

If I can get a Neural Network to write “pandemonium”, then I’ll be impressed.

Preprocessing our data

In order to train an LSTM Neural Network to generate text, we must first preprocess our text data so that it can be consumed by the network.

In this case, since a Neural Network takes vectors as input, we need a way to convert the text into vectors.

For these examples, I decided to train my LSTM Neural Networks to predict the next M characters in a string, taking as input the previous N ones.

To be able to feed it the N characters, I did a one-hot encoding of each one of them, so that the network’s input is a matrix of CxN elements, where C is the total number of different characters on my dataset.

First, we read the text files and concatenate all of their contents.

We limit our characters to be alphanumerical, plus a few punctuation marks.

We can then proceed to one-hot encode the strings into matrices, where every element of the j-th column is a 0 except for the one corresponding to the j-th character in the corpus.

In order to do this, we first define a dictionary that assigns an index to each character.

Notice how, if we wished to sample our data, we could just make the variable slices smaller.

I also chose a value for SEQ_LENGTH of 50, making the network receive 50 characters and try to predict the next 50.

Training our LSTM Neural Network in Keras

In order to train the Neural Network, we must first define it.

This Python code creates an LSTM Neural Network with two LSTM layers, each with 100 units.

Remember each unit has one cell for each character in the input sequence, thus 50.

Here VOCAB_SIZE is just the amount of characters we’ll use, and TimeDistributed is a way of applying a given layer to each different cell, maintaining temporal ordering.

For this model, I actually tried many different learning rates to test convergence speed vs overfitting.

Here’s the code for training:

What you are seeing is what had the best performance in terms of loss minimization.

However, with a binary_cross_entropy of 0.0244 in the final epoch (after 500 epochs), here’s what the model’s output looked like.

Tolman hast toemtnsteaetl nh otmn tf titer aut tot tust tot ahen h l the srrers ohre trrl tf thes snneenpecg tettng s olt oait ted beally tad ened ths tan en ng y afstrte and trr t sare t teohetilman hnd tdwasd hxpeinte thicpered the reed af the satl r tnnd Tev hilman hnteut iout y techesd d ty ter thet te wnow tn tis strdend af ttece and tn aise ecn

There are many good things about this output, and many bad ones as well.

The way the spacing is set up, with words mostly between 2 and 5 characters long with some longer outliers, is pretty similar to the actual word length distribution in the corpus.

I also noticed the letters ‘T’, ‘E’ and ‘I’ were appearing very commonly , whereas ‘y’ or ‘x’ were less frequent.

When I looked at letter relative frequencies in the sampled output versus the corpus, they were pretty similar. It’s the ordering that’s completely off.

There is also something to be said about how capital letters only appear after spaces , as is usually the case in English.

To generate these outputs, I simply asked the model to predict the next 50 characters for different 50 character subsets in the corpus. If it’s this bad with training data, I figured testing or random data wouldn’t be worth checking.

The nonsense actually reminded me of one of H. P. Lovecraft’s most famous tales, “Call of Cthulhu”, where people start having hallucinations about this cosmic, eldritch being, and they say things like:

Ph’nglui mglw’nafh Cthulhu R’lyeh wgah’nagl fhtagn.

Sadly the model wasn’t overfitting that either, it was clearly underfitting.

So I tried to make its task smaller, and the model bigger: 125 units, predicting only 30 characters.

Bigger model, smaller problem. Any results?

With this smaller model, after another 500 epochs, some patterns began to emerge.

Even though the loss function wasn’t that much smaller (at 210), the character’s frequency remained similar to the corpus’.

The ordering of characters improved a lot though: here’s a random sample from its output, see if you can spot some words.

the sreun troor Tvwood sas an ahet eae rin and t paared th te aoolling onout The e was thme trr t sovtle tousersation oefore tifdeng tor teiak uth tnd tone gen ao tolman aarreed y arsred tor h tndarcount tf tis feaont oieams wnd toar Tes heut oas nery tositreenic and t aeed aoet thme hing tftht to te tene Te was noewked ay tis prass s deegn aedgireean ect and tot ced the sueer anoormal -iuking torsarn oaich hnher tad beaerked toring the sars tark he e was tot tech

Tech, the, and, was… small words are where it’s at! It also realized many words ended with common suffixes like -ing, -ed, and -tion.

Out of 10000 words, 740 were “the“, 37 ended in “tion” (whereas only 3 contained without ending in it), and 115 ended in –ing.

Other common words were “than” and “that”, though the model was clearly still unable to produce English sentences.

Even bigger model

This gave me hopes. The Neural Network was clearly learning something, just not enough.

So I did what you do when your model underfits: I tried an even bigger Neural Network.

Take into account, I’m running this on my laptop.

With a modest 16GB of RAM and an i7 processor, these models take hours to learn.

So I set the amount of units to 150, and tried my hand again at 50 characters.

I figured maybe giving it a smaller time window was making things harder for the Network.

Here’s what the model’s output was like, after a few hours of training.

andeonlenl oou torl u aote targore -trnnt d tft thit tewk d tene tosenof the stown ooaued aetane ng thet thes teutd nn aostenered tn t9t aad tndeutler y aean the stun h tf trrns anpne thin te saithdotaer totre aene Tahe sasen ahet teae es y aeweeaherr aore ereus oorsedt aern totl s a dthe snlanete toase af the srrls-thet treud tn the tewdetern tarsd totl s a dthe searle of the sere t trrd eneor tes ansreat tear d af teseleedtaner nl and tad thre n tnsrnn tearltf trrn T has tn oredt d to e e te hlte tf the sndirehio aeartdtf trrns afey aoug ath e -ahe sigtereeng tnd tnenheneo l arther ardseu troa Tnethe setded toaue and tfethe sawt ontnaeteenn an the setk eeusd ao enl af treu r ue oartenng otueried tnd toottes the r arlet ahicl tend orn teer ohre teleole tf the sastr ahete ng tf toeeteyng tnteut ooseh aore of theu y aeagteng tntn rtng aoanleterrh ahrhnterted tnsastenely aisg ng tf toueea en toaue y anter aaneonht tf the sane ng tf the

Pure nonsense, except a lot of “the” and “and”s.

It was actually saying “the” more often than the previous one, but it hadn’t learned about gerunds yet (no -ing).

Interestingly, many words here ended with “-ed” which means it was kinda grasping the idea of the past tense.

I let it go at it a few hundred more epochs (to a total of 750).

The output didn’t change too much, still a lot of “the”, “a” and “an”, and still no bigger structure. Here’s another sample:

Tn t srtriueth ao tnsect on tias ng the sasteten c wntnerseoa onplsineon was ahe ey thet tf teerreag tispsliaer atecoeent of teok ond ttundtrom tirious arrte of the sncirthio sousangst tnr r te the seaol enle tiedleoisened ty trococtinetrongsoa Trrlricswf tnr txeenesd ng tispreeent T wad botmithoth te tnsrtusds tn t y afher worsl ahet then

An interesting thing that emerged here though, was the use of prepositions and pronouns.

The network wrote “I”, “you”, “she”, “we”, “of” and other similar words a few times. All in all, prepositions and pronouns amounted to about 10% of the total sampled words.

This was an improvement, as the Network was clearly learning low-entropy words.

However, it was still far from generating coherent English texts.

I let it train 100 more epochs, and then killed it.

Here’s its last output.

thes was aooceett than engd and te trognd tarnereohs aot teiweth tncen etf thet torei The t hhod nem tait t had nornd tn t yand tesle onet te heen t960 tnd t960 wndardhe tnong toresy aarers oot tnsoglnorom thine tarhare toneeng ahet and the sontain teadlny of the ttrrteof ty tndirtanss aoane ond terk thich hhe senr aesteeeld Tthhod nem ah tf the saar hof tnhe e on thet teauons and teu the ware taiceered t rn trr trnerileon and

I knew it was doing its best, but it wasn’t really going anywhere, at least not quickly enough.

I thought of accelerating convergence speed with Batch Normalization.

However, I read on StackOverflow that BatchNorm is not supposed to be used with LSTM Neural Networks.

If any of you is more experienced with LSTM nets, please let me know if that’s right in the comments!

At last, I tried this same task with 10 characters as input and 10 as output.

I guess the model wasn’t getting enough context to predict things well enough though: the results were awful.

I considered the experiment finished for now.

Conclusions

While it is clear, looking at other people’s work, that an LSTM Neural Network could learn to write like Lovecraft, I don’t think my PC is powerful enough to train a big enough model in a reasonable time.

Or maybe it just needs more data than I had.

In the future, I’d like to repeat this experiment with a word-based approach instead of a character-based one.

I checked, and about 10% of the words in the corpus appear only once.

Is there any good practice I should follow if I removed them before training? Like replacing all nouns with the same one, sampling from clusters, or something? Please let me know! I’m sure many of you are more experienced with LSTM neural networks than I.

Do you think this would have worked better with a different architecture? Something I should have handled differently? Please also let me know, I want to learn more about this.

Did you find any rookie mistakes on my code? Do you think I’m an idiot for not trying XYZ? Or did you actually find my experiment enjoyable, or maybe you even learned something from this article?

Contact me on Twitter, LinkedIn, Medium or Dev.to if you want to discuss that, or any related topic.

If you want to become a Data scientist, or learn something new, check out my Machine Learning Reading List!

5 Probability Distributions Every Data Scientist Should Know

Luciano Strika — Mon, 17 Jun 2019 04:34:11 +0000

Probability Distributions are like 3D glasses. They allow a skilled Data Scientist to recognize patterns in otherwise completely random variables.

In a way, most of the other Data Science or Machine Learning skills are based on certain assumptions about the probability distributions of your data.

This makes probability knowledge part of the basis on which you can build your toolkit as a statistician. The first steps if you are figuring out how to become a Data Scientist.

Without further ado, let us cut to the chase.

What are Probability Distributions?

In Probability and Statistics, a random variable is a thing that takes random values , like “the height of the next person I see” or “the amount of cook’s hairs in my next ramen bowl”.

Given a random variable X, we’d like to have a way of describing which values it takes. Even more than that, we’d like to characterize how likely it is for that variable to take a certain value x.

For instance, if X is “how many cats my girlfriend has”, then there’s a non-zero chance that number could be 1. One could argue there’s a non-zero probability that value could even be 5 or 10.

However, there’s no way (and therefore no probability) that a person will have negative cats.

We therefore would like an unambiguous, mathematical way of expressing every possible value x a variable X can take, and how likely the event (X= x) is.

In order to do this, we define a function P, such that P(X = x) is the probability of the variable X having a value of x.

We could also ask for P(X < x), or P(X > x), for intervals instead of discrete values. This will become even more relevant soon.

P is the variable’s density function , and it characterizes that variable’s distribution.

Over time, scientists have come to realize many things in nature and real life tend to behave similarly , with variables sharing a distribution, or having the same density functions (or a similar function changing a few constants in it).

Interestingly, for P to be an actual density function, some things have to apply.

P(X =x) <= 1 for any value x. Nothing’s more certain than certain.
P(X =x) >= 0 for any value x. A thing can be impossible, but not less likely than that.
And the last one: the sum of P(X=x) for all possible values x is 1.

This last one means something like “the probability of X taking any value in the universe, has to add up to 1, since we know it will take some value”.

Discrete vs Continuous Random Variable Distributions

Lastly, random variables can be thought of as belonging to two groups: discrete and continuous random variables.

Discrete Random Variables

Discrete variables have a discrete set of possible values, each of them with a non-zero probability.

For instance, when flipping a coin, if we say

X = “1 if the coin is heads, 0 if tails”

Then P(X = 1) = P(X = 0) = 0.5.

Note however, that a discrete set need not be finite.

A geometric distribution , is used for modelling the chance of some event with probability p happening after _ k _ retries.

It has the following density formula.

Where _ k _ can take any non-negative value with a positive probability.

Notice how the sum of all possible values’ probabilities still adds up to 1.

Continuous Random Variables

If you said

X = “the length in millimeters (without rounding) of a randomly plucked hair from my head”

Which possible values can X take? We can all probably agree a negative value doesn’t make any sense here.

However if you said it is exactly 1 millimeter, and not 1.1853759… or something like that, I would either doubt your measuring skills, or your measuring error reporting.

A continuous random variable can take any value in a given (continuous) interval.

Therefore, if we assigned a non-zero probability to all of its possible values , their sum would not add up to 1.

To solve this, if X is continuous, we set P(X=x) = 0 for all k, and instead assign a non-zero chance to X taking a value in a certain interval.

To express the probability of X laying between values a and b, we say

P(a < X < b).

Instead of just replacing values in a density function, to get P(a < X < b) for X a continuous variable, you’ll integrate X‘s density function from a to b.

Whoah, you’ve made it through the whole theory section! Here’s your reward.

Reward puppy. Source: Pixabay.

Now that you know what a probability distribution is, let’s learn about some of the most common ones!

Bernoulli Probability Distribution

A Random Variable with a Bernoulli Distribution is among the simplest ones.

It represents a binary event : “this happened” vs “this didn’t happen”, and takes a value p as its only parameter , which represents the probability that the event will occur.

A random variable B with a Bernoulli distribution with parameter p will have the following density function:

P(B = 1) = p, P(B =0)= (1-p)

Here B=1 means the event happened, and B=0 means it didn’t.

Notice how both probabilities add up to 1, and therefore no other value for B will be possible.

Uniform Probability Distribution

There are two kinds of uniform random variables: discrete and continuous ones.

A discrete uniform distribution will take a (finite) set of values S, and assign a probability of 1/n to each of them, where n is the amount of elements in S.

This way, if for instance my variable Y was uniform in {1,2,3}, then there’d be a 33% chance each of those values came out.

A very typical case of a discrete uniform random variable is found in dice , where your typical dice has the set of values {1,2,3,4,5,6}.

A continuous uniform distribution , instead, only takes two values a and b as parameters, and assigns the same density to each value in the interval between them.

That means the probability of Y taking a value in an interval (from c to d) is proportional to its size versus the size of the whole interval (b-a).

Therefore if Y is uniformly distributed between a and b, then

This way, if Y is a uniform random variable between 1 and 2,

P(1 < X < 2)=1 and P(1 < X < 1.5) = 0.5

Python’s random package’s random method samples a uniformly distributed continuous variable between 0 and 1.

Interestingly, it can be shown that any other distribution can be sampled given a uniform random values generator and some calculus.

Normal Probability Distribution

Normal Distributions. source: Wikipedia

Normally distributed variables are so commonly found in nature, they’re actually the norm. That’s actually where the name comes from.

If you round up all your workmates and measure their heights, or weigh them all and plot a histogram with the results, odds are it’s gonna approach a normal distribution.

I actually saw this effect when I showed you Exploratory Data Analysis examples.

It can also be shown that if you take a sample of any random variable and average those measures , and repeat that process many times, that average will also have a normal distribution. That fact’s so important, it’s called the fundamental theorem of statistics.

Normally distributed variables:

Are symmetrical , centered around a mean (usually called μ ).
Can take all values on the real space , but only deviate two sigmas from the norm 5% of the time.
Are literally everywhere.

Most often if you measure any empirical data and it’s symmetrical, assuming it’s normal will kinda work.

For example, rolling K dice and adding up the results will distribute pretty much normally.

Log-Normal Probability Distribution

Lognormal distribution. source: Wikipedia

Log-normal probability distribution is Normal Probability Distribution’s smaller, less frequently seen sister.

A variable X is said to be log-normally distributed if the variable Y = log(X) follows a normal distribution.

When plotted in a histogram, log-normal probability distributions are asymmetrical , and become even more so if their standard deviation is bigger.

I believe lognormal distributions to be worth mentioning, because most money-based variables behave this way.

If you look at the probability distributions of any variable that relates to money, like

Amount sent on the latest transfer of a certain bank.
Volume of the latest transaction in Wall Street.
A set of companies’ quarterly earnings for a given quarter.

They will usually not have a normal probability distribution, but will behave much closer to a lognormal random variable.

(For other Data Scientists: chime in in the comments if you can think of any other empirical lognormal variables you’ve come across in your work! Especially anything outside of finances).

Exponential Probability Distribution

Source: Wikipedia

Exponential probability distributions appear everywhere, too.

They are heavily linked to a Probability concept called a Poisson Process.

Stealing straight from Wikipedia, a Poisson Process is “a process in which events occur continuously and independently at a constant average rate“.

All that means is, if:

You have a lot of events going.
They happen at a certain rate (which does not change over time).
Just because one happened the chances of another one happening don’t change.

Then you have a Poisson process.

Some examples could be requests coming to a server, transactions happening in a supermarket, or birds fishing in a certain lake.

Imagine a Poisson Process with a frequency rate of λ (say, events happen once every second).

Exponential random variables model the time it takes, after an event, for the next event to occur.

Interestingly, in a Poisson Process an event can happen anywhere between 0 and infinity times (with decreasing probability), in any interval of time.

This means there’s a non-zero chance that the event won’t happen, no matter how long you wait. It also means it could happen a lot of times in a very short interval.

In class we used to joke bus arrivals are Poisson Processes. I think the response time when you send a WhatsApp message to some people also fits the criteria.

However, the λ parameter regulates the frequency of the events.

It will make the expected time it actually takes for an event to happen center around a certain value.

This means if we know a taxi passes our block every 15 minutes, even though theoretically we could wait for it forever, it’s extremely likely we won’t wait longer than, say, 30 minutes.

Exponential Probability Distribution: In Practice

Here’s the density function for an exponential distribution random variable:

Suppose you have a sample from a variable and want to see if it can be modelled with an Exponential distribution Variable.

The optimum λ parameter can be easily estimated as the inverse of the average of your sampled values.

Exponential variables are very good for modelling any probability distributions with very infrequent, but huge (and mean-breaking) outliers.

This is because they can take any non-negative value but center in smaller ones, with decreased frequency as the value grows.

In a particularly outlier-heavy sample , you may want to estimate λ as the median instead of the average , since the median is more robust to outliers. Your mileage may vary on this one, so take it with a grain of salt.

Conclusions

To sum up, as Data Scientists, I think it’s important for us to learn the basics.

Probability and Statistics may not be as flashy as Deep Learning or Unsupervised Machine Learning, but they are the bedrock of Data Science. Especially Machine Learning.

Feeding a Machine Learning model with features without knowing which distribution they follow is, in my experience, a poor choice.

It’s also good to remember the ubiquity of Exponential and Normal Probability Distributions , and their smaller counterpart, the lognormal distribution.

Knowing their properties, uses and appearance is game-changing when training a Machine Learning model. It’s also generally good to keep them in mind while doing any kind of Data Analysis.

Did you find any part of this article useful? Was it all stuff you already knew? did you learn anything new? Let me know in the comments!

Contact me on Twitter, Medium of dev.to if there’s anything you don’t think was clear enough, anything that you disagree with, or just anything that’s plain wrong. Don’t worry, I don’t bite.

Convolutional Neural Networks: Python Tutorial (TensorFlow Eager API)

Luciano Strika — Wed, 12 Jun 2019 18:05:35 +0000

Convolutional Neural Networks are a part of what made Deep Learning reach the headlines so often in the last decade. Today we'll train an image classifier to tell us whether an image contains a dog or a cat, using TensorFlow's eager API.

Artificial Neural Networks have disrupted several industries lately, due to their unprecedented capabilities in many areas. However, Different Deep Learning architectures excel on each one:

Image Classification (Convolutional Neural Networks).
Image, audio and text generation (GANs, RNNs).
Time Series Forecasting (RNNs, LSTM).
Recommendations Systems.
A huge et cetera (e.g., regression).

Today we’ll focus on the first item of the list, though each of those deserves an article of its own.

What are Convolutional Neural Networks?

In MultiLayer Perceptrons (MLP), the vanilla Neural Networks, each layer’s neurons connect to all the neurons in the next layer. We call this type of layers fully connected.

A MLP. Source: AstroML

A Convolutional Neural Network is different: they have Convolutional Layers.

On a fully connected layer, each neuron’s output will be a linear transformation of the previous layer, composed with a non-linear activation function (e.g., ReLu or Sigmoid).

Conversely, the output of each neuron in a Convolutional Layer is only a function of a (typically small) subset of the previous layer’s neurons.

Source: Brilliant

Outputs on a Convolutional Layer will be the result of applying a convolution to a subset of the previous layer’s neurons, and then an activation function.

What is a convolution?

The convolution operation, given an input matrix A (usually the previous layer’s values) and a (typically much smaller) weight matrix called a kernel or filter K, will output a new matrix B.

by @RaghavPrabhu

if K is a CxC matrix, the first element in B will be the result of:

Taking the first CxC submatrix of A.
Multiplying each of its elements by its corresponding weight in K.
Adding all the products.

These two last steps are equivalent to flattening both A's submatrix and K, and computing the dot product of the resulting vectors.

We then slide K to the right to get the next element, and so on, repeating this process for each of A‘s rows.

Convolution visualization by @RaghavPrabhu

Depending on what we want, we could only start with our kernel centered at the Cth row and column, to avoid “going out of bounds”, or assume all elements “outside A” have a certain default value (typically 0) –This will define whether B‘s size is smaller than A‘s or the same.

As you can see, if A was an NxM matrix, now each neuron’s value in B won’t depend on N*M weights, but only on C*C (much less) of them. This makes a convolutional layer much lighter than a fully connected one, helping convolutional models learn a lot faster.

Granted, we will end up using many kernels on each layer (getting a stack of matrices as each layer’s output). However, that will still require a lot less weights than our good old MLP.

Why does this work?

Why can we ignore how each neuron affects most of the others? Well, this whole system holds up on the premise that each neuron is strongly affected by its “neighbors”. Faraway neurons, however, have only a small bearing on it.

This assumption is intuitively true in images –if we think of the input layer, each neuron will be a pixel or a pixel’s RGB value. And that’s part of the reason why this approach works so well for image classification.

For example, if I take a region of a picture where there’s a blue sky, it’s likely that nearby regions will show the sky as well, using similar tones.

A pixel’s neighbors will usually have similar RGB values to it. If they don’t, then that probably means we are on the edge of a figure or object.

If you do some convolutions with pen and paper (or a calculator), you’ll realize certain kernels will increase an input’s intensity if it’s on a certain kind of edge. In other edges, they could decrease it.

As an example, let’s consider the following kernels V and H:

Filters for vertical and horizontal edges

V filters vertical edges (where colors above are very different from colors below), whereas H filters horizontal edges. Notice how one is the transposed version of the other.

Convolutions by example

Here’s an unfiltered picture of a litter of kittens:

Here’s what happens if we apply the horizontal and vertical edge filters, respectively:

We can see how some features become a lot more noticeable, whereas others fade away. Interestingly, each filter showcases different features.

This is how Convolutional Neural Networks learn to identify features in an image. Letting it fit its own kernel weights is a lot easier than any manual approach. Imagine trying to figure out how you should express the relationship between pixels… by hand!

To really grasp what each convolution does to a picture, I strongly recommend you play around on this website. It helped me more than any book or tutorial could. Go ahead, bookmark it. It’s fun.

Alright, you’ve learned some theory already. Now let’s move on to the practical part.

How do you train a Convolutional Neural Network in TensorFlow?

TensorFlow is Python’s most popular Deep Learning framework. I’ve heard good things about PyTorch too, though I’ve never had the chance to try it.

I’ve already written one tutorial on how to train a Neural Network with TensorFlow’s Eager API, focusing on AutoEncoders.

Today will be different: we will try three different architectures, and see which one does better. As usual, all the code is available on GitHub, so you can try everything out for yourself or follow along. Of course I’ll also be showing you Python snippets.

The Dataset

We will be training a neural network to predict whether an image contains a dog or a cat. To do this we’ll use Kaggle’s cats and dogs Dataset. It contains 12500 pictures of cats and 12500 of dogs, with different resolutions.

Loading and Preprocessing our Image Data with NumPy

A neural network receives a features vector or matrix as an input, typically with fixed dimensions. How do we generate that from our pictures?

Lucky for us, Python’s Image library provides us an easy way to load an image as a NumPy array. A HeightxWidth matrix of RGB values.

We already did that on this article, so I’ll just reuse that code.

However we still have to fix the fixed dimensions part: which dimensions do we choose for our input layer? This is important, since we will have to resize every picture to the chosen resolution. We do not want to distort aspect ratios too much in case it brings too much noise for the network.

Here’s how we can see what the most common shape is in our dataset.

I sampled the first 1000 pictures for this, though the result did not change when I looked at 5000. The most common shape was 375×500, though I decided to divide that by 4 for our network’s input.

This is what our image loading code looks like now.

Finally, you can load the data with this snippet. I chose to use a sample of 4096 pictures for the training set and 1024 for validation. However, that’s just because my PC couldn’t handle much more due to RAM size.

Feel free to increase these numbers to the max (like 10K for training and 2500 for validation) if you try this at home!

Training our Neural Networks

First of all, as a sort of baseline, let’s see how good a normal MLP does on this task. If Convolutional Neural Networks are so revolutionary, I’d expect the results to be terrible for this experiment.

So here’s a single hidden layer fully connected neural network.

All the trainings for this article were made using AdamOptimizer, since it’s the fastest one. I only tuned the learning rate per model (here it was 1e-5).

I trained this model for 10 epochs, and it basically converged to random guessing. I made sure to shuffle the training data, since I loaded it in order and that could’ve biased the model.

I used MSE as loss function, since it’s usually more intuitive to interpret. If your MSE is 0.5 in binary classification, you’re as good as always predicting 0. However, MLPs with more layers, or different loss functions did not perform better.

Training a Convolutional Neural Network

How much good can a single convolutional layer do? Let’s see.

For this network, I decided to add a single convolutional layer (with 24 kernels), followed by 2 fully connected layers.

All Max Pooling does is reduce every four neurons to a single one, with the highest value between the four.

After only 5 epochs, it was already performing much better than the previous networks. With a validation MSE of 0.36, it was a lot better than random guessing already. Notice however that I had to use a much smaller learning rate. Also, even though it learned in less epochs, each epoch took much longer. The model is also quite a lot heavier (200+ MB).

I decided to also start measuring the Pearson correlation between predictions and validation labels. This model scored a 15.2%.

Neural Network with two Convolutional Layers

Since that model had done so much better, I decided I would try out a bigger one. I added another convolutional layer, and made both a lot bigger (48 kernels each). This means the model gets to learn more complex features from the images. However it also predictably meant my RAM almost exploded. Also, training took a lot longer (half an hour for 15 epochs).

Results were superb. The Pearson correlation coefficient between predictions and labels reached 0.21, with validation MSE reaching as low as 0.33.

Let’s measure the network’s accuracy. Since 1 is a cat and 0 is a dog, I could say “If the model predicts a value higher than some threshold t, then predict cat. Else predict dog.” After trying 10 straightforward thresholds, this network had a maximum accuracy of 61%.

Even bigger Convolutional Neural Network

Since clearly adding size to the model made it learn better, I tried making both convolutional layers a lot bigger, with 128 filters each. I left the rest of the model untouched, and didn’t change the learning rate.

This model finally reached a correlation of 30%! Its best accuracy was 67%, which means it was right two thirds of the time. I assume an even bigger model could’ve fit the data even better. However, this one was taking 7 minutes per epoch already, and I didn’t want to leave the next one training all morning.

Usually, there’s a tradeoff to be made between a model’s size, and time constraints. Size limits how well the network can fit the data (a small model will underfit ), however I won’t wait 3 hours for my model to learn.

The same concerns may apply if you have a business deadline.

Conclusions

We’ve seen Convolutional Neural Networks are significantly better than vanilla architectures at image classification tasks. We also tried different metrics to measure model performance (correlation, accuracy).

We learned about the tradeoff between a model’s size (which prevents underfitting) and its convergence speed.

Lastly, we used TensorFlow’s eager API to easily train a Deep Neural Network, and numpy for (albeit simple) image preprocessing.

For future articles, I believe we could experiment a lot more with different pooling layers, filter sizes, striding and a different preprocessing for this same task.

Did you find this article useful? Would you have preferred to learn more about anything else? Is anything not clear enough? Let me know in the comments!

Find me on Twitter, Medium or Dev.to if you have any questions, or want to contact me for anything. If you want to start a career in Machine Learning, here’s my recommended reading list.

DEV Community: Luciano Strika

How to Create a Spoiler Tag in HTML

CSS Example Class

HTML Part

JavaScript Part

Ant Colony Optimization and the Travelling Salesman Problem

The Travelling Salesman Problem

Ant Colony Optimization: Solving TSP

Tests and Results

Experiments

Conclusions

Suggested Further Reading

Feature Visualization on Convolutional Neural Networks (or: Making your own Deep-Dream with Keras)

How does Feature Visualization work?

Implementing Filter Visualization

Using a Pre-trained Neural Network

My Results Trying out Feature Visualization

Conclusion

3 Programming Books to Read During Lockdown

Automate the Boring Stuff with Python

Introduction to Algorithms (Cormen)

Cracking the Code Interview

Conclusion

What is the one tip you would give to new bloggers out there?

Markov Chains: Training AI to Write Game of Thrones

What are Markov Chains?

Markov Chains for Text Generation

Training our chain: toy example.

Coding our Markov Chain in Python

Training our chain

Results: Generated Sentences

1-word Markov Chain results

Results with 2-word Markov chains

3-Word Chain Results

Conclusion

Coding MapReduce in C from Scratch using Threads: Map

What is MapReduce?

Example of a MapReduce solution

Why MapReduce scales

Programming MapReduce in C

Single threaded implementation of Map in C

Multi-threaded Map function in C

How to use threads in C

Using pthread for Parallel MapReduce in C

Map in C, Benchmarks: Single-threaded vs Multi-threaded

First benchmark: slow_twice

Benchmark Results

Second benchmark: is_prime

Conclusion

Why do Neural Networks Need an Activation Function?

What are activation functions?

What are Neural Networks?

How do Neural Networks work?

What does a Neuron do?

What are the most commonly used activation functions?

ReLU

Sigmoid

Why do Neural Networks Need an Activation Function?

Conclusions

Data Scientists|Engineers: What are the Frameworks you use the most at your job?

LSTM: How to Train Neural Networks to Write like Lovecraft

What are LSTM Neural Networks?

How do LSTM cells work?

LSTM Cells in action: an intuitive example.

How LSTM Cells work: the Math.

Forget Gate

Input Gate

The LSTM cell’s outputs

Training LSTM Neural Networks with TensorFlow Keras

Preprocessing our data

Training our LSTM Neural Network in Keras

Bigger model, smaller problem. Any results?

Even bigger model

Conclusions

5 Probability Distributions Every Data Scientist Should Know

What are Probability Distributions?

Discrete vs Continuous Random Variable Distributions

Discrete Random Variables

Continuous Random Variables

Bernoulli Probability Distribution