<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Luciano Strika</title>
    <description>The latest articles on DEV Community by Luciano Strika (@strikingloo).</description>
    <link>https://dev.to/strikingloo</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F156204%2F43b00ecf-1d3a-4661-bea4-6a31c484daa3.jpg</url>
      <title>DEV Community: Luciano Strika</title>
      <link>https://dev.to/strikingloo</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/strikingloo"/>
    <language>en</language>
    <item>
      <title>How to Create a Spoiler Tag in HTML</title>
      <dc:creator>Luciano Strika</dc:creator>
      <pubDate>Sun, 24 Mar 2024 05:46:42 +0000</pubDate>
      <link>https://dev.to/strikingloo/how-to-create-a-spoiler-tag-in-html-1ni7</link>
      <guid>https://dev.to/strikingloo/how-to-create-a-spoiler-tag-in-html-1ni7</guid>
      <description>&lt;p&gt;Many forums or blogs make use of the spoiler tag: a little button or anchor that, if clicked, reveals otherwise invisible content.&lt;/p&gt;

&lt;p&gt;I wanted to add this functionality to the site for Tables of Content, so I figured adding this guide here could be useful both for my own future reference and for anyone else looking for a concise explanation.&lt;/p&gt;

&lt;p&gt;In this post we will code a spoiler tag: an anchor that shows or hides an HTML element when clicked.&lt;/p&gt;

&lt;p&gt;What we will do to implement a spoiler tag is divided in three parts.&lt;/p&gt;

&lt;h3&gt;
  
  
  CSS Example Class
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight css"&gt;&lt;code&gt;    &lt;span class="nc"&gt;.spoiler-content&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nl"&gt;display&lt;/span&gt;&lt;span class="p"&gt;:&lt;/span&gt; &lt;span class="nb"&gt;none&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;With the &lt;code&gt;display&lt;/code&gt; property set to &lt;code&gt;none&lt;/code&gt;, we make content invisible (but still part of the page's HTML). Setting this to &lt;code&gt;block&lt;/code&gt; would make it visible again.&lt;/p&gt;

&lt;h3&gt;
  
  
  HTML Part
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight html"&gt;&lt;code&gt;&lt;span class="nt"&gt;&amp;lt;a&lt;/span&gt; &lt;span class="na"&gt;href=&lt;/span&gt;&lt;span class="s"&gt;"#"&lt;/span&gt; &lt;span class="na"&gt;onclick=&lt;/span&gt;&lt;span class="s"&gt;"toggleSpoiler(event, '1')"&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;Table of Contents&lt;span class="nt"&gt;&amp;lt;/a&amp;gt;&lt;/span&gt;

&lt;span class="nt"&gt;&amp;lt;div&lt;/span&gt; &lt;span class="na"&gt;class=&lt;/span&gt;&lt;span class="s"&gt;"spoiler-content"&lt;/span&gt; &lt;span class="na"&gt;id=&lt;/span&gt;&lt;span class="s"&gt;'1'&lt;/span&gt;&lt;span class="nt"&gt;&amp;gt;&lt;/span&gt;
    &lt;span class="c"&gt;&amp;lt;!-- Your invisible content here.&amp;gt; &amp;lt;/!--&amp;gt;&lt;/span&gt;
&lt;span class="nt"&gt;&amp;lt;/div&amp;gt;&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pretty straightforward: the &lt;code&gt;div&lt;/code&gt; has the spoiler-content class that makes it invisible by default, and a unique id. We pair that content with that anchor by sending the same id as the second argument to the toggleSpoiler function.&lt;/p&gt;

&lt;h3&gt;
  
  
  JavaScript Part
&lt;/h3&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight javascript"&gt;&lt;code&gt;&lt;span class="kd"&gt;function&lt;/span&gt; &lt;span class="nf"&gt;toggleSpoiler&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
    &lt;span class="nx"&gt;event&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;preventDefault&lt;/span&gt;&lt;span class="p"&gt;();&lt;/span&gt;
    &lt;span class="kd"&gt;var&lt;/span&gt; &lt;span class="nx"&gt;spoilerContent&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="nb"&gt;document&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;getElementById&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;id&lt;/span&gt;&lt;span class="p"&gt;);&lt;/span&gt;
    &lt;span class="k"&gt;if &lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="nx"&gt;spoilerContent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;style&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;display&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;none&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt; &lt;span class="o"&gt;||&lt;/span&gt; &lt;span class="nx"&gt;spoilerContent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;style&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;display&lt;/span&gt; &lt;span class="o"&gt;===&lt;/span&gt; &lt;span class="dl"&gt;''&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;spoilerContent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;style&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;display&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;block&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt; &lt;span class="k"&gt;else&lt;/span&gt; &lt;span class="p"&gt;{&lt;/span&gt;
        &lt;span class="nx"&gt;spoilerContent&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;style&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nx"&gt;display&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="s1"&gt;none&lt;/span&gt;&lt;span class="dl"&gt;'&lt;/span&gt;&lt;span class="p"&gt;;&lt;/span&gt;
    &lt;span class="p"&gt;}&lt;/span&gt;
&lt;span class="p"&gt;}&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;For the script, we define toggleSpoiler such that, given an id, if the element with that id is visible it becomes hidden, or viceversa. I add the check for &lt;code&gt;display === ''&lt;/code&gt; as for some reason in this toy example, js was detecting the value as '' the first time even if the class was correctly applied, so that you don't need to click twice to reveal the content.&lt;/p&gt;

&lt;p&gt;And there you have it, a simple spoiler tag in plain HTML/JS. Note that you could make the anchor a button or any other thing instead, and make the div contain any arbitrary HTML elements.&lt;/p&gt;

</description>
      <category>html</category>
      <category>webdev</category>
      <category>javascript</category>
      <category>css</category>
    </item>
    <item>
      <title>Ant Colony Optimization and the Travelling Salesman Problem</title>
      <dc:creator>Luciano Strika</dc:creator>
      <pubDate>Mon, 12 Sep 2022 14:48:52 +0000</pubDate>
      <link>https://dev.to/strikingloo/ant-colony-optimization-and-the-travelling-salesman-problem-1034</link>
      <guid>https://dev.to/strikingloo/ant-colony-optimization-and-the-travelling-salesman-problem-1034</guid>
      <description>&lt;p&gt;Ant Colony Optimization algorithms always intrigued me. They are loosely based in biology and the real protocols ants use to communicate and plan routes. They do this by coordinating through small pheromone messages: chemical trails they leave as they move forward, signaling for other ants to follow them. Even though each ant is not especially smart, and they follow simple rules individually, collectively they can converge to complex behaviors as a system, and amazing properties emerge.&lt;/p&gt;

&lt;p&gt;In the computational sense, Ant Colony Optimization algorithms solve complex optimization problems for which a closed-form or polynomial solution does not exist, by trying different "routes" across some relevant space or graph, and trying to find the most efficient one (typically the shortest) from two points that satisfies some constraints.&lt;/p&gt;

&lt;p&gt;Personally, I had a debt with myself from 5 years ago from an &lt;em&gt;Algorithms III&lt;/em&gt; class where Ant Colony Optimization was mentioned as an alternative to simulated annealing and Genetic Algorithms, but not expanded on and left as an exercise for future study. I remember back then the concept sounded interesting, but since I was busy with other matters I decided to postpone studying it. Now I find myself having more free time, so I finally decided to give it a try. What better way to verify I learned than coding an Ant Colony Optimization algorithm from scratch and showing it here?&lt;/p&gt;

&lt;p&gt;First, let's start with some motivation: why would you want to learn about Ant Colony Optimization?&lt;/p&gt;

&lt;h2&gt;
  
  
  The Travelling Salesman Problem
&lt;/h2&gt;

&lt;p&gt;One especially important use-case for Ant Colony Optimization (ACO from now on) algorithms is solving the Travelling Salesman Problem (TSP).&lt;/p&gt;

&lt;p&gt;This problem is defined as follows: &lt;em&gt;Given a complete graph G with weighted edges, find the minimum weight Hamiltonian cycle. That is, a cycle that passes through each node exactly once and minimizes the total weight sum.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Note that the graph needs to be &lt;em&gt;complete&lt;/em&gt;: there needs to exist an edge conecting each possible pair of nodes. For graphs based in real places, this makes sense: you can just connect two places with an edge with a weight equal to their distance, or their estimated travel time. &lt;/p&gt;

&lt;p&gt;For a concrete example, look at the following graph.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--T8ckhfFu--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://strikingloo.github.io/resources/post_image/TSP-graph-example.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--T8ckhfFu--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://strikingloo.github.io/resources/post_image/TSP-graph-example.png" alt="an image of a graph for travelling salesman problem" width="788" height="582"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;In this case, the salesman wants to visit every home once and get back to where it started. Each edge joining two houses has a numeric label, representing the travel time between them in minutes. The salesman is a busy man, and would prefer to take as little time as possible in visiting all the houses. What would be the most efficient route?&lt;/p&gt;

&lt;p&gt;As an example, if we started from the house on the top left, we would want to go bottom, right, center, left again for a total of 80 minutes of travel. You can take a little time to convince yourself that is the right answer by hand, since this is a small case. Try to find a different route that would take less time to visit the four houses.&lt;/p&gt;

&lt;p&gt;Why is the Travelling Salesman Problem important? Many reasons. &lt;/p&gt;

&lt;p&gt;First of all, &lt;strong&gt;TSP appears everywhere in logistics&lt;/strong&gt;. Imagine you need to make multiple deliveries with a truck. You have packages, each of which has to go to a different place. What is the most time-efficient order to deliver them in and then go back to the warehouse? You just found the Travelling Salesman Problem.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;TSP is also important because it is an NP-Complete problem&lt;/strong&gt;. That means in the family of NP (nondeterministic polynomial time) problems -those problems for which verification of a solution takes polynomial time, even if finding that solution is harder-, it is in the hardest category: if we found a polynomial time solution for it, then since any other NP problem can be transformed into a TSP problem (sometimes through very esoteric means, but still) in polynomial time too, we would have found a polynomial solution for all NP problems.&lt;/p&gt;

&lt;p&gt;Finding TSP can be solved in polynomial time would prove P=NP. This would be huge. To the point of being considered one of this century's biggest questions. Suddenly swathes of hard problems would become easier to solve, and many new applications would open up, with multiple kinds of software becoming vastly more efficient. What it would do for logistics would probably contribute significantly to the world's GDP and global trade.&lt;/p&gt;

&lt;p&gt;But before I digress further, now that we know what TSP is, let's see how to solve it. For more information, I recommend the &lt;a href="https://en.wikipedia.org/wiki/Travelling_salesman_problem"&gt;Wikipedia article on TSP&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Ant Colony Optimization: Solving TSP
&lt;/h2&gt;

&lt;p&gt;There are many possible ways to solve the Travelling Salesman Problem for a given graph. As discussed above, there is no fast way to get the best solution for an arbitrary graph for certain, at least not without it taking a very long time.&lt;/p&gt;

&lt;p&gt;The trivial way to solve TSP would be to look at all the possible Hamiltonian Cycles and keep the best one. This would imply looking at all possible orderings of nodes, which grow factorially -O(N!)- with the number N of nodes. Growing factorially is much worse than growing exponentially, for any base. It is so bad that even parallelism would not help: since adding a single node makes the problem N times harder, each extra node in the graph would require we grow the infrastructure superexponentially just to keep up. This would be extremely inefficient.&lt;/p&gt;

&lt;p&gt;Due to this, instead of looking for the exact solution for a graph, what most frameworks and solvers do is finding approximate solutions: can we find a way of connecting all nodes in a cycle that is "good enough"? To achieve this, multiple optimization algorithms exist. the &lt;em&gt;Networkx&lt;/em&gt; framework for graphs in Python solves TSP with &lt;em&gt;&lt;a href="https://en.wikipedia.org/wiki/Christofides_algorithm"&gt;Christofides&lt;/a&gt;&lt;/em&gt; or &lt;em&gt;&lt;a href="https://en.wikipedia.org/wiki/Simulated_annealing"&gt;Simulated Annealing&lt;/a&gt;&lt;/em&gt;, for example, of which the latter is quite similar to Ant Colony Optimization. Christofides has the nice property of never being wrong by more than 50% (so if the best cycle has a weight of 100, Christofides is guaranteed to find a cycle of weight at most 150).&lt;/p&gt;

&lt;p&gt;The algorithm we will see today is one such way of approximating a solution. &lt;/p&gt;

&lt;p&gt;In &lt;strong&gt;&lt;a href="https://en.wikipedia.org/wiki/Ant_colony_optimization_algorithms"&gt;Ant Colony Optimization Algorithms&lt;/a&gt;&lt;/strong&gt;, we will run a simulation of "ants" traversing the graph, constrained to only move in cycles, visiting each node exactly once. Each ant will leave, after finishing its traversal, a trail of pheromones that is proportional to the inverse weight of the discovered cycle (that is, if the cycle the ant encountered is twice as big, it will leave half the pheromones on each edge of the graph it went through, and so on). &lt;/p&gt;

&lt;p&gt;Finally, though we will make ants choose which edge to go through on each step of their traversal randomly, they will assign more preference to edges with more pheromones on them, and less preference to those with less pheromones. Additionally, if an edge is longer, it will receive less preference, since it implies higher travel times.&lt;/p&gt;

&lt;p&gt;These two preference adjustments could be linear, or any other polynomial (in my case, I tried many different coefficients and found the optimum to be sublinear for the pheromones, and quadratic or **1.5 for the distance).&lt;/p&gt;

&lt;p&gt;The pseudocode Wikipedia gives is:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;procedure ACO_MetaHeuristic is
    while not terminated do
        generateSolutions()
        daemonActions()
        pheromoneUpdate()
    repeat
end procedure
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;For this post, I coded Ant Colony Optimization (initially proposed by Marco Dorigo in 1992 in his PhD thesis) from scratch in Python using the Wikipedia article as a reference. I then ran a few experiments with it and benchmarked it against other algorithms for different problem instances.&lt;/p&gt;

&lt;p&gt;I used numpy for the traversals and other numerical operations, and pytest for testing. The whole code is &lt;a href="https://github.com/StrikingLoo/ant-colony-optimization"&gt;available on GitHub&lt;/a&gt;, but I will show you the main parts step-by-step now. If you're not interested in how the Ant Colony Optimization algorithm works in detail, you can skip straight to the results and benchmarks.&lt;/p&gt;

&lt;p&gt;First of all, I designed a minimal Graph class, whose code I will not include here since it is very simple. Suffice it to say that the &lt;em&gt;.distance&lt;/em&gt; property holds an adjacency matrix with the weight -distance- for each edge.&lt;/p&gt;

&lt;p&gt;Then I coded the &lt;code&gt;traverse_graph&lt;/code&gt; function, which represents a single ant going through the graph one node at a time, constrained to move in a cycle. &lt;/p&gt;

&lt;p&gt;The ant starts from a given node, and will at each step choose from among every node it has not stepped on yet, with a weighted distribution that assigns preference proportional to an edge's pheromone load and to the inverse of its distance, each raised to a power that is a hyperparameter coefficient (&lt;em&gt;alpha&lt;/em&gt; and &lt;em&gt;beta&lt;/em&gt; respectively).&lt;/p&gt;

&lt;p&gt;That is, the probability of choosing a certain edge will be proportional to:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--7Gd64S9g--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://strikingloo.github.io/resources/post_image/weight.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--7Gd64S9g--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://strikingloo.github.io/resources/post_image/weight.png" alt="weight equation for ant colony optimization" width="462" height="343"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Where P is the level of pheromones in that edge, and D the distance the edge covers. To get the distribution we sample from at each random jump, we normalize these weight coefficients so they add up to one.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;



&lt;p&gt;After that, the optimization procedure itself consists of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Initialize the graph with a constant (typically initially very high, to encourage exploration) amount of pheromones on each edge.&lt;/li&gt;
&lt;li&gt;Make &lt;em&gt;k&lt;/em&gt; ants start from random nodes and traverse the graph using the procedure defined above.&lt;/li&gt;
&lt;li&gt;For each traversal, update the level of pheromones in its edges according to the function &lt;em&gt;Q/total_weight&lt;/em&gt;, where Q is a hyperparameter (a constant) and &lt;em&gt;total_weight&lt;/em&gt; is the sum of the distances of all the edges in the cycle. If using &lt;em&gt;elitism&lt;/em&gt;, add to the list of traversals the best one we have encountered so far, to incentivize the ants not to deviate too far from it.&lt;/li&gt;
&lt;li&gt;If a cycle was found that beats the best one so far, update it.&lt;/li&gt;
&lt;li&gt;All pheromone levels are multiplied by a &lt;em&gt;degradation constant&lt;/em&gt;, another hyperparameter between 0 and 1 that represents the passage of time and prevents bad past solutions to influence good recent ones too much.&lt;/li&gt;
&lt;li&gt;Repeat for a certain number of iterations, or until convergence.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Intuitively, this converges to short cycles because &lt;strong&gt;each ant is leaving more pheromones in the edges on its cycle the shorter it is&lt;/strong&gt; and, as old pheromones fade over time, and new ants favor edges with more pheromones in them, &lt;strong&gt;new cycles will tend to be ever shorter&lt;/strong&gt;. Crucially, as each ant is choosing its next step at random, even though they will &lt;em&gt;tend&lt;/em&gt; to pick the candidates with the most pheromone every time, they will also have a non-negligible probability of picking a different edge and going off exploring. Should that lead to a better cycle overall, then that ant will tell future ants about it by leaving even more pheromones, as the cycle is shorter.&lt;/p&gt;

&lt;p&gt;Over time, we would expect the average ant traversal to get shorter and shorter.&lt;/p&gt;

&lt;p&gt;Additionally, I tried a few more modifications to the algorithm: the 'elite' or best candidate can be specified manually at the start (as that allows for reusing of the best solution from other runs) and I designed a protocol for increasing the amount of pheromones everywhere by a constant if progress stagnated -no new best cycle found after &lt;em&gt;patience&lt;/em&gt; iterations-, though I did not achieve better results through that. Also, after running &lt;em&gt;k&lt;/em&gt; ants, I only updated the pheromone trails with the best &lt;em&gt;k/2&lt;/em&gt; ants' traversals instead of using them all. This did improve results quite significantly, as did using elite candidates --not keeping them made the algorithm more unstable and it converged a lot more slowly.&lt;/p&gt;

&lt;p&gt;Here is the whole function in all its glory (with comments for sanity).&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;Some possible improvements for this algorithm that I didn't have the time for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Traversals could be trivially paralllelized since each ant is independent. This can be done very easily using the &lt;em&gt;multiprocessing&lt;/em&gt; Python module, but it doesn't work on Mac by default. In this tradeoff, I chose portability over speed.&lt;/li&gt;
&lt;li&gt;Choosing the next jump in a traversal can be done in parallel with numpy vector multiplication, which resulted in everything running about 5x faster. However due to numerical instability, a jump could be performed to the same node over and over, even though I was multiplying by zero, and solving this bug would have taken more time than I thought worth it. If you find a way to make this work for all cases, then feel free to make a pull request and you will get the credit and a link.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Tests and Results
&lt;/h2&gt;

&lt;p&gt;After coding the algorithm and testing it in toy cases, I was very happy to find that the internet had provided me with a wealth of different graphs and TSP problems to try it on.&lt;/p&gt;

&lt;p&gt;I got my first small but real test case from this &lt;a href="https://towardsdatascience.com/solving-the-travelling-salesman-problem-for-germany-using-networkx-in-python-2b181efd7b07"&gt;Medium Article&lt;/a&gt; using real Germany cities data. I was happy to see ACO found the optimal solution in seconds! &lt;/p&gt;

&lt;p&gt;Then I found the huge &lt;a href="http://cs.uef.fi/sipu/santa/data.html"&gt;Santa Claus Challenge&lt;/a&gt; with coordinates data representing millions of houses in Finland (for Santa to visit). The entire dataset did not fit in memory, so I could not verify how close my solution got to the best ones in the challenge, but taking ever bigger samples let me see how fast or slow each part of the program was for profiling. Go to the &lt;a href="https://www.frontiersin.org/articles/10.3389/frobt.2021.689908/full"&gt;challenge's article&lt;/a&gt; for a fun read.&lt;/p&gt;

&lt;p&gt;Finally, my favorite resource for finding TSP problems, often with their optimal cycle's weight, was &lt;a href="http://comopt.ifi.uni-heidelberg.de/software/TSPLIB95/XML-TSPLIB/instances/"&gt;Heidelberg University's site&lt;/a&gt;. &lt;/p&gt;

&lt;p&gt;I used that site's Berlin dataset for most of my benchmarking and hyperparameter optimization, from which I found the best &lt;em&gt;alpha&lt;/em&gt; and &lt;em&gt;beta&lt;/em&gt; values to be around &lt;em&gt;0.9&lt;/em&gt; and &lt;em&gt;1.5&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;I was very happy to see that, while Networkx's &lt;em&gt;TSP solve&lt;/em&gt; took 2 seconds and this program took a couple minutes, my solution for that dataset had a weight of ~44000 whereas Networkx's was around 46k. This proves for some cases, even though slower, ACO algorithms could be a good approach for solving TSP problems.&lt;/p&gt;

&lt;h2&gt;
  
  
  Experiments
&lt;/h2&gt;

&lt;p&gt;Encouraged by the comments in Reddit, I decided to experiment further and see how the optimization behaved in different situations. &lt;/p&gt;

&lt;p&gt;Particularly, since ACO can be updated online, it is supposed to perform very well in dynamic network or logistics problems where the graph is shifting in real time, in comparison with other algorithms which need to be re-run from scratch.&lt;/p&gt;

&lt;p&gt;Since the ants update their pheromone trails in real time, whenever there is a shift in the edge's distances they should eventually notice it and change their path to reflect it. For instance if two nodes got closer (the distance value in the edge joining them was reduced) then more ants should want to cross between them, and its pheromone load should grow larger. Alternatively if two nodes grow farther apart, the ants should shun them more.&lt;/p&gt;

&lt;p&gt;To test whether this was the case, I tried two experiments. In both of them I started with the Berlin graph I had looked at earlier, which I knew the algorithm converged in after about 500 iterations of 50 ants each.&lt;/p&gt;

&lt;p&gt;For the first experiment, after the 500th iteration I selected the edge with the highest amount of pheromones, and made its weight 10 times bigger. That is, if the edge was joining nodes i and j, then the distance between them grew 10 times larger.&lt;/p&gt;

&lt;p&gt;I wanted to see how quickly the swarm would respond to this change, so I plotted the pheromone load for that edge from iteration 500 onwards for 500 more iterations.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--WzgNSNp4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://strikingloo.github.io/resources/post_image/ant_trail_smaller.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--WzgNSNp4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://strikingloo.github.io/resources/post_image/ant_trail_smaller.png" alt="An image depicting a graph of decreasing pheromone trails after an edge's weight grew bigger" width="640" height="480"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As you can see, the ants don't respond instantaneously to the changes, but after 30 iterations they have adapted to them and do not visit that edge nearly as often as before. Its pheromone level remains very low afterward, with occasional peaks probably due to some of the exploration incentives I set.&lt;/p&gt;

&lt;p&gt;For the second experiment, I took a random hamiltonian cycle and divided all of its edges by 10. This way, this cycle suddenly became tempting for the ants, as it was a cheap way of traversing the whole graph, smaller by an order of magnitude. Again this change took place in the 500th iteration, so I wanted to see how the ants reacted.&lt;/p&gt;

&lt;p&gt;I looked at the mean pheromone load for edges in the diminished cycle, and this is what it looked like.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--drztmb7e--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://strikingloo.github.io/resources/post_image/ant_trail_mean.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--drztmb7e--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://strikingloo.github.io/resources/post_image/ant_trail_mean.png" alt="An image depicting a graph of increasing pheromone trails after a cycle grew shorter, incentivizing ants to explore it" width="640" height="480"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As expected, the ants were highly incentivized to deviate from their known paths and explore this cycle (it had a third of the weight of the next smallest cycle that the colony had found so far). After a single iteration, the average pheromone levels for that cycle had increased dramatically. &lt;/p&gt;

&lt;p&gt;This shows that, as long as the algorithm contemplates the possibility of change by always encouraging a minimum level of exploration, new opportunities can be exploited as they arise. &lt;/p&gt;

&lt;p&gt;Interestingly, if the minimum level of pheromones was plotted instead of the mean, it did not rise very much. I think this is because even after dividing by ten, a few of the edges in the best solution were still not included in this cycle. This can further be attested by the dip in average pheromone levels by the end of the graph above. I believe in the last 50 iterations a cycle was found that contained an edge that had not been diminished, but was nonetheless small enough to present an improvement.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusions
&lt;/h2&gt;

&lt;p&gt;We showed that Ant Colony Optimization can be implemented quite easily in Python, and since many of its operations can be vectorized or parallelized it should not be too slow, though not it is not as fast as Christofides's algorithm or others.&lt;/p&gt;

&lt;p&gt;More importantly, we showed that in many datasets, ACO can converge to the optimal solution, and in many others its flexibility allows it to find better solutions (shorter traversals) than simpler algorithms.&lt;/p&gt;

&lt;p&gt;Additionally, it could be seen that one of the best properties of Ant Colony Optimization over other algorithms is its capability for online adaptation to changes in the system. In certain situations this could prove critical for performance, especially if rapid response is encouraged.&lt;/p&gt;

&lt;p&gt;On a more philosophical level, I think it is beautiful how by specifying a large set of simple agents that each follow very few rules, we could solve a problem that is known to be hard.&lt;/p&gt;

&lt;p&gt;I would like to try Ant Colony Optimization for problems other than TSP in the future, so if you know of any other applications where ACO shines, let me know! &lt;/p&gt;

&lt;p&gt;&lt;strong&gt;If you enjoyed this article, please share it on Twitter or with a friend.&lt;/strong&gt; I write these for you and would be happy if more people can read them and share my love for algorithms.&lt;/p&gt;

&lt;h2&gt;
  
  
  Suggested Further Reading
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;a href="https://kazimuth.github.io/blog/post/shake-and-pull-gently/"&gt;&lt;em&gt;Shake and Pull Gently&lt;/em&gt;, Kazimuth&lt;/a&gt;: This post reminded me of my love for search and optimization algorithms, and I recommend it full-heartedly.&lt;/li&gt;
&lt;li&gt;Reddit User &lt;em&gt;/u/git&lt;/em&gt;'s comments on &lt;a href="https://www.reddit.com/r/programming/comments/wx69fs/comment/ilplkgs/"&gt;Ant Behavior&lt;/a&gt; and &lt;a href="https://www.reddit.com/r/funny/comments/wt1fcr/comment/il1w9u2/"&gt;Ant Trails&lt;/a&gt;, which originally inspired me to write this post.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://www.frontiersin.org/articles/10.3389/frobt.2021.689908/full"&gt;Solving the Large-Scale TSP Problem in 1 h: Santa Claus Challenge 2020&lt;/a&gt;: A fun challenge and a good explanation of TSP.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://arxiv.org/pdf/2205.15678v1.pdf"&gt;Automatic Relation-aware Graph Network Proliferation&lt;/a&gt;: Using Graph Neural Networks to solve, among other things, TSP.&lt;/li&gt;
&lt;li&gt;
&lt;a href="https://github.com/maoaiz/tsp-genetic-python"&gt;TSP Genetic Python&lt;/a&gt;: A genetic algorithm for solving TSP.&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>programming</category>
      <category>python</category>
      <category>algorithms</category>
      <category>math</category>
    </item>
    <item>
      <title>Feature Visualization on Convolutional Neural Networks (or: Making your own Deep-Dream with Keras)</title>
      <dc:creator>Luciano Strika</dc:creator>
      <pubDate>Sat, 30 May 2020 01:20:19 +0000</pubDate>
      <link>https://dev.to/strikingloo/feature-visualization-on-convolutional-neural-networks-keras-11mh</link>
      <guid>https://dev.to/strikingloo/feature-visualization-on-convolutional-neural-networks-keras-11mh</guid>
      <description>&lt;p&gt;According to Wikipedia, &lt;a href="https://en.wikipedia.org/wiki/Apophenia" rel="noopener noreferrer"&gt;apophenia&lt;/a&gt; is &lt;em&gt;“the tendency to mistakenly perceive connections and meaning between unrelated things”&lt;/em&gt; . It is also used as “the human propensity to seek patterns in random information”. Whether it’s a scientist doing research in a lab, or a conspiracy theorist warning us about how “it’s all connected”, I guess people need to feel like we understand what’s going on, even in the face of clearly random information.&lt;/p&gt;

&lt;p&gt;Deep Neural Networks are usually treated like “black boxes” due to their &lt;strong&gt;inscrutability&lt;/strong&gt; compared to more transparent models, like XGboost or &lt;a href="https://github.com/interpretml/interpret" rel="noopener noreferrer"&gt;Explainable Boosted Machines&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;However, there is a way to interpret what &lt;strong&gt;each individual filter&lt;/strong&gt; is doing in a Convolutional Neural Network, and which kinds of images it is learning to detect.&lt;/p&gt;

&lt;p&gt;Convolutional Neural Networks rose to prominence since at least 2012, when &lt;a href="http://image-net.org/challenges/LSVRC/2012/supervision.pdf" rel="noopener noreferrer"&gt;AlexNet&lt;/a&gt; won the &lt;a href="http://www.image-net.org/challenges/LSVRC/2012/index#workshop" rel="noopener noreferrer"&gt;ImageNet computer vision contest&lt;/a&gt; with an accuracy of 85%. The second place was at a mere 74%, and &lt;a href="http://www.image-net.org/challenges/LSVRC/2013/results.php#cls" rel="noopener noreferrer"&gt;a year later&lt;/a&gt; most competitors were switching to this “new” kind of algorithm.&lt;/p&gt;

&lt;p&gt;They are widely used for many different tasks, mostly relating to &lt;strong&gt;image processing&lt;/strong&gt;. These include Image Classification, Detection problems, and many others.&lt;/p&gt;

&lt;p&gt;I will not go in depth into how a Convolutional Neural Network works, but if you’re getting started in this subject I recommend you read my &lt;a href="https://www.datastuff.tech/machine-learning/convolutional-neural-networks-an-introduction-tensorflow-eager/" rel="noopener noreferrer"&gt;Practical Introduction to Convolutional Neural Networks&lt;/a&gt; with working TensorFlow code.&lt;/p&gt;

&lt;p&gt;If you already have a grasp of how a Convolutional Neural Network works, then this article is all you need to know to understand &lt;strong&gt;what Feature Visualization does&lt;/strong&gt; and how it works.&lt;/p&gt;

&lt;h2&gt;
  
  
  How does Feature Visualization work?
&lt;/h2&gt;

&lt;p&gt;Normally, you would train a CNN feeding it images and labels, and using Gradient Descent or a similar optimization method to &lt;strong&gt;fit the Neural Network’s weights&lt;/strong&gt; so that it predicts the right label.&lt;/p&gt;

&lt;p&gt;Throughout this process, one would expect the image to remain untouched, and the same applies to the label.&lt;/p&gt;

&lt;p&gt;However, what do you think would happen if we took any image, picked one convolutional filter in our (already trained) network, and applied Gradient Descent &lt;strong&gt;on the input image&lt;/strong&gt; to &lt;strong&gt;maximize that filter’s output&lt;/strong&gt; , while &lt;strong&gt;leaving the Network’s weights constant&lt;/strong&gt;?&lt;/p&gt;

&lt;p&gt;Suddenly, we have &lt;strong&gt;shifted perspectives&lt;/strong&gt;. We’re no longer training a model to predict an image’s label. Rather, we’re now kind of fitting the image to the model, to make it generate whatever output we want.&lt;/p&gt;

&lt;p&gt;In a way, it’s like we’re asking the model “See this filter? What kind of images turn it on?”.&lt;/p&gt;

&lt;p&gt;If our Network has been properly trained, then we expect most filters to carry interesting, valuable information that help the model make accurate predictions for its classification task. We expect a filter’s activation to carry &lt;strong&gt;semantic meaning&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;It stands to reason then, that an image that “activates” a filter, making it have a large output, should have features that resemble those of one of the objects present in the Dataset (and among the model’s labels).&lt;/p&gt;

&lt;p&gt;However, given that convolutions are a &lt;strong&gt;local transformation&lt;/strong&gt; , it is common to see the patterns that trigger that convolutional filter repeatedly “sprout” in many different areas of our image.&lt;/p&gt;

&lt;p&gt;This process generates the kind of picture &lt;a href="https://deepdreamgenerator.com/#gallery" rel="noopener noreferrer"&gt;Google’s Deep Dream&lt;/a&gt; model made popular.&lt;/p&gt;

&lt;p&gt;In this tutorial, we will use &lt;strong&gt;TensorFlow’s Keras&lt;/strong&gt; code to &lt;strong&gt;generate images&lt;/strong&gt; that maximize a given filter’s output (namely, the filter’s ouputs’ average, since the output is technically a matrix).&lt;/p&gt;

&lt;h2&gt;
  
  
  Implementing Filter Visualization
&lt;/h2&gt;

&lt;p&gt;As I mentioned before, for this to work we would need to first &lt;strong&gt;train a Neural Network classifier&lt;/strong&gt;. Luckily, we don’t need to go through that whole messy and costly process: &lt;strong&gt;Keras&lt;/strong&gt; already comes with a whole suite of pre-trained Neural Networks we can just download and use.&lt;/p&gt;

&lt;h3&gt;
  
  
  Using a Pre-trained Neural Network
&lt;/h3&gt;

&lt;p&gt;For this article, we will use VGG16, a huge Convolutional Neural Network trained on the same ImageNet competition Dataset. Remember how I mentioned AlexNet won with an 85% accuracy and disrupted the Image Classification field? VGG16 scored 92% on that same task.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;VGG16 is a convolutional neural network model proposed by K. Simonyan and A. Zisserman from the University of Oxford in the paper “Very Deep Convolutional Networks for Large-Scale Image Recognition”. The model achieves 92.7% top-5 test accuracy in ImageNet, which is a dataset of over 14 million images belonging to 1000 classes. It was one of the famous model submitted to &lt;a href="http://www.image-net.org/challenges/LSVRC/2014/results" rel="noopener noreferrer"&gt;ILSVRC-2014&lt;/a&gt;. It makes the improvement over AlexNet by replacing large kernel-sized filters (11 and 5 in the first and second convolutional layer, respectively) with multiple 3×3 kernel-sized filters one after another. VGG16 &lt;strong&gt;was trained for weeks and was using NVIDIA Titan Black GPU’s&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;cite&gt;&lt;a href="https://neurohive.io/en/popular-networks/vgg16/" rel="noopener noreferrer"&gt;&lt;/a&gt;&lt;a href="https://neurohive.io/en/popular-networks/vgg16/" rel="noopener noreferrer"&gt;https://neurohive.io/en/popular-networks/vgg16/&lt;/a&gt; — VGG16 – Convolutional Network for Classification and Detection (emphasis mine)&lt;/cite&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;For these experiments I will be using &lt;a href="https://colab.research.google.com/" rel="noopener noreferrer"&gt;Google colab’s&lt;/a&gt; GPU machine, and tweaking Keras Library’s example of &lt;a href="https://github.com/keras-team/keras/blob/master/examples/conv_filter_visualization.py" rel="noopener noreferrer"&gt;Filter Visualization&lt;/a&gt; code.&lt;/p&gt;

&lt;p&gt;For a breakdown of how the original script works, you can see &lt;a href="https://blog.keras.io/how-convolutional-neural-networks-see-the-world.html" rel="noopener noreferrer"&gt;Keras Blog&lt;/a&gt;. I only made slight changes to it to easily configure file names, and other minor details, so I don’t think it’s worth it to link to my own notebook.&lt;/p&gt;

&lt;p&gt;What the important function does is:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Define a loss function that’s equal to the chosen &lt;strong&gt;filter’s mean output&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;Initialize a small &lt;strong&gt;starting picture&lt;/strong&gt; , typically with &lt;strong&gt;random uniform noise&lt;/strong&gt; centered around RGB(128,128,128) (I actually played around with this a bit, and will expand on it later).&lt;/li&gt;
&lt;li&gt;Compute the &lt;strong&gt;gradient of the input picture&lt;/strong&gt; with regards to this loss, and perform gradient descent&lt;/li&gt;
&lt;li&gt;Repeat N times, then resize the picture to make it slightly bigger (default value was 20%). We start with a small picture and make it increasingly bigger as we generate the filter’s maximizing image, because otherwise the algorithm tends to create a small pattern that repeats many times, instead of making a lower-frequency pattern with bigger (and, subjectively, more aesthetically pleasing) shapes.&lt;/li&gt;
&lt;li&gt;Repeat the last two steps until reaching the desired resolution.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That’s pretty much it. The code I linked to has a few more things happening (image normalization, and stitching together many filters’ generated images into a cute collage) but that is the most important bit.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;Here’s the code for that function, not that scary now that you know what’s going on right?&lt;/p&gt;

&lt;p&gt;Now for the fun part, let’s try this out and see which kinds of filters come out.&lt;/p&gt;

&lt;h2&gt;
  
  
  My Results Trying out Feature Visualization
&lt;/h2&gt;

&lt;p&gt;I read &lt;a href="https://distill.pub/2017/feature-visualization/" rel="noopener noreferrer"&gt;many&lt;/a&gt; &lt;a href="https://towardsdatascience.com/how-to-visualize-convolutional-features-in-40-lines-of-code-70b7d87b0030" rel="noopener noreferrer"&gt;different&lt;/a&gt; &lt;a href="https://arxiv.org/pdf/1311.2901.pdf" rel="noopener noreferrer"&gt;examples&lt;/a&gt; of Feature Visualization articles before giving it a shot. Here are some of the things I learned.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The first Convolutional Layers (the ones closer to the inputs) generate simpler visuals. They’re usually just rough textures like parallel wavy lines, or multicolored circles.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;a href="https://www.datastuff.tech/wp-content/uploads/2020/05/blck2_conv1.png" rel="noopener noreferrer"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.datastuff.tech%2Fwp-content%2Fuploads%2F2020%2F05%2Fblck2_conv1-1024x1024.png" alt="Visualization of Convolutional Filter on VGG16, second layer."&gt;&lt;/a&gt;Visualization of Convolutional Filter on VGG16, second layer.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;The Convolutional layers closer to the outputs generate more intricate textures and patterns. Some even resemble objects that exist, or sorta look like they may exist (in a very uncanny-valley kind of way).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This is also where I had the most fun, to be honest. I tried out many different “starting images”, from random noise to uniform grey, to a progressive degrade.&lt;/p&gt;

&lt;p&gt;The results for any given filter all came out pretty similar. This makes me think given the number of iterations I used, the starting image itself became pretty irrelevant. At the very least, it did not have a predictable impact on the results.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.datastuff.tech/wp-content/uploads/2020/05/block4_conv1.png" rel="noopener noreferrer"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.datastuff.tech%2Fwp-content%2Fuploads%2F2020%2F05%2Fblock4_conv1-1024x1024.png"&gt;&lt;/a&gt;Feature Visualization for Block 4, filters in first convolutional layer of VGG16. Most of the patterns look regular and granular, but a lot more complicated than the early, rustic textures we saw on the first layers.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.datastuff.tech/wp-content/uploads/2020/05/block4_conv2.png" rel="noopener noreferrer"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.datastuff.tech%2Fwp-content%2Fuploads%2F2020%2F05%2Fblock4_conv2-1024x1024.png"&gt;&lt;/a&gt;Filter Visualization for Block 4, filters in second convolutional layer of VGG16. Note how the patterns are very repetitive, but generate textures that look a lot more sophisticated than in the first layers.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.datastuff.tech/wp-content/uploads/2020/05/block4_conv3.png" rel="noopener noreferrer"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.datastuff.tech%2Fwp-content%2Fuploads%2F2020%2F05%2Fblock4_conv3-1024x1024.png" alt="Filter Visualization for Block 4, filters in third convolutional layer of VGG16"&gt;&lt;/a&gt;Filter Visualization for Block 4, filters in third convolutional layer of VGG16. Some more porous patterns seem to emerge.&lt;/p&gt;

&lt;p&gt;As we go deeper, and &lt;strong&gt;closer to the fully connected layers&lt;/strong&gt; , we reach the &lt;strong&gt;last Convolutional Layer&lt;/strong&gt;. The images it generates are the &lt;strong&gt;most intricate&lt;/strong&gt; by far, and the patterns they make many times resemble real life items.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.datastuff.tech/wp-content/uploads/2020/05/block5_conv1.png" rel="noopener noreferrer"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.datastuff.tech%2Fwp-content%2Fuploads%2F2020%2F05%2Fblock5_conv1-1024x1024.png" alt="Filter Visualization for Block 5, filters in first convolutional layer of VGG16"&gt;&lt;/a&gt;Filter Visualization for Block 5, filters in first convolutional layer of VGG16&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.datastuff.tech/wp-content/uploads/2020/05/block5_conv3.png" rel="noopener noreferrer"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.datastuff.tech%2Fwp-content%2Fuploads%2F2020%2F05%2Fblock5_conv3-1024x1024.png" alt="Filter Visualization for Block 5, filters in third convolutional layer of VGG16"&gt;&lt;/a&gt;Filter Visualization for Block 5, filters in third convolutional layer of VGG16&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.datastuff.tech/wp-content/uploads/2020/05/block5_conv2.png" rel="noopener noreferrer"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.datastuff.tech%2Fwp-content%2Fuploads%2F2020%2F05%2Fblock5_conv2-1024x1024.png" alt="Filter Visualization for Block 5, filters in third convolutional layer of VGG16"&gt;&lt;/a&gt;Block 5, filters in second Convolutional Layer. Isn’t it just crazy that all of these patterns emerge just from maximizing a “simple” (albeit hyperdimensional) mathematical function?&lt;/p&gt;

&lt;p&gt;Now, looking into these images in search of patterns, it is easy to feel like one is falling into apophenia. However, I think we can all agree some of those images have features that &lt;strong&gt;really look like&lt;/strong&gt; … you can zoom in and complete that sentence on your own. Feature Visualization is the new gazing at clouds.&lt;/p&gt;

&lt;p&gt;My own guess is it’s just a new kind of abstract art.&lt;/p&gt;

&lt;p&gt;Let me show you some of the filters I found most visually interesting:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.datastuff.tech/wp-content/uploads/2020/05/block4_conv2_gradual_37_0.png" rel="noopener noreferrer"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.datastuff.tech%2Fwp-content%2Fuploads%2F2020%2F05%2Fblock4_conv2_gradual_37_0.png" alt="block4, second convolutional layer filter"&gt;&lt;/a&gt;The texture kind of reminds me of an Orange peel&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.datastuff.tech/wp-content/uploads/2020/05/block4_conv2_gradual_20_0.png" rel="noopener noreferrer"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.datastuff.tech%2Fwp-content%2Fuploads%2F2020%2F05%2Fblock4_conv2_gradual_20_0.png" alt="Filter Visualization Convolutional Neural Network vgg16"&gt;&lt;/a&gt;This one looks like clouds or cotton.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.datastuff.tech/wp-content/uploads/2020/05/block5_conv2_gradual_40_0-1.png" rel="noopener noreferrer"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.datastuff.tech%2Fwp-content%2Fuploads%2F2020%2F05%2Fblock5_conv2_gradual_40_0-1.png" alt="Filter Vsualization convolutional neural network looks like spiral"&gt;&lt;/a&gt;This one looks like spirals infested with fungi (great name for a band!)&lt;/p&gt;

&lt;p&gt;&lt;a href="https://www.datastuff.tech/wp-content/uploads/2020/05/block5_conv2_gradual_47_0-1.png" rel="noopener noreferrer"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwww.datastuff.tech%2Fwp-content%2Fuploads%2F2020%2F05%2Fblock5_conv2_gradual_47_0-1.png"&gt;&lt;/a&gt;This one is just too crazy, so I’ll wrap this up with it.&lt;/p&gt;

&lt;p&gt;I have about 240 more of these, if there’s enough interest I can make a gallery out of them, but I feared it may turn repetitive after a while.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Finally, If you use the classification layer’s cells to generate an image, it will usually &lt;a href="https://blog.keras.io/how-convolutional-neural-networks-see-the-world.html" rel="noopener noreferrer"&gt;come out wrong&lt;/a&gt;, greyish and ugly. I didn’t even try this out, since the results weren’t that interesting. It’s good to keep it in mind however, especially when you read headlines about AI taking over soon or similar unwarranted panic.&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;To be honest, I had a lot of fun with this project. I hadn’t really heard about Google Colab until a couple weeks ago (thanks to &lt;a href="https://www.reddit.com/r/MediaSynthesis/" rel="noopener noreferrer"&gt;r/mediaSynthesis&lt;/a&gt;). It feels great to be able to use a good GPU machine for free.&lt;/p&gt;

&lt;p&gt;I’d also read most of the papers on this subject a couple years ago, then never got around to actually testing the code or doing an article like this. I’m glad I finally scratched it out of my list (or Trello, who am I kidding?).&lt;/p&gt;

&lt;p&gt;Finally, in the future I’d like to try out different network architectures and visualize how the images morph at every iteration, instead of simply looking at the finished product.&lt;/p&gt;

&lt;p&gt;Please let me know in the comments which other experiments or bibliography could be worth checking to keep expanding on this subject!&lt;/p&gt;

&lt;p&gt;&lt;em&gt;If you liked this article, please consider tweeting it or sharing anywhere else!&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Follow me on &lt;a href="http://www.twitter.com/strikingloo" rel="noopener noreferrer"&gt;Twitter&lt;/a&gt; to discuss any of this further or keep up to date with my latest articles.&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The post &lt;a href="https://www.datastuff.tech/machine-learning/feature-visualization-convolutional-neural-networks-keras/" rel="noopener noreferrer"&gt;Feature Visualization on Convolutional Neural Networks (Keras)&lt;/a&gt; appeared first on &lt;a href="https://www.datastuff.tech" rel="noopener noreferrer"&gt;Data Stuff&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>python</category>
      <category>deeplearning</category>
      <category>tensorflow</category>
    </item>
    <item>
      <title>3 Programming Books to Read During Lockdown</title>
      <dc:creator>Luciano Strika</dc:creator>
      <pubDate>Mon, 20 Apr 2020 04:40:49 +0000</pubDate>
      <link>https://dev.to/strikingloo/3-programming-books-for-beginners-to-read-during-lockdown-1fln</link>
      <guid>https://dev.to/strikingloo/3-programming-books-for-beginners-to-read-during-lockdown-1fln</guid>
      <description>&lt;p&gt;Be it an O’Reilly book, or some of the Computer Science classics, many &lt;strong&gt;programming books&lt;/strong&gt; can help you level up in your &lt;strong&gt;career as a Developer&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This can be especially important when you are &lt;strong&gt;getting started&lt;/strong&gt; in Software Development, or in a &lt;strong&gt;programming language&lt;/strong&gt; like Python.&lt;/p&gt;

&lt;p&gt;These last months have been quite heavy and stressful for many of us, what with the Apocalypse taking place and all that.&lt;/p&gt;

&lt;p&gt;So why not take advantage of the situation, and use our newfound free time to double down on our studies and read some programming books?&lt;/p&gt;

&lt;p&gt;This may be the time to be twice as productive. As &lt;a href="http://paulgraham.com/gh.html?viewfullsite=1"&gt;Paul Graham&lt;/a&gt; said:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“If it is possible to make yourself into a great hacker, the way to do it may be to make the following deal with yourself: you never have to work on boring projects (…), and in return, you’ll never allow yourself to do a half-assed job.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Without further ado, let’s see book number 1.&lt;/p&gt;

&lt;h2&gt;
  
  
  Automate the Boring Stuff with Python
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.amazon.com/Automate-Boring-Stuff-Python-2nd-ebook/dp/B07VSXS4NK/ref=as_li_ss_il?dchild=1&amp;amp;keywords=Automate+the+Boring+Stuff+with+Python&amp;amp;qid=1587355901&amp;amp;sr=8-1&amp;amp;linkCode=li3&amp;amp;tag=strikingloo-20&amp;amp;linkId=f895004eb4c2d8493cba066d43e4b256"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--U98nZ0MT--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/http://ws-na.amazon-adsystem.com/widgets/q%3F_encoding%3DUTF8%26ASIN%3DB07VSXS4NK%26Format%3D_SL250_%26ID%3DAsinImage%26MarketPlace%3DUS%26ServiceVersion%3D20070822%26WS%3D1%26tag%3Dstrikingloo-20" alt=""&gt;&lt;/a&gt; &lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--evwjMe_K--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://ir-na.amazon-adsystem.com/e/ir%3Ft%3Dstrikingloo-20%26l%3Dli3%26o%3D1%26a%3DB07VSXS4NK" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--evwjMe_K--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://ir-na.amazon-adsystem.com/e/ir%3Ft%3Dstrikingloo-20%26l%3Dli3%26o%3D1%26a%3DB07VSXS4NK" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you’re new to programming, there’s this very early stage when you’re still realizing the &lt;strong&gt;huge potential software can have&lt;/strong&gt; , especially when applied to &lt;strong&gt;automation&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;There’s a big difference between performing a simple task manually, and doing it a thousand times faster with a script.&lt;/p&gt;

&lt;p&gt;There’s also a case to be made that &lt;strong&gt;Python is the best language to get started&lt;/strong&gt; with Software Development, since its syntax and environments can be less daunting than C or Java. This way, you can spend less time doing this kind of “set up” things, freeing you up to focus on what’s important: solving actual problems.&lt;/p&gt;

&lt;p&gt;I think &lt;em&gt;&lt;a href="https://amzn.to/2yoYd2y"&gt;Automate the Boring Stuff&lt;/a&gt;&lt;/em&gt; really sets itself apart from other programming books in this area: showing you from the get-go which typical day-to-day problems you can solve with Python scripts, or with code, really.&lt;/p&gt;

&lt;p&gt;From basic program flow and logic to more advanced tasks like Web Scraping, this book walks with you &lt;strong&gt;all the way from beginner to proficient&lt;/strong&gt; , without holding your hand too much.&lt;/p&gt;

&lt;p&gt;My favorite project from that book is the one on the chapter &lt;strong&gt;&lt;em&gt;Handle the Clipboard Content&lt;/em&gt;&lt;/strong&gt; , which teaches you how to copy and paste text programmatically, eventually making a &lt;em&gt;super-clipboard&lt;/em&gt; which stores more than one text.&lt;/p&gt;

&lt;p&gt;I have a personal attachment to this book, as I used it to learn Python when I was still in high school, deciding on whether to study Computer Science or not.&lt;/p&gt;

&lt;p&gt;If you work in an office and you’re thinking of &lt;strong&gt;pivoting into programming&lt;/strong&gt; , this book is for you.&lt;/p&gt;

&lt;p&gt;Here’s a link to &lt;em&gt;&lt;a href="https://amzn.to/2yoYd2y"&gt;Automate the Boring Stuff&lt;/a&gt;&lt;/em&gt; in Amazon.&lt;/p&gt;

&lt;h2&gt;
  
  
  Introduction to Algorithms (Cormen)
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.amazon.com/Introduction-Algorithms-Press-Thomas-Cormen-ebook/dp/B007CNRCAO/ref=as_li_ss_il?dchild=1&amp;amp;keywords=cormen&amp;amp;qid=1587356486&amp;amp;sr=8-1&amp;amp;linkCode=li3&amp;amp;tag=strikingloo-20&amp;amp;linkId=09fe74bb316b42a1559a9377aabc1f17"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--uMBLqY1Y--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/http://ws-na.amazon-adsystem.com/widgets/q%3F_encoding%3DUTF8%26ASIN%3DB007CNRCAO%26Format%3D_SL250_%26ID%3DAsinImage%26MarketPlace%3DUS%26ServiceVersion%3D20070822%26WS%3D1%26tag%3Dstrikingloo-20" alt=""&gt;&lt;/a&gt; &lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--1QGnt77B--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://ir-na.amazon-adsystem.com/e/ir%3Ft%3Dstrikingloo-20%26l%3Dli3%26o%3D1%26a%3DB007CNRCAO" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--1QGnt77B--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://ir-na.amazon-adsystem.com/e/ir%3Ft%3Dstrikingloo-20%26l%3Dli3%26o%3D1%26a%3DB007CNRCAO" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;To every Computer Science student, &lt;em&gt;&lt;a href="https://amzn.to/2VktSLi"&gt;Cormen et al.’s Introduction to Algorithms&lt;/a&gt;&lt;/em&gt; is our bible.&lt;/p&gt;

&lt;p&gt;This book has been &lt;strong&gt;sitting on my shelf for years&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;It has helped me &lt;strong&gt;prepare for many exams&lt;/strong&gt; , or whenever I need to &lt;strong&gt;brush up on Data Structures&lt;/strong&gt; before an interview.&lt;/p&gt;

&lt;p&gt;Especially if you’re planning to get into Software Development without getting a college degree, this book is a definite must-read.&lt;/p&gt;

&lt;p&gt;This Computer Science book is the most comprehensive study of basic &lt;strong&gt;Data Structures and Algorithms&lt;/strong&gt; you will find.&lt;/p&gt;

&lt;p&gt;It covers:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Algorithmic Complexity&lt;/strong&gt; (with the best explanation of Big-O notation I’ve seen so far).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Sorting Algorithms&lt;/strong&gt; (&lt;em&gt;many&lt;/em&gt; sorting algorithms).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Graphs&lt;/strong&gt; and Graph-related Algorithms (especially Binary trees).&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Hash tables&lt;/strong&gt; and hashing algorithms.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Dynamic Programming&lt;/strong&gt; , Greedy Algorithms, &lt;strong&gt;Divide-and-Conquer&lt;/strong&gt; Algorithms.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These topics and many others are explained in understandable terms, but with &lt;strong&gt;mathematical rigor and correctness&lt;/strong&gt;. Not only that, but they often come up both in &lt;strong&gt;day-to-day work&lt;/strong&gt; , and in &lt;strong&gt;interview problems&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Keep in mind this is a university level book, packed full with formal proofs and mathematical notation.&lt;/p&gt;

&lt;p&gt;Even so, I think most developers will agree it is &lt;strong&gt;generally entertaining to read&lt;/strong&gt; (&lt;em&gt;if you don’t find Data Structures fun, make sure you’re picking the right career!&lt;/em&gt;), and explains most concepts really clearly and succinctly.&lt;/p&gt;

&lt;p&gt;If you need to learn how a hash table works, or want to be able to build a binary search tree from scratch, or just need a quick brush up on sorting algorithms before an interview, this is the book for you.&lt;/p&gt;

&lt;p&gt;As before, here’s a link to &lt;em&gt;&lt;a href="https://amzn.to/2VktSLi"&gt;Cormen et al.’s Introduction to Algorithms&lt;/a&gt;&lt;/em&gt; in Amazon.&lt;/p&gt;

&lt;p&gt;And, speaking of interviews…&lt;/p&gt;

&lt;h2&gt;
  
  
  Cracking the Code Interview
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://www.amazon.com/Cracking-Coding-Interview-Programming-Questions/dp/0984782850/ref=as_li_ss_il?dchild=1&amp;amp;keywords=ctci&amp;amp;qid=1587356024&amp;amp;sr=8-1&amp;amp;linkCode=li3&amp;amp;tag=strikingloo-20&amp;amp;linkId=904a242e267cc92cfb582e366ee3e63a"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--lpYvl_VN--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/http://ws-na.amazon-adsystem.com/widgets/q%3F_encoding%3DUTF8%26ASIN%3D0984782850%26Format%3D_SL250_%26ID%3DAsinImage%26MarketPlace%3DUS%26ServiceVersion%3D20070822%26WS%3D1%26tag%3Dstrikingloo-20" alt=""&gt;&lt;/a&gt; &lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--_wsccG4b--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://ir-na.amazon-adsystem.com/e/ir%3Ft%3Dstrikingloo-20%26l%3Dli3%26o%3D1%26a%3D0984782850" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--_wsccG4b--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://ir-na.amazon-adsystem.com/e/ir%3Ft%3Dstrikingloo-20%26l%3Dli3%26o%3D1%26a%3D0984782850" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Ok, hear me out here. If you’re starting from scratch, I think &lt;a href="https://amzn.to/2yoYd2y"&gt;Automate the Boring Stuff&lt;/a&gt; is the most practical way to start learning Python and programming.&lt;/p&gt;

&lt;p&gt;And if you want to &lt;strong&gt;dive deeper&lt;/strong&gt; and learn more advanced or theoretical Computer Science concepts, like Algorithms and Data Structures, then &lt;em&gt;&lt;a href="https://amzn.to/2VktSLi"&gt;Cormen’s Introduction to Algorithms&lt;/a&gt;&lt;/em&gt; is the undisputed book to go.&lt;/p&gt;

&lt;p&gt;However, when all is said and done, there is a craftsmanship that you can only learn by doing, and practicing.&lt;/p&gt;

&lt;p&gt;As Charles Darwin once said:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;“I have always maintained that, excepting fools, men did not differ much in intellect, only in zeal and hard work.”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If that’s the stage where you feel you’re at, then the best thing you can do is &lt;strong&gt;practice a lot&lt;/strong&gt; , with &lt;strong&gt;many different problems&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That’s exactly what &lt;em&gt;&lt;a href="https://amzn.to/3bluxBO"&gt;Cracking the Code Interview&lt;/a&gt;&lt;/em&gt; (CTCI, for friends) has to offer.&lt;/p&gt;

&lt;p&gt;Sure, the first chapter deals more with the “soft” aspects of a Software Interview (which, again, if you plan to apply for a SWE job eventually, then you should master those too!).&lt;/p&gt;

&lt;p&gt;But the rest of the book? Chapter after chapter of &lt;strong&gt;fun, challenging problems&lt;/strong&gt; taken straight out of Google’s, Microsoft’s or Facebook’s interview processes. And, they are divided into categories so you can practice one subject at a time.&lt;/p&gt;

&lt;p&gt;Feel like you need to polish your &lt;strong&gt;bit manipulation&lt;/strong&gt; skills? CTCI has a chapter for you.&lt;/p&gt;

&lt;p&gt;Want to practice thinking on your feet and deciding which &lt;strong&gt;Data Structures&lt;/strong&gt; fit each kind of problem setup? CTCI has you covered, too.&lt;/p&gt;

&lt;p&gt;I did feel my &lt;strong&gt;Software Interview skills&lt;/strong&gt; improved after reading CTCI and going through all its exercises. However, that’s definitely not the most important part. The most valuable thing I got from CTCI, is practice: hands-on practice, solving many different problems through code.&lt;/p&gt;

&lt;p&gt;To get started, be sure to check out &lt;em&gt;&lt;a href="https://amzn.to/3bluxBO"&gt;Cracking the Code Interview&lt;/a&gt;&lt;/em&gt; on Amazon.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;So far, I’ve made recommendations for:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A very language-focused programming book for your first steps as a Developer.&lt;/li&gt;
&lt;li&gt;A more academic or broader book for the more theoretically-oriented readers.&lt;/li&gt;
&lt;li&gt;A last, very practical book with a lot of exercises for everyone, old and new to coding.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Each of these programming books has helped me learn a lot. Some have saved my skin on more than one occasion -or exam!-.&lt;/p&gt;

&lt;p&gt;When I am preparing for an interview, or a tough exam, there are no other books I’d rather have &lt;em&gt;(though, if you read this far and are thinking ‘hey, he didn’t mention !’ this is your time to shine! Hit me up in the comments and I’ll make sure to add it to my reading list)&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;I hope at least some of these books will be as helpful to you or your programmer friends, too!&lt;/p&gt;

&lt;p&gt;Have you already read any of these books? Are you reading any of them? Let me know what you think of them in the comments!&lt;/p&gt;

&lt;p&gt;I’d love to know your opinion, both if you liked them or not. Especially if you can offer a recommendation for what you think is a better alternative!&lt;/p&gt;

&lt;p&gt;If you want to get into Data Science or Machine Learning, check out my older post &lt;em&gt;&lt;a href="http://www.datastuff.tech/data-science/3-machine-learning-books-that-helped-me-level-up-as-a-data-scientist/"&gt;Machine Learning Books to Level Up as a Data Scientists&lt;/a&gt;&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;We can also discuss these books on &lt;a href="https://twitter.com/strikingloo"&gt;Twitter&lt;/a&gt;, &lt;a href="https://medium.com/@strikingloo"&gt;Medium&lt;/a&gt; of &lt;a href="http://dev.to/strikingloo"&gt;dev.to&lt;/a&gt; if you’re interested.&lt;br&gt;&lt;br&gt;
I want to hear your opinions!&lt;/p&gt;

&lt;p&gt;&lt;em&gt;(small disclaimer: all of these links are Amazon affiliate links&lt;/em&gt;. &lt;em&gt;This means I get a small commission if you buy them. However, I’ll only review books I’ve actually read, and have genuinely recommended to people in real life)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;The post &lt;a href="http://www.datastuff.tech/programming/3-programming-books-for-beginners-to-read-during-lockdown/"&gt;3 Programming Books for Beginners to Read During Lockdown&lt;/a&gt; appeared first on &lt;a href="http://www.datastuff.tech"&gt;Data Stuff&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>python</category>
      <category>beginners</category>
      <category>bookreviews</category>
      <category>programming</category>
    </item>
    <item>
      <title>What is the one tip you would give to new bloggers out there?</title>
      <dc:creator>Luciano Strika</dc:creator>
      <pubDate>Sun, 10 Nov 2019 17:50:12 +0000</pubDate>
      <link>https://dev.to/strikingloo/what-is-the-one-tip-you-would-give-to-new-bloggers-out-there-33km</link>
      <guid>https://dev.to/strikingloo/what-is-the-one-tip-you-would-give-to-new-bloggers-out-there-33km</guid>
      <description>&lt;p&gt;I've been blogging for about a year now, and feel like I've learned a lot of things in this time, even though I'm nowhere near as experienced as those big-time guys I see on social networks.&lt;/p&gt;

&lt;p&gt;The few things I'd say to a newcomer are:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;p&gt;Make a plan and stick to it: posting consistently over a long period of time is better than making bursts of content every once in a while (I wish I had the discipline to follow this advice, it really pays off).&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Focus on promoting your content almost as much as you focus on writing it. At least if you really care about it reaching a big audience.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Don't spend too much time focusing on improving your site speed. As developers, I think we have a tendency to want to optimize every last inch of website performance. I know I do. And I've spent countless hours, enough to write like five good quality articles, just optimizing those last points in Google PageSpeed, or those few points in GTMetrix. My advice? Get a few plugins to do the job for you, get to like 90 or 95 pagespeed and then just focus on content. &lt;br&gt;
I wish I'd known that from the start.&lt;/p&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;By the way, if you're using WordPress, the PageSpeed Ninja plugin was a gamechanger for me. It's a lot better than Autoptimize, which is what Google suggested to me most of the time.&lt;/p&gt;

&lt;p&gt;So what about you? Bloggers of the tech world, lords of the web, which pearls of wisdom do you think every other blogger should receive?&lt;/p&gt;

</description>
      <category>discuss</category>
      <category>blogging</category>
      <category>webdev</category>
    </item>
    <item>
      <title>Markov Chains: Training AI to Write Game of Thrones</title>
      <dc:creator>Luciano Strika</dc:creator>
      <pubDate>Fri, 25 Oct 2019 05:29:17 +0000</pubDate>
      <link>https://dev.to/strikingloo/markov-chains-training-ai-to-write-game-of-thrones-25d6</link>
      <guid>https://dev.to/strikingloo/markov-chains-training-ai-to-write-game-of-thrones-25d6</guid>
      <description>&lt;p&gt;Markov chains have been around for a while now, and they are here to stay. From predictive keyboards to applications in trading and biology, they’ve proven to be versatile tools.&lt;/p&gt;

&lt;p&gt;Here are some Markov Chains industry applications:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Text Generation (you’re here for this).&lt;/li&gt;
&lt;li&gt;Financial modelling and forecasting (including trading algorithms).&lt;/li&gt;
&lt;li&gt;Logistics: modelling future deliveries or trips.&lt;/li&gt;
&lt;li&gt;Search Engines: PageRank can seen as modelling a random internet surfer with a Markov Chain.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;So far, we can tell this algorithm is useful, but what exactly are Markov Chains?&lt;/p&gt;

&lt;h2&gt;
  
  
  What are Markov Chains?
&lt;/h2&gt;

&lt;p&gt;A Markov Chain is a &lt;strong&gt;stochastic process&lt;/strong&gt; that models a finite &lt;strong&gt;set of states&lt;/strong&gt; , with fixed &lt;strong&gt;conditional probabilities of jumping&lt;/strong&gt; from a given state to another.&lt;/p&gt;

&lt;p&gt;What this means is, we will have an “agent” that randomly jumps around different states, with a certain probability of going from each state to another one.&lt;/p&gt;

&lt;p&gt;To show what a Markov Chain looks like, we can use a &lt;strong&gt;digraph&lt;/strong&gt; , where each node is a state (with a label or associated data), and the weight of the edge that goes from node &lt;em&gt;a&lt;/em&gt; to node &lt;em&gt;b&lt;/em&gt; is the &lt;strong&gt;probability of jumping from state&lt;/strong&gt; _ &lt;strong&gt;a&lt;/strong&gt; _ &lt;strong&gt;to state&lt;/strong&gt; _ &lt;strong&gt;b&lt;/strong&gt; _.&lt;/p&gt;

&lt;p&gt;Here’s an example, modelling the weather as a Markov Chain.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/http%3A%2F%2Fwww.datastuff.tech%2Fwp-content%2Fuploads%2F2019%2F10%2Fmarkovdiag.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/http%3A%2F%2Fwww.datastuff.tech%2Fwp-content%2Fuploads%2F2019%2F10%2Fmarkovdiag.png" alt="A diagram showing a Markov Chain as a weather model example."&gt;&lt;/a&gt;&lt;a href="http://techeffigytutorials.blogspot.com/2015/01/markov-chains-explained.html" rel="noreferrer noopener"&gt;Source&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can express the probability of going from state &lt;em&gt;a&lt;/em&gt; to state &lt;em&gt;b&lt;/em&gt; as a &lt;strong&gt;matrix component&lt;/strong&gt; , where the whole &lt;strong&gt;matrix characterizes our Markov chain&lt;/strong&gt; process, corresponding to the &lt;strong&gt;digraph’s adjacency matrix&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/http%3A%2F%2Fwww.datastuff.tech%2Fwp-content%2Fuploads%2F2019%2F10%2Fmarkovmx.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/http%3A%2F%2Fwww.datastuff.tech%2Fwp-content%2Fuploads%2F2019%2F10%2Fmarkovmx.png" alt="The adjacency Matrix for the graph in the previous picture."&gt;&lt;/a&gt;&lt;a href="http://techeffigytutorials.blogspot.com/2015/01/markov-chains-explained.html" rel="noopener noreferrer"&gt;Source&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Then, if we represent the current state as a one-hot encoding, we can obtain the conditional probabilities for the next state’s values by taking the current state, and looking at its corresponding row.&lt;/p&gt;

&lt;p&gt;After that, if we repeatedly sample the discrete distribution described by the &lt;em&gt;n&lt;/em&gt;-th state’s row, we may model a succession of states of arbitrary length.&lt;/p&gt;

&lt;h2&gt;
  
  
  Markov Chains for Text Generation
&lt;/h2&gt;

&lt;p&gt;In order to generate text with Markov Chains, we need to define a few things:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;What are our states going to be?&lt;/li&gt;
&lt;li&gt;What probabilities will we assign to jumping from each state to a different one?&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;We could do a character-based model for text generation, where we define our state as the last &lt;em&gt;n&lt;/em&gt; characters we’ve seen, and try to predict the next one.&lt;/p&gt;

&lt;p&gt;I’ve already gone in-depth on this for my &lt;a href="https://dev.to/strikingloo/lstm-how-to-train-neural-networks-to-write-like-lovecraft-2bbk"&gt;LSTM for Text Generation&lt;/a&gt; article, to mixed results.&lt;/p&gt;

&lt;p&gt;In this experiment, I will instead choose to use the previous &lt;em&gt;k&lt;/em&gt; words as my current state, and model the probabilities of the next token.&lt;/p&gt;

&lt;p&gt;In order to do this, I will simply create a vector for each distinct sequence of &lt;em&gt;k&lt;/em&gt; words, having N components, where N is the total quantity of distinct words in my corpus.&lt;/p&gt;

&lt;p&gt;I will then add 1 to the &lt;em&gt;j&lt;/em&gt;-th component of the &lt;em&gt;i&lt;/em&gt;-th vector, where i is the index of the &lt;em&gt;i&lt;/em&gt;-th k-sequence of words, and &lt;em&gt;j&lt;/em&gt; is the index of the next word.&lt;/p&gt;

&lt;p&gt;If I normalize each word vector, I will then have a probability distribution for the next word, given the previous &lt;em&gt;k&lt;/em&gt; tokens.&lt;/p&gt;

&lt;p&gt;Confusing? Let’s see an example with a small corpus.&lt;/p&gt;

&lt;h3&gt;
  
  
  Training our chain: toy example.
&lt;/h3&gt;

&lt;p&gt;Let’s imagine my corpus is the following sentence.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;This sentence has five words&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;We will first choose &lt;em&gt;k&lt;/em&gt;: the &lt;strong&gt;quantity of words our chain will consider&lt;/strong&gt; before sampling/ &lt;strong&gt;predicting the next&lt;/strong&gt; one. For this example, let’s use k=1.&lt;/p&gt;

&lt;p&gt;Now, how many distinct sequences of 1 word does our sentence have? It has 5, one for each word. If it had duplicate words, they wouldn’t add to this number.&lt;/p&gt;

&lt;p&gt;We will first initialize a 5×5 matrix of zeroes.&lt;/p&gt;

&lt;p&gt;After that, we will add 1 to the column corresponding to ‘sentence’ on the row for ‘this’. Then another 1 on the row for ‘sentence’, on the column for ‘has’. We will continue this process until we’ve gone through the whole sentence.&lt;/p&gt;

&lt;p&gt;This would be the resulting matrix:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/http%3A%2F%2Fwww.datastuff.tech%2Fwp-content%2Fuploads%2F2019%2F10%2Fresulting.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/http%3A%2F%2Fwww.datastuff.tech%2Fwp-content%2Fuploads%2F2019%2F10%2Fresulting.png" alt="A diagonal matrix of 5x5."&gt;&lt;/a&gt;The diagonal pattern comes from the ordering of the words.&lt;/p&gt;

&lt;p&gt;Since each word only appears once, this model would simply generate the same sentence over and over, but you can see how adding more words could make this interesting.&lt;/p&gt;

&lt;p&gt;I hope things are clearer now. Let’s jump to some code!&lt;/p&gt;

&lt;h2&gt;
  
  
  Coding our Markov Chain in Python
&lt;/h2&gt;

&lt;p&gt;Now for the fun part! We will train a Markov chain on the whole A Song of Ice and Fire corpus (Ha! You thought I was going to reference the show? Too bad, I’m a book guy!).&lt;/p&gt;

&lt;p&gt;We will then generate sentences with varying values for &lt;em&gt;k&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;For this experiment, I decided to treat anything between two spaces as a word or &lt;em&gt;token&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Conventionally, in NLP we treat punctuation marks (like ‘,’ or ‘.’) as tokens as well. To solve this, I will first add padding in the form of two spaces to every punctuation mark.&lt;/p&gt;

&lt;p&gt;Here’s the code for that small preprocessing, plus loading the corpus:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;We will start training our Markov Chain right away, but first let’s look at our dataset:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;We have over 2 million tokens, representing over 32000 distinct words! That’s a pretty big corpus for a single writer.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;If only he could add 800k more…&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Training our chain
&lt;/h3&gt;

&lt;p&gt;Moving on, here’s how we initialize our “word after k-sequence” counts matrix for an arbitrary &lt;em&gt;k&lt;/em&gt; (in this case, 2).&lt;/p&gt;

&lt;p&gt;There are 2185918 words in the corpus, and 429582 different sequences of 2 words, each followed by one of 32663 words.&lt;/p&gt;

&lt;p&gt;That means only slightly over 0.015% of our matrix’s components will be non-zero.&lt;/p&gt;

&lt;p&gt;Because of that, I used scipy’s &lt;em&gt;dok_matrix&lt;/em&gt; (&lt;em&gt;dok&lt;/em&gt; stands for Dictionary of Keys), a sparse matrix implementation, since we know this dataset is going to be extremely sparse.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;After initializing our matrix, sampling it is pretty intuitive.&lt;/p&gt;

&lt;p&gt;Here’s the code for that:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;There are two things that may have caught your attention here. The first is the &lt;em&gt;alpha&lt;/em&gt; hyperparameter.&lt;/p&gt;

&lt;p&gt;This is our chain’s &lt;em&gt;creativity&lt;/em&gt;: a (typically small, or zero) chance that it will pick a totally random word instead of the ones suggested by the corpus.&lt;/p&gt;

&lt;p&gt;If the number is high, then the next word’s distribution will approach uniformity. If zero or closer to it, then the distribution will more closely resemble that seen in the corpus.&lt;/p&gt;

&lt;p&gt;For all the examples I’ll show, I used an &lt;em&gt;alpha&lt;/em&gt; value of 0.&lt;/p&gt;

&lt;p&gt;The second thing is the weighted_choice function. I had to implement it since Python’s random package doesn’t support weighted choice over a list with more than 32 elements, let alone 32000.&lt;/p&gt;

&lt;h2&gt;
  
  
  Results: Generated Sentences
&lt;/h2&gt;

&lt;p&gt;First of all, as a baseline, I tried a deterministic approach: what happens if we pick a word, use k=1, and always jump to the most likely word after the current one?&lt;/p&gt;

&lt;p&gt;The results are underwhelming, to say the least.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;**I** am not have been a man , and the Wall . " " " " 
**he** was a man , and the Wall . " " " " " " " 
**she** had been a man , and the Wall . " " " " " "
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Since we’re being deterministic, ‘a’ is always followed by ‘man’, ‘the’ is always followed by ‘Wall’ (hehe) and so on.&lt;/p&gt;

&lt;p&gt;This means our sentences will be boring, predictable and kind of nonsensical.&lt;/p&gt;

&lt;p&gt;Now for some actual generation, I tried using a stochastic Markov Chain of 1 word, and a value of 0 for alpha.&lt;/p&gt;

&lt;h3&gt;
  
  
  1-word Markov Chain results
&lt;/h3&gt;

&lt;p&gt;Here are some of the resulting 15-word sentences, with the seed word in bold letters.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;' **the** Seven in front of whitefish in a huge blazes burning flesh . I had been' ' 
**a** squire , slain , they thought . " He bathed in his head . The' ' 
**Bran** said Melisandre had been in fear I’ve done . " It must needs you will' ' 
**Melisandre** would have feared he’d squired for something else I put his place of Ser Meryn' ' 
**Daenerys** is dead cat - TOOTH , AT THE GREAT , Asha , which fills our' ' 
**Daenerys** Targaryen after Melara had worn rich grey sheep to encircle Stannis . " The deep'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;As you can see, the resulting sentences are quite nonsensical, though a lot more interesting than the previous ones.&lt;/p&gt;

&lt;p&gt;Each individual pair of words makes some sense, but the whole sequence is pure non-sequitur.&lt;/p&gt;

&lt;p&gt;The model did learn some interesting things, like how Daenerys is usually followed by Targaryen, and ‘would have feared’ is a pretty good construction for only knowing the previous word.&lt;/p&gt;

&lt;p&gt;However, in general, I’d say this is nowhere near as good as it could be.&lt;/p&gt;

&lt;p&gt;When increasing the value of alpha for the single-word chain, the sentences I got started turning even more random.&lt;/p&gt;

&lt;h3&gt;
  
  
  Results with 2-word Markov chains
&lt;/h3&gt;

&lt;p&gt;The 2-word chain produced some more interesting sentences.&lt;/p&gt;

&lt;p&gt;Even though it too usually ends sounding completely random, most of its output may actually fool you for a bit at the beginning &lt;em&gt;(emphasis mine)&lt;/em&gt;.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;' **the world**. _And Ramsay loved the feel of grass_ _welcomed them warmly_ , the axehead flew'' 
**Jon Snow**. _You are to strike at him_ . _The bold ones have had no sense_'' 
**Eddard Stark** had done his best to give her _the promise was broken_ . By tradition the'' 
**The game** of thrones , so you must tell her the next buyer who comes running ,'' 
**The game** trail brought her messages , strange spices . _The Frey stronghold was not large enough_'' 
**heard the** scream of fear . I want to undress properly . Shae was there , fettered'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The sentences maintain local coherence (&lt;em&gt;You are to strike at him&lt;/em&gt;, or &lt;em&gt;Ramsay loved the feel of grass&lt;/em&gt;), but then join very coherent word sequences into a total mess.&lt;/p&gt;

&lt;p&gt;Any sense of syntax, grammar or semantics is clearly absent.&lt;/p&gt;

&lt;p&gt;By the way, I didn’t cherry-pick those sentences at all, those are the first outputs I sampled.&lt;/p&gt;

&lt;p&gt;Feel free to &lt;a href="https://github.com/StrikingLoo/ASOIAF-Markov" rel="noopener noreferrer"&gt;play with the code yourself&lt;/a&gt;, and you can share the weirdest sentences you get in the comments!&lt;/p&gt;

&lt;p&gt;As a last experiment, let’s see what we get with a 3-word Markov Chain.&lt;/p&gt;

&lt;h3&gt;
  
  
  3-Word Chain Results
&lt;/h3&gt;

&lt;p&gt;Here are some of the sentences the model generated when trained with sequences of 3 words.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;' **I am a** master armorer , lords of Westeros , sawing out each bay and peninsula until the'' 
**Jon Snow is** with the Night’s Watch . I did not survive a broken hip , a leathern'' 
**Jon Snow is** with the Hound in the woods . He won’t do it . " Please don’t'' 
**Where are the** chains , and the Knight of Flowers to treat with you , Imp . "'' 
**Those were the** same . Arianne demurred . " So the fishwives say , " It was Tyrion’s'' 
**He thought that** would be good or bad for their escape . If they can truly give us'' 
**I thought that** she was like to remember a young crow he’d met briefly years before . "'
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Alright, I really liked some of those, especially the last one. It kinda sounds like a real sentence you could find in the books.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;Implementing a Markov Chain is a lot easier than it may sound, and training it on a real corpus was fun.&lt;/p&gt;

&lt;p&gt;The results were frankly better than I expected, though I may have set the bar too low after my little LSTM fiasco.&lt;/p&gt;

&lt;p&gt;In the future, I may try training this model with even longer chains, or a completely different corpus.&lt;/p&gt;

&lt;p&gt;In this case, trying a 5-word chain had basically deterministic results again, since each 5-word sequence was almost always unique, so I did not consider 5-words and upwards chains to be of interest.&lt;/p&gt;

&lt;p&gt;Which corpus do you think would generate more interesting results, Especially for a longer chain? Let me know in the comments!&lt;/p&gt;

&lt;p&gt;&lt;em&gt;If you wish to learn even more about Markov Chains, consider checking &lt;a href="https://amzn.to/31IDAHp" rel="noopener noreferrer"&gt;this in-depth book&lt;/a&gt;. That’s an affiliate link, which means I get a small commission from it&lt;/em&gt;.&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>markovchain</category>
      <category>nlp</category>
      <category>python</category>
    </item>
    <item>
      <title>Coding MapReduce in C from Scratch using Threads: Map</title>
      <dc:creator>Luciano Strika</dc:creator>
      <pubDate>Sat, 19 Oct 2019 05:45:55 +0000</pubDate>
      <link>https://dev.to/strikingloo/coding-mapreduce-in-c-from-scratch-using-threads-map-5f7</link>
      <guid>https://dev.to/strikingloo/coding-mapreduce-in-c-from-scratch-using-threads-map-5f7</guid>
      <description>&lt;p&gt;Hadoop’s MapReduce is not just a Framework, it’s also a problem-solving philosophy.&lt;/p&gt;

&lt;p&gt;Borrowing from functional programming, the MapReduce team realized a lot of different problems could be divided into two common operations: &lt;strong&gt;map&lt;/strong&gt; , and &lt;strong&gt;reduce&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Both &lt;strong&gt;mapping&lt;/strong&gt; and &lt;strong&gt;reducing&lt;/strong&gt; steps can be done &lt;strong&gt;in parallel&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This meant as long as you could &lt;strong&gt;frame your problem&lt;/strong&gt; in that specific way, there would be a solution to it that could easily be run in parallel. This will usually result in a big &lt;strong&gt;performance boost&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That all sounds good, and running things on parallel is usually a good thing, especially when working at scale. But, some of you on the back may be wondering, what are &lt;strong&gt;Map&lt;/strong&gt; and &lt;strong&gt;Reduce&lt;/strong&gt;?&lt;/p&gt;

&lt;h2&gt;
  
  
  What is &lt;em&gt;MapReduce&lt;/em&gt;?
&lt;/h2&gt;

&lt;p&gt;In order to understand the MapReduce framework, we need to understand its two basic operations: &lt;strong&gt;Map&lt;/strong&gt; and &lt;strong&gt;Reduce&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;They’re both high order functions: Meaning they are functions that can take other functions as their argument.&lt;/p&gt;

&lt;p&gt;Specifically, when you need to convert a certain sequence of elements of type A into a result, or series of results of type B, you will:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Map&lt;/strong&gt; all your inputs to a different domain: that means you will &lt;strong&gt;transform each of them&lt;/strong&gt; with a chosen function, applying it to each element.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Group&lt;/strong&gt; the mapped elements by some criterion, usually a grouping &lt;strong&gt;key&lt;/strong&gt;.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reduce&lt;/strong&gt; the mapped elements on each group with some other function. This function needs to take two arguments and return a single one of the same type, successively running an operation between an &lt;strong&gt;accumulator&lt;/strong&gt; and each value in our collection. It should be &lt;strong&gt;commutative and associative&lt;/strong&gt; , as parallel execution &lt;strong&gt;won’t guarantee any order&lt;/strong&gt; for the operations.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;To make this clearer, let’s see an example.&lt;/p&gt;

&lt;h3&gt;
  
  
  Example of a MapReduce solution
&lt;/h3&gt;

&lt;p&gt;Suppose you’re working for an e-commerce company, and they give you a log file of this form:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;John Surname bought 2 apples Alice Challice bought 3 bananas John Surname bought 5 pineapples
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;


&lt;p&gt;Then they ask you to tell them how many fruits each customer bought.&lt;/p&gt;

&lt;p&gt;In this case, after parsing this file to turn it into an actual format, like CSV, you could easily go through each line, and add the number of bought fruits on a dictionary under each name.&lt;/p&gt;

&lt;p&gt;You could even solve it with a bit of &lt;a href="http://www.datastuff.tech/programming/files-strings-shell-tutorial/"&gt;Bash scripting&lt;/a&gt;, or load the CSV on a &lt;a href="http://www.datastuff.tech/data-science/exploratory-data-analysis-with-pandas-and-jupyter-notebooks/"&gt;Pandas DataFrame&lt;/a&gt; and get some statistics.&lt;/p&gt;

&lt;p&gt;However, if the log file was a trillion lines long, bash scripting wouldn’t really cut it. Especially not if you’re not immortal.&lt;/p&gt;

&lt;p&gt;You would need to run this in parallel. Let me propose a MapReduce-y way of doing it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;Map&lt;/strong&gt; each line to a Pair of the form &amp;lt;Name, Quantity&amp;gt; by parsing each string.&lt;/li&gt;
&lt;li&gt;Group by Name.&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Reduce&lt;/strong&gt; by summing the quantities.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;If you’re familiar with SQL and relational databases, you may have thought of a similar solution. The query would look something like&lt;/p&gt;

&lt;p&gt;&lt;code&gt;select user, sum(bought_fruits)&lt;/code&gt;&lt;br&gt;&lt;br&gt;
&lt;code&gt;from fruit_transactions group by user;&lt;/code&gt;&lt;/p&gt;
&lt;h2&gt;
  
  
  Why MapReduce scales
&lt;/h2&gt;

&lt;p&gt;Notice how &lt;strong&gt;the mapper doesn’t need to see the whole file&lt;/strong&gt; , just some of the lines. The &lt;strong&gt;reducer&lt;/strong&gt; , on the other hand, &lt;strong&gt;only needs to have the lines that have the same Name&lt;/strong&gt; (the ones that belong to the same group).&lt;/p&gt;

&lt;p&gt;You could do this with many different threads on the same computer, and then just join the results.&lt;/p&gt;

&lt;p&gt;Or, you could have many different processes running the map jobs, and feeding their output to another set running the reducing job.&lt;/p&gt;

&lt;p&gt;If the log was big enough, you could even be running Mapper and Reducer processes on many different computers (say, on a cluster), and then joining their results on some lake in the end.&lt;/p&gt;

&lt;p&gt;This kind of solution is very common in ETL jobs and other data-intensive applications, but I won’t delve any further into applications.&lt;/p&gt;

&lt;p&gt;If you wish to learn more about this kind of scalable solutions, I recommend you check &lt;a href="https://amzn.to/33Dwh56"&gt;this O’Reilly book on designing applications at scale&lt;/a&gt;.&lt;/p&gt;
&lt;h2&gt;
  
  
  Programming MapReduce in C
&lt;/h2&gt;

&lt;p&gt;Now that you have an understanding of what MapReduce is, and why MapReduce scales, let’s cut to the chase.&lt;/p&gt;

&lt;p&gt;For this first article, we will program two different implementations of the &lt;em&gt;Map&lt;/em&gt; function.&lt;/p&gt;

&lt;p&gt;One of them will be &lt;strong&gt;single-threaded&lt;/strong&gt; , to introduce a few concepts and show a &lt;strong&gt;simple solution&lt;/strong&gt;. The other one will use the &lt;em&gt;pthread&lt;/em&gt; library to make an actually &lt;strong&gt;multi-threaded&lt;/strong&gt; , and &lt;strong&gt;much faster&lt;/strong&gt; version of &lt;em&gt;Map&lt;/em&gt;. Finally, we will compare the two and run some benchmarks.&lt;/p&gt;

&lt;p&gt;As usual, all the code is available on &lt;a href="https://github.com/StrikingLoo/mapReduCe"&gt;this C GitHub project&lt;/a&gt;.&lt;/p&gt;
&lt;h3&gt;
  
  
  Single threaded implementation of &lt;em&gt;Map&lt;/em&gt; in C
&lt;/h3&gt;

&lt;p&gt;First of all, let’s remember what &lt;em&gt;Map&lt;/em&gt; does.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;The Map function receives a &lt;strong&gt;sequence&lt;/strong&gt; and a &lt;strong&gt;function&lt;/strong&gt; , and returns the result of &lt;strong&gt;applying that function to each element&lt;/strong&gt; in the sequence.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Since this is C, representing a sequence can be very straight forward: we can just use a pointer to whatever type we’re mapping over!&lt;/p&gt;

&lt;p&gt;However, there’s a catch. &lt;strong&gt;C is statically typed&lt;/strong&gt; , and we would like our Map function to be &lt;strong&gt;as generic as possible&lt;/strong&gt;. We want it to be able to map over a sequence of elements of any type (provided they all share a type. Let’s not get carried away here, boys).&lt;/p&gt;

&lt;p&gt;How do we solve this? There are probably a few different solutions to this problem. I chose the one that looked like the most simple one, but feel free to pitch in with other ideas.&lt;/p&gt;

&lt;p&gt;We will use sequences of &lt;code&gt;void*&lt;/code&gt;, and cast everything to this type. This means every element will be represented as a pointer to a random memory address, without specifying a type (or size).&lt;/p&gt;

&lt;p&gt;We will trust whatever function we are calling over all these sequence elements knows how to cast them to the right type before using them. We’re effectively delegating that problem away.&lt;/p&gt;

&lt;p&gt;A smaller problem we need to solve is sequence length. A pointer to void doesn’t carry the information of how many elements the sequence has. It only knows where it starts, not where it ends.&lt;/p&gt;

&lt;p&gt;We will solve this other problem by passing sequence length as a second argument. Knowing that, our &lt;em&gt;Map&lt;/em&gt; function becomes pretty straightfoward.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;



&lt;p&gt;You see, the function receives a &lt;code&gt;void**&lt;/code&gt; to represent the sequence it will map over, and a &lt;code&gt;void* (*f)(void*)&lt;/code&gt; function that transforms elements of some generic type to another (or the same) one.&lt;/p&gt;

&lt;p&gt;After that, we can use our &lt;em&gt;Map&lt;/em&gt; function on any sequence. We only need to do some awkward wrapping and pointer arithmetic beforehand.&lt;/p&gt;

&lt;p&gt;Here’s an example, using a function that returns 1 for prime numbers and 0 for the others.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;As expected, the resulting pointer points to a sequence of integers: 1 corresponds to prime numbers, 0 to composite ones.&lt;/p&gt;

&lt;p&gt;Now we’ve gone through the single-threaded &lt;em&gt;Map&lt;/em&gt; function, let’s see how to make this run on parallel in C.&lt;/p&gt;

&lt;h2&gt;
  
  
  Multi-threaded Map function in C
&lt;/h2&gt;

&lt;p&gt;In order to use parallel execution in C, we can either turn to processes, or threads.&lt;/p&gt;

&lt;p&gt;For this project, we will be using threads, as they’re more lightweight and, in my opinion, their API is a bit more intuitive for this kind of tutorial.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;(If you want to add a benchmark using processes and forking, feel free to make a pull request!)&lt;/em&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  How to use threads in C
&lt;/h3&gt;

&lt;p&gt;Threads’ API in C is quite intuitive, if only a bit obscure at first.&lt;/p&gt;

&lt;p&gt;To use them, we will have to &lt;code&gt;#include &amp;lt;pthread.h&amp;gt;&lt;/code&gt;. &lt;code&gt;Pthreads&lt;/code&gt;‘ man page explains their interface quite nicely. However, for this tutorial, all we will use is the &lt;code&gt;pthread_create&lt;/code&gt; function.&lt;/p&gt;

&lt;p&gt;&lt;code&gt;pthread_create&lt;/code&gt; takes four arguments:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A pointer to a &lt;code&gt;pthread_t&lt;/code&gt;: the actual thread.&lt;/li&gt;
&lt;li&gt;A configuration &lt;code&gt;struct&lt;/code&gt;. In this case, we will use &lt;code&gt;NULL&lt;/code&gt; for default config.&lt;/li&gt;
&lt;li&gt;The function we want the thread to run. Unlike a process, a thread will only run a function until it returns, rather than continuing the execution of arbitrary code. This function must take a single &lt;code&gt;void*&lt;/code&gt; argument and return another &lt;code&gt;void*&lt;/code&gt; value.&lt;/li&gt;
&lt;li&gt;The input of the aforementioned function. It must be cast to &lt;code&gt;void*&lt;/code&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After calling on &lt;code&gt;pthread_create&lt;/code&gt;, a parallel thread of execution will begin running the given function.&lt;/p&gt;

&lt;p&gt;Once we call &lt;code&gt;pthread_create&lt;/code&gt; for each of the chunks we wish to map, we will have to call &lt;code&gt;pthread_join&lt;/code&gt; on each of them, which makes the parent (original) thread &lt;strong&gt;wait&lt;/strong&gt; until all the threads it spun &lt;strong&gt;finish running&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Otherwise, the program would end before the mapping was done.&lt;/p&gt;

&lt;p&gt;Now, let’s feast our eyes on some code.&lt;/p&gt;

&lt;h3&gt;
  
  
  Using pthread for Parallel MapReduce in C
&lt;/h3&gt;

&lt;p&gt;To code MapReduce’s &lt;em&gt;Map&lt;/em&gt; function in C, the first thing we are going to do is define a &lt;code&gt;struct&lt;/code&gt; that can store the generic inputs and outputs for it, as well as the function we will be mapping with.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;Since parallel execution requires some manner of slicing and &lt;strong&gt;partitioning&lt;/strong&gt; , we will store that logic inside this structure as well, using two different &lt;strong&gt;indices&lt;/strong&gt; for the start and end of our slice.&lt;/p&gt;

&lt;p&gt;Next, we will code the function that actually does the mapping: it will cycle the inputs from &lt;code&gt;start&lt;/code&gt; to &lt;code&gt;end&lt;/code&gt;, storing the result of applying the mapped function to each input in the outputs’ pointer.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;Finally the star of the show, the function that starts the threads, assigns a &lt;code&gt;map_argument&lt;/code&gt; to each of them, and waits for all the map jobs to run, finally returning the results.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;Notice how this function allows us to choose how many threads we want, and partitions the data accordingly. It also handles &lt;em&gt;pthreads&lt;/em&gt;‘ creation and joining.&lt;/p&gt;

&lt;p&gt;Finally, the way we would call this function in main looks something like this:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight"&gt;&lt;pre class="highlight plaintext"&gt;&lt;code&gt;concurrent_map((void**) numbers, twice, N, NTHREADS)
&lt;/code&gt;&lt;/pre&gt;&lt;/div&gt;



&lt;p&gt;Where &lt;code&gt;NTHREADS&lt;/code&gt; is the number of threads we want, and &lt;code&gt;N&lt;/code&gt; is how many elements &lt;code&gt;numbers&lt;/code&gt; has.&lt;/p&gt;

&lt;p&gt;Now the code is done, let’s run some benchmarks! Is this really going to be faster? Will all this wrapper code make things a lot slower? Let’s find out!&lt;/p&gt;

&lt;h2&gt;
  
  
  Map in C, Benchmarks: Single-threaded vs Multi-threaded
&lt;/h2&gt;

&lt;p&gt;In order to measure performance improvements from using parallel &lt;em&gt;Map&lt;/em&gt;, I tested some single-threaded algorithms against their multi-threaded counterparts.&lt;/p&gt;

&lt;h4&gt;
  
  
  First benchmark: slow_twice
&lt;/h4&gt;

&lt;p&gt;For my first test, I used the &lt;em&gt;slow_twice&lt;/em&gt; function, which simply multiplies each number by 2.&lt;/p&gt;

&lt;p&gt;You may be wondering, ‘why is it called slow?’. The answer is simple: we will double each number 1000 times.&lt;/p&gt;

&lt;p&gt;This makes the operation slower, so we can measure time differences without having to use so many numbers that initialization takes too long. It also lets us benchmark the case of many memory writes.&lt;/p&gt;

&lt;p&gt;Since execution time for each number is constant, the non-parallel algorithm’s time grows pretty much linearly on input size.&lt;/p&gt;

&lt;p&gt;I then ran it with 2, 4 and 8 threads. My laptop has 4 cores, and I found that to be the optimum number of threads to use as well. For some other algorithms, I’ve found using a multiple of my quantity of cores to be optimum, but this hasn’t been the case.&lt;/p&gt;

&lt;h4&gt;
  
  
  Benchmark Results
&lt;/h4&gt;

&lt;p&gt;I ran each benchmark 10 times and took the average, just in case.&lt;/p&gt;

&lt;p&gt;Here are the results:&lt;/p&gt;

&lt;p&gt;| Time (s): | 5000000 elements | 10000000 elements |&lt;br&gt;
| single-threaded | 18.91 | 37.47 |&lt;br&gt;
| 2-threads | 9.78 | 19.49 |&lt;br&gt;
| 4-threads | 6.46 | 12.85 |&lt;br&gt;
| 8-threads | 8.60 | 17.18 |&lt;/p&gt;

&lt;p&gt;For both test cases, using &lt;strong&gt;4 threads&lt;/strong&gt; was about &lt;strong&gt;three times faster&lt;/strong&gt; than the single-threaded implementation. This proves using Parallel &lt;em&gt;Map&lt;/em&gt; is a lot faster than using a common single-threaded version.&lt;/p&gt;

&lt;p&gt;There was also a cost to adding more than 4 threads, probably due to the overhead of initialization and context switching.&lt;/p&gt;

&lt;h4&gt;
  
  
  Second benchmark: is_prime
&lt;/h4&gt;

&lt;p&gt;For this benchmark I coded a naive prime testing function: it simply iterates through all the numbers smaller than the input, and returns 1 if any divides it, 0 otherwise.&lt;/p&gt;

&lt;p&gt;Notice how this function takes O(n) instead of O(1) for each element, so a few partitions of our data (which is ordered) will be a lot slower than the others. I wonder how this will affect running times?&lt;/p&gt;

&lt;p&gt;| Time (s): | 150000 elements | 300000 elements |&lt;br&gt;
| single-threaded | 5.02 | 18.73 |&lt;br&gt;
| 2-threads | 3.76 | 13.78 |&lt;br&gt;
| 4-threads | 2.73 | 10.14 |&lt;br&gt;
| 8-threads | 2.43 | 8.70 |&lt;/p&gt;

&lt;p&gt;In this case, again the parallel algorithm beats the single-threaded one. No big surprises there. However, this time there’s an &lt;strong&gt;improvement when using over 4 threads&lt;/strong&gt;!&lt;/p&gt;

&lt;p&gt;I think this is because when partitioning our inputs, dividing it into smaller chunks makes the &lt;strong&gt;slowest partition take less time&lt;/strong&gt; , thus making our bottleneck smaller.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusion
&lt;/h2&gt;

&lt;p&gt;I had a lot of fun running this experiment.&lt;/p&gt;

&lt;p&gt;Picking &lt;strong&gt;how many threads to use&lt;/strong&gt; turns out to be a lot harder than just “use the same amount as cores”, and &lt;strong&gt;depends a lot on our input&lt;/strong&gt; even for very dumb algorithms.&lt;/p&gt;

&lt;p&gt;This may help us understand why optimizing a cluster’s configuration can be such a daunting task for a big application.&lt;/p&gt;

&lt;p&gt;In the future, I may add a parallel &lt;em&gt;reduce&lt;/em&gt; implementation to complete this little framework.&lt;/p&gt;

&lt;p&gt;A few other benchmarks that might’ve been fun and I may run in the future are &lt;em&gt;Map&lt;/em&gt; in C vs &lt;a href="http://www.datastuff.tech/programming/pythons-list-comprehensions-uses-and-advantages/"&gt;Python List Comprehensions&lt;/a&gt;, and C vs SIMD-Assembly.&lt;/p&gt;

&lt;p&gt;Remember you can use this code any way you like, or run your own experiments, and if you do &lt;em&gt;please&lt;/em&gt; don’t forget to let me know your results in the comments!&lt;/p&gt;

&lt;p&gt;Feel free to contact me on &lt;a href="http://www.twitter.com/strikingloo"&gt;Twitter&lt;/a&gt;, &lt;a href="http://www.medium.com/@strikingloo"&gt;Medium&lt;/a&gt; or &lt;a href="http://www.dev.to/strikingloo"&gt;dev.to&lt;/a&gt; for anything you want to say or ask to me!&lt;/p&gt;

&lt;p&gt;If you want to level up as a Data scientist, check out my &lt;a href="http://www.datastuff.tech/data-science/3-machine-learning-books-that-helped-me-level-up-as-a-data-scientist/"&gt;best Machine Learning books&lt;/a&gt; list and my &lt;a href="http://www.datastuff.tech/programming/terminal-tutorial-more-productive/"&gt;Bash tutorial&lt;/a&gt;.&lt;/p&gt;

</description>
      <category>programming</category>
      <category>beginners</category>
      <category>c</category>
      <category>dataengineering</category>
    </item>
    <item>
      <title>Why do Neural Networks Need an Activation Function?</title>
      <dc:creator>Luciano Strika</dc:creator>
      <pubDate>Mon, 01 Jul 2019 00:21:12 +0000</pubDate>
      <link>https://dev.to/strikingloo/why-do-neural-networks-need-an-activation-function-127m</link>
      <guid>https://dev.to/strikingloo/why-do-neural-networks-need-an-activation-function-127m</guid>
      <description>&lt;p&gt;Why do Neural Networks Need an Activation Function? Whenever you see a Neural Network’s architecture for the first time, one of the first things you’ll notice is they have a lot of interconnected layers.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Each layer in a Neural Network has an activation function, but why are they necessary? And why are they so important? Learn the answer here.&lt;/strong&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  What are activation functions?
&lt;/h2&gt;

&lt;p&gt;To answer the question of what Activation Functions are, let’s first take a step back and answer a bigger one: What is a Neural Network?&lt;/p&gt;

&lt;h3&gt;
  
  
  What are Neural Networks?
&lt;/h3&gt;

&lt;p&gt;A Neural Network is a Machine Learning model that, given certain input and output vectors, will try to “fit” the outputs to the inputs.&lt;/p&gt;

&lt;p&gt;What this means is, given a set of observed instances with certain values we wish to predict, and some data we have on each instance, it will try to generalize those data so that it can predict the values correctly for new instances of the problem.&lt;/p&gt;

&lt;p&gt;As an example, we may be designing an image classifier (typically with a &lt;a href="https://dev.to/strikingloo/convolutional-neural-networks-an-introduction-tensorflow-eager-4f4m"&gt;Convolutional Neural Network&lt;/a&gt;). Here, the inputs are a vector of pixels. The output could be a numerical class label (for instance, 1 for dogs, 0 for cats).&lt;/p&gt;

&lt;p&gt;This would train a Neural Network to predict whether an image contains a cat or a dog.&lt;/p&gt;

&lt;p&gt;But what is a mathematical function that, given a set of pixels, returns 1 if they correspond to the image of a dog, and 0 to the image of a cat?&lt;/p&gt;

&lt;p&gt;Coming up with a mathematical function that did that by hand would be impossible. &lt;strong&gt;For a human&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;So what we did is invent a Machine that finds that function for us.&lt;/p&gt;

&lt;p&gt;It looks something like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdocs.opencv.org%2F2.4%2F_images%2Fmlp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdocs.opencv.org%2F2.4%2F_images%2Fmlp.png" alt="Image result for neural network mlp"&gt;&lt;/a&gt;Single hidden layer Neural Network. &lt;a href="https://docs.opencv.org/2.4/modules/ml/doc/neural_networks.html" rel="noopener noreferrer"&gt;Source&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;But you may have seen this picture many times, recognize it for a Neural Network, and still not know exactly what it represents.&lt;/p&gt;

&lt;p&gt;Here, each circle represents a neuron in our Neural Network, and the vertically aligned neurons represent each layer.&lt;/p&gt;

&lt;h3&gt;
  
  
  How do Neural Networks work?
&lt;/h3&gt;

&lt;p&gt;A neuron is just a mathematical function, that takes inputs (the outputs of the neurons pointing to it) and returns outputs.&lt;/p&gt;

&lt;p&gt;These outputs serve as inputs for the next layer, and so on until we get to the final, output layer, which is the actual value we return.&lt;/p&gt;

&lt;p&gt;There is an input layer, where each neuron will simply return the corresponding value in the inputs vector.&lt;/p&gt;

&lt;p&gt;For each set of inputs, the Neural Network’s goal is to make each of its outputs as close as possible to the actual expected values.&lt;/p&gt;

&lt;p&gt;Again, think back at the example of the image classifier.&lt;/p&gt;

&lt;p&gt;If we take 100x100px pictures of animals as inputs, then our input layer will have 30000 neurons. That’s 10000 for all the pixels, times three since a pixel is already a triple vector (RGB values).&lt;/p&gt;

&lt;p&gt;We will then run the inputs through each layer. We get a new vector as each layer’s output, feed it to the next layer as inputs, and so on.&lt;/p&gt;

&lt;p&gt;Each neuron in a layer will return a single value, so a layer’s output vector will have as many dimensions as the layer has neurons.&lt;/p&gt;

&lt;p&gt;So, which value will a neuron return, given some inputs?&lt;/p&gt;

&lt;h3&gt;
  
  
  What does a Neuron do?
&lt;/h3&gt;

&lt;p&gt;A neuron will take an input vector, and do three things to it:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Multiply it by a weights vector.&lt;/li&gt;
&lt;li&gt;Add a bias value to that product.&lt;/li&gt;
&lt;li&gt;Apply an &lt;strong&gt;activation function&lt;/strong&gt; to that value.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;And we finally got to the core of our business: that’s what activation functions do.&lt;/p&gt;

&lt;p&gt;We’ll typically use non-linear functions as activation functions. This is because the linear part is already handled by the previously applied product and addition.&lt;/p&gt;

&lt;h2&gt;
  
  
  What are the most commonly used activation functions?
&lt;/h2&gt;

&lt;p&gt;I’m saying non-linear functions and it sounds logic enough, but what are the typical, commonly used activation functions?&lt;/p&gt;

&lt;p&gt;Let’s see some examples.&lt;/p&gt;

&lt;h3&gt;
  
  
  &lt;strong&gt;ReLU&lt;/strong&gt;
&lt;/h3&gt;

&lt;p&gt;ReLU stands for “Rectified Linear Unit”.&lt;/p&gt;

&lt;p&gt;Of all the activation functions, this is the one that’s most similar to a linear one:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;For non-negative values, it just applies the identity.&lt;/li&gt;
&lt;li&gt;For negative values, it returns 0.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;In mathematical words,&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwikimedia.org%2Fapi%2Frest_v1%2Fmedia%2Fmath%2Frender%2Fsvg%2Fbb2c32931fad595832c8e66f2f73760ebcbc0096" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwikimedia.org%2Fapi%2Frest_v1%2Fmedia%2Fmath%2Frender%2Fsvg%2Fbb2c32931fad595832c8e66f2f73760ebcbc0096" alt="{\displaystyle f(x)=x^{+}=\max(0,x)}"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This means all negative values will become 0, while the rest of the values just stay as they are.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsebastianraschka.com%2Fimages%2Ffaq%2Frelu-derivative%2Frelu_3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fsebastianraschka.com%2Fimages%2Ffaq%2Frelu-derivative%2Frelu_3.png" alt="Image result for RELu"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This is a biologically inspired function, since neurons in a brain will either “fire” (return a positive value) or not (return 0).&lt;/p&gt;

&lt;p&gt;Notice how combined with a bias, this actually filters out any value beneath a certain threshold.&lt;/p&gt;

&lt;p&gt;Suppose our bias had a value of -b. Any input value lower than b, after adding the bias will become negative. This turns to a 0 after applying ReLU to it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Sigmoid
&lt;/h3&gt;

&lt;p&gt;The sigmoid function takes any real number as input, and returns a value between 0 and 1. Since it is continuous, it effectively “smushes” values:&lt;/p&gt;

&lt;p&gt;If you apply the sigmoid to 3, you get 0.95. Apply it to 10, you get 0.999… And it will keep approaching 1 without ever reaching it.&lt;/p&gt;

&lt;p&gt;The same happens in the negative direction, except there it converges to 0.&lt;/p&gt;

&lt;p&gt;Here’s the mathematical formula for the sigmoid function.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwikimedia.org%2Fapi%2Frest_v1%2Fmedia%2Fmath%2Frender%2Fsvg%2F9537e778e229470d85a68ee0b099c08298a1a3f6" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fwikimedia.org%2Fapi%2Frest_v1%2Fmedia%2Fmath%2Frender%2Fsvg%2F9537e778e229470d85a68ee0b099c08298a1a3f6" alt="{\displaystyle S(x)={\frac {1}{1+e^{-x}}}={\frac {e^{x}}{e^{x}+1}}.}"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;As you see, it approaches 1 as x approaches infinity, and approaches 0 if x approaches minus infinity.&lt;/p&gt;

&lt;p&gt;It is also symmetrical, and has a value of 1/2 when its input is 0.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2Fthumb%2F8%2F88%2FLogistic-curve.svg%2F320px-Logistic-curve.svg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fupload.wikimedia.org%2Fwikipedia%2Fcommons%2Fthumb%2F8%2F88%2FLogistic-curve.svg%2F320px-Logistic-curve.svg.png" alt="Image result for sigmoid"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Since it takes values between 0 and 1, this function is extremely useful as an output if you want to model a probability.&lt;/p&gt;

&lt;p&gt;It’s also helpful if you wish to apply a “filter” to partially keep a certain value (like in an &lt;a href="https://dev.to/strikingloo/lstm-how-to-train-neural-networks-to-write-like-lovecraft-2bbk"&gt;LSTM’s forget gate&lt;/a&gt;).&lt;/p&gt;

&lt;h2&gt;
  
  
  Why do Neural Networks Need an Activation Function?
&lt;/h2&gt;

&lt;p&gt;We’ve already talked about the applications some different activation functions have, in different cases.&lt;/p&gt;

&lt;p&gt;Some let a signal through or obstruct it, others filter its intensity. There’s even the &lt;a href="https://en.wikipedia.org/wiki/Hyperbolic_function" rel="noopener noreferrer"&gt;tanh&lt;/a&gt; activation function: instead of filtering, it turns its input into either a negative or positive value.&lt;/p&gt;

&lt;p&gt;But what why do our Neural Networks need Activation Functions? What would happen if we didn’t use them?&lt;/p&gt;

&lt;p&gt;I found the explanation for this question in Yoshua Bengio’s awesome &lt;a href="https://amzn.to/305g2MF" rel="noopener noreferrer"&gt;Deep Learning book&lt;/a&gt;, and I think it’s perfectly explained there.&lt;/p&gt;

&lt;p&gt;We could, instead of composing our linear transformations with non-linear functions, make each neuron simply return their result (effectively composing them with the identity instead).&lt;/p&gt;

&lt;p&gt;But then all of our layers would simply stack one affine (product plus addition) transformation after another. Each layer would simply add a vector product, and vector addition, to the previous one.&lt;/p&gt;

&lt;p&gt;It can be shown (and you can even convince yourself if you try the math with a small vector on a whiteboard) that this composition of affine transformations, is equivalent to a single affine transformation.&lt;/p&gt;

&lt;p&gt;Effectively, this whole “Neural Network” where all activation functions have been replaced by the identity would be nothing more than a vector product and a bias addition.&lt;/p&gt;

&lt;p&gt;There are many problems a linear transformation can’t solve, so we would effectively be shrinking the quantity of functions our model could estimate.&lt;/p&gt;

&lt;p&gt;As a very simple but earthshaking example, consider the XOR operator.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1600%2F1%2AFDAAQeaaE8s_Il8K9llr8w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1600%2F1%2AFDAAQeaaE8s_Il8K9llr8w.png" alt="XOR values table"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Try to find a two-element vector, plus a bias that can take x1 and x2, and turn them into x1 XOR x2. Go ahead, I’ll wait.&lt;/p&gt;

&lt;p&gt;…&lt;/p&gt;

&lt;p&gt;Exactly, you can’t. nobody can. However, consider&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fquicklatex.com%2Fcache3%2Fb6%2Fql_95cce8bf433664197906005aa89260b6_l3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fquicklatex.com%2Fcache3%2Fb6%2Fql_95cce8bf433664197906005aa89260b6_l3.png" alt="formula for a neural network that solves the XOR problem."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fquicklatex.com%2Fcache3%2Fcd%2Fql_64edade5ed24228097cfdaae7c1e0ecd_l3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fquicklatex.com%2Fcache3%2Fcd%2Fql_64edade5ed24228097cfdaae7c1e0ecd_l3.png" alt="defining vectors with latex"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;If you work the math, you’ll see this has the desired output for each possible combination of 1 and 0.&lt;/p&gt;

&lt;p&gt;Congratulations! You’ve just trained your first Neural Network!&lt;/p&gt;

&lt;p&gt;And it’s learned a problem a linear model could never have learned.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusions
&lt;/h2&gt;

&lt;p&gt;I hope after this explanation, you now have a better understanding of why Neural Networks need an Activation Function.&lt;/p&gt;

&lt;p&gt;In future articles, I may cover other Activation Functions and their uses, like SoftMax and the controversial Cos.&lt;/p&gt;

&lt;p&gt;So what do you think? Did you learn anything from this article? Did you find it interesting? Was the math off?&lt;/p&gt;

&lt;p&gt;Feel free to contact me on &lt;a href="http://www.twitter.com/strikingloo" rel="noopener noreferrer"&gt;Twitter&lt;/a&gt;, &lt;a href="http://www.medium.com/@strikingloo" rel="noopener noreferrer"&gt;Medium&lt;/a&gt; or &lt;a href="http://www.dev.to/strikingloo"&gt;dev.to&lt;/a&gt; for anything you want to say or ask to me!&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>deeplearning</category>
      <category>beginners</category>
      <category>ai</category>
    </item>
    <item>
      <title>Data Scientists|Engineers: What are the Frameworks you use the most at your job?</title>
      <dc:creator>Luciano Strika</dc:creator>
      <pubDate>Fri, 28 Jun 2019 05:20:11 +0000</pubDate>
      <link>https://dev.to/strikingloo/data-scientists-engineers-what-are-the-frameworks-you-use-the-most-at-your-job-4dda</link>
      <guid>https://dev.to/strikingloo/data-scientists-engineers-what-are-the-frameworks-you-use-the-most-at-your-job-4dda</guid>
      <description>&lt;p&gt;I've seen a lot of statistics about programmers, but not specifically about Data Scientists or Engineers.&lt;br&gt;
Because of that, I'd like to propose this survey:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Do you identify as a &lt;strong&gt;Data Scientist&lt;/strong&gt; or &lt;strong&gt;Data Engineer&lt;/strong&gt;?&lt;/li&gt;
&lt;li&gt;What are the &lt;strong&gt;languages and frameworks&lt;/strong&gt; you use the most at your job? Say, the top 5 or 6.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;As a bonus:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;What are the Frameworks you &lt;em&gt;wish&lt;/em&gt; you were using instead?&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;I will write back with some visuals or analysis if this gets enough traction. &lt;br&gt;
Of course all the data will be public here, so you can do your own too.&lt;/p&gt;

</description>
      <category>discuss</category>
      <category>datascience</category>
      <category>python</category>
      <category>machinelearning</category>
    </item>
    <item>
      <title>LSTM: How to Train Neural Networks to Write like Lovecraft</title>
      <dc:creator>Luciano Strika</dc:creator>
      <pubDate>Mon, 24 Jun 2019 00:37:59 +0000</pubDate>
      <link>https://dev.to/strikingloo/lstm-how-to-train-neural-networks-to-write-like-lovecraft-2bbk</link>
      <guid>https://dev.to/strikingloo/lstm-how-to-train-neural-networks-to-write-like-lovecraft-2bbk</guid>
      <description>&lt;p&gt;LSTM Neural Networks have seen a lot of use in the recent years, both for text and music generation, and for Time Series Forecasting.&lt;/p&gt;

&lt;p&gt;Today, I’ll teach you how to train a LSTM Neural Network for text generation, so that it can write with H. P. Lovecraft’s style.&lt;/p&gt;

&lt;p&gt;In order to train this LSTM, we’ll be using TensorFlow’s Keras API for Python.&lt;/p&gt;

&lt;p&gt;I learned about this subject from this awesome &lt;a href="http://colah.github.io/posts/2015-08-Understanding-LSTMs/" rel="noopener noreferrer"&gt;LSTM Neural Networks tutorial&lt;/a&gt;. My code follows this &lt;a href="https://chunml.github.io/ChunML.github.io/project/Creating-Text-Generator-Using-Recurrent-Neural-Network/" rel="noopener noreferrer"&gt;Text Generation tutorial&lt;/a&gt;‘s closely.&lt;/p&gt;

&lt;p&gt;I’ll show you my Python examples and results as usual, but first, let’s do some explaining.&lt;/p&gt;

&lt;h2&gt;
  
  
  What are LSTM Neural Networks?
&lt;/h2&gt;

&lt;p&gt;The most vanilla, run-of-the-mill Neural Network, called a Multi-Layer-Perceptron, is just a composition of fully connected layers.&lt;/p&gt;

&lt;p&gt;In these models, the input is a vector of features, and each subsequent layer is a set of “neurons”.&lt;/p&gt;

&lt;p&gt;Each neuron performs an affine (linear) transformation to the previous layer’s output, and then applies some non-linear function to that result.&lt;/p&gt;

&lt;p&gt;The output of a layer’s neurons, a new vector, is fed to the next layer, and so on.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxkyyedoredz9bax6h4v6.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fxkyyedoredz9bax6h4v6.jpg" alt="Image result for multilayer perceptron"&gt;&lt;/a&gt;&lt;a href="https://www.researchgate.net/figure/A-hypothetical-example-of-Multilayer-Perceptron-Network_fig4_303875065" rel="noopener noreferrer"&gt;Source&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A LSTM (Long Short-term Memory) Neural Network is just another kind of &lt;a href="http://www.datastuff.tech/machine-learning/autoencoder-deep-learning-tensorflow-eager-api-keras/" rel="noopener noreferrer"&gt;Artificial Neural Network&lt;/a&gt;, which falls in the category of Recurrent Neural Networks.&lt;/p&gt;

&lt;p&gt;What makes LSTM Neural Networks different from regular Neural Networks is, they have LSTM cells as neurons in some of their layers.&lt;/p&gt;

&lt;p&gt;Much like &lt;a href="https://dev.to/strikingloo/convolutional-neural-networks-an-introduction-tensorflow-eager-4f4m"&gt;Convolutional Layers&lt;/a&gt; help a Neural Network learn about image features, LSTM cells help the Network learn about temporal data, something which other Machine Learning models traditionally struggled with.&lt;/p&gt;

&lt;p&gt;How do LSTM cells work? I’ll explain it now, though I highly recommend you give those tutorials a chance too.&lt;/p&gt;

&lt;h2&gt;
  
  
  How do LSTM cells work?
&lt;/h2&gt;

&lt;p&gt;An LSTM layer will contain many LSTM cells.&lt;/p&gt;

&lt;p&gt;Each LSTM cell in our Neural Network will only look at a single column of its inputs, and also at the previous column’s LSTM cell’s output.&lt;/p&gt;

&lt;p&gt;Normally, we feed our LSTM Neural Network a whole matrix as its input, where each column corresponds to something that “comes before” the next column.&lt;/p&gt;

&lt;p&gt;This way, each LSTM cell will have &lt;strong&gt;two different input vectors&lt;/strong&gt; : the previous LSTM cell’s output (which gives it some information about the previous input column) and its own input column.&lt;/p&gt;

&lt;h3&gt;
  
  
  LSTM Cells in action: an intuitive example.
&lt;/h3&gt;

&lt;p&gt;For instance, if we were training an LSTM Neural Network to predict stock exchange values, we could feed it a vector with a stock’s closing price in the last three days.&lt;/p&gt;

&lt;p&gt;The first LSTM cell, in that case, would use the first day as input, and send some extracted features to the next cell.&lt;/p&gt;

&lt;p&gt;That second cell would look at the second day’s price, and also at whatever the previous cell learned from yesterday, before generating new inputs for the next cell.&lt;/p&gt;

&lt;p&gt;After doing this for each cell, the last one will actually have a lot of temporal information. It will receive, from the previous one, what it learned from yesterday’s closing price, and from the previous two (through the other cells’ extracted information).&lt;/p&gt;

&lt;p&gt;You can experiment with different time windows, and also change how many units (neurons) will look at each day’s data, but this is the general idea.&lt;/p&gt;

&lt;h3&gt;
  
  
  How LSTM Cells work: the Math.
&lt;/h3&gt;

&lt;p&gt;The actual math behind what each cell extracts from the previous one is a bit more involved.&lt;/p&gt;

&lt;h4&gt;
  
  
  Forget Gate
&lt;/h4&gt;

&lt;p&gt;The “forget gate” is a sigmoid layer, that regulates how much the previous cell’s outputs will influence this one’s.&lt;/p&gt;

&lt;p&gt;It takes as input both the previous cell’s “hidden state” (another output vector), and the actual inputs from the previous layer.&lt;/p&gt;

&lt;p&gt;Since it is a sigmoid, it will return a vector of “probabilities”: values between 0 and 1.&lt;/p&gt;

&lt;p&gt;They will &lt;strong&gt;multiply the previous cell’s outputs&lt;/strong&gt; to regulate how much influence they hold, creating this cell’s state.&lt;/p&gt;

&lt;p&gt;For instance, in a drastic case, the sigmoid may return a vector of zeroes, and the whole state would be multiplied by 0 and thus discarded.&lt;/p&gt;

&lt;p&gt;This may happen if this layer sees a very big change in the inputs distribution, for example.&lt;/p&gt;

&lt;h4&gt;
  
  
  Input Gate
&lt;/h4&gt;

&lt;p&gt;Unlike the forget gate, the input gate’s output is added to the previous cell’s outputs (after they’ve been multiplied by the forget gate’s output).&lt;/p&gt;

&lt;p&gt;The input gate is the dot product of two different layers’ outputs, though they both take the same input as the forget gate (previous cell’s hidden state, and previous layer’s outputs):&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;strong&gt;sigmoid unit&lt;/strong&gt; , regulating how much the new information will impact this cell’s output.&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;tanh unit&lt;/strong&gt; , which actually extracts the new information. Notice tanh takes values between -1 and 1.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The &lt;strong&gt;product of these two units&lt;/strong&gt; (which could, again, be 0, or be exactly equal to the tanh output, or anything in between) is added to this neuron’s cell state.&lt;/p&gt;

&lt;h4&gt;
  
  
  The LSTM cell’s outputs
&lt;/h4&gt;

&lt;p&gt;The cell’s state is what the next LSTM cell will receive as input, along with this cell’s hidden state.&lt;/p&gt;

&lt;p&gt;The hidden state will be &lt;strong&gt;another tanh unit&lt;/strong&gt; applied to this neuron’s state, multiplied by another &lt;strong&gt;sigmoid unit&lt;/strong&gt; that takes the previous layer’s and cell’s outputs (just like the forget gate).&lt;/p&gt;

&lt;p&gt;Here’s a visualization of what each LSTM cell looks like, borrowed from the tutorial I just linked:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd7m61oxn93t2j2yorbh1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Farticles%2Fd7m61oxn93t2j2yorbh1.png" alt="LSTM"&gt;&lt;/a&gt;Source: &lt;a href="https://chunml.github.io/ChunML.github.io/project/Creating-Text-Generator-Using-Recurrent-Neural-Network/" rel="noopener noreferrer"&gt;Text Generating LSTMs&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Now that we’ve covered the theory, let’s move on to some practical uses!&lt;/p&gt;

&lt;p&gt;As usual, all of the code is &lt;a href="https://github.com/StrikingLoo/LoveCraftLSTM" rel="noopener noreferrer"&gt;available on GitHub&lt;/a&gt; if you want to try it out, or you can just follow along and see the gists.&lt;/p&gt;

&lt;h2&gt;
  
  
  Training LSTM Neural Networks with TensorFlow Keras
&lt;/h2&gt;

&lt;p&gt;For this task, I used this &lt;a href="https://github.com/vilmibm/lovecraftcorpus" rel="noopener noreferrer"&gt;dataset containing 60 Lovecraft tales&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Since he wrote most of his work in the 20s, and he died in 1937, it’s now mostly in the public domain, so it wasn’t that hard to get.&lt;/p&gt;

&lt;p&gt;I thought training a Neural Network to write like him would be an interesting challenge.&lt;/p&gt;

&lt;p&gt;This is because, on the one hand, he had a very distinct style (with abundant purple prose: using weird words and elaborate language), but on the other he used a very complex vocabulary, and a Network may have trouble understanding it.&lt;/p&gt;

&lt;p&gt;For instance, here’s a random sentence from the first tale in the dataset:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;At night the subtle stirring of the black city outside, the sinister scurrying of rats in the wormy partitions, and the creaking of hidden timbers in the centuried house, were enough to give him a sense of strident pandemonium&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;If I can get a Neural Network to write “pandemonium”, then I’ll be impressed.&lt;/p&gt;

&lt;h3&gt;
  
  
  Preprocessing our data
&lt;/h3&gt;

&lt;p&gt;In order to train an LSTM Neural Network to generate text, we must first preprocess our text data so that it can be consumed by the network.&lt;/p&gt;

&lt;p&gt;In this case, since a Neural Network takes vectors as input, we need a way to convert the text into vectors.&lt;/p&gt;

&lt;p&gt;For these examples, I decided to train my LSTM Neural Networks to predict the next M characters in a string, taking as input the previous N ones.&lt;/p&gt;

&lt;p&gt;To be able to feed it the N characters, I did a one-hot encoding of each one of them, so that the network’s input is a matrix of CxN elements, where C is the total number of different characters on my dataset.&lt;/p&gt;

&lt;p&gt;First, we read the text files and concatenate all of their contents.&lt;/p&gt;

&lt;p&gt;We limit our characters to be alphanumerical, plus a few punctuation marks.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;
 

&lt;p&gt;We can then proceed to one-hot encode the strings into matrices, where every element of the &lt;em&gt;j&lt;/em&gt;-th column is a 0 except for the one corresponding to the &lt;em&gt;j&lt;/em&gt;-th character in the corpus.&lt;/p&gt;

&lt;p&gt;In order to do this, we first define a dictionary that assigns an index to each character.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;Notice how, if we wished to sample our data, we could just make the variable &lt;em&gt;slices&lt;/em&gt; smaller.&lt;/p&gt;

&lt;p&gt;I also chose a value for &lt;em&gt;SEQ_LENGTH&lt;/em&gt; of 50, making the network receive 50 characters and try to predict the next 50.&lt;/p&gt;

&lt;h3&gt;
  
  
  Training our LSTM Neural Network in Keras
&lt;/h3&gt;

&lt;p&gt;In order to train the Neural Network, we must first define it.&lt;/p&gt;

&lt;p&gt;This Python code creates an LSTM Neural Network with two LSTM layers, each with 100 units.&lt;/p&gt;

&lt;p&gt;Remember each unit has one cell for each character in the input sequence, thus 50.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;Here &lt;em&gt;VOCAB_SIZE&lt;/em&gt; is just the amount of characters we’ll use, and &lt;em&gt;TimeDistributed&lt;/em&gt; is a way of applying a given layer to each different cell, maintaining temporal ordering.&lt;/p&gt;

&lt;p&gt;For this model, I actually tried many different learning rates to test convergence speed vs overfitting.&lt;/p&gt;

&lt;p&gt;Here’s the code for training:&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;
 

&lt;p&gt;What you are seeing is what had the best performance in terms of loss minimization.&lt;/p&gt;

&lt;p&gt;However, with a binary_cross_entropy of 0.0244 in the final epoch (after 500 epochs), here’s what the model’s output looked like.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Tolman hast toemtnsteaetl nh otmn tf titer aut tot tust tot ahen h l the srrers ohre trrl tf thes snneenpecg tettng s olt oait ted beally tad ened ths tan en ng y afstrte and trr t sare t teohetilman hnd tdwasd hxpeinte thicpered the reed af the satl r tnnd Tev hilman hnteut iout y techesd d ty ter thet te wnow tn tis strdend af ttece and tn aise ecn
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;


&lt;p&gt;There are many &lt;strong&gt;good things&lt;/strong&gt; about this output, and &lt;strong&gt;many bad ones&lt;/strong&gt; as well.&lt;/p&gt;

&lt;p&gt;The way the spacing is set up, with words mostly between 2 and 5 characters long with some longer outliers, is pretty similar to the actual word length distribution in the corpus.&lt;/p&gt;

&lt;p&gt;I also noticed the &lt;strong&gt;letters&lt;/strong&gt; ‘T’, ‘E’ and ‘I’ were &lt;strong&gt;appearing very commonly&lt;/strong&gt; , whereas ‘y’ or ‘x’ were &lt;strong&gt;less frequent&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;When I looked at &lt;strong&gt;letter relative frequencies&lt;/strong&gt; in the sampled output versus the corpus, they were pretty similar. It’s the &lt;strong&gt;ordering&lt;/strong&gt; that’s &lt;strong&gt;completely off&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;There is also something to be said about how &lt;strong&gt;capital letters only appear after spaces&lt;/strong&gt; , as is usually the case in English.&lt;/p&gt;

&lt;p&gt;To generate these outputs, I simply asked the model to predict the next 50 characters for different 50 character subsets in the corpus. If it’s this bad with training data, I figured testing or random data wouldn’t be worth checking.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;



&lt;p&gt;The nonsense actually reminded me of one of H. P. Lovecraft’s most famous tales, “Call of Cthulhu”, where people start having hallucinations about this cosmic, eldritch being, and they say things like:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Ph’nglui mglw’nafh &lt;em&gt;Cthulhu R’lyeh&lt;/em&gt; wgah’nagl fhtagn.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Sadly the model wasn’t overfitting that either, it was clearly &lt;strong&gt;underfitting&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;So I tried to make its task smaller, and the model bigger: 125 units, predicting only 30 characters.&lt;/p&gt;

&lt;h3&gt;
  
  
  Bigger model, smaller problem. Any results?
&lt;/h3&gt;

&lt;p&gt;With this smaller model, after another 500 epochs, some patterns began to emerge.&lt;/p&gt;

&lt;p&gt;Even though the loss function wasn’t that much smaller (at 210), the character’s frequency remained similar to the corpus’.&lt;/p&gt;

&lt;p&gt;The ordering of characters improved a lot though: here’s a random sample from its output, see if you can spot some words.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;the sreun troor Tvwood sas an ahet eae rin and t paared th te aoolling onout The e was thme trr t sovtle tousersation oefore tifdeng tor teiak uth tnd tone gen ao tolman aarreed y arsred tor h tndarcount tf tis feaont oieams wnd toar Tes heut oas nery tositreenic and t aeed aoet thme hing tftht to te tene Te was noewked ay tis prass s deegn aedgireean ect and tot ced the sueer anoormal -iuking torsarn oaich hnher tad beaerked toring the sars tark he e was tot tech
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Tech, the, and, was… &lt;strong&gt;small words&lt;/strong&gt; are where it’s at! It also realized many words ended with &lt;strong&gt;common suffixes&lt;/strong&gt; like -ing, -ed, and -tion.&lt;/p&gt;

&lt;p&gt;Out of 10000 words, 740 were “&lt;em&gt;the&lt;/em&gt;“, 37 ended in “&lt;em&gt;tion&lt;/em&gt;” (whereas only 3 contained without ending in it), and 115 ended in –&lt;em&gt;ing&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Other common words were “than” and “that”, though the model was clearly still unable to produce English sentences.&lt;/p&gt;

&lt;h3&gt;
  
  
  Even bigger model
&lt;/h3&gt;

&lt;p&gt;This gave me hopes. The Neural Network was clearly learning &lt;em&gt;something&lt;/em&gt;, just not enough.&lt;/p&gt;

&lt;p&gt;So I did what you do when your model underfits: I tried an even bigger Neural Network.&lt;/p&gt;

&lt;p&gt;Take into account, I’m running this on my laptop.&lt;/p&gt;

&lt;p&gt;With a modest 16GB of RAM and an i7 processor, these models take hours to learn.&lt;/p&gt;

&lt;p&gt;So I set the amount of units to 150, and tried my hand again at 50 characters.&lt;/p&gt;

&lt;p&gt;I figured maybe giving it a smaller time window was making things harder for the Network.&lt;/p&gt;

&lt;p&gt;Here’s what the model’s output was like, after a few hours of training.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;andeonlenl oou torl u aote targore -trnnt d tft thit tewk d tene tosenof the stown ooaued aetane ng thet thes teutd nn aostenered tn t9t aad tndeutler y aean the stun h tf trrns anpne thin te saithdotaer totre aene Tahe sasen ahet teae es y aeweeaherr aore ereus oorsedt aern totl s a dthe snlanete toase af the srrls-thet treud tn the tewdetern tarsd totl s a dthe searle of the sere t trrd eneor tes ansreat tear d af teseleedtaner nl and tad thre n tnsrnn tearltf trrn T has tn oredt d to e e te hlte tf the sndirehio aeartdtf trrns afey aoug ath e -ahe sigtereeng tnd tnenheneo l arther ardseu troa Tnethe setded toaue and tfethe sawt ontnaeteenn an the setk eeusd ao enl af treu r ue oartenng otueried tnd toottes the r arlet ahicl tend orn teer ohre teleole tf the sastr ahete ng tf toeeteyng tnteut ooseh aore of theu y aeagteng tntn rtng aoanleterrh ahrhnterted tnsastenely aisg ng tf toueea en toaue y anter aaneonht tf the sane ng tf the 
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Pure nonsense, except a lot of “the” and “and”s.&lt;/p&gt;

&lt;p&gt;It was actually saying “the” more often than the previous one, but it hadn’t learned about gerunds yet (no -ing).&lt;/p&gt;

&lt;p&gt;Interestingly, many words here ended with “-ed” which means it was kinda grasping the idea of the &lt;strong&gt;past tense&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;I let it go at it a few hundred more epochs (to a total of 750).&lt;/p&gt;

&lt;p&gt;The output didn’t change too much, still a lot of “the”, “a” and “an”, and still no bigger structure. Here’s another sample:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Tn t srtriueth ao tnsect on tias ng the sasteten c wntnerseoa onplsineon was ahe ey thet tf teerreag tispsliaer atecoeent of teok ond ttundtrom tirious arrte of the sncirthio sousangst tnr r te the seaol enle tiedleoisened ty trococtinetrongsoa Trrlricswf tnr txeenesd ng tispreeent T wad botmithoth te tnsrtusds tn t y afher worsl ahet then
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;An interesting thing that emerged here though, was the use of prepositions and pronouns.&lt;/p&gt;

&lt;p&gt;The network wrote “I”, “you”, “she”, “we”, “of” and other similar words a few times. All in all, &lt;strong&gt;prepositions and pronouns&lt;/strong&gt; amounted to about &lt;strong&gt;10% of the total sampled words&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This was an improvement, as the Network was clearly learning low-entropy words.&lt;/p&gt;

&lt;p&gt;However, it was still far from generating coherent English texts.&lt;/p&gt;

&lt;p&gt;I let it train 100 more epochs, and then killed it.&lt;/p&gt;

&lt;p&gt;Here’s its last output.&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;thes was aooceett than engd and te trognd tarnereohs aot teiweth tncen etf thet torei The t hhod nem tait t had nornd tn t yand tesle onet te heen t960 tnd t960 wndardhe tnong toresy aarers oot tnsoglnorom thine tarhare toneeng ahet and the sontain teadlny of the ttrrteof ty tndirtanss aoane ond terk thich hhe senr aesteeeld Tthhod nem ah tf the saar hof tnhe e on thet teauons and teu the ware taiceered t rn trr trnerileon and
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;I knew it was doing its best, but it wasn’t really going anywhere, at least not quickly enough.&lt;/p&gt;

&lt;p&gt;I thought of accelerating convergence speed with &lt;strong&gt;Batch Normalization&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;However, I read on StackOverflow that BatchNorm is not supposed to be used with LSTM Neural Networks.&lt;/p&gt;

&lt;p&gt;If any of you is more experienced with LSTM nets, please let me know if that’s right in the comments!&lt;/p&gt;

&lt;p&gt;At last, I tried this same task with 10 characters as input and 10 as output.&lt;/p&gt;

&lt;p&gt;I guess the model wasn’t getting enough context to predict things well enough though: the results were awful.&lt;/p&gt;

&lt;p&gt;I considered the experiment finished for now.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusions
&lt;/h2&gt;

&lt;p&gt;While it is clear, looking at other people’s work, that an LSTM Neural Network &lt;em&gt;could&lt;/em&gt; learn to write like Lovecraft, I don’t think my PC is powerful enough to train a big enough model in a reasonable time.&lt;/p&gt;

&lt;p&gt;Or maybe it just needs more data than I had.&lt;/p&gt;

&lt;p&gt;In the future, I’d like to repeat this experiment with a word-based approach instead of a character-based one.&lt;/p&gt;

&lt;p&gt;I checked, and about 10% of the words in the corpus appear only once.&lt;/p&gt;

&lt;p&gt;Is there any good practice I should follow if I removed them before training? Like replacing all nouns with the same one, sampling from &lt;a href="http://www.datastuff.tech/machine-learning/k-means-clustering-unsupervised-learning-for-recommender-systems/" rel="noopener noreferrer"&gt;clusters&lt;/a&gt;, or something? Please let me know! I’m sure many of you are more experienced with LSTM neural networks than I.&lt;/p&gt;

&lt;p&gt;Do you think this would have worked better with a different architecture? Something I should have handled differently? Please also let me know, I want to learn more about this.&lt;/p&gt;

&lt;p&gt;Did you find any rookie mistakes on my code? Do you think I’m an idiot for not trying XYZ? Or did you actually find my experiment enjoyable, or maybe you even learned something from this article?&lt;/p&gt;

&lt;p&gt;Contact me on &lt;a href="http://www.twitter.com/strikingloo" rel="noopener noreferrer"&gt;Twitter&lt;/a&gt;, &lt;a href="http://linkedin.com/in/luciano-strika" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;, &lt;a href="http://medium.com/@strikingloo" rel="noopener noreferrer"&gt;Medium&lt;/a&gt; or &lt;a href="http://www.dev.to/strikingloo"&gt;Dev.to&lt;/a&gt; if you want to discuss that, or any related topic.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;If you want to become a Data scientist, or learn something new, check out my &lt;a href="http://www.datastuff.tech/data-science/3-machine-learning-books-that-helped-me-level-up-as-a-data-scientist/" rel="noopener noreferrer"&gt;Machine Learning Reading List&lt;/a&gt;!&lt;/em&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>keras</category>
      <category>lstm</category>
      <category>neuralnetworks</category>
    </item>
    <item>
      <title>5 Probability Distributions Every Data Scientist Should Know</title>
      <dc:creator>Luciano Strika</dc:creator>
      <pubDate>Mon, 17 Jun 2019 04:34:11 +0000</pubDate>
      <link>https://dev.to/strikingloo/5-probability-distributions-every-data-scientist-should-know-21di</link>
      <guid>https://dev.to/strikingloo/5-probability-distributions-every-data-scientist-should-know-21di</guid>
      <description>&lt;p&gt;Probability Distributions are like 3D glasses. They allow a skilled Data Scientist to recognize patterns in otherwise completely random variables.&lt;/p&gt;

&lt;p&gt;In a way, most of the other Data Science or Machine Learning skills are based on certain assumptions about the probability distributions of your data.&lt;/p&gt;

&lt;p&gt;This makes probability knowledge part of the basis on which you can build your toolkit as a statistician. The first steps if you are figuring out &lt;a href="http://www.datastuff.tech/data-science/3-machine-learning-books-that-helped-me-level-up-as-a-data-scientist/"&gt;how to become a Data Scientist&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Without further ado, let us cut to the chase.&lt;/p&gt;

&lt;h2&gt;
  
  
  What are Probability Distributions?
&lt;/h2&gt;

&lt;p&gt;In Probability and Statistics, a &lt;strong&gt;random variable&lt;/strong&gt; is a thing that &lt;strong&gt;takes random values&lt;/strong&gt; , like “the height of the next person I see” or “the amount of cook’s hairs in my next ramen bowl”.&lt;/p&gt;

&lt;p&gt;Given a random variable &lt;em&gt;X&lt;/em&gt;, we’d like to have a way of describing which values it takes. Even more than that, we’d like to characterize &lt;strong&gt;how likely&lt;/strong&gt; it is for that variable to &lt;strong&gt;take a certain value&lt;/strong&gt; &lt;em&gt;x&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;For instance, if &lt;em&gt;X&lt;/em&gt; is “how many cats my girlfriend has”, then there’s a non-zero chance that number could be 1. One could argue there’s a non-zero probability that value could even be 5 or 10.&lt;/p&gt;

&lt;p&gt;However, there’s no way (and therefore no probability) that a person will have negative cats.&lt;/p&gt;

&lt;p&gt;We therefore would like an unambiguous, mathematical way of expressing every possible value &lt;em&gt;x&lt;/em&gt; a variable &lt;em&gt;X&lt;/em&gt; can take, and how likely the event &lt;em&gt;(X= x)&lt;/em&gt; is.&lt;/p&gt;

&lt;p&gt;In order to do this, we define a function &lt;em&gt;P&lt;/em&gt;, such that &lt;em&gt;P(X = x)&lt;/em&gt; is the probability of the variable &lt;em&gt;X&lt;/em&gt; having a value of &lt;em&gt;x&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;We could also ask for P(X &amp;lt; x), or P(X &amp;gt; x), for intervals instead of discrete values. This will become even more relevant soon.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;P&lt;/em&gt; is the variable’s &lt;strong&gt;density function&lt;/strong&gt; , and it characterizes that variable’s &lt;strong&gt;distribution&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Over time, scientists have come to realize many things in nature and real life tend to &lt;strong&gt;behave similarly&lt;/strong&gt; , with variables sharing a distribution, or having the same density functions (or a similar function changing a few constants in it).&lt;/p&gt;

&lt;p&gt;Interestingly, for &lt;em&gt;P&lt;/em&gt; to be an actual density function, some things have to apply.&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;&lt;strong&gt;P(X =x)&lt;/strong&gt;&lt;/em&gt; &lt;strong&gt;&amp;lt;= 1&lt;/strong&gt; for any value &lt;em&gt;x&lt;/em&gt;. Nothing’s more certain than certain.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;&lt;strong&gt;P(X =x)&lt;/strong&gt;&lt;/em&gt; &lt;strong&gt;&amp;gt;= 0&lt;/strong&gt; for any value &lt;em&gt;x&lt;/em&gt;. A thing can be impossible, but not less likely than that.&lt;/li&gt;
&lt;li&gt;And the last one: the &lt;strong&gt;sum&lt;/strong&gt; of &lt;em&gt;P(X=x)&lt;/em&gt; for all possible values &lt;em&gt;x&lt;/em&gt; &lt;strong&gt;is 1&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This last one means something like “the probability of X taking &lt;em&gt;any&lt;/em&gt; value in the universe, &lt;em&gt;has&lt;/em&gt; to add up to 1, since we know it will take &lt;em&gt;some&lt;/em&gt; value”.&lt;/p&gt;

&lt;h3&gt;
  
  
  Discrete vs Continuous Random Variable Distributions
&lt;/h3&gt;

&lt;p&gt;Lastly, random variables can be thought of as belonging to two groups: &lt;strong&gt;discrete&lt;/strong&gt; and &lt;strong&gt;continuous&lt;/strong&gt; random variables.&lt;/p&gt;

&lt;h4&gt;
  
  
  Discrete Random Variables
&lt;/h4&gt;

&lt;p&gt;&lt;strong&gt;Discrete variables&lt;/strong&gt; have a discrete set of possible values, each of them with a non-zero probability.&lt;/p&gt;

&lt;p&gt;For instance, when flipping a coin, if we say&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;X = “1 if the coin is heads, 0 if tails”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Then &lt;em&gt;P(X = 1) = P(X = 0) = 0.5&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Note however, that a discrete set need not be finite.&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;geometric distribution&lt;/strong&gt; , is used for modelling the chance of some event with probability &lt;em&gt;p&lt;/em&gt; &lt;strong&gt;happening after&lt;/strong&gt; _ &lt;strong&gt;k&lt;/strong&gt; _ &lt;strong&gt;retries&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;It has the following density formula.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--U03D0Fle--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://quicklatex.com/cache3/38/ql_a1aa64858e88e4fa841ecade06d08038_l3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--U03D0Fle--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://quicklatex.com/cache3/38/ql_a1aa64858e88e4fa841ecade06d08038_l3.png" alt="" width="172" height="47"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Where _ &lt;strong&gt;k&lt;/strong&gt; _ &lt;strong&gt;can take any non-negative value with a positive probability&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Notice how the sum of all possible values’ probabilities still &lt;strong&gt;adds up to 1&lt;/strong&gt;.&lt;/p&gt;

&lt;h4&gt;
  
  
  Continuous Random Variables
&lt;/h4&gt;

&lt;p&gt;If you said&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;X = “the length in millimeters (without rounding) of a randomly plucked hair from my head”&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Which possible values can &lt;em&gt;X&lt;/em&gt; take? We can all probably agree a negative value doesn’t make any sense here.&lt;/p&gt;

&lt;p&gt;However if you said it is exactly 1 millimeter, and not 1.1853759… or something like that, I would either doubt your measuring skills, or your measuring error reporting.&lt;/p&gt;

&lt;p&gt;A continuous random variable can take &lt;strong&gt;any value&lt;/strong&gt; in a given (continuous) interval.&lt;/p&gt;

&lt;p&gt;Therefore, if we assigned a &lt;strong&gt;non-zero probability to all of its possible values&lt;/strong&gt; , their sum would &lt;strong&gt;not add up to 1&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;To solve this, if &lt;em&gt;X&lt;/em&gt; is continuous, we set &lt;em&gt;P(X=x) = 0&lt;/em&gt; for all &lt;em&gt;k&lt;/em&gt;, and instead assign a non-zero chance to &lt;em&gt;X&lt;/em&gt; taking a value &lt;strong&gt;in a certain interval.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;To express the probability of X laying between values &lt;em&gt;a&lt;/em&gt; and &lt;em&gt;b&lt;/em&gt;, we say&lt;br&gt;&lt;br&gt;
&lt;em&gt;P(a &amp;lt; X &amp;lt; b)&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Instead of just replacing values in a density function, to get &lt;em&gt;P(a &amp;lt; X &amp;lt; b)&lt;/em&gt; for &lt;em&gt;X&lt;/em&gt; a continuous variable, you’ll integrate &lt;em&gt;X&lt;/em&gt;‘s density function from &lt;em&gt;a&lt;/em&gt; to &lt;em&gt;b&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;Whoah, you’ve made it through the whole theory section! Here’s your reward.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--py1gYXj0--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/http://www.datastuff.tech/wp-content/uploads/2019/06/pug-690566_640-e1560733151341.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--py1gYXj0--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/http://www.datastuff.tech/wp-content/uploads/2019/06/pug-690566_640-e1560733151341.jpg" alt="A pug puppy." width="382" height="425"&gt;&lt;/a&gt;Reward puppy. Source: &lt;a href="https://pixabay.com/photos/pug-puppy-dog-animal-cute-690566/"&gt;Pixabay&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Now that you know what a probability distribution is, let’s learn about some of the most common ones!&lt;/p&gt;

&lt;h2&gt;
  
  
  Bernoulli Probability Distribution
&lt;/h2&gt;

&lt;p&gt;A Random Variable with a Bernoulli Distribution is among the simplest ones.&lt;/p&gt;

&lt;p&gt;It represents a &lt;strong&gt;binary event&lt;/strong&gt; : “this happened” vs “this didn’t happen”, and takes a value &lt;em&gt;p&lt;/em&gt; as its &lt;strong&gt;only parameter&lt;/strong&gt; , which represents the &lt;strong&gt;probability&lt;/strong&gt; that &lt;strong&gt;the event will occur&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;A random variable &lt;em&gt;B&lt;/em&gt; with a Bernoulli distribution with parameter &lt;em&gt;p&lt;/em&gt; will have the following density function:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;P(B = 1) = p, P(B =0)= (1-p)&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Here &lt;em&gt;B=1&lt;/em&gt; means the event happened, and &lt;em&gt;B=0&lt;/em&gt; means it didn’t.&lt;/p&gt;

&lt;p&gt;Notice how both probabilities add up to 1, and therefore no other value for &lt;em&gt;B&lt;/em&gt; will be possible.&lt;/p&gt;

&lt;h2&gt;
  
  
  Uniform Probability Distribution
&lt;/h2&gt;

&lt;p&gt;There are two kinds of uniform random variables: discrete and continuous ones.&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;discrete uniform distribution&lt;/strong&gt; will take a &lt;strong&gt;(finite)&lt;/strong&gt; set of values &lt;em&gt;S&lt;/em&gt;, and assign a probability of &lt;em&gt;1/n&lt;/em&gt; to each of them, where &lt;em&gt;n&lt;/em&gt; is the amount of elements in &lt;em&gt;S&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;This way, if for instance my variable &lt;em&gt;Y&lt;/em&gt; was uniform in {1,2,3}, then there’d be a 33% chance each of those values came out.&lt;/p&gt;

&lt;p&gt;A very typical case of a discrete uniform random variable is found in &lt;strong&gt;dice&lt;/strong&gt; , where your typical dice has the set of values {1,2,3,4,5,6}.&lt;/p&gt;

&lt;p&gt;A &lt;strong&gt;continuous uniform distribution&lt;/strong&gt; , instead, only takes &lt;strong&gt;two values &lt;em&gt;a&lt;/em&gt; and &lt;em&gt;b&lt;/em&gt;&lt;/strong&gt; as parameters, and assigns the same density to each value in the &lt;strong&gt;interval between them&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;That means the probability of Y taking a &lt;strong&gt;value in an interval&lt;/strong&gt; (from &lt;em&gt;c&lt;/em&gt; to &lt;em&gt;d&lt;/em&gt;) is &lt;strong&gt;proportional to its size&lt;/strong&gt; versus the size of the whole interval (&lt;em&gt;b-a&lt;/em&gt;).&lt;/p&gt;

&lt;p&gt;Therefore if &lt;em&gt;Y&lt;/em&gt; is uniformly distributed between &lt;em&gt;a&lt;/em&gt; and &lt;em&gt;b&lt;/em&gt;, then&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--e9D-1L18--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://quicklatex.com/cache3/1f/ql_6428ae1ce24b50fa38db8faf6c6e211f_l3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--e9D-1L18--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://quicklatex.com/cache3/1f/ql_6428ae1ce24b50fa38db8faf6c6e211f_l3.png" alt="" width="355" height="19"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;This way, if &lt;em&gt;Y&lt;/em&gt; is a uniform random variable between 1 and 2,&lt;/p&gt;

&lt;p&gt;&lt;em&gt;P(1 &amp;lt; X &amp;lt; 2)=1&lt;/em&gt; and &lt;em&gt;P(1 &amp;lt; X &amp;lt; 1.5) = 0.5&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Python’s &lt;code&gt;random&lt;/code&gt; package’s &lt;code&gt;random&lt;/code&gt; method samples a uniformly distributed continuous variable between 0 and 1.&lt;/p&gt;

&lt;p&gt;Interestingly, it can be shown that &lt;strong&gt;any other distribution&lt;/strong&gt; can be sampled given a &lt;a href="https://www.mathworks.com/help/stats/generate-random-numbers-using-the-uniform-distribution-inversion-method.html"&gt;uniform random values generator and some calculus&lt;/a&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Normal Probability Distribution
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--cGyW1gjM--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/http://www.datastuff.tech/wp-content/uploads/2019/06/340px-Normal_Distribution_PDF.svg_.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--cGyW1gjM--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/http://www.datastuff.tech/wp-content/uploads/2019/06/340px-Normal_Distribution_PDF.svg_.png" alt="" width="340" height="217"&gt;&lt;/a&gt;Normal Distributions. source: &lt;a href="https://en.wikipedia.org/wiki/Normal_distribution"&gt;Wikipedia&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Normally distributed variables&lt;/strong&gt; are so commonly found in nature, they’re actually &lt;strong&gt;&lt;em&gt;the norm&lt;/em&gt;&lt;/strong&gt;. That’s actually where the name comes from.&lt;/p&gt;

&lt;p&gt;If you round up all your workmates and measure their heights, or weigh them all and plot a histogram with the results, odds are it’s gonna approach a normal distribution.&lt;/p&gt;

&lt;p&gt;I actually saw this effect when I showed you &lt;a href="http://www.datastuff.tech/data-analysis/data-analysis-pandas-seaborn-kaggle-dataset/"&gt;Exploratory Data Analysis&lt;/a&gt; examples.&lt;/p&gt;

&lt;p&gt;It can also be shown that if you &lt;strong&gt;take a sample&lt;/strong&gt; of any random variable and &lt;strong&gt;average those measures&lt;/strong&gt; , and repeat that process many times, that average will also have a &lt;strong&gt;normal distribution&lt;/strong&gt;. That fact’s so important, it’s called the &lt;a href="https://math.tutorvista.com/statistics/fundamental-theorem-of-statistics.html"&gt;fundamental theorem of statistics&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;Normally distributed variables:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Are &lt;strong&gt;symmetrical&lt;/strong&gt; , centered around a mean (usually called &lt;strong&gt;μ&lt;/strong&gt; ).&lt;/li&gt;
&lt;li&gt;Can take &lt;strong&gt;all values on the real space&lt;/strong&gt; , but only deviate two sigmas from the norm 5% of the time.&lt;/li&gt;
&lt;li&gt;Are &lt;strong&gt;literally everywhere&lt;/strong&gt;.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Most often if you measure any empirical data and it’s symmetrical, assuming it’s normal will kinda work.&lt;/p&gt;

&lt;p&gt;For example, rolling &lt;em&gt;K&lt;/em&gt; dice and adding up the results will distribute pretty much normally.&lt;/p&gt;

&lt;h3&gt;
  
  
  Log-Normal Probability Distribution
&lt;/h3&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--2pCOqEqH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/http://www.datastuff.tech/wp-content/uploads/2019/06/300px-PDF-log_normal_distributions.svg_.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--2pCOqEqH--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/http://www.datastuff.tech/wp-content/uploads/2019/06/300px-PDF-log_normal_distributions.svg_.png" alt="" width="300" height="300"&gt;&lt;/a&gt;Lognormal distribution. source: &lt;a href="https://en.wikipedia.org/wiki/Log-normal_distribution"&gt;Wikipedia&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Log-normal probability distribution is Normal Probability Distribution’s smaller, less frequently seen sister.&lt;/p&gt;

&lt;p&gt;A variable &lt;em&gt;X&lt;/em&gt; is said to be &lt;strong&gt;log-normally distributed&lt;/strong&gt; if the variable &lt;em&gt;Y = log(X)&lt;/em&gt; follows a normal distribution.&lt;/p&gt;

&lt;p&gt;When plotted in a histogram, log-normal probability distributions are &lt;strong&gt;asymmetrical&lt;/strong&gt; , and become even more so if their standard deviation is bigger.&lt;/p&gt;

&lt;p&gt;I believe &lt;strong&gt;lognormal&lt;/strong&gt; distributions to be worth mentioning, because &lt;strong&gt;most money-based variables&lt;/strong&gt; behave this way.&lt;/p&gt;

&lt;p&gt;If you look at the probability distributions of any variable that relates to money, like&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Amount sent on the latest transfer of a certain bank.&lt;/li&gt;
&lt;li&gt;Volume of the latest transaction in Wall Street.&lt;/li&gt;
&lt;li&gt;A set of companies’ quarterly earnings for a given quarter.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;They will usually not have a normal probability distribution, but will behave much closer to a lognormal random variable.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;(For other Data Scientists: chime in in the comments if you can think of any other empirical lognormal variables you’ve come across in your work! Especially anything outside of finances)&lt;/em&gt;.&lt;/p&gt;

&lt;h2&gt;
  
  
  Exponential Probability Distribution
&lt;/h2&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--AvrQ4dk7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/http://www.datastuff.tech/wp-content/uploads/2019/06/exp.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--AvrQ4dk7--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/http://www.datastuff.tech/wp-content/uploads/2019/06/exp.png" alt="" width="325" height="260"&gt;&lt;/a&gt;Source: &lt;a href="https://en.wikipedia.org/wiki/Exponential_distribution"&gt;Wikipedia&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Exponential probability distributions&lt;/strong&gt; appear everywhere, too.&lt;/p&gt;

&lt;p&gt;They are heavily linked to a Probability concept called a &lt;strong&gt;Poisson Process&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Stealing straight from Wikipedia, a &lt;a href="https://en.wikipedia.org/wiki/Poisson_point_process"&gt;Poisson Process&lt;/a&gt; is “&lt;em&gt;a process in which events occur continuously and independently at a constant average rate&lt;/em&gt;“.&lt;/p&gt;

&lt;p&gt;All that means is, if:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;You have a lot of events going.&lt;/li&gt;
&lt;li&gt;They happen at a certain rate (which &lt;strong&gt;does not change&lt;/strong&gt; over time).&lt;/li&gt;
&lt;li&gt;Just because one happened the chances of another one happening don’t change.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Then you have a Poisson process.&lt;/p&gt;

&lt;p&gt;Some examples could be requests coming to a server, transactions happening in a supermarket, or birds fishing in a certain lake.&lt;/p&gt;

&lt;p&gt;Imagine a Poisson Process with a frequency rate of λ (say, events happen once every second).&lt;/p&gt;

&lt;p&gt;Exponential random variables model the time it takes, after an event, for the next event to occur.&lt;/p&gt;

&lt;p&gt;Interestingly, in a Poisson Process &lt;strong&gt;an event can happen anywhere between 0 and infinity times&lt;/strong&gt; (&lt;em&gt;with decreasing probability&lt;/em&gt;), in any interval of time.&lt;/p&gt;

&lt;p&gt;This means there’s a &lt;strong&gt;non-zero chance that the event won’t happen, no matter how long you wait&lt;/strong&gt;. It also means it could happen a lot of times in a very short interval.&lt;/p&gt;

&lt;p&gt;In class we used to joke bus arrivals are Poisson Processes. I think the response time when you send a WhatsApp message to &lt;em&gt;some people&lt;/em&gt; also fits the criteria.&lt;/p&gt;

&lt;p&gt;However, the λ parameter &lt;strong&gt;regulates the frequency&lt;/strong&gt; of the events.&lt;/p&gt;

&lt;p&gt;It will make the &lt;strong&gt;expected time&lt;/strong&gt; it actually takes for an event to happen &lt;strong&gt;center around a certain value&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This means if we know a taxi passes our block every 15 minutes, even though theoretically we &lt;em&gt;could&lt;/em&gt; wait for it forever, it’s extremely likely we won’t wait longer than, say, 30 minutes.&lt;/p&gt;

&lt;h3&gt;
  
  
  Exponential Probability Distribution: In Practice
&lt;/h3&gt;

&lt;p&gt;Here’s the density function for an exponential distribution random variable:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--9DxZcOhi--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://quicklatex.com/cache3/d3/ql_1491ff4bfb47a7894aa4b4021d96f4d3_l3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--9DxZcOhi--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_800/https://quicklatex.com/cache3/d3/ql_1491ff4bfb47a7894aa4b4021d96f4d3_l3.png" alt="" width="227" height="23"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Suppose you have a sample from a variable and want to see if it can be modelled with an Exponential distribution Variable.&lt;/p&gt;

&lt;p&gt;The optimum &lt;strong&gt;λ parameter can be easily estimated&lt;/strong&gt; as the inverse of the average of your sampled values.&lt;/p&gt;

&lt;p&gt;Exponential variables are very good for modelling any probability distributions with very infrequent, but huge (and mean-breaking) &lt;strong&gt;outliers&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;This is because they can &lt;strong&gt;take any non-negative value&lt;/strong&gt; but center in smaller ones, with decreased frequency as the value grows.&lt;/p&gt;

&lt;p&gt;In a particularly &lt;strong&gt;outlier-heavy sample&lt;/strong&gt; , you may want to estimate λ as the &lt;strong&gt;median instead of the average&lt;/strong&gt; , since the median is more &lt;strong&gt;robust to outliers&lt;/strong&gt;. Your mileage may vary on this one, so take it with a grain of salt.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusions
&lt;/h2&gt;

&lt;p&gt;To sum up, as Data Scientists, I think it’s important for us to learn the basics.&lt;/p&gt;

&lt;p&gt;Probability and Statistics may not be as flashy as &lt;a href="http://www.datastuff.tech/machine-learning/autoencoder-deep-learning-tensorflow-eager-api-keras/"&gt;Deep Learning&lt;/a&gt; or &lt;a href="http://www.datastuff.tech/machine-learning/k-means-clustering-unsupervised-learning-for-recommender-systems/"&gt;Unsupervised Machine Learning&lt;/a&gt;, but they are the &lt;strong&gt;bedrock of Data Science&lt;/strong&gt;. Especially Machine Learning.&lt;/p&gt;

&lt;p&gt;Feeding a Machine Learning model with features without knowing which distribution they follow is, in my experience, a poor choice.&lt;/p&gt;

&lt;p&gt;It’s also good to remember the &lt;strong&gt;ubiquity of Exponential and Normal Probability Distributions&lt;/strong&gt; , and their smaller counterpart, the lognormal distribution.&lt;/p&gt;

&lt;p&gt;Knowing their properties, uses and appearance is &lt;strong&gt;game-changing when training a Machine Learning model&lt;/strong&gt;. It’s also generally good to keep them in mind while doing any kind of Data Analysis.&lt;/p&gt;

&lt;p&gt;Did you find any part of this article useful? Was it all stuff you already knew? did you learn anything new? Let me know in the comments!&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Contact me on &lt;a href="http://www.twitter.com/strikingloo"&gt;Twitter&lt;/a&gt;, &lt;a href="http://www.medium.com/@strikingloo"&gt;Medium&lt;/a&gt; of &lt;a href="http://www.dev.to/strikingloo"&gt;dev.to&lt;/a&gt; if there’s anything you don’t think was clear enough, anything that you disagree with, or just anything that’s plain wrong. Don’t worry, I don’t bite.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>datascience</category>
      <category>beginners</category>
      <category>statistics</category>
      <category>datascientist</category>
    </item>
    <item>
      <title>Convolutional Neural Networks: Python Tutorial (TensorFlow Eager API)</title>
      <dc:creator>Luciano Strika</dc:creator>
      <pubDate>Wed, 12 Jun 2019 18:05:35 +0000</pubDate>
      <link>https://dev.to/strikingloo/convolutional-neural-networks-an-introduction-tensorflow-eager-4f4m</link>
      <guid>https://dev.to/strikingloo/convolutional-neural-networks-an-introduction-tensorflow-eager-4f4m</guid>
      <description>&lt;p&gt;Convolutional Neural Networks are a part of what made Deep Learning reach the headlines so often in the last decade. Today we'll train an &lt;strong&gt;image classifier&lt;/strong&gt; to tell us whether an image contains a dog or a cat, using TensorFlow's eager API.&lt;/p&gt;

&lt;p&gt;Artificial Neural Networks have disrupted several industries lately, due to their unprecedented capabilities in many areas. However, Different Deep Learning architectures excel on each one:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Image Classification (Convolutional Neural Networks).&lt;/li&gt;
&lt;li&gt;Image, audio and text generation (GANs, RNNs).&lt;/li&gt;
&lt;li&gt;Time Series Forecasting (RNNs, LSTM).&lt;/li&gt;
&lt;li&gt;Recommendations Systems.&lt;/li&gt;
&lt;li&gt;A huge et cetera (e.g., regression).&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Today we’ll focus on the first item of the list, though each of those deserves an article of its own.&lt;/p&gt;

&lt;h2&gt;
  
  
  What are Convolutional Neural Networks?
&lt;/h2&gt;

&lt;p&gt;In MultiLayer Perceptrons (MLP), the &lt;em&gt;vanilla&lt;/em&gt; Neural Networks, each layer’s neurons connect to &lt;strong&gt;all&lt;/strong&gt; the neurons in the next layer. We call this type of layers &lt;strong&gt;fully connected&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--npsw_Q9A--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/http://www.astroml.org/_images/fig_neural_network_1.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--npsw_Q9A--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/http://www.astroml.org/_images/fig_neural_network_1.png" alt=""&gt;&lt;/a&gt;A MLP. Source: &lt;a href="http://www.astroml.org/book_figures/appendix/fig_neural_network.html"&gt;AstroML&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;A Convolutional Neural Network is different: they have Convolutional Layers.&lt;/p&gt;

&lt;p&gt;On a fully connected layer, each neuron’s output will be a linear transformation of the previous layer, composed with a non-linear activation function (e.g., &lt;em&gt;ReLu&lt;/em&gt; or &lt;em&gt;Sigmoid&lt;/em&gt;).&lt;/p&gt;

&lt;p&gt;Conversely, the output of each neuron in a &lt;strong&gt;Convolutional Layer&lt;/strong&gt; is only a function of a (typically small) &lt;strong&gt;subset&lt;/strong&gt; of the previous layer’s neurons.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--XXatDYus--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://ds055uzetaobb.cloudfront.net/brioche/uploads/MDyKhb5tXY-1_hbp1vrfewnareprrlnxtqq2x.png%3Fwidth%3D1200" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--XXatDYus--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://ds055uzetaobb.cloudfront.net/brioche/uploads/MDyKhb5tXY-1_hbp1vrfewnareprrlnxtqq2x.png%3Fwidth%3D1200" alt=""&gt;&lt;/a&gt; Source: &lt;a href="https://brilliant.org/wiki/convolutional-neural-network/"&gt;Brilliant&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Outputs on a Convolutional Layer will be the result of applying a &lt;strong&gt;convolution&lt;/strong&gt; to a subset of the previous layer’s neurons, and then an activation function.&lt;/p&gt;

&lt;h3&gt;
  
  
  What is a convolution?
&lt;/h3&gt;

&lt;p&gt;The convolution operation, given an input matrix &lt;em&gt;A&lt;/em&gt; (usually the previous layer’s values) and a (typically much smaller) weight matrix called a &lt;strong&gt;kernel&lt;/strong&gt; or &lt;strong&gt;filter&lt;/strong&gt; &lt;em&gt;K&lt;/em&gt;, will output a new matrix &lt;em&gt;B&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--PZ8zI0AQ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/800/1%2A4yv0yIH0nVhSOv3AkLUIiw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--PZ8zI0AQ--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/800/1%2A4yv0yIH0nVhSOv3AkLUIiw.png" alt=""&gt;&lt;/a&gt;by @&lt;a href="https://medium.com/@RaghavPrabhu/understanding-of-convolutional-neural-network-cnn-deep-learning-99760835f148"&gt;RaghavPrabhu&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;if &lt;em&gt;K&lt;/em&gt; is a &lt;em&gt;CxC&lt;/em&gt; matrix, the first element in &lt;em&gt;B&lt;/em&gt; will be the result of:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Taking the first &lt;em&gt;CxC&lt;/em&gt; submatrix of &lt;em&gt;A&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;Multiplying each of its elements by its corresponding weight in &lt;em&gt;K&lt;/em&gt;.&lt;/li&gt;
&lt;li&gt;Adding all the products.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These two last steps are equivalent to flattening both A's submatrix and K, and computing the dot product of the resulting vectors.&lt;/p&gt;

&lt;p&gt;We then slide K to the right to get the next element, and so on, repeating this process for each of &lt;em&gt;A&lt;/em&gt;‘s rows.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--S-S1Smhf--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://cdn-images-1.medium.com/max/800/1%2AMrGSULUtkXc0Ou07QouV8A.gif" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--S-S1Smhf--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_66%2Cw_880/https://cdn-images-1.medium.com/max/800/1%2AMrGSULUtkXc0Ou07QouV8A.gif" alt=""&gt;&lt;/a&gt;Convolution visualization by @&lt;a href="https://medium.com/@RaghavPrabhu/understanding-of-convolutional-neural-network-cnn-deep-learning-99760835f148"&gt;RaghavPrabhu&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Depending on what we want, we could only start with our kernel centered at the &lt;em&gt;Cth&lt;/em&gt; row and column, to avoid “going out of bounds”, or assume all elements “outside A” have a certain default value (typically 0) –This will define whether &lt;em&gt;B&lt;/em&gt;‘s size is smaller than &lt;em&gt;A&lt;/em&gt;‘s or the same.&lt;/p&gt;

&lt;p&gt;As you can see, if &lt;em&gt;A&lt;/em&gt; was an &lt;em&gt;NxM&lt;/em&gt; matrix, now each neuron’s value in &lt;em&gt;B&lt;/em&gt; won’t depend on &lt;em&gt;N*M&lt;/em&gt; weights, but only on &lt;em&gt;C*C&lt;/em&gt; (much less) of them. This makes a convolutional layer much lighter than a fully connected one, helping convolutional models learn a lot faster.&lt;/p&gt;

&lt;p&gt;Granted, we will end up using many kernels on each layer (getting a stack of matrices as each layer’s output). However, that will still require a lot less weights than our good old MLP.&lt;/p&gt;

&lt;h3&gt;
  
  
  Why does this work?
&lt;/h3&gt;

&lt;p&gt;Why can we &lt;strong&gt;ignore&lt;/strong&gt; how each neuron affects most of the others? Well, this whole system holds up on the premise that each neuron is &lt;strong&gt;strongly affected by its “neighbors”&lt;/strong&gt;. Faraway neurons, however, have only a small bearing on it.&lt;/p&gt;

&lt;p&gt;This assumption is &lt;strong&gt;intuitively true in images&lt;/strong&gt; –if we think of the input layer, each neuron will be a pixel or a pixel’s RGB value. And that’s part of the reason why this approach works so well for image classification.&lt;/p&gt;

&lt;p&gt;For example, if I take a region of a picture where there’s a blue sky, it’s likely that nearby regions will show the sky as well, using similar tones.&lt;/p&gt;

&lt;p&gt;A pixel’s neighbors will usually have similar RGB values to it. If they don’t, then that probably means we are on the edge of a figure or object.&lt;/p&gt;

&lt;p&gt;If you do some convolutions with pen and paper (or a calculator), you’ll realize certain kernels will increase an input’s intensity if it’s on a certain kind of edge. In other edges, they could decrease it.&lt;/p&gt;

&lt;p&gt;As an example, let’s consider the following kernels &lt;em&gt;V&lt;/em&gt; and &lt;em&gt;H&lt;/em&gt;:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--3siJzt3_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://quicklatex.com/cache3/65/ql_ba21e5c0e8d0bca8495df438cd2a7f65_l3.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--3siJzt3_--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://quicklatex.com/cache3/65/ql_ba21e5c0e8d0bca8495df438cd2a7f65_l3.png" alt=""&gt;&lt;/a&gt;Filters for vertical and horizontal edges&lt;/p&gt;

&lt;p&gt;&lt;em&gt;V&lt;/em&gt; filters vertical edges (where colors above are very different from colors below), whereas &lt;em&gt;H&lt;/em&gt; filters horizontal edges. Notice how one is the transposed version of the other.&lt;/p&gt;

&lt;h3&gt;
  
  
  Convolutions by example
&lt;/h3&gt;

&lt;p&gt;Here’s an unfiltered picture of a litter of kittens:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--QlTo9kC6--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/http://www.datastuff.tech/wp-content/uploads/2019/06/cat.1093.jpg" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--QlTo9kC6--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/http://www.datastuff.tech/wp-content/uploads/2019/06/cat.1093.jpg" alt="A cute kitten litter for image preprocessing."&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Here’s what happens if we apply the horizontal and vertical edge filters, respectively:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--h5VWAdJ4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/http://www.datastuff.tech/wp-content/uploads/2019/06/imgonline-com-ua-twotoone-8mRNYq0lXpVgvgF-e1560317743178.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--h5VWAdJ4--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/http://www.datastuff.tech/wp-content/uploads/2019/06/imgonline-com-ua-twotoone-8mRNYq0lXpVgvgF-e1560317743178.png" alt="kittens after horizontal and vertical convolutional edge filters"&gt;&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;We can see how some features become a lot more noticeable, whereas others fade away. Interestingly, each filter showcases different features.&lt;/p&gt;

&lt;p&gt;This is how Convolutional Neural Networks learn to identify features in an image. Letting it fit its own kernel weights is a lot easier than any manual approach. Imagine trying to figure out how you should express the relationship between pixels… by hand!&lt;/p&gt;

&lt;p&gt;To really grasp what each convolution does to a picture, I strongly recommend you play around on &lt;a href="http://setosa.io/ev/image-kernels/"&gt;this website&lt;/a&gt;. It helped me more than any book or tutorial could. Go ahead, bookmark it. It’s fun.&lt;/p&gt;

&lt;p&gt;Alright, you’ve learned some theory already. Now let’s move on to the practical part.&lt;/p&gt;

&lt;h2&gt;
  
  
  How do you train a Convolutional Neural Network in TensorFlow?
&lt;/h2&gt;

&lt;p&gt;TensorFlow is Python’s most popular Deep Learning framework. I’ve heard good things about PyTorch too, though I’ve never had the chance to try it.&lt;/p&gt;

&lt;p&gt;I’ve already written one tutorial on &lt;a href="http://www.datastuff.tech/machine-learning/autoencoder-deep-learning-tensorflow-eager-api-keras/"&gt;how to train a Neural Network with TensorFlow’s Eager API&lt;/a&gt;, focusing on AutoEncoders.&lt;/p&gt;

&lt;p&gt;Today will be different: we will try three different architectures, and see which one does better. As usual, all the code is available on &lt;a href="https://github.com/StrikingLoo/Cats-and-dogs-classifier-tensorflow-CNN"&gt;GitHub&lt;/a&gt;, so you can try everything out for yourself or follow along. Of course I’ll also be showing you Python snippets.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Dataset
&lt;/h3&gt;

&lt;p&gt;We will be training a neural network to predict whether an image contains a dog or a cat. To do this we’ll use Kaggle’s &lt;a href="https://www.kaggle.com/c/dogs-vs-cats"&gt;cats and dogs Dataset&lt;/a&gt;. It contains 12500 pictures of cats and 12500 of dogs, with different resolutions.&lt;/p&gt;

&lt;h3&gt;
  
  
  Loading and Preprocessing our Image Data with NumPy
&lt;/h3&gt;

&lt;p&gt;A neural network receives a features vector or matrix as an input, typically with &lt;strong&gt;fixed dimensions&lt;/strong&gt;. How do we generate that from our pictures?&lt;/p&gt;

&lt;p&gt;Lucky for us, Python’s Image library provides us an easy way to load an image as a NumPy array. A HeightxWidth matrix of RGB values.&lt;br&gt;&lt;br&gt;
We already did that on &lt;a href="https://dev.to/strikingloo/k-means-clustering-with-dask-image-filters-for-pictures-of-kittens-ip7"&gt;this article&lt;/a&gt;, so I’ll just reuse that code.&lt;/p&gt;

&lt;p&gt;However we still have to fix the fixed dimensions part: which dimensions do we choose for our input layer? This is important, since we will have to resize every picture to the chosen resolution. We do not want to distort aspect ratios too much in case it brings too much noise for the network.&lt;/p&gt;

&lt;p&gt;Here’s how we can see what the most common shape is in our dataset.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;I sampled the first 1000 pictures for this, though the result did not change when I looked at 5000. The most common shape was 375×500, though I decided to divide that by 4 for our network’s input.&lt;/p&gt;

&lt;p&gt;This is what our image loading code looks like now.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;
 

&lt;p&gt;Finally, you can load the data with this snippet. I chose to use a sample of 4096 pictures for the training set and 1024 for validation. However, that’s just because my PC couldn’t handle much more due to RAM size.&lt;/p&gt;

&lt;p&gt;Feel free to increase these numbers to the max (like 10K for training and 2500 for validation) if you try this at home!&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;
 
&lt;h3&gt;
  
  
  Training our Neural Networks
&lt;/h3&gt;

&lt;p&gt;First of all, as a sort of baseline, let’s see how good a normal &lt;strong&gt;MLP&lt;/strong&gt; does on this task. If Convolutional Neural Networks are so revolutionary, I’d expect the results to be &lt;strong&gt;terrible&lt;/strong&gt; for this experiment.&lt;/p&gt;

&lt;p&gt;So here’s a single hidden layer fully connected neural network.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;All the trainings for this article were made using AdamOptimizer, since it’s the fastest one. I only tuned the learning rate per model (here it was 1e-5).&lt;/p&gt;

&lt;p&gt;I trained this model for 10 epochs, and it basically converged to &lt;strong&gt;random guessing&lt;/strong&gt;. I made sure to &lt;strong&gt;shuffle the training data&lt;/strong&gt;, since I loaded it in order and that could’ve biased the model.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;I used &lt;strong&gt;MSE&lt;/strong&gt; as loss function, since it’s usually &lt;strong&gt;more intuitive to interpret&lt;/strong&gt;. If your MSE is 0.5 in binary classification, you’re as good as &lt;strong&gt;always predicting 0&lt;/strong&gt;. However, MLPs with more layers, or different loss functions &lt;strong&gt;did not perform better&lt;/strong&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Training a Convolutional Neural Network
&lt;/h3&gt;

&lt;p&gt;How much good can a single convolutional layer do? Let’s see.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;
&lt;br&gt;
For this network, I decided to add a single convolutional layer (with 24 kernels), followed by 2 fully connected layers.

&lt;p&gt;All Max Pooling does is reduce every four neurons to a single one, with the highest value between the four.&lt;/p&gt;

&lt;p&gt;After only 5 epochs, it was already &lt;strong&gt;performing much better&lt;/strong&gt; than the previous networks. With a validation MSE of 0.36, it was a lot better than random guessing already. Notice however that I had to use a &lt;strong&gt;much smaller learning rate&lt;/strong&gt;. Also, even though it learned in less epochs, &lt;strong&gt;each epoch&lt;/strong&gt; took &lt;strong&gt;much longer&lt;/strong&gt;. The model is also quite a lot heavier (200+ MB).&lt;/p&gt;

&lt;p&gt;I decided to also start measuring the Pearson correlation between predictions and validation labels. This model scored a 15.2%.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;
 
&lt;h3&gt;
  
  
  Neural Network with two Convolutional Layers
&lt;/h3&gt;

&lt;p&gt;Since that model had done so much better, I decided I would try out a bigger one. I added &lt;strong&gt;another convolutional layer&lt;/strong&gt;, and made both a lot bigger (48 kernels each). This means the model gets to learn &lt;strong&gt;more complex features&lt;/strong&gt; from the images. However it also predictably meant my RAM almost exploded. Also, training took &lt;strong&gt;a lot longer&lt;/strong&gt; (half an hour for 15 epochs).&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;
 

&lt;p&gt;Results were superb. The Pearson correlation coefficient between predictions and labels reached 0.21, with validation MSE reaching as low as 0.33.&lt;/p&gt;

&lt;p&gt;Let’s measure the network’s accuracy. Since 1 is a cat and 0 is a dog, I could say “If the model predicts a value higher than some threshold t, then predict &lt;em&gt;cat&lt;/em&gt;. Else predict &lt;em&gt;dog&lt;/em&gt;.” After trying 10 straightforward thresholds, this network had a &lt;strong&gt;maximum accuracy of 61%&lt;/strong&gt;.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;h3&gt;
  
  
  Even bigger Convolutional Neural Network
&lt;/h3&gt;

&lt;p&gt;Since clearly adding size to the model made it learn better, I tried making both convolutional layers a lot bigger, with 128 filters each. I left the rest of the model untouched, and didn’t change the learning rate.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;This model finally reached a correlation of 30%! Its best &lt;strong&gt;accuracy was 67%&lt;/strong&gt;, which means it was right two thirds of the time. I assume an even bigger model could’ve fit the data even better. However, this one was taking 7 minutes per epoch already, and I didn’t want to leave the next one training all morning.&lt;/p&gt;

&lt;p&gt;Usually, there’s a &lt;strong&gt;tradeoff&lt;/strong&gt; to be made between a model’s &lt;strong&gt;size&lt;/strong&gt;, and &lt;strong&gt;time constraints&lt;/strong&gt;. Size limits how well the network can fit the data (a &lt;strong&gt;small model&lt;/strong&gt; will &lt;strong&gt;underfit&lt;/strong&gt; ), however I won’t wait 3 hours for my model to learn.&lt;/p&gt;

&lt;p&gt;The same concerns may apply if you have a business deadline.&lt;/p&gt;

&lt;h2&gt;
  
  
  Conclusions
&lt;/h2&gt;

&lt;p&gt;We’ve seen Convolutional Neural Networks are &lt;strong&gt;significantly better&lt;/strong&gt; than vanilla architectures at &lt;strong&gt;image classification&lt;/strong&gt; tasks. We also tried different &lt;strong&gt;metrics&lt;/strong&gt; to measure &lt;strong&gt;model performance&lt;/strong&gt; (correlation, accuracy).&lt;/p&gt;

&lt;p&gt;We learned about the &lt;strong&gt;tradeoff&lt;/strong&gt; between a &lt;strong&gt;model’s size&lt;/strong&gt; (which prevents underfitting) and its &lt;strong&gt;convergence speed&lt;/strong&gt;.&lt;/p&gt;

&lt;p&gt;Lastly, we used TensorFlow’s eager API to easily train a Deep Neural Network, and numpy for (albeit simple) image preprocessing.&lt;/p&gt;

&lt;p&gt;For future articles, I believe we could experiment a lot more with different pooling layers, filter sizes, striding and a different preprocessing for this same task.&lt;/p&gt;

&lt;p&gt;Did you find this article useful? Would you have preferred to learn more about anything else? Is anything not clear enough? Let me know in the comments!&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Find me on &lt;a href="http://www.twitter.com/strikingloo"&gt;Twitter&lt;/a&gt;, &lt;a href="http://www.medium.com/@strikingloo"&gt;Medium&lt;/a&gt; or &lt;a href="http://www.dev.to/strikingloo"&gt;Dev.to&lt;/a&gt; if you have any questions, or want to contact me for anything. If you want to start a career in Machine Learning, here’s my &lt;a href="http://www.datastuff.tech/data-science/3-machine-learning-books-that-helped-me-level-up-as-a-data-scientist/"&gt;recommended reading list&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>machinelearning</category>
      <category>beginners</category>
      <category>deeplearning</category>
      <category>python</category>
    </item>
  </channel>
</rss>
