<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Pradeep Chhetri</title>
    <description>The latest articles on DEV Community by Pradeep Chhetri (@p_chhetri).</description>
    <link>https://dev.to/p_chhetri</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F177313%2Fb6dc5e92-66e2-47b6-9e5e-6a094999a719.jpg</url>
      <title>DEV Community: Pradeep Chhetri</title>
      <link>https://dev.to/p_chhetri</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/p_chhetri"/>
    <language>en</language>
    <item>
      <title>CSE138 Lecture 2  Notes</title>
      <dc:creator>Pradeep Chhetri</dc:creator>
      <pubDate>Sun, 19 Apr 2020 06:07:41 +0000</pubDate>
      <link>https://dev.to/p_chhetri/cse138-lecture-2-notes-53md</link>
      <guid>https://dev.to/p_chhetri/cse138-lecture-2-notes-53md</guid>
      <description>&lt;h3&gt;
  
  
  CSE138 Lecture 2 Notes
&lt;/h3&gt;

&lt;h3&gt;
  
  
  What is a Distributed System ?
&lt;/h3&gt;

&lt;p&gt;According to Leslie Lamport:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A system where I can’t get my work done because some computer which I never heard of crashed.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;According to Martin Kleppmann:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;A system running on several nodes and characterised by &lt;em&gt;partial failures.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  What is Partial Failure ?
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Machine crashing&lt;/li&gt;
&lt;li&gt;Network failure&lt;/li&gt;
&lt;li&gt;Messages being dropped&lt;/li&gt;
&lt;li&gt;Software misbehaviour&lt;/li&gt;
&lt;/ul&gt;

&lt;blockquote&gt;
&lt;p&gt;&lt;em&gt;Part of the computation happened while another don’t.&lt;/em&gt;&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;In the event of such failure, you don’t want the system to degrade in the case of partial failure. Your system must continue working.&lt;/p&gt;

&lt;h4&gt;
  
  
  Cloud Computing vs HPC Philosophy
&lt;/h4&gt;

&lt;p&gt;HPC Philosophy:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;Choosing to take partial failure as total failure i.e. if something does fail, it will start the computation over completely from scratch. It involves the process of checkpointing (every so often it saves the progress and if the failure happens, it will rollback to the last checkpoint).&lt;/em&gt;&lt;/p&gt;

&lt;p&gt;Cloud Computing Philosophy:&lt;/p&gt;

&lt;p&gt;&lt;em&gt;It involves working around such partial failures and expecting those kinds of failures.&lt;/em&gt;&lt;/p&gt;

&lt;h4&gt;
  
  
  Two Nodes Scenario
&lt;/h4&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--JwuHNr0Y--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2AEgwnP-Wwr3id7Y4ZNtgTdg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--JwuHNr0Y--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2AEgwnP-Wwr3id7Y4ZNtgTdg.png" alt=""&gt;&lt;/a&gt;System of two machines&lt;/p&gt;

&lt;p&gt;Possible Failure Scenarios:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Request from &lt;em&gt;Machine A&lt;/em&gt; gets lost: maybe someone removed the cable.&lt;/li&gt;
&lt;li&gt;Request from &lt;em&gt;Machine A&lt;/em&gt; is slow and &lt;em&gt;Machine B&lt;/em&gt; never receives it: network congestion or some sort of message queue either on &lt;em&gt;Machine A&lt;/em&gt; side or &lt;em&gt;Machine B&lt;/em&gt; side. &lt;em&gt;Machine A&lt;/em&gt; thought that it sent the request but &lt;em&gt;Machine B&lt;/em&gt; never received it.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Machine B&lt;/em&gt; crashed: &lt;em&gt;Machine A&lt;/em&gt; sent the message and &lt;em&gt;Machine B&lt;/em&gt; received the message and crashed.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Machine B&lt;/em&gt; is slow in precessing.&lt;/li&gt;
&lt;li&gt;Response from &lt;em&gt;Machine B&lt;/em&gt; is slow and &lt;em&gt;Machine A&lt;/em&gt; never receives it.&lt;/li&gt;
&lt;li&gt;Response from &lt;em&gt;Machine B&lt;/em&gt; gets lost.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Does &lt;em&gt;Machine A&lt;/em&gt; has any way to distinguish any of the three situations ? No.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If you send a request to another machine and don’t receive a response, it is impossible to know WHY (without having global knowledge of the system).&lt;/p&gt;
&lt;/blockquote&gt;

&lt;p&gt;Other possible failure scenarios:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Machine B&lt;/em&gt; is lying or refusing to answer.&lt;/li&gt;
&lt;li&gt;Cosmic Rays flipping the bits.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;These kinds of scenarios are called &lt;a href="https://en.wikipedia.org/wiki/Byzantine_fault"&gt;Byzantine Faults&lt;/a&gt;.&lt;/p&gt;

&lt;p&gt;How do real systems deal with it ?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;When a machine sends a message, it needs to have some sort of timeout which means if the message has no response, give up and assume failure.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Why it might be a mistake to assume failure due to timeout ?&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--iPHR7UVj--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2ArNBjv_0AY7y3LsVJmkHD0w.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--iPHR7UVj--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2ArNBjv_0AY7y3LsVJmkHD0w.png" alt=""&gt;&lt;/a&gt;System of two machines&lt;/p&gt;

&lt;p&gt;In this case, there is a possible that value of x is incremented twice because &lt;em&gt;Machine A&lt;/em&gt; asked &lt;em&gt;Machine B&lt;/em&gt; to increment the value of x twice because it never got &lt;em&gt;ok&lt;/em&gt; for the first message and timeout happened.&lt;/p&gt;

&lt;p&gt;Let assume the following:&lt;/p&gt;

&lt;p&gt;Maximum delay between &lt;em&gt;Machine A&lt;/em&gt; and &lt;em&gt;Machine B (and vice-versa)&lt;/em&gt; is  &lt;strong&gt;&lt;em&gt;d&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Maximum time processing a request is  &lt;strong&gt;&lt;em&gt;r&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;How long should &lt;em&gt;Machine A&lt;/em&gt; waiting ? &lt;strong&gt;&lt;em&gt;2d + r&lt;/em&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;It will rule out the uncertainties due to slowness but still leaves other kinds of uncertainties which we need to deal with. Further more, most of the time we don’t have this sort of guarantee of maximum delay, sometimes we make assumptions about this maximum delay.&lt;/p&gt;

&lt;p&gt;According to Prof. Peter Alvaro:&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;In distributed systems, not only do we have to deal with partial failures but we also have to deal with unbounded latency.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  Why do we want to have a distributed system ?
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;Data too big to fit on a single machine.&lt;/li&gt;
&lt;li&gt;You want things to be faster even though the data can fit on a single machine.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Time and Clocks
&lt;/h3&gt;

&lt;h4&gt;
  
  
  What are clocks for ?
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;Mark points in time&lt;/em&gt;: Eg. this item in my browser will expire on April 10, 2020 at 08:00 hours.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;Durations or Intervals of time:&lt;/em&gt; Eg. this user spent 4 minutes and 55 seconds on our websites.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Computers have two types of clocks:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;em&gt;time-of-day clocks&lt;/em&gt;: tells you what time it is. It is usually synchronised between machines using NTP (network time protocol). They are bad for measuring &lt;em&gt;durations or intervals&lt;/em&gt; because time-of-day clocks can jump backward due to daylight savings or leap seconds. They are okayish (but not good) for &lt;em&gt;marking points in time&lt;/em&gt; because clock synchronisation is only so good and we need more fine-grained resolution to prevent certain kinds of bugs. Hence we aren’t going to use them much in distributed systems.&lt;/li&gt;
&lt;li&gt;
&lt;em&gt;monotonic clocks:&lt;/em&gt; they always go further i.e. its certain kind of timer, maybe it counts the milliseconds since the machine restarted. In python, if you use &lt;a href="https://docs.python.org/3/library/time.html"&gt;time&lt;/a&gt; module, you can get the monotonic counter. It’s completely useless for &lt;em&gt;marking points in time&lt;/em&gt; but it’s good for measuring &lt;em&gt;duration or intervals of time.&lt;/em&gt; We can use these types of clocks to implement &lt;em&gt;timeouts.&lt;/em&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Checkout &lt;a href="https://blog.cloudflare.com/how-and-why-the-leap-second-affected-cloudflare-dns/"&gt;Cloudflare Blog on Leap Second&lt;/a&gt;. They tried implementing timeouts using time-of-day clock.&lt;/p&gt;

&lt;p&gt;Both of these kinds of clocks are &lt;strong&gt;physical clocks&lt;/strong&gt; but in distributed systems, we need to have a different notion of clocks which are &lt;strong&gt;logical clocks.&lt;/strong&gt; Logical clocks don’t measure the time-of-day and elapsed time, instead they only measure the ordering of events (which events happened before another).&lt;/p&gt;

&lt;p&gt;Suppose A happened before B.&lt;/p&gt;

&lt;p&gt;A — — — → B&lt;/p&gt;

&lt;p&gt;What does it tells us ?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;A &lt;em&gt;could&lt;/em&gt; have caused B.&lt;/li&gt;
&lt;li&gt;B &lt;em&gt;could not&lt;/em&gt; have caused A.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;This notion of potential causality is very important. Why ?&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Debugging: Figuring out possible causes of bug.&lt;/li&gt;
&lt;li&gt;Designing systems.&lt;/li&gt;
&lt;/ul&gt;

&lt;h4&gt;
  
  
  Resources:
&lt;/h4&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://www.youtube.com/watch?v=G0wpsacaYpE"&gt;Lecture Video&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;em&gt;Thank you Prof. Lindsey Kuper for keeping the lectures online.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>distributedsystems</category>
    </item>
    <item>
      <title>Extending Python with Rust</title>
      <dc:creator>Pradeep Chhetri</dc:creator>
      <pubDate>Wed, 01 May 2019 17:37:44 +0000</pubDate>
      <link>https://dev.to/p_chhetri/extending-python-with-rust-4pna</link>
      <guid>https://dev.to/p_chhetri/extending-python-with-rust-4pna</guid>
      <description>&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s---EbjGIA9--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/800/1%2Am9mLGeyjmV7Flq-t5MA62g.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s---EbjGIA9--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/800/1%2Am9mLGeyjmV7Flq-t5MA62g.png" alt=""&gt;&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  Introduction:
&lt;/h3&gt;

&lt;p&gt;Python is a great programming language but sometimes it can be a bit of slowcoach when it comes to performing certain tasks. That’s why developers have been &lt;a href="https://docs.python.org/3/extending/building.html"&gt;building C/C++ extensions&lt;/a&gt; and integrating them with Python to speed up the performance. However, writing these extensions is a bit difficult because these low-level languages are not type-safe, so doesn’t guarantee a defined behavior. This tends to introduce bugs with respect to memory management. Rust ensures memory safety and hence can easily prevent these kinds of bugs.&lt;/p&gt;

&lt;h3&gt;
  
  
  Slow Python Scenario:
&lt;/h3&gt;

&lt;p&gt;One of the many cases when Python is slow is building out large strings. In Python, the string object is immutable. Each time a string is assigned to a variable, a new object is created in memory to represent the new value. This contrasts with languages like Perl where a string variable can be modified in place. That’s why the common operation of constructing a long string out of several short segments is not very efficient in Python. Each time you append to the end of a string, the Python interpreter must allocate a new string object and copy the contents of both the existing string and the appended string into it. As the string under manipulation become large, this process can become increasingly slow.&lt;/p&gt;

&lt;p&gt;Problem: Write a function which accepts a positive integer as argument and returns a string concatenating a series of integers from zero to that integer.&lt;/p&gt;

&lt;p&gt;So let’s try solving the above problem in python and see if we can improve the performance by extending it via Rust.&lt;/p&gt;

&lt;h3&gt;
  
  
  Python Implementations:
&lt;/h3&gt;

&lt;h4&gt;
  
  
  Method I: Naive appending
&lt;/h4&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;This is the most obvious approach. Using the concatenate operator (+=) to append each segment to the string.&lt;/p&gt;

&lt;h4&gt;
  
  
  Method II: Build a list of strings and then join them
&lt;/h4&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;This approach is commonly suggested as a very pythonic way to do string concatenation. First a list is built containing each of the component strings, then in a single join operation a string is constructed containing all of the list elements appended together.&lt;/p&gt;

&lt;h4&gt;
  
  
  Method III: List comprehensions
&lt;/h4&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;This version is extremely compact and is also pretty understandable. Create a list of numbers using a list comprehension and then join them all together. This is just an abbreviated version of last approach and it consumes pretty much the same amount of memory.&lt;/p&gt;

&lt;p&gt;Let’s measure the performance of each of these three approaches and see which one wins. We are going to do this using &lt;a href="https://pypi.org/project/pytest-benchmark/"&gt;pytest-benchmark&lt;/a&gt; module.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;Here is the result of the above benchmarks. Lower the value, better is the approach.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;Just by looking at the &lt;strong&gt;Mean&lt;/strong&gt; column, one can easily justify that the list comprehension approach is definitely the winner among three approaches.&lt;/p&gt;

&lt;h3&gt;
  
  
  Rust Implementations:
&lt;/h3&gt;

&lt;p&gt;After trying out basic implementation of the above problem in Rust, and doing some rough benchmarking using &lt;a href="https://github.com/rust-lang/cargo/blob/master/src/doc/man/cargo-bench.adoc"&gt;cargo-bench&lt;/a&gt;, the result definitely looked promising. Hence, I decided to port the rust implementation as shared library using &lt;a href="https://github.com/dgrunwald/rust-cpython"&gt;rust-cpython&lt;/a&gt; project and call it from python program.&lt;/p&gt;

&lt;p&gt;To achieve this, I had create a rust crate with the following src/lib.rs.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;Building the above crate created a . &lt;strong&gt;dylib&lt;/strong&gt; file which needs to be rename  &lt;strong&gt;.so&lt;/strong&gt;.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;Then, we ran the same benchmark including the rust one as before.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;This time the result is more interesting.&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;p&gt;The rust extension is definitely the winner. As you increase the number of iterations to even more, the result is even more promising.&lt;/p&gt;

&lt;p&gt;Eg. for iterations = 1000, following are the benchmark results&lt;/p&gt;


&lt;div class="ltag_gist-liquid-tag"&gt;
  
&lt;/div&gt;


&lt;h3&gt;
  
  
  Code:
&lt;/h3&gt;

&lt;p&gt;You can find the code used in the post:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://github.com/chhetripradeep/rust-python-example"&gt;https://github.com/chhetripradeep/rust-python-example&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://github.com/chhetripradeep/cargo-bench-example"&gt;https://github.com/chhetripradeep/cargo-bench-example&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  Conclusion:
&lt;/h3&gt;

&lt;p&gt;I am very new to Rust but these results definitely inspires me to learn Rust more. If you know better implementation of above problem in Rust, do let me know.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://github.com/PyO3/pyo3"&gt;PyO3&lt;/a&gt; started a fork of &lt;a href="https://github.com/dgrunwald/rust-cpython"&gt;rust-cpython&lt;/a&gt; but definitely has lot more active development and hence on my todo-list of experimentation.&lt;/p&gt;

&lt;p&gt;Distributing of your python module will demand the rust extension to be compiled on the target system because of the variation of architecture. &lt;a href="https://github.com/getsentry/milksnake"&gt;Milksnake&lt;/a&gt; is a extension of &lt;a href="https://pypi.org/project/setuptools/"&gt;python-setuptools&lt;/a&gt; that allows you to distribute dynamic linked libraries in Python wheels in the most portable way imaginable.&lt;/p&gt;

</description>
      <category>performance</category>
      <category>python3</category>
      <category>rust</category>
      <category>python</category>
    </item>
    <item>
      <title>Cassandra 202: Snitches</title>
      <dc:creator>Pradeep Chhetri</dc:creator>
      <pubDate>Wed, 01 May 2019 15:40:41 +0000</pubDate>
      <link>https://dev.to/p_chhetri/cassandra-202-snitches-4ake</link>
      <guid>https://dev.to/p_chhetri/cassandra-202-snitches-4ake</guid>
      <description>&lt;h3&gt;
  
  
  Introduction:
&lt;/h3&gt;

&lt;p&gt;Snitch is a component which determines the network topology of the whole cassandra cluster. It provides the translation from the node’s IP address to the datacenter &amp;amp; rack it belongs to. This ensures that the data is placed in such a way that the cluster can handle rack/datacenter level outages.&lt;/p&gt;

&lt;p&gt;To improve the resiliency of our cassandra cluster, we decided to move from &lt;a href="https://docs.datastax.com/en/cassandra/3.0/cassandra/architecture/archSnitchSimple.html" rel="noopener noreferrer"&gt;SimpleSnitch&lt;/a&gt; which is the default snitch to &lt;a href="https://docs.datastax.com/en/cassandra/3.0/cassandra/architecture/archsnitchGossipPF.html" rel="noopener noreferrer"&gt;GossipingPropertyFileSnitch&lt;/a&gt;, a recommended snitch for production grade cluster. While the latter is rack and datacenter aware snitch, the former doesn’t recognize any of this information.&lt;/p&gt;

&lt;h3&gt;
  
  
  Facts:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;Every node in the cluster must be configured to use same snitch type.&lt;/li&gt;
&lt;li&gt;Since SimpleSnitch assigns every node to rack1in datacenter1 , you can only migrate from SimpleSnitch to GossipingPropertyFileSnitch &lt;em&gt;first&lt;/em&gt;. None of the other snitches like Ec2Snitch or GoogleCloudSnitch are compatible to SimpleSnitch. Migrating to any incompatible snitch &lt;em&gt;directly&lt;/em&gt; can cause data loss.&lt;/li&gt;
&lt;li&gt;Cassandra &lt;em&gt;doesn’t&lt;/em&gt; allow changing rack or datacenter of a node which already has data in it. Hence the only option in such case is to first decommission the node and bootstrap it.&lt;/li&gt;
&lt;/ul&gt;

&lt;h3&gt;
  
  
  SimpleSnitch to GPFS Migration:
&lt;/h3&gt;

&lt;h4&gt;
  
  
  I. Change the Snitch configuration of the current cluster
&lt;/h4&gt;

&lt;p&gt;Let’s say you have five nodes in your cluster configured with SimpleSnitch. You can visualize your cluster like this:&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F382%2F1%2AuY89Nh2SMhxjjd0wu9-CFA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F382%2F1%2AuY89Nh2SMhxjjd0wu9-CFA.png"&gt;&lt;/a&gt;5 nodes configured in simplesnitch&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;Stop all nodes of the current cluster.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;Since GPFS refers to cassandra-rackdc.properties to infer the rack and datacenter of a node, update them in each node as follows&lt;br&gt;
&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;dc=datacenter1
rack=rack1
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Update the snitch type in each node in cassandra.yaml
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;endpoint\_snitch: GossipingPropertyFileSnitch
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;ol&gt;
&lt;li&gt;Start all nodes of the current cluster.&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F382%2F1%2AuY89Nh2SMhxjjd0wu9-CFA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F382%2F1%2AuY89Nh2SMhxjjd0wu9-CFA.png"&gt;&lt;/a&gt;5 nodes configured in GPFS&lt;/p&gt;

&lt;h4&gt;
  
  
  II. Update all cassandra clients to be DCAware
&lt;/h4&gt;

&lt;p&gt;Before adding the new datacenter, we need to fulfill these prerequisites:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Make sure that clients query the existing datacenter datacenter1 cassandra nodes only:&lt;/li&gt;
&lt;/ol&gt;

&lt;ul&gt;
&lt;li&gt;Ensure clients are configured to useDCAwareRoundRobinPolicy&lt;/li&gt;
&lt;li&gt;Ensure clients are pointing to datacenter1&lt;/li&gt;
&lt;li&gt;Change consistency level from QUORUM to LOCAL_QUORUM&lt;/li&gt;
&lt;/ul&gt;

&lt;ol&gt;
&lt;li&gt;Update all cassandra keyspaces to use NetworkTopologyStrategy from SimpleStrategy
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ALTER KEYSPACE sample\_keyspace WITH REPLICATION =
{ 'class' : 'NetworkTopologyStrategy', 'datacenter1' : 3 };
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You need to do it for all system related keyspaces and user created keyspaces except few system keyspaces whose configuration you cant change.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;Altering the keyspace replication settings doesn’t actually move any existing data. It only affects new reads/writes.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;h4&gt;
  
  
  III. Add the new datacenter
&lt;/h4&gt;

&lt;p&gt;Now start nodes of the new datacenter making sure that they all have the same cluster_name configuration as current cluster.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F763%2F1%2AwRwXEr20YbKJ6gQ5Be-nug.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F763%2F1%2AwRwXEr20YbKJ6gQ5Be-nug.png"&gt;&lt;/a&gt;5 nodes configured in GPFS in two datacenters without knowing each other&lt;/p&gt;

&lt;p&gt;To make sure that cassandra nodes in one datacenter can see the nodes of the other datacenter, add the seed nodes of the new datacenter in all of the old datacenter’s nodes configuration and restart them. Similarly, add the seed nodes of the old datacenter in all of the new datacenter’s nodes configuration and restart them. It is always recommended to use seeds from all datacenters.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F757%2F1%2AE_Jaq_jw-TDNMs6DK60YoA.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F757%2F1%2AE_Jaq_jw-TDNMs6DK60YoA.png"&gt;&lt;/a&gt;5 nodes configured in GPFS in two datacenters communicating with each other&lt;/p&gt;

&lt;p&gt;One this is done, you will notice that nodetool status will show both the datacenters in the output.&lt;/p&gt;

&lt;p&gt;Although the new datacenter nodes have joined the cluster, they still don’t have any data. To ensure that every keyspace from old datacenter nodes is replicated to new datacenter nodes.&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Update every keyspace adding count of expected replicas in newer datacenter.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ALTER KEYSPACE sample\_keyspace WITH REPLICATION =
{ 'class' : 'NetworkTopologyStrategy', 'datacenter1' : 3, 'datacenter2' : 3 };
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can specify replica count as 1 for the new datacenter to ensure rebuild faster.&lt;/p&gt;

&lt;blockquote&gt;
&lt;p&gt;If you specify replica count as 1 for the new datacenter and change it to lets say 3 later, you’ll need to run nodetool repair -full with the -dc option to repair nodes only in the new datacenter. This may increase the overall time.&lt;/p&gt;
&lt;/blockquote&gt;

&lt;ol&gt;
&lt;li&gt;Rebuild each node in the new datacenter.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;nodetool rebuild -- _datacenter1_
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;You can rebuild on one or more nodes at the same time, but we will suggest you to run it only on once at a time. Running on one node at a time reduces the impact on the older datacenter. In our case, running it concurrently on multiple nodes caused out of memory issues for cassandra.&lt;/p&gt;

&lt;h4&gt;
  
  
  IV. Verify the newer datacenter is in sync with older datacenter
&lt;/h4&gt;

&lt;h4&gt;
  
  
  V. Remove references of older datacenter
&lt;/h4&gt;

&lt;p&gt;Before starting the decommissioning process for the older datacenter, we need to fulfill these prerequisites:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;Make sure that clients are pointing to the newer datacenter datacenter2 cassandra nodes only.&lt;/li&gt;
&lt;li&gt;Run a full repair with nodetool repair -full to ensure that all data is propagated from the datacenter being decommissioned. You need to run it on each of the nodes of the older datacenter (and on one node at a time)&lt;/li&gt;
&lt;li&gt;Update every keyspace removing the older datacenter datacenter1 from replication configuration.
&lt;/li&gt;
&lt;/ol&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;ALTER KEYSPACE sample\_keyspace WITH REPLICATION =
{ 'class' : 'NetworkTopologyStrategy', 'datacenter2' : 3 };
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;h4&gt;
  
  
  VI. Decommission the older datacenter
&lt;/h4&gt;

&lt;ol&gt;
&lt;li&gt;Run nodetool decommission on every node of the older datacenter.&lt;/li&gt;
&lt;/ol&gt;

&lt;h3&gt;
  
  
  References:
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://docs.datastax.com/en/cql/3.3/cql/cql_using/useUpdateKeyspaceRF.html" rel="noopener noreferrer"&gt;Cassandra: Updating the replication factor&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>cassandra</category>
      <category>aws</category>
    </item>
    <item>
      <title>Running a modern infrastructure on Kubernetes</title>
      <dc:creator>Pradeep Chhetri</dc:creator>
      <pubDate>Fri, 09 Feb 2018 01:57:58 +0000</pubDate>
      <link>https://dev.to/p_chhetri/running-a-modern-infrastructure-on-kubernetes-1end</link>
      <guid>https://dev.to/p_chhetri/running-a-modern-infrastructure-on-kubernetes-1end</guid>
      <description>&lt;p&gt;At StashAway, we have been running Docker containers from the very first day. Initially, we were using Rancher as the container orchestrator, but as we grew, we decided to switch to Kubernetes (k8s) — mainly because of its rapidly growing ecosystem and wide adoption.&lt;/p&gt;

&lt;p&gt;This post describes how we use k8s and its tooling stack to run our application in a production-grade environment.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://res.cloudinary.com/practicaldev/image/fetch/s--L2dPRExq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2AG9OVxt8KHclvgfedgmyV-Q.png" class="article-body-image-wrapper"&gt;&lt;img src="https://res.cloudinary.com/practicaldev/image/fetch/s--L2dPRExq--/c_limit%2Cf_auto%2Cfl_progressive%2Cq_auto%2Cw_880/https://cdn-images-1.medium.com/max/1024/1%2AG9OVxt8KHclvgfedgmyV-Q.png" alt=""&gt;&lt;/a&gt;Google trends graph showing how the interest on various container schedulers changed over time.&lt;/p&gt;

&lt;p&gt;All applications whether stateless or stateful needs an environment with these fundamental necessities built-in:&lt;/p&gt;

&lt;h3&gt;
  
  
  Service Discovery
&lt;/h3&gt;

&lt;p&gt;Containers are elastic in nature, they can come up and go down anytime. Since each container gets a dynamic IP address, registration of each container instance is a must, so that others can communicate with it. Kubernetes supports two modes of discovery: Environment variables and DNS-based.&lt;/p&gt;

&lt;p&gt;If you are for example running Cassandra inside a container, its IP address will be available both as an environment variable &lt;em&gt;CASSANDRA_HOST&lt;/em&gt;, as well as a domain name &lt;em&gt;cassandra.default.svc.cluster.local&lt;/em&gt;.&lt;/p&gt;

&lt;p&gt;DNS-based service discovery is more popular among the two but special care needs to be taken since some DNS client libraries set high DNS cache TTL values. Eg. &lt;a href="https://docs.oracle.com/javase/8/docs/technotes/guides/net/properties.html"&gt;JVM caches domain names forever by default&lt;/a&gt;.&lt;/p&gt;

&lt;h3&gt;
  
  
  Service Addressing
&lt;/h3&gt;

&lt;p&gt;DNS-based names for discovery as shown above are only resolvable from inside the Kubernetes cluster. In order to address a service from outside the cluster, it needs an automatic DNS registration to a third party DNS provider such as AWS Route53, Google CloudDNS, AzureDNS, CloudFlare. To mitigate against the dependency on a single DNS provider, you should consider hosting your zones on multiple providers.&lt;/p&gt;

&lt;p&gt;In k8s world, this can be achieved easily via &lt;a href="https://kubernetes.io/docs/concepts/overview/working-with-objects/annotations/"&gt;annotations&lt;/a&gt; using &lt;a href="https://github.com/kubernetes-incubator/external-dns"&gt;ExternalDNS&lt;/a&gt;. This incubator project takes care of registering a new (sub-)domain as soon as any new k8s service or ingress controller is created. It is also aware of the records it manages via an extra TXT record along with the primary A record, hence preventing any accidental overwriting of existing records.&lt;/p&gt;

&lt;h3&gt;
  
  
  Routing
&lt;/h3&gt;

&lt;p&gt;With lots of containers and services popping in and out of existence, routing external traffic to healthy containers is challenging. &lt;a href="https://kubernetes.io/docs/concepts/services-networking/ingress/"&gt;Kubernetes Ingress&lt;/a&gt; is the saviour. It provides load balancing, SSL termination and even name based routing. Ingress is just an abstraction layer which can use any software load-balancer as its implementation.&lt;/p&gt;

&lt;p&gt;Current Ingress controller implementations include &lt;a href="https://github.com/kubernetes/ingress-nginx"&gt;Nginx Ingress&lt;/a&gt; (Nginx based), &lt;a href="https://github.com/appscode/voyager"&gt;Voyager&lt;/a&gt; (HAProxy based) and &lt;a href="https://github.com/heptio/contour"&gt;Contour&lt;/a&gt; (EnvoyProxy based). The first one is the most matured which we are using (along with ELB) for all our traffic routing — but it provides only HTTP based routing. For TCP based routing, you’ll need to use Voyager. Contour is very interesting since it comes along with all the benefits of Envoy which is a service proxy designed specifically for modern cloud native applications. It has first class support for gRPC and provides features like circuit breaking which are not available in standard load-balancers.&lt;/p&gt;

&lt;h3&gt;
  
  
  Monitoring
&lt;/h3&gt;

&lt;p&gt;Many Kubernetes objects like pods, services and ingresses together define the application, hence it is important to monitor the state of each one of them.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://prometheus.io/"&gt;Prometheus&lt;/a&gt; is definitely the right choice available in open-source to monitor your Kubernetes apps and cluster. It has an inbuilt &lt;a href="https://prometheus.io/docs/prometheus/latest/configuration/configuration/#&amp;lt;kubernetes_sd_config"&gt;discovery&lt;/a&gt; for these k8s objects. Since monitoring without alerting is useless, &lt;a href="https://github.com/prometheus/alertmanager"&gt;Alertmanager&lt;/a&gt; perfectly fills the gap by providing nice integrations like Slack notifications.&lt;/p&gt;

&lt;p&gt;Most people use Prometheus along with &lt;a href="https://github.com/kubernetes/heapster"&gt;Heapster&lt;/a&gt; which can be integrated with many open-source monitoring solutions like InfluxDB and Riemann. Those who want to get fine container level metrics can add &lt;a href="https://github.com/google/cadvisor"&gt;cAdvisor&lt;/a&gt; to their monitoring stack, too.&lt;/p&gt;

&lt;h3&gt;
  
  
  Logging
&lt;/h3&gt;

&lt;p&gt;While you are running multiple instances of the same image, you can’t afford to login into each container and tail the logs. Each k8s node needs to run an agent to push these container logs. Surprisingly, in k8s world, the EFK stack is more popular than the ELK stack.&lt;/p&gt;

&lt;p&gt;&lt;a href="http://fluentbit.io/"&gt;Fluent-bit&lt;/a&gt; is a lightweight (alternative to &lt;a href="https://www.fluentd.org/"&gt;Fluentd&lt;/a&gt;) and is a fully Docker- and Kubernetes-aware agent which can be used to push these logs directly to Elasticsearch. It automatically adds kubernetes labels and annotations in each log line. You can also integrate it with Slack for sending notifications in case of any error/exception.&lt;/p&gt;

&lt;h3&gt;
  
  
  Deploying
&lt;/h3&gt;

&lt;p&gt;We bundle all our applications as Docker images along with its dependencies. To keep the deployment pipeline clean, its very important to templatize these manifests. &lt;a href="https://helm.sh/"&gt;Helm&lt;/a&gt;, the package manager for k8s, is a great way to deploy apps on Kubernetes. There are many community-managed helm &lt;a href="https://github.com/kubernetes/charts"&gt;charts&lt;/a&gt; which are stable and ready to be used in production.&lt;/p&gt;

&lt;p&gt;Since Helm doesn’t provide a neat way to store secrets, we use &lt;a href="http://docs.ansible.com/ansible/2.4/vault.html"&gt;Ansible Vault&lt;/a&gt; as their source of truth. We trigger the helm command-line via Ansible using the &lt;a href="https://gist.github.com/chhetripradeep/12cd9f0b94572cede89e18502b84ced1"&gt;ansible-helm&lt;/a&gt; module.&lt;/p&gt;

&lt;p&gt;One of the pain-points of helm is that someone needs to write these charts by first understanding each of the YAML fields. &lt;a href="https://ksonnet.io/"&gt;Ksonnet&lt;/a&gt; is going to &lt;a href="https://blog.heptio.com/the-next-chapter-for-ksonnet-1dcbbad30cb"&gt;remove it&lt;/a&gt; by dynamically generating helm charts on demand.&lt;/p&gt;

&lt;h3&gt;
  
  
  SSL Certificates
&lt;/h3&gt;

&lt;p&gt;Almost 90% of modern applications expose an HTTP endpoint. To secure these HTTP services, the basic requirement is to install an SSL certificate to enable encrypted communication.&lt;/p&gt;

&lt;p&gt;Provisioning, installing &amp;amp; updating these certificates can become cumbersome, if it is not automated properly.&lt;/p&gt;

&lt;p&gt;Automatic provisioning of &lt;a href="https://letsencrypt.org/"&gt;Let’s Encrypt&lt;/a&gt; certificates for k8s ingresses can be done via &lt;a href="https://github.com/PalmStoneGames/kube-cert-manager"&gt;kube-cert-manager&lt;/a&gt;. We chose this over &lt;a href="https://github.com/jetstack/kube-lego"&gt;kube-lego&lt;/a&gt; since it has support for Let’s encrypt DNS based validation challenges. Hence it can be used for issuing certs for applications which are hosted in a private network. It also takes care of renewing these certificates.&lt;/p&gt;

&lt;p&gt;JetStack folks are developing another tool named &lt;a href="https://github.com/jetstack/cert-manager"&gt;cert-manager&lt;/a&gt; which is pretty interesting since it will soon be able to use &lt;a href="https://www.vaultproject.io/"&gt;Hashicorp Vault&lt;/a&gt; as a CA authority.&lt;/p&gt;

&lt;h3&gt;
  
  
  Conclusion
&lt;/h3&gt;

&lt;p&gt;In this post, we talked about why you should choose Kubernetes for your next project. We will go in-depth into few of these topics in our upcoming blog posts.&lt;/p&gt;

&lt;p&gt;&lt;em&gt;We are constantly on the lookout for great tech talent to join our team — &lt;/em&gt;&lt;a href="https://www.stashaway.sg"&gt;&lt;em&gt;visit our website&lt;/em&gt;&lt;/a&gt; &lt;em&gt;to learn more and feel free to reach out to us!&lt;/em&gt;&lt;/p&gt;




</description>
      <category>devops</category>
      <category>containers</category>
      <category>docker</category>
      <category>kubernetes</category>
    </item>
    <item>
      <title>Impact of Meltdown fix on AWS</title>
      <dc:creator>Pradeep Chhetri</dc:creator>
      <pubDate>Sat, 13 Jan 2018 13:46:38 +0000</pubDate>
      <link>https://dev.to/p_chhetri/impact-of-meltdown-fix-on-aws-ki8</link>
      <guid>https://dev.to/p_chhetri/impact-of-meltdown-fix-on-aws-ki8</guid>
      <description>&lt;p&gt;On 3rd January, Google Project Zero Team &lt;a href="https://googleprojectzero.blogspot.com/2018/01/reading-privileged-memory-with-side.html" rel="noopener noreferrer"&gt;disclosed&lt;/a&gt; about the two hardware vulnerabilities: &lt;a href="https://meltdownattack.com/meltdown.pdf" rel="noopener noreferrer"&gt;Meltdown&lt;/a&gt; and &lt;a href="https://spectreattack.com/spectre.pdf" rel="noopener noreferrer"&gt;Spectre&lt;/a&gt;. Whereas Meltdown is specific to Intel processors, Spectre affects almost all modern processors.&lt;/p&gt;

&lt;p&gt;As soon as they were disclosed, all of the cloud providers started working on patching the hypervisors with the fix. In this post, we will talk about how AWS handled the same.&lt;/p&gt;

&lt;p&gt;AWS Instances are broadly classified into two categories: &lt;a href="https://en.wikipedia.org/wiki/Paravirtualization" rel="noopener noreferrer"&gt;PVM&lt;/a&gt; and &lt;a href="https://en.wikipedia.org/wiki/Hardware-assisted_virtualization" rel="noopener noreferrer"&gt;HVM&lt;/a&gt;. While HVM hypervisors were patched online without affecting any of the running instances, AWS notified the customers to reboot their PVM instances before 6th January.&lt;/p&gt;

&lt;p&gt;We noticed increased CPU utilisation for almost all of our instance groups significantly.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F1%2ACclTRJnjEt24auzmzc0UMg.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F1%2ACclTRJnjEt24auzmzc0UMg.png"&gt;&lt;/a&gt;Increased CPU Utilisation on our 3-node Cassandra Cluster running on r4.large instances&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;{1}&lt;/strong&gt; By 4th January, &lt;a href="https://aws.amazon.com/security/security-bulletins/AWS-2018-013/" rel="noopener noreferrer"&gt;AWS&lt;/a&gt; patched the hypervisor with &lt;a href="https://lwn.net/Articles/741878/" rel="noopener noreferrer"&gt;Kernel Page Table Isolation (KPTI)&lt;/a&gt; which caused &amp;gt; 100% increase in the CPU utilisation. Some of the cassandra consultant and managed hosting companies have noticed &lt;a href="https://twitter.com/BenBromhead/status/950245250504601600" rel="noopener noreferrer"&gt;the&lt;/a&gt; &lt;a href="http://thelastpickle.com/blog/2018/01/10/meltdown-impact-on-latency.html" rel="noopener noreferrer"&gt;same&lt;/a&gt;. Performance impact of KPTI mitigation depends purely on the system calls made by the application. So, the performance impact may vary accordingly.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;{2}&lt;/strong&gt; On 12th January, AWS rolled out something which reduced the performance impact back to the pre-meltdown patch level. Although, AWS hasn’t disclosed anything about the same yet.&lt;/p&gt;

&lt;p&gt;We noticed something similar in RDS instance too.&lt;/p&gt;

&lt;p&gt;&lt;a href="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F1%2A2oDHNPgOFXs1nGwfOUstUw.png" class="article-body-image-wrapper"&gt;&lt;img src="https://media.dev.to/dynamic/image/width=800%2Cheight=%2Cfit=scale-down%2Cgravity=auto%2Cformat=auto/https%3A%2F%2Fcdn-images-1.medium.com%2Fmax%2F1024%2F1%2A2oDHNPgOFXs1nGwfOUstUw.png"&gt;&lt;/a&gt;Increased CPU Utilisation on our RDS instance&lt;/p&gt;

&lt;p&gt;AWS patch protects from any instance-to-instance concerns (one instance can read the memory of another) and instance-to-hypervisor concerns (instance can read hypervisor memory). AWS still recommend all customers to upgrade their instance kernel to mitigate any process-to-process concerns.&lt;/p&gt;

</description>
      <category>meltdown</category>
      <category>cloudcomputing</category>
      <category>amazonwebservices</category>
    </item>
  </channel>
</rss>
