<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Pang Yan Han</title>
    <description>The latest articles on DEV Community by Pang Yan Han (@yanhan).</description>
    <link>https://dev.to/yanhan</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F59351%2Fd4686b82-a90e-4a78-acf7-afd3bdd922c3.jpeg</url>
      <title>DEV Community: Pang Yan Han</title>
      <link>https://dev.to/yanhan</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/yanhan"/>
    <language>en</language>
    <item>
      <title>Notes from a Reddit Sysadmins AMA in 2013</title>
      <dc:creator>Pang Yan Han</dc:creator>
      <pubDate>Sun, 18 Mar 2018 15:02:11 +0000</pubDate>
      <link>https://dev.to/yanhan/notes-from-a-reddit-sysadmins-ama-in-2013-b64</link>
      <guid>https://dev.to/yanhan/notes-from-a-reddit-sysadmins-ama-in-2013-b64</guid>
      <description>&lt;p&gt;Source: &lt;a href="https://www.reddit.com/r/sysadmin/comments/r6zfv/we_are_sysadmins_reddit_ask_us_anything/"&gt;https://www.reddit.com/r/sysadmin/comments/r6zfv/we_are_sysadmins_reddit_ask_us_anything/&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Also available at: &lt;a href="https://github.com/yanhan/notes/blob/master/reddit-sysadmins-ama.md"&gt;https://github.com/yanhan/notes/blob/master/reddit-sysadmins-ama.md&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;I came across this Reddit AMA a while ago and wanted to take down some notes of the more interesting stuff I read there. Finally got down to doing it today.&lt;/p&gt;

&lt;h2&gt;
  
  
  Stats
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Peak bandwidth: 924.21MBits / second. They used Akamai heavily&lt;/li&gt;
&lt;li&gt;Aggregate size of databases: 2.4TB. Seems to be growing a few GB per week&lt;/li&gt;
&lt;li&gt;On load balancer: ~8K established connections, ~250K in time wait (with very short time wait timeout)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What they use
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Akamai&lt;/li&gt;
&lt;li&gt;AWS (284 running instances, 161 were app servers)&lt;/li&gt;
&lt;li&gt;Puppet&lt;/li&gt;
&lt;li&gt;&lt;a href="http://ganglia.sourceforge.net/"&gt;Ganglia&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://www.zenoss.com/"&gt;Zenoss&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;RabbitMQ&lt;/li&gt;
&lt;li&gt;MCollective&lt;/li&gt;
&lt;li&gt;Central memcached servers (with pylibmc). Each app server has small memcached instance for &lt;em&gt;very&lt;/em&gt; local caching that cannot suffer network latency&lt;/li&gt;
&lt;li&gt;rsyslog&lt;/li&gt;
&lt;li&gt;Log consolidation: rsyslog with RELP module&lt;/li&gt;
&lt;li&gt;Hadoop (for in-house data warehouse)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Interesting stuff
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;They use HAProxy on EC2 instances instead of ELB. Total 8 instances

&lt;ul&gt;
&lt;li&gt;ELB is HAProxy with an API. Limited control over instance size of ELB. Initially set to very small instance&lt;/li&gt;
&lt;li&gt;ELB load balancing is done via round-robin DNS. When one of the backing instances crashes, any cached DNS on the Internet is going to suck. A lot of devices/software/ISPs still cache DNS incorrectly&lt;/li&gt;
&lt;li&gt;If ELB has these, it will be useful:&lt;/li&gt;
&lt;li&gt;Static VIP support. Just round-robin DNS is not acceptable&lt;/li&gt;
&lt;li&gt;Granular control over instance size that backs ELB&lt;/li&gt;
&lt;li&gt;More rule functionality in load balancing. Very limited compared to HAProxy&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;At one point, Postgres replication issues were taking down the site very often.

&lt;ul&gt;
&lt;li&gt;These were due to EBS failures. They had to login and start addressing replication immediately to prevent really bad breakages&lt;/li&gt;
&lt;li&gt;Upgrading to Postgres 9 and moving away from EBS took care of it&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;When they took Reddit down during SOPA protest, they had to prepare for severe amount of immediate load because everyone knew the site was coming back online

&lt;ul&gt;
&lt;li&gt;So they cannot do anything that cause the caching layers to clear. Otherwise site would have fallen flat on its face when it came back online&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Load testing: users

&lt;ul&gt;
&lt;li&gt;They do not have a load testing infra that can replicate user traffic&lt;/li&gt;
&lt;li&gt;At every place one of them has worked at, one of the most difficult problems is to simulate load properly. With dynamic services like reddit, it takes a lot of work to develop a suitable load simulator&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Non logged in traffic hits Akamai's cache&lt;/li&gt;
&lt;li&gt;Security focus: ensuring evildoers cannot get into app and do evil things. Since they are only hosting web, the infra has a very small number of vectors which are under decent security controls

&lt;ul&gt;
&lt;li&gt;Most common attack: people trying to 'DDOS' them by scraping one URL over and over again&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;For async stuff, RabbitMQ is used. For instance:

&lt;ul&gt;
&lt;li&gt;Votes&lt;/li&gt;
&lt;li&gt;Comment tree recomputing&lt;/li&gt;
&lt;li&gt;New comments&lt;/li&gt;
&lt;li&gt;Thumbnailer&lt;/li&gt;
&lt;li&gt;Search engine updates&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;IPv6: Akamai supports it and takes most burden off them&lt;/li&gt;
&lt;li&gt;They keep a close eye on request rate hitting infra and real time stats from Google Analytics&lt;/li&gt;
&lt;li&gt;Worst downtime: &lt;a href="https://redditblog.com/2011/03/17/why-reddit-was-down-for-6-of-the-last-24-hours/"&gt;&lt;/a&gt;&lt;a href="https://redditblog.com/2011/03/17/why-reddit-was-down-for-6-of-the-last-24-hours/"&gt;https://redditblog.com/2011/03/17/why-reddit-was-down-for-6-of-the-last-24-hours/&lt;/a&gt;
&lt;/li&gt;
&lt;li&gt;Silliest downtime: &lt;code&gt;iptables -t nat -L&lt;/code&gt; to check rules on primary load balancer. This loads all the iptables modules, including conntrack. Conntrack table immediately filled up and took site down for a few seconds&lt;/li&gt;
&lt;li&gt;Servers are patched as necessary. They subscribe to all security alert notification lists&lt;/li&gt;
&lt;li&gt;Backup strategies: encrypt and send to S3. There's also one backup Postgres server where everything from every database cluster is written to (for more real time backup needs)&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Challenges
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Starting from scratch on a lot of stuff&lt;/li&gt;
&lt;li&gt;Bottlenecks constantly popping up. Fix one bottleneck and the increased throughput introduces multiple new bottlenecks&lt;/li&gt;
&lt;li&gt;Cannot touch memcached boxes. Reheating them will be very painful

&lt;ul&gt;
&lt;li&gt;At their scale, they must make heavy use of caching whenever possible. Hence shutting everything down and starting everything back up is a painful process&lt;/li&gt;
&lt;li&gt;Need to engineer a clean way to reheat caches without having users hit the site&lt;/li&gt;
&lt;li&gt;One idea is to replay access logs against front-end hosts&lt;/li&gt;
&lt;li&gt;Another idea is to send increasing amounts of real traffic. Say every 1 in 4 requests gets to somewhere other than the maintenance page&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Advice
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Spend a lot of time working on own stuff. Eg, set up a web / database server just for the hell of it.

&lt;ul&gt;
&lt;li&gt;Break stuff, rebuild it, repeat&lt;/li&gt;
&lt;li&gt;Find every interesting thing you can do on your home server and try it. Even if you are never going to use it personally.&lt;/li&gt;
&lt;li&gt;If anything breaks or doesn't make sense, don't drop it until you truly understand what is going on&lt;/li&gt;
&lt;li&gt;Avoid adopting any cargo cult mentality at all costs&lt;/li&gt;
&lt;li&gt;If that sounds like an extreme bore, reconsider sysadmin aspirations&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Certs &lt;em&gt;may&lt;/em&gt; help you get an interview at some companies and leverage for promotions at current workplace

&lt;ul&gt;
&lt;li&gt;But they mostly demonstrate at most a shallow understanding of a system&lt;/li&gt;
&lt;li&gt;If you already know a system inside out, doesn't hurt to spend a small amount of time getting a cert&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Bare metal vs. cloud
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Bare metal:

&lt;ul&gt;
&lt;li&gt;Load balancers and database servers will benefit from bare metal&lt;/li&gt;
&lt;li&gt;Plus point: can experiment with new hardware&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;li&gt;Cloud:

&lt;ul&gt;
&lt;li&gt;App servers will benefit from cloud&lt;/li&gt;
&lt;li&gt;Plus points: nice to not have to worry about things like networking infra, installing new hardware, ordering new hardware, rack power, etc&lt;/li&gt;
&lt;/ul&gt;


&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Mistakes they made
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Everything used to be in one security group&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  What they were working on
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;Automating most infrastructure tasks, such as building out new servers&lt;/li&gt;
&lt;li&gt;Getting the site to run in more than one region. Huge project that will require a lot of work throughout entire stack&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>devops</category>
      <category>infrastructure</category>
    </item>
    <item>
      <title>Cheatsheet on the `top` utility</title>
      <dc:creator>Pang Yan Han</dc:creator>
      <pubDate>Sun, 25 Feb 2018 04:17:11 +0000</pubDate>
      <link>https://dev.to/yanhan/cheatsheet-on-the-top-utility--82c</link>
      <guid>https://dev.to/yanhan/cheatsheet-on-the-top-utility--82c</guid>
      <description>&lt;p&gt;This is available on my GitHub repo: &lt;a href="https://github.com/yanhan/notes/blob/master/top.md"&gt;https://github.com/yanhan/notes/blob/master/top.md&lt;/a&gt;&lt;/p&gt;

&lt;p&gt;Accompanying blog post: &lt;a href="https://yanhan.github.io/posts/my-notes-on-the-top-program.html"&gt;https://yanhan.github.io/posts/my-notes-on-the-top-program.html&lt;/a&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  Stuff you see at the top of the screen
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Load average values
&lt;/h3&gt;

&lt;p&gt;The load average values are located at the top right corner of the screen. They look like the following:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;load average: 0.45, 0.57, 0.62
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;These 3 numbers are the 1 min, 5 min and 15 min load average values respectively.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Simple way to interpret load averages:&lt;/strong&gt; If the load average is 1.00 and the CPU has 1 core, the server is at capacity. With 2 cores, server is at capacity when the number is 2.00. With 4 cores, this number should be 4.00. And so on.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Longer explanation:&lt;/strong&gt; Think of a CPU core as a road and a process as a car. If there is always 1 car on the road, the load average is 1.00. If there are 2 cars, then the load average is 2.00 and 1 car can be on the road while the other car has to wait for the road to be free. Hence load average is &lt;strong&gt;very roughly&lt;/strong&gt; &lt;code&gt;number of process that need to run / number of CPU cores&lt;/code&gt; and measures how overloaded a server is.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;A simple rule of thumb:&lt;/strong&gt; If the 15 min load average exceeds 0.7 (after dividing by the number of CPU cores), then the server may be overloaded.&lt;/p&gt;

&lt;p&gt;For a better explanation on load averages, see: &lt;a href="http://blog.scoutapp.com/articles/2009/07/31/understanding-load-averages"&gt;http://blog.scoutapp.com/articles/2009/07/31/understanding-load-averages&lt;/a&gt;&lt;/p&gt;

&lt;h3&gt;
  
  
  CPU percentage numbers
&lt;/h3&gt;

&lt;ul&gt;
&lt;li&gt;user time &lt;code&gt;(us)&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;system time &lt;code&gt;(sys)&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;time spent on low priority processes aka nice time &lt;code&gt;(ni)&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;time spent in wait for I/O processes &lt;code&gt;(wa)&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;time handling hardware interruptions &lt;code&gt;(hi)&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;time handling software interruptions &lt;code&gt;(si)&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;time stolen from virtual machine &lt;code&gt;(st)&lt;/code&gt;
&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Columns
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;PR&lt;/code&gt;: task's priority. From -20 to 19, with -20 being most important&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;NI&lt;/code&gt;: nice value, which augments priority of task. Negative number increases task's priority, positive number decreases it&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;VIRT&lt;/code&gt;: virtual memory used (combo of RAM and swap)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;RES&lt;/code&gt;: resident size of non-swapped, physical memory in KBs&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;SHR&lt;/code&gt;: shared memory size, memory that can be allocated to other processes&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;S&lt;/code&gt;: process status. Can be running &lt;code&gt;(R)&lt;/code&gt;, sleeping and unable to be interrupted &lt;code&gt;(D)&lt;/code&gt;, sleeping and able to be interrupted &lt;code&gt;(S)&lt;/code&gt;, trace / stopped &lt;code&gt;(T)&lt;/code&gt;, zombie &lt;code&gt;(Z)&lt;/code&gt;
&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;TIME+&lt;/code&gt;: cumulative CPU time that the process and children processes have used&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Interactive commands
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;M&lt;/code&gt;: sort by memory usage&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;P&lt;/code&gt;: sort by CPU usage&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;s&lt;/code&gt;: change refresh time (will be prompted to enter a value)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;Space / Enter&lt;/code&gt;: refresh&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;n&lt;/code&gt;: change number of processes shown (will be prompted to enter a value)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;k&lt;/code&gt;: kill process (will be prompted to enter a value for the PID)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;f&lt;/code&gt;: see list of fields and you can choose which to display. Use up and down keys to navigate, press &lt;code&gt;d&lt;/code&gt; to toggle display, press &lt;code&gt;s&lt;/code&gt; to select as sort field&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;H&lt;/code&gt;: show individual threads for all processes&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;i&lt;/code&gt;: toggle whether idle processes are shown&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;U / u&lt;/code&gt;: filter by username&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;1&lt;/code&gt;: toggle between all CPUs as a whole vs. CPU by core&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;L&lt;/code&gt;: locate string&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;w&lt;/code&gt;: write config file&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;h&lt;/code&gt;: open help&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Command line options
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;code&gt;-n 10&lt;/code&gt;: shows &lt;code&gt;10&lt;/code&gt; iterations of information and then quit&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;-b&lt;/code&gt;: batch mode: just prints information on processes every specified number of seconds until all iterations run out (specified with &lt;code&gt;-n&lt;/code&gt;)&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;-d[interval]&lt;/code&gt;: set delay time that top uses to refresh results&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;-i&lt;/code&gt;: toggle whether idle processes are shown&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;p[PID,PID]&lt;/code&gt;: filter to only show the specified processes&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;-u [username]&lt;/code&gt;: filters by user&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  References
&lt;/h2&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="http://www.tech-faq.com/how-to-use-the-unix-top-command.html"&gt;http://www.tech-faq.com/how-to-use-the-unix-top-command.html&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://www.linode.com/docs/uptime/monitoring/top-htop-iotop/"&gt;https://www.linode.com/docs/uptime/monitoring/top-htop-iotop/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://coskan.wordpress.com/2008/12/22/how-to-use-top-effectivelly-on-linux-as-a-dba/"&gt;https://coskan.wordpress.com/2008/12/22/how-to-use-top-effectivelly-on-linux-as-a-dba/&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="http://blog.scoutapp.com/articles/2009/07/31/understanding-load-averages"&gt;http://blog.scoutapp.com/articles/2009/07/31/understanding-load-averages&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;

</description>
      <category>linux</category>
      <category>devops</category>
      <category>sysadmin</category>
    </item>
  </channel>
</rss>
