DEV Community: Pang Yan Han

Notes from a Reddit Sysadmins AMA in 2013

Pang Yan Han — Sun, 18 Mar 2018 15:02:11 +0000

Source: https://www.reddit.com/r/sysadmin/comments/r6zfv/we_are_sysadmins_reddit_ask_us_anything/

Also available at: https://github.com/yanhan/notes/blob/master/reddit-sysadmins-ama.md

I came across this Reddit AMA a while ago and wanted to take down some notes of the more interesting stuff I read there. Finally got down to doing it today.

Stats

Peak bandwidth: 924.21MBits / second. They used Akamai heavily
Aggregate size of databases: 2.4TB. Seems to be growing a few GB per week
On load balancer: ~8K established connections, ~250K in time wait (with very short time wait timeout)

What they use

Akamai
AWS (284 running instances, 161 were app servers)
Puppet
Ganglia
Zenoss
RabbitMQ
MCollective
Central memcached servers (with pylibmc). Each app server has small memcached instance for very local caching that cannot suffer network latency
rsyslog
Log consolidation: rsyslog with RELP module
Hadoop (for in-house data warehouse)

Interesting stuff

They use HAProxy on EC2 instances instead of ELB. Total 8 instances
- ELB is HAProxy with an API. Limited control over instance size of ELB. Initially set to very small instance
- ELB load balancing is done via round-robin DNS. When one of the backing instances crashes, any cached DNS on the Internet is going to suck. A lot of devices/software/ISPs still cache DNS incorrectly
- If ELB has these, it will be useful:
- Static VIP support. Just round-robin DNS is not acceptable
- Granular control over instance size that backs ELB
- More rule functionality in load balancing. Very limited compared to HAProxy
At one point, Postgres replication issues were taking down the site very often.
- These were due to EBS failures. They had to login and start addressing replication immediately to prevent really bad breakages
- Upgrading to Postgres 9 and moving away from EBS took care of it
When they took Reddit down during SOPA protest, they had to prepare for severe amount of immediate load because everyone knew the site was coming back online
- So they cannot do anything that cause the caching layers to clear. Otherwise site would have fallen flat on its face when it came back online
Load testing: users
- They do not have a load testing infra that can replicate user traffic
- At every place one of them has worked at, one of the most difficult problems is to simulate load properly. With dynamic services like reddit, it takes a lot of work to develop a suitable load simulator
Non logged in traffic hits Akamai's cache
Security focus: ensuring evildoers cannot get into app and do evil things. Since they are only hosting web, the infra has a very small number of vectors which are under decent security controls
- Most common attack: people trying to 'DDOS' them by scraping one URL over and over again
For async stuff, RabbitMQ is used. For instance:
- Votes
- Comment tree recomputing
- New comments
- Thumbnailer
- Search engine updates
IPv6: Akamai supports it and takes most burden off them
They keep a close eye on request rate hitting infra and real time stats from Google Analytics
Worst downtime: https://redditblog.com/2011/03/17/why-reddit-was-down-for-6-of-the-last-24-hours/
Silliest downtime: iptables -t nat -L to check rules on primary load balancer. This loads all the iptables modules, including conntrack. Conntrack table immediately filled up and took site down for a few seconds
Servers are patched as necessary. They subscribe to all security alert notification lists
Backup strategies: encrypt and send to S3. There's also one backup Postgres server where everything from every database cluster is written to (for more real time backup needs)

Challenges

Starting from scratch on a lot of stuff
Bottlenecks constantly popping up. Fix one bottleneck and the increased throughput introduces multiple new bottlenecks
Cannot touch memcached boxes. Reheating them will be very painful
- At their scale, they must make heavy use of caching whenever possible. Hence shutting everything down and starting everything back up is a painful process
- Need to engineer a clean way to reheat caches without having users hit the site
- One idea is to replay access logs against front-end hosts
- Another idea is to send increasing amounts of real traffic. Say every 1 in 4 requests gets to somewhere other than the maintenance page

Advice

Spend a lot of time working on own stuff. Eg, set up a web / database server just for the hell of it.
- Break stuff, rebuild it, repeat
- Find every interesting thing you can do on your home server and try it. Even if you are never going to use it personally.
- If anything breaks or doesn't make sense, don't drop it until you truly understand what is going on
- Avoid adopting any cargo cult mentality at all costs
- If that sounds like an extreme bore, reconsider sysadmin aspirations
Certs may help you get an interview at some companies and leverage for promotions at current workplace
- But they mostly demonstrate at most a shallow understanding of a system
- If you already know a system inside out, doesn't hurt to spend a small amount of time getting a cert

Bare metal vs. cloud

Bare metal:
- Load balancers and database servers will benefit from bare metal
- Plus point: can experiment with new hardware
Cloud:
- App servers will benefit from cloud
- Plus points: nice to not have to worry about things like networking infra, installing new hardware, ordering new hardware, rack power, etc

Mistakes they made

Everything used to be in one security group

What they were working on

Automating most infrastructure tasks, such as building out new servers
Getting the site to run in more than one region. Huge project that will require a lot of work throughout entire stack

Cheatsheet on the `top` utility

Pang Yan Han — Sun, 25 Feb 2018 04:17:11 +0000

This is available on my GitHub repo: https://github.com/yanhan/notes/blob/master/top.md

Accompanying blog post: https://yanhan.github.io/posts/my-notes-on-the-top-program.html

Stuff you see at the top of the screen

Load average values

The load average values are located at the top right corner of the screen. They look like the following:

load average: 0.45, 0.57, 0.62

These 3 numbers are the 1 min, 5 min and 15 min load average values respectively.

Simple way to interpret load averages: If the load average is 1.00 and the CPU has 1 core, the server is at capacity. With 2 cores, server is at capacity when the number is 2.00. With 4 cores, this number should be 4.00. And so on.

Longer explanation: Think of a CPU core as a road and a process as a car. If there is always 1 car on the road, the load average is 1.00. If there are 2 cars, then the load average is 2.00 and 1 car can be on the road while the other car has to wait for the road to be free. Hence load average is very roughly number of process that need to run / number of CPU cores and measures how overloaded a server is.

A simple rule of thumb: If the 15 min load average exceeds 0.7 (after dividing by the number of CPU cores), then the server may be overloaded.

For a better explanation on load averages, see: http://blog.scoutapp.com/articles/2009/07/31/understanding-load-averages

CPU percentage numbers

user time (us)
system time (sys)
time spent on low priority processes aka nice time (ni)
time spent in wait for I/O processes (wa)
time handling hardware interruptions (hi)
time handling software interruptions (si)
time stolen from virtual machine (st)

Columns

PR: task's priority. From -20 to 19, with -20 being most important
NI: nice value, which augments priority of task. Negative number increases task's priority, positive number decreases it
VIRT: virtual memory used (combo of RAM and swap)
RES: resident size of non-swapped, physical memory in KBs
SHR: shared memory size, memory that can be allocated to other processes
S: process status. Can be running (R), sleeping and unable to be interrupted (D), sleeping and able to be interrupted (S), trace / stopped (T), zombie (Z)
TIME+: cumulative CPU time that the process and children processes have used

Interactive commands

M: sort by memory usage
P: sort by CPU usage
s: change refresh time (will be prompted to enter a value)
Space / Enter: refresh
n: change number of processes shown (will be prompted to enter a value)
k: kill process (will be prompted to enter a value for the PID)
f: see list of fields and you can choose which to display. Use up and down keys to navigate, press d to toggle display, press s to select as sort field
H: show individual threads for all processes
i: toggle whether idle processes are shown
U / u: filter by username
1: toggle between all CPUs as a whole vs. CPU by core
L: locate string
w: write config file
h: open help

Command line options

-n 10: shows 10 iterations of information and then quit
-b: batch mode: just prints information on processes every specified number of seconds until all iterations run out (specified with -n)
-d[interval]: set delay time that top uses to refresh results
-i: toggle whether idle processes are shown
p[PID,PID]: filter to only show the specified processes
-u [username]: filters by user