Notes from a Reddit Sysadmins AMA in 2013

#devops #infrastructure

Source: https://www.reddit.com/r/sysadmin/comments/r6zfv/we_are_sysadmins_reddit_ask_us_anything/

Also available at: https://github.com/yanhan/notes/blob/master/reddit-sysadmins-ama.md

I came across this Reddit AMA a while ago and wanted to take down some notes of the more interesting stuff I read there. Finally got down to doing it today.

Stats

Peak bandwidth: 924.21MBits / second. They used Akamai heavily
Aggregate size of databases: 2.4TB. Seems to be growing a few GB per week
On load balancer: ~8K established connections, ~250K in time wait (with very short time wait timeout)

What they use

Akamai
AWS (284 running instances, 161 were app servers)
Puppet
Ganglia
Zenoss
RabbitMQ
MCollective
Central memcached servers (with pylibmc). Each app server has small memcached instance for very local caching that cannot suffer network latency
rsyslog
Log consolidation: rsyslog with RELP module
Hadoop (for in-house data warehouse)

Interesting stuff

They use HAProxy on EC2 instances instead of ELB. Total 8 instances
- ELB is HAProxy with an API. Limited control over instance size of ELB. Initially set to very small instance
- ELB load balancing is done via round-robin DNS. When one of the backing instances crashes, any cached DNS on the Internet is going to suck. A lot of devices/software/ISPs still cache DNS incorrectly
- If ELB has these, it will be useful:
- Static VIP support. Just round-robin DNS is not acceptable
- Granular control over instance size that backs ELB
- More rule functionality in load balancing. Very limited compared to HAProxy
At one point, Postgres replication issues were taking down the site very often.
- These were due to EBS failures. They had to login and start addressing replication immediately to prevent really bad breakages
- Upgrading to Postgres 9 and moving away from EBS took care of it
When they took Reddit down during SOPA protest, they had to prepare for severe amount of immediate load because everyone knew the site was coming back online
- So they cannot do anything that cause the caching layers to clear. Otherwise site would have fallen flat on its face when it came back online
Load testing: users
- They do not have a load testing infra that can replicate user traffic
- At every place one of them has worked at, one of the most difficult problems is to simulate load properly. With dynamic services like reddit, it takes a lot of work to develop a suitable load simulator
Non logged in traffic hits Akamai's cache
Security focus: ensuring evildoers cannot get into app and do evil things. Since they are only hosting web, the infra has a very small number of vectors which are under decent security controls
- Most common attack: people trying to 'DDOS' them by scraping one URL over and over again
For async stuff, RabbitMQ is used. For instance:
- Votes
- Comment tree recomputing
- New comments
- Thumbnailer
- Search engine updates
IPv6: Akamai supports it and takes most burden off them
They keep a close eye on request rate hitting infra and real time stats from Google Analytics
Worst downtime: https://redditblog.com/2011/03/17/why-reddit-was-down-for-6-of-the-last-24-hours/
Silliest downtime: iptables -t nat -L to check rules on primary load balancer. This loads all the iptables modules, including conntrack. Conntrack table immediately filled up and took site down for a few seconds
Servers are patched as necessary. They subscribe to all security alert notification lists
Backup strategies: encrypt and send to S3. There's also one backup Postgres server where everything from every database cluster is written to (for more real time backup needs)

Challenges

Starting from scratch on a lot of stuff
Bottlenecks constantly popping up. Fix one bottleneck and the increased throughput introduces multiple new bottlenecks
Cannot touch memcached boxes. Reheating them will be very painful
- At their scale, they must make heavy use of caching whenever possible. Hence shutting everything down and starting everything back up is a painful process
- Need to engineer a clean way to reheat caches without having users hit the site
- One idea is to replay access logs against front-end hosts
- Another idea is to send increasing amounts of real traffic. Say every 1 in 4 requests gets to somewhere other than the maintenance page

Advice

Spend a lot of time working on own stuff. Eg, set up a web / database server just for the hell of it.
- Break stuff, rebuild it, repeat
- Find every interesting thing you can do on your home server and try it. Even if you are never going to use it personally.
- If anything breaks or doesn't make sense, don't drop it until you truly understand what is going on
- Avoid adopting any cargo cult mentality at all costs
- If that sounds like an extreme bore, reconsider sysadmin aspirations
Certs may help you get an interview at some companies and leverage for promotions at current workplace
- But they mostly demonstrate at most a shallow understanding of a system
- If you already know a system inside out, doesn't hurt to spend a small amount of time getting a cert

Bare metal vs. cloud

Bare metal:
- Load balancers and database servers will benefit from bare metal
- Plus point: can experiment with new hardware
Cloud:
- App servers will benefit from cloud
- Plus points: nice to not have to worry about things like networking infra, installing new hardware, ordering new hardware, rack power, etc

Mistakes they made

Everything used to be in one security group

What they were working on

Automating most infrastructure tasks, such as building out new servers
Getting the site to run in more than one region. Huge project that will require a lot of work throughout entire stack

Oldest comments (1)

Peter Kim Frank • Mar 18 '18

This is fascinating. I'd love to see an updated version to see how they're dealing with 2018 traffic and demands five years later. Their platform has gotten way bigger, but infrastructure has also improved dramatically.