DEV Community

Vinicius Uehara
Vinicius Uehara

Posted on

Reliability, scalability and maintainability

Designing Data-Intensive Applications, Chapter 1

  • Recently I've acquired Martin's DDIA book, and in the following days I will be sharing some notes about the chapters
  • Also reading it? Feel free to connect, let's talk about it!

Reliability

  • Roughly, "continuing to work correctly, even when things go wrong"
    • Things go wrong -> fault
    • Systems that anticipate and can cope with faults are called fault-tolerant or resilient
    • Fault =/= Failure
      • Fault: one component deviating from spec
      • Failure: the whole system stops providing the required service to the user
    • ❌ Reduce fault probability to zero ❌
    • ✅ Design fault-tolerance mechanisms that prevent faults from causing failures ✅
      • There are cases where prevention is better, though, as with security matters

Hardware faults

  • Crashed hard disks, faulty RAM, lack of maintainance, etc
    • First response: adding redundancy to individual hardware components
  • ⬆ applications' computing demands ⬆ rate of hardware faults
    • And in common cloud services, flexibility and elasticity are prioritized over single-machine reliability
    • Software fault tolerance techniques!

Software faults

  • Harder to anticipate
  • Correlated across nodes
    • Which causes more system failures than uncorrelated hardware faults
  • Examples include:
    • Software bugs given a particular input
    • Runaway processes (infinite loops)
    • Downtime in some service that the system depends on
    • Cascading faillures
  • Prevention includes:
    • Carefully thinking about interactions and assumptions about the system
    • Automated/manual testing
    • Process isolation
    • Measuring, monitoring and analyzing system behavior in production

Human errors

  • How do we make our systems reliable, in spite of unreliable humans?
    • Designing systems in a way that minimizes opportunities for error. Eg: abstractions, APIs and admin dashboards (high-level interface)
    • Non-production environments (so people can explore and experiment safely)
    • Automated/manual testing to cover corner cases
    • Easy rollback options
    • Deploying code gradually + feature flags(trunk-based development)
    • Detailed monitoring and observability. Eg: performance metrics and error rates
    • Good management practices and training
  • Reliability is important for big or small companies (potential lost of revenue and damage to reputation)
    • Balance between reliability and costs

Scalability

  • Ability to cope with increased load
    • If the system grows in a particular way, what are our options for coping with the growth?

Load

  • Actually, first we have to consider what load is
  • Load parameters depend on the system architecture
    • Request per second to a web server
    • Ratio of reads to writes in database
    • Number of Simultaneosly active users in a chat room
    • Hit rate on cache, etc
  • Twitter key load parameter: distribution of followers per user (fan out load)
    • SELECT all tweets in home timeline requests VS timeline cache

Performance

  • Batch processing -> throughput (records processed/second or time to complete a job)
  • Online systems -> response time (request -> response)
  • We should think response time as a distribution of values, not a single number
    • Since in every request, the response time will be different
    • Random latency, loss of packet, garbage collection pause, mechanical vibrations, etc
  • Average response time is NOT a good metric
    • It does not tell how many users experienced that delay
  • Always use percentile: sort it from fatest to slowest and take the median (half of the requests return less/longer than X)
    • Also known as p50
  • To know how bad the outliers are: p95, p99, p999
    • Also known as tail latencies
  • If p95 = 1.5 seconds, 5 out of 100 requests take 1.5 seconds or more
  • Tail latencies are important, since they affect users' experience of the service
    • 100ms increase in response time reduces sales by 1% (Amazon)
    • 1 sec slowdown reduces customer satisfaction metrics by 16%
  • These percenetiles are used when defining SLAs and SLOs
    • "Server considered up if has a median response time of X ms and a p99 under 1s"
  • Monitoring response times:
    • Ongoing basis
    • Window of response times of requests in the last 10 minutes
    • Every minute, calculate median and percentiles and plot those metrics in a graph

Coping with load

  • Rethink the architecture on every order of magnitude load increase
  • Scaling up (moving to a more powerful machine) vs scaling out (distributing load across machines)
    • Elastic systems (automatically add computing resources as load increases
    • Useful if increase of load is unpredictable
  • Architecture of large systems is highly specific to the application
    • There is no such thing as a one-size-fits-all architecture
    • 100,000 1kB requests/sec vs 3 2GB requests/min
    • An architecture that scales well will be built around the load parameters
    • Iterate quickly on features VS scale to hypothetical future load
    • However, scalable architectures are built from general-purpose building-blocks (arranged in familiar patterns)

Maintainability

  • Software cost is due mainly in maintainance, not in initial development process
    • Fixing bugs, investigating failures, adapting to new platforms, modifying use cases, repaying technical debt, adding new features, etc

Operability

  • Good software can not run reliably with bad operations
  • Operations squad responsibilities include:
    • System health monitoring/restoring from bad state
    • Tracking cause of problems (eg. failures or degrade performance)
    • Keep everything up to date
    • Anticipating future problems and solving them
    • Establishing good practices and development tools
    • Performing complex management tasks (such as platform migrations)
    • Maintaining system security
    • Writing docs about the system
  • Good operability of data systems include:
    • Good monitoring (visibility)
    • Automation and integration with standard tools
    • Good documentation of operations
    • Good and predictable default behavior but with options to override it
    • Self-healing

Simplicity

  • Simple and expressive code instead of bloated and complex
  • Symptoms of complexity:
    • Tight coupling of modules
    • Tangled dependencies
    • Inconsistent naming
    • Hacks to solve performance issues, etc
  • When the system is harder for developers to understand and reason about (hidden assumptions, unintended consequences, unexpected interactions), the risk of introducing new bugs is increased
  • Making system simpler -> removing accidental complexity
    • Through abstractions (façade) #### Evolvability
  • Constant flux: learn new facts, new use cases emerge, business priorities change, user request new features, architecture changes, etc
  • Agile development
  • The ease of modifying a system is linked with its simplicity
    • Easy-to-understand systems are easier to modify than complex ones

Top comments (0)