System Design & Scalability

System Design & Scalability



  • Types of Scaling & Load Balancing
  • Data Storage Design
  • Message-oriented Middleware (MOM)
  • Fault Handling
  • Performance Metrics
  • MapReduce





Types of Scaling & Load Balancing

Horizontal vs. Vertical Scaling


Load Balancing

  • Round Robin
  • Source IP Hash - a given client IP address will always go to the same server
  • Request Hash - a given request type will always go to the same server/cache; avoids cache duplication
  • Least Connections
  • Least Traffic
  • Least Latency


Data Storage Design

Database - Structure

  • Relational - general purpose for tabular/table-based data
  • Specialized - for data structures that don't easily fit the tabular format (e.g. multi-level nesting & hierarchies)
    • NoSQL
    • Others

Database - Read vs. Write Performance

  • Normalization - the process of organizing data to minimize redundancy (removing duplicate columns in different tables). In relational DBs, normalization involves dividing a DB into several tables and defining relationships between those tables using foreign keys.
  • Normalize vs. Denormalize
    • Normalize - ↓duplicate data ⇨ ↑write perf but ↓read perf
    • Denormalize - ↑duplicate data ⇨ ↑read perf but ↓write perf
  •  Have your cake and… - Use an append-only structure for writes; then asynchronously restructure data into a read-optimized format[*]


  • DB reads are expensive; i.e. hold as much of it in memory as possible
  • Cache Hit - data were found in cache; Cache Miss - data not found, so retrieve it from DB[*]
  • Local vs. Distributed Rule of Thumb - use local cache for small data sets, with predictable number of immutable records[*]
  • Cache Warming - anticipate queries and "prime" the cache not only on startup but also in real-time (e.g. load surrounding tiles of a recently-requested map)

Cache - Replacement Policy

Replacement Policy - algorithm used to maximize cache performance by choosing which data to eject & which data to add in its place[*]

  • LRU - ejects the most Least Recently Used data
  • advanced - considers access frequency, size of items, latency & throughput

Data Store Sharding


Sharding - partition data across multiple nodes




Message-oriented Middleware (MOM)

MOM Considerations

  • Used by distributed systems to communicate amongst nodes[*]
  • Abstracts OS & network intricacies (e.g. endian format, sockets, etc.)

MOM Types


Fault Handling


Fault Handling Types

  • High Availability (HA) - delayed recovery to secondary 
  • Fault Tolerant - immediate recovery
    • Active/Passive - primary fails over to secondary
    • Active/Active - no primary vs. secondary; when 1 fails, the other(s) takes the additional load
  • Great YouTube video on the subject!




Avoid SPOF

A Single Point of Failure (SPOF) is a part of a system that, if it fails, will stop the entire system from working.[*]



Performance Metrics

Performance Metrics - Terminology

Bandwidth - The maximum amount of data that can be transferred in a unit of time (e.g. 100Mbps)[*]
Throughput - The actual amount of data that is transferred in a unit of time (e.g. 88MBps)
Latency - The time it takes to send & receive (round-trip) a packet of data (e.g. 20ms)[*]

Performance Metrics - Analogy

Given a water pipe, its diameter determines its throughput, and its length determines its latency.  Therefore, to improve:

  • Throughput - Get a fatter pipe
  • Latency - Colocate to reduce distance or reduce network hops (point-to-point circuit), which also reduces distance that data have to travel

Performance Metrics - Tail Latencies

  • A tail latency, like P99 or P95, is used to represent the worst-case scenario.[*]
  • e.g. P99 = 0.75ms:  For the 99th percentile, the request latency was 0.75ms.  Therefore, 99% of the requests were faster than 0.75ms.


MapReduce - Explanation

Uses parallel & distributed systems to process large data sets[*]

MapReduce - Steps

Fundamentally, consists of two steps, Map & Reduce, but Shuffle step is also prevalent:

  • Map - Organizes/filters/sorts.  Think of putting elements into a typical Map Interface with key-value pairs (e.g. <key, value>)
  • Shuffle - Redistributes data so that all data pertaining to a given key reside on the same node
  • Reduce - Summary/aggregation (e.g. sum all values for a given key)
Product Management

Product Management

Cassandra Overview

Cassandra Overview