Core distributed systems concepts
- Distributed system
- A collection of independent computers connected by a network that coordinate to accomplish a common goal.
- Autonomous computer
- A computer that has its own processor, memory, operating system, and clock, and operates independently of others.
- Message passing
- Explicit communication between processes using network messages rather than shared memory.
- Partial failure
- A failure mode in which some components fail while others continue operating.
- All-or-nothing failure
- A failure mode in which the entire system stops functioning when a failure occurs.
- Single point of failure
- A component whose failure causes the entire system to fail.
- Horizontal scaling
- Increasing system capacity by adding more machines and distributing work across them.
- Vertical scaling
- Increasing system capacity by adding resources to a single machine.
Laws and principles
- Moore’s Law
- The historical observation that transistor counts, and thus computing capacity, roughly doubled every 18 to 24 months.
- Amdahl’s Law
- A principle stating that the speedup from parallelism is limited by the portion of a task that must remain sequential.
- Metcalfe’s Law
- The idea that the value of a network grows roughly with the square of the number of its participants.
- End-to-end principle
- A network design principle that places functionality such as reliability and security at the communicating endpoints rather than in the network.
- Fate sharing
- The principle that communication state should reside at the endpoints so failures affect only the components already involved.
- Best-effort delivery
- A network service model in which packets are attempted but not guaranteed to be delivered, ordered, or delivered within a fixed time.
Failure models
- Fail-stop failure
- A failure in which a component halts execution and produces no further output, and the failure can be detected.
- Fail-silent failure
- A failure in which a component produces no output, but other components cannot reliably distinguish failure from delay.
- Crash-restart failure
- A failure in which a component crashes and later restarts, possibly with lost or stale state.
- Network partition
- A failure that divides a system into disconnected groups that cannot communicate.
- Byzantine failure
- A failure in which a component continues running but does not follow the system specification, producing incorrect or inconsistent behavior.
Fault tolerance and availability
- Fault tolerance
- The ability of a system to continue operating correctly despite component failures.
- Redundancy
- The use of multiple components to tolerate failures and improve availability.
- Availability
- The fraction of time a system is usable from a client’s perspective.
- Reliability
- A measure of correctness and time-to-failure of a system or component.
- Series system
- A system structure in which failure of any component causes system failure.
- Parallel system
- A system structure in which the system continues operating as long as some components remain functional.
Networking fundamentals
- Packet switching
- A networking approach in which data is divided into packets that are routed independently through the network.
- Layered architecture
- A design approach that separates networking functionality into layers with well-defined responsibilities.
- OSI model
- A conceptual seven-layer model used to describe and reason about network protocol design.
- Data link layer
- The layer responsible for communication on a single physical network.
- Network layer
- The layer responsible for routing packets between machines across networks.
- Transport layer
- The layer responsible for process-to-process communication.
Internet and IP networking
- Internet Protocol (IP)
- A network-layer protocol that provides connectionless, best-effort delivery of packets between machines.
- Datagram
- An independent packet of data sent over a network without guarantees of delivery or ordering.
- Port
- A transport-layer identifier used to deliver data to the correct process on a machine.
Transport protocols and sockets
- Transmission Control Protocol (TCP)
- A transport protocol that provides reliable, ordered, congestion-controlled byte-stream communication.
- User Datagram Protocol (UDP)
- A transport protocol that provides connectionless, best-effort datagram delivery with minimal overhead.
- Head-of-line blocking
- A delay that occurs when later data must wait for earlier data to be delivered in order.
- Socket
- An operating system abstraction that provides an interface for network communication.
- Connection-oriented communication
- Communication that involves explicit connection setup and teardown.
- Connectionless communication
- Communication in which messages are sent independently without establishing a connection.
- QUIC
- A transport protocol built on UDP that provides reliable, multiplexed communication in user space.
Data placement
- Replication
- The creation of multiple authoritative copies of data to improve availability and fault tolerance.
- Caching
- The storage of temporary copies of data to reduce latency and load, potentially serving stale data.