Distributed Systems Foundations

What Is a Distributed System?

A distributed system is a collection of independent computers connected by a communication network that work together to accomplish some goal.

Each computer has its own processor, memory, operating system, and clock. Programs running on different machines cannot directly access each other’s memory or rely on a single shared notion of time. As a result, all coordination must be performed explicitly by sending messages over a network.

Failures are expected. In practice, neither the computers nor the communication are assumed to be perfectly reliable. Most of the time, systems behave as expected, but delays, message loss, crashes, and restarts do occur. These failures may affect only part of the system at any given time. Distributed systems are designed with the expectation that such failures will happen occasionally and must be handled explicitly.

Leslie Lamport, one of the pioneers of distributed systems, summarized this reality succinctly:

“You know you have a distributed system when the crash of a computer you’ve never heard of stops you from getting any work done.”

This observation captures the defining challenge of distributed systems: components fail independently, and their failures can have system-wide consequences.

Distributed Systems Are Older Than You Think

Distributed systems did not emerge with the web or cloud computing. They predate the modern Internet by decades and were driven by many of the same pressures we see today: scale, availability, and coordination across distance.

One of the earliest large-scale distributed systems was the SAGE (Semi-Automatic Ground Environment) air defense system, deployed beginning in 1957. SAGE connected radar stations across North America to a network of 24 large computers, each site built around a pair of IBM AN/FSQ-7 machines. Each site used two interconnected computers so that one could take over if the other failed, an early example of fault tolerance through redundancy.

In 1960, the first online airline reservation system, Sabre, was deployed. Sabre ran on two IBM 7090 computers and allowed airlines to manage reservations in real time across geographically distributed terminals. This system introduced many ideas that are still central today: remote access, shared state, concurrency, and the need to maintain consistency despite failures.

What changed dramatically was not the nature of the problems, but the environment in which systems operated. The introduction and widespread adoption of the Internet provided a general-purpose, packet-switched communication infrastructure that made it practical to connect large numbers of machines across organizations and continents. As networking became cheaper and more pervasive, distributed systems moved from specialized, tightly controlled environments into everyday computing.

Modern distributed systems operate at far larger scales and in more hostile environments, but they still struggle with the same fundamental issues of coordination, delay, and partial failure that appeared in these early systems.

Why We Build Distributed Systems

Distributed systems exist because some problems cannot be solved effectively on a single machine.

Scale and the Limits of Vertical Growth

One reason is scale. As demand for computation, storage, and I/O grows, a single machine eventually becomes a bottleneck. Historically, the computing industry relied on faster hardware to address this problem.

Gordon Moore, a co-founder of Intel, made an observation in 1965 that the number of transistors on an integrated circuit doubles approximately every two years, leading to dramatic performance improvements. This observation, known as Moore's Law, held for several decades.

That trend has been increasingly stressed. Transistors have become smaller, but higher clock speeds lead to excessive power consumption and heat.

In the early 2000s, performance gains shifted from faster single cores to multiple cores per chip, which in turn requires parallel programming.

Beyond that, systems increasingly rely on heterogeneous computing, which places a variety of computing elements, such as GPUs, neural processing units, and other specialized accelerators, on a single chip.

Even more recently, system technology co-optimization (STCO) focuses on using the principles of heterogeneous computing but optimizing entire systems rather than individual transistors.

The key point is that we can no longer expect single machines to get “fast enough” to solve our problems. Performance gains from hardware alone are no longer sufficient, which pushes us toward distributed solutions.

Vertical vs. Horizontal Scaling

Vertical scaling (scaling up) increases the capacity of a single machine by adding faster CPUs, more cores, more memory, or larger disks. This approach is limited by hardware constraints, power, cost, and diminishing returns.

Horizontal scaling (scaling out) increases capacity by adding more machines and distributing computation or data across them. Distributed systems are fundamentally about enabling horizontal scaling.

Closely related is Amdahl’s Law, which reminds us that parallel speedup is limited by the portion of a task that cannot be parallelized. Even with many cores or machines, sequential components quickly become bottlenecks.

Collaboration and Network Effects

Distributed systems enable collaboration. Many modern applications derive their value from connecting users and services rather than from raw computation alone. Social networks, collaborative editing, online gaming, and marketplaces all depend on interactions among many participants.

This is often described by Metcalfe’s Law, which states that the value of a network grows roughly with the square of the number of connected users. While not a precise law (neither are Moore's Law nor Amdahl's Law), it captures an important intuition: connectivity itself creates value, and that value depends on distributed infrastructure.

Reduced Latency

Geographic distribution allows systems to reduce latency by placing data and computation closer to users. Content delivery networks, regional data centers, and edge computing all exploit this idea. Rather than serving every request from a single location, systems distribute replicas or cached data across the network to improve responsiveness.

Mobility and Ubiquitous Computing

Mobility is another major driver. Distributed systems are no longer just about desktops and servers. Phones, tablets, cars, sensors, cameras, and other embedded devices all participate in distributed systems. There are now more deployed IoT (Internet-of-Things) devices than traditional computers.

These devices move, disconnect, reconnect, and operate under varying network conditions. Distributed systems provide the infrastructure that allows them to function coherently despite these constraints.

Incremental Growth and Cost

Distributed systems also support incremental growth. A service does not need to be built at full scale from the beginning. It can start on a small number of machines and grow over time as demand increases.

Google is a good example of this model. The early versions of Google ran on a small number of commodity machines in a single location at Stanford. As usage grew, the system scaled by adding more machines, then more racks, and eventually multiple data centers around the world. The basic approach did not change: distribute data and computation across machines and route requests to where the data lives.

All-or-Nothing Failure vs. Partial Failure

In a centralized system, failures are typically all-or-nothing. If the system crashes, everything stops. If it is running, everything works.

Distributed systems behave differently. Components can fail independently. One server may crash while others continue to operate. A network link may fail while the machines on either side remain functional. A slow response may be indistinguishable from a failed component.

This phenomenon is called partial failure, and it is one of the defining challenges of distributed systems. The system must continue operating despite incomplete, delayed, or incorrect information about which components are functioning.

Fault Tolerance and Redundancy

Because failures are expected, distributed systems are designed to tolerate them rather than avoid them entirely.

Fault tolerance involves detecting failures, recovering from them, and continuing to provide service. A central goal is to avoid single points of failure, where the failure of one component brings down the entire system.

Redundancy is the primary tool for achieving this. By replicating components, a system can continue operating even if some replicas fail.

Availability vs. Reliability

Reliability concerns correctness and time-to-failure: whether a system produces correct results and how long it operates before something breaks.

Availability measures the fraction of time a system is usable from a client’s perspective.

A system can be reliable but unavailable, or highly available while occasionally returning stale or inconsistent results. Distributed systems often prioritize availability using redundancy, which introduces consistency challenges.

Series and Parallel Systems

The way components are combined has a dramatic effect on system reliability and availability. Two simple models capture this difference: series systems and parallel systems.

Series systems (all-or-nothing)

In a series system, every component must be functioning correctly for the system to work. If any component fails, the entire system fails.

This is the default behavior of many naïve designs. For example, if a service requires a database server, an authentication server, a logging service, and a configuration service to all be reachable before it can process requests, then the failure of any one of these components makes the service unavailable.

If a system consists of $n$ independent components, each with failure probability $P_i$, then the probability that the system is operational is:

$$ P(\text{system works}) = \prod_{i=1}^{{n} (1 - P_i)
$$}

The probability that the system fails is therefore:

$$ P(\text{system fails}) = 1 - \prod_{i=1}{n} (1 - P_i) $$

As $n$ grows, the probability that something is broken approaches 1, even if individual components are quite reliable. This is why large systems built as series dependencies tend to be unavailable much of the time. With enough components, something is always failing.

This is the essence of all-or-nothing failure.

Parallel systems (fault-tolerant)

In a parallel system, components provide redundancy. The system continues to operate as long as at least one component is functioning.

A common example is a replicated service behind a load balancer. If one replica crashes, requests can be routed to another. From the client’s perspective, the service remains available.

If two independent components each fail with probability $P$, then the system fails only if both fail:

$$ P(\text{system fails}) = P^2
$$

More generally, for $n$ replicated components:

$$ P(\text{system fails}) = \prod_{i=1}{n} P_i $$

Even modest replication can dramatically improve availability. For example, two components that are each available 95% of the time yield a system availability of 99.75% when used in parallel.

These simple models explain several core design principles in distributed systems:

Requiring all components to be operational is a losing proposition at scale.
High availability comes from structuring systems so that components fail independently and redundantly.
Adding features or services as hard dependencies reduces availability unless those services are themselves replicated.

Fault tolerance is as much about system structure as it is about component reliability.

This is why distributed systems are designed to avoid single points of failure and minimize long dependency chains. The goal is not to prevent failures, but to ensure that failures do not propagate into system-wide outages.

An important caveat

These probability calculations assume independent failures. In real systems, failures are often correlated. Power outages, network partitions, software bugs, and misconfigurations can take down multiple replicas at once.

This is why simply adding replicas is not enough. Where replicas are placed, how they are managed, and how they fail all matter. We will return to this issue when we discuss replication strategies, failure domains, and consensus.

Failure Models

Different types of failures lead to different design assumptions.

In a fail-stop failure, a component halts execution and produces no further output. This model captures crashes and power failures and is a useful starting point for reasoning about distributed systems.

However, fail-stop behavior assumes that other components can eventually detect that the failure has occurred. In real life, this assumption is often optimistic. In a distributed system, a component that has failed is often indistinguishable from one that is slow or temporarily unreachable due to network delays or partitions. This behavior is sometimes described as a fail-silent failure: the component produces no output, but its failure cannot be reliably distinguished from delay.

In a fail-restart failure, a component crashes and later restarts. Restarted components may have stale state, which introduces additional complexity. The component must realize that it restarted and may have obsolete information.

Network failures include message omission (where messages get lost), excessive delays, and partitions, where the network splits into disconnected sub-networks.

In Byzantine failures, a component continues to run but does not behave according to the system’s specification. It may send incorrect, inconsistent, or misleading messages, either due to software bugs, hardware faults, or malicious behavior.

Most systems choose a failure model that matches their expected environment rather than attempting to handle all possible failures.

Caching vs. Replication

Caching and replication are often confused, but they serve different purposes.

Replication creates multiple durable copies of data or services to improve availability and fault tolerance. Replicas are part of the system’s authoritative state and must be kept consistent.

Caching stores temporary copies of frequently accessed data to reduce latency. Cached data may become stale and is typically treated as an optimization rather than a source of truth.

Both introduce consistency challenges, but the design goals and tradeoffs are different.

No Global Knowledge

There is no global view of a distributed system.

Each component knows only its own state and whatever information it has received from others, which may be delayed or outdated. There is no single place where the “true” system state can be observed.

As a result, failure detection is inherently imperfect. A component that appears to be down may simply be slow or unreachable. Distributed systems must operate despite this uncertainty.

Summary

Distributed systems exist because single machines are no longer sufficient to meet demands for scale, availability, latency, collaboration, and mobility.

They are difficult because components operate independently, communicate over unreliable networks, and fail independently. These constraints shape every design decision that follows in this course.