pk.org: CS 417/Lecture Notes

Engineering Distributed Systems

Building systems that survive scale, failure, latency, change, and bad assumptions

Paul Krzyzanowski – 2026-05-01

Distributed systems are the normal way large computing services are built. We use them because a single machine cannot provide the availability, performance, geographic reach, or operational flexibility that modern services require. A distributed system lets us replicate data, spread computation across many machines, place services near users, and isolate parts of an application so they can evolve independently.

The cost is that distribution changes the character of the engineering problem. A local program usually fails as a unit. A distributed system fails in pieces. A message may be delayed, duplicated, reordered, or lost. A server may be alive but overloaded. A dependency may be reachable from one data center but not another. A client may retry a request after the server already performed the operation. Sending bytes over a network is the easy part. The harder problem is building a system that behaves predictably when the network, the machines, the software, and the people operating them do not.

These notes pull together the engineering lessons behind the course. They are less about one algorithm and more about judgment: when to distribute, how to choose interfaces and frameworks, how to design for scale and availability, how to detect failures, how to keep latency under control, and how to avoid assumptions that break as soon as a system leaves the demo environment.

Why We Build Distributed Systems

A system is distributed when its components run on separate machines and communicate over a network. That definition covers a wide range: cloud storage, payment systems, search engines, multiplayer games, streaming platforms, content delivery networks (CDNs), mobile backends, ride-hailing services, ATM networks, and fleets of remote devices.

We distribute systems for several recurring reasons.

High availability comes from replication. Instead of trying to build one machine that never fails, we run multiple copies and keep the service operating when one machine, rack, availability zone, or data center fails.

Parallel computation comes from dividing a large problem into pieces. MapReduce, Spark, and Pregel (a graph-processing model that organizes work into supersteps) exist because many data-processing tasks are too large for one machine but can be divided into independent chunks of work.

Remote operation is unavoidable when the resources being controlled are physically dispersed. Cars, phones, cameras, sensors, point-of-sale terminals, and ATMs are distributed by nature.

Geographic proximity reduces latency. A user in New Jersey should not need to fetch every image, video segment, or API response from Singapore. CDNs, edge computing, and regional cloud deployments exist because distance imposes delay.

Independent services let organizations divide development and deployment. Authentication, file storage, search, recommendation, billing, and analytics can be separate services with separate teams and release schedules.

Outsourced infrastructure lets teams rent parts of a distributed system instead of building everything themselves. Cloud providers offer compute, databases, queues, object stores, load balancers, identity systems, monitoring, and deployment tools.

Distribution solves real problems, but it also creates new engineering work. Every benefit comes with extra work in communication, coordination, failure handling, security, deployment, and operations.

The Ecosystem Is Too Large to Memorize

The distributed systems ecosystem is overwhelming. AWS alone offers hundreds of cloud services. Azure, Google Cloud, and Alibaba Cloud have catalogs of comparable scale. The Apache Software Foundation hosts hundreds of open-source projects across data processing, cloud infrastructure, content systems, security, and search. New frameworks appear every year. Some become lasting infrastructure. Many fade.

You will not learn all of them. That is not the goal.

The durable skill is learning how to evaluate a framework. The questions worth asking are:

A framework is someone else’s engineering tradeoff packaged as a tool. You inherit that tradeoff when you adopt it.

Services and Interfaces

Many distributed systems are built as a collection of services. This may be called a service-oriented architecture or microservice architecture, depending on the style and scale of decomposition. The name is less important than the discipline of defining clean service boundaries.

A good service has a well-defined interface, minimal dependencies, and behavior that can be tested in isolation. It should be usable from different languages and platforms unless there is a strong reason to restrict it. A Java-only service interface, a Python pickle format, or an iOS-specific protocol may be convenient during development, but it ties every caller to the same technology choice.

Interfaces last longer than implementations. A service can be reimplemented, optimized, moved to another language, or deployed on a different platform while preserving the same interface. Changing the interface is harder because every caller may need to change with it.

Interface Design

Good interfaces are easy to understand, hard to misuse, and stable enough to survive system evolution. They should define what the service promises, what inputs it accepts, what errors it returns, and how clients should handle version changes.

Versioning deserves attention from the start. Servers and clients are rarely updated at exactly the same time. During a deployment, some clients may call the new interface while others still call the old one. Mobile clients make this worse because a version released years ago may still be installed on a phone.

Schema evolution is part of interface design. A data format should allow fields to be added without breaking old clients and should define how unknown fields are handled. Protocol Buffers, Avro, Thrift, and similar systems exist partly to make this manageable.

Stateless interfaces and stateless services usually go together. A stateless interface does not require the server to remember a session: each call carries the context it needs. A stateless service does not retain per-client state between requests, so any instance can handle any request, which makes load balancing and failover far easier. Stateful services are sometimes necessary, but state has to live somewhere: in a database, a replicated log, a session store, or a carefully designed state machine.

Keep the System Understandable

The KISS principle applies strongly to distributed systems. The more clever the design, the harder it is to debug under failure. This becomes painful because distributed bugs often appear only under load, during partial failures, or after a deployment changes timing.

Readable code is not a luxury. Distributed systems require error handling, retries, timeouts, idempotency, tracing, logging, and operational controls. If the core logic is already hard to follow, the production version will become worse.

The same rule applies to AI-generated code. Code that compiles is not enough. You need to understand what it does, what assumptions it makes, what security properties it has, and how it behaves under concurrent execution. A race condition written by a model is still your race condition when it reaches production.

Brian Kernighan’s warning remains accurate: debugging is harder than writing the code. If code is written at the limit of your cleverness, debugging it will exceed that limit.

Communication Is Still Built on Sockets

Every network service eventually relies on the operating system’s socket interface. Higher-level frameworks hide sockets, but they do not remove the underlying behavior. Connections can close. Writes can fail. A peer can stop responding. A connection can remain open while the remote service is too overloaded to make progress.

Sockets also expose many options because network behavior has many edge cases: address reuse, buffering, timeouts, keepalives, multicast, nonblocking I/O, and more. The surface area below the API is larger than most developers ever see. Linux exposes about 127 tunable parameters under /proc/sys/net/ipv4 alone, and depending on the operating system and release, there are 12 to 29 TCP-specific socket options and another 6 to 11 generic socket options that affect TCP behavior.

Defaults work for almost every service, and most engineers will rarely touch any of these, but each one exists for a reason, and a high-performance system sometimes depends on tuning a handful of them for the specific workload, network path, or traffic pattern.

RPC and Failure

Remote procedure call frameworks try to make a network call look like a local function call. That abstraction is attractive, especially for internal services. It also hides the most dangerous part of distributed programming: a remote call does not fail like a local call.

A local function call either returns, raises an exception, or crashes the process. An RPC may have many ambiguous outcomes. The request may not have reached the server. The server may have completed the operation but the response was lost. The client may time out and retry while the first request is still running. The server may fail after updating state but before sending a response.

Retries are a common source of damage. A retry can help when the failure is transient. It can also duplicate an operation, overload a weak dependency, or create a retry storm. Safe retries usually require three things: timeouts, backoff with jitter, and idempotent operations.

Idempotency and Retry Safety

Retries are unavoidable in distributed systems, but they are safe only when the operation can be repeated without changing the result. An operation with that property is idempotent.

Reading a record is usually idempotent. Setting a user’s display name to “Alice” is idempotent. Charging a credit card, creating an order, reserving a seat, or appending an event to a log is not idempotent unless the system is designed to make duplicate requests harmless.

Production systems usually handle this with request identifiers, transaction identifiers, or idempotency keys. The client sends a unique ID with the request. If the server receives the same request again, it can return the previous result rather than performing the operation again.

This is one reason failure handling cannot be completely hidden within an RPC framework. The framework can retry a call, but only the application knows whether the operation is safe to retry.

Data Encoding and RPC Mechanisms

Network communication needs a data representation. The representation affects interoperability, performance, storage size, debugging, and evolution.

JSON over HTTP is popular because it is easy to inspect and works well for public web APIs. It is a good default for browser-facing services and third-party APIs. Its drawbacks are verbosity, parsing overhead, weak schema discipline unless extra tooling is added, and text encoding costs.

XML and SOAP are older enterprise approaches that still appear in banking, finance, government, and legacy systems. They are verbose and often cumbersome, but many organizations depend on them.

For high-performance internal services, two decisions are separable: how data is encoded on the wire, and how the call itself is made. Modern frameworks often bundle the two together, but they are different concerns and can be evaluated separately.

Binary serialization formats such as Protocol Buffers, FlatBuffers, Cap’n Proto, and Apache Avro define schemas, generate code in many languages, and produce more compact messages than JSON. They handle the encoding question: how a structured value becomes a sequence of bytes that another program can decode.

Modern RPC frameworks combine a serialization format with a transport and a code generator. gRPC is the most widely used example. It uses Protocol Buffers for messages, runs over HTTP/2, supports streaming in both directions, and generates client and server stubs in many languages. Apache Thrift packages its own serialization, transport, and code generation into a similar framework. The benefit of a modern RPC framework is that a service interface is described once, in a language-neutral schema, and can be called from any supported language with type-safe generated code.

The main lesson is to avoid writing your own parser or wire format unless the problem demands it. Parser bugs are security bugs. Encoding bugs are interoperability bugs. A custom format also needs documentation, versioning, tests, tooling, and migration rules.

Choosing a Communication Mechanism

A useful rule is to match the mechanism to the boundary.

Boundary Common choice Reason
Browser or public API HTTP with JSON Easy adoption, tooling, and inspection.
Internal high-throughput service calls gRPC, Thrift, or similar RPC Typed interfaces, compact encoding, and generated stubs.
Event streams Kafka or another log-based system Decouples producers and consumers and supports replay.
Large object transfer Object storage, CDN, or streaming protocol Avoids moving large payloads through request-response APIs.

No mechanism is best for all cases. The mistake is choosing one because it is fashionable rather than because it fits the communication style and failure behavior the service needs.

Avoid Unnecessary Distribution

A microservice is not a substitute for a function call.

An in-process function call is measured in nanoseconds. A call to another service is measured in microseconds at best, often milliseconds, and it can fail in many more ways. It also requires serialization, transport, authentication, authorization, observability, deployment, versioning, and failure handling.

Distribute code when there is a real reason: different scaling needs, independent deployment, separate ownership, fault isolation, security isolation, or geographic placement. If two pieces of code always change together, scale together, fail together, and use the same data, a network boundary probably makes the system worse.

Excessive decomposition multiplies operational cost. Every new service adds another deployment target, another monitoring surface, another dependency graph edge, another interface to version, and another place where latency can accumulate.

Fundamental Issues in Distributed Systems

The same problems appear in different forms across distributed systems. They are the background conditions that every design has to address.

Partial Failure

Partial failure means one part of the system fails while other parts continue running. A server may crash, a network link may fail, a disk may stop responding, a process may be killed, a data center may lose power, or a dependency may become slow enough to be unusable.

Partial failure is worse than total failure because the system may keep accepting work while some of its assumptions are false. A service may believe a peer is down when the peer is only slow. Two partitions may both believe they are allowed to accept writes. A client may time out while the server completes the operation.

Durability mechanisms reduce the damage. A service can write to a transaction log before acknowledging a request. A replicated state machine can apply updates in the same order on multiple replicas. A database can commit only after enough replicas have persisted the update.

Concurrency

Distributed systems are concurrent by nature. Many clients issue requests simultaneously, many workers process data simultaneously, and many replicas receive updates simultaneously. Concurrency creates race conditions, ordering problems, and conflicts.

Concurrency is not limited to threads within a single process. Two services can race to update the same record. Two replicas can accept conflicting writes. Two deployment systems can change related configurations at the same time. The absence of shared memory does not remove shared state.

Consistency

Consistency is about the relationship among operations, replicas, and observed state. Concurrent updates should not corrupt data. Replicas should not diverge unless the system has a defined reconciliation rule. Users should not see behavior that violates the promises the system made.

Strong consistency usually requires coordination. Techniques include distributed locking, two-phase commit, consensus protocols, ordered logs, and replicated state machines. Coordination adds latency and can reduce availability during partitions.

Eventual consistency reduces coordination by allowing replicas to temporarily differ. This can improve availability and latency, but the application must tolerate stale reads, conflicting updates, or delayed visibility. Eventual consistency is a design choice, not an excuse to ignore correctness.

The CAP theorem states a hard limit: during a network partition, a distributed system must choose between serving all requests and preserving a single, consistent view of the data. A well-engineered system can make partitions rare, but it cannot make the theorem disappear.

Latency

Network latency is variable. Messages may arrive quickly most of the time, but then take much longer due to congestion, routing changes, queueing, retransmissions, garbage collection, lock contention, or a slow dependency.

Latency accumulates across service boundaries. A request that calls ten services in sequence pays for ten round-trips plus the processing time of each service. The average may look fine while the overall latency becomes unacceptable.

Caching, replication, parallel requests, batching, pipelining, and geographic placement are all tools to address latency.

Each one introduces tradeoffs. A cache can serve stale data. Replication needs consistency rules. Parallel requests increase load. Batching improves throughput but can add waiting time.

Security

Distributed systems expose more boundaries than local programs. Every boundary needs authentication, authorization, encryption, input validation, rate limiting, and auditing. The old model of a trusted internal network is not enough. A compromised service inside the network should not become a master key for the rest of the system.

Zero trust means that every request is authenticated and authorized regardless of its origin. Services should authenticate other services. Credentials should be short-lived. Private keys should not live in source code, shared directories, or long-lived configuration files.

TLS protects a connection only if certificates and keys are managed correctly. Mutual TLS (mTLS) extends TLS so both sides authenticate each other, which is useful for service-to-service communication. Service meshes are one way to automate mTLS across many services; the tradeoffs are covered later under Security Engineering.

Security is also availability. A denial-of-service attack, a cloud account suspension, a lost key, an expired certificate, or a ransomware event can take down a system as effectively as a failed disk.

Designing for Scale

Scale is easier to design into a system early than to bolt on after the design assumes one machine, one database, one queue, or one data center.

A scalable design partitions work and data. Large datasets are sharded. Computation moves toward the data when possible. Independent tasks run concurrently. Results are merged after parallel work completes. MapReduce, Spark, Cassandra, Bigtable, and HDFS (the Hadoop Distributed File System) all apply this pattern in different forms.

Scaling also means avoiding central bottlenecks. A single coordinator, single database primary, single lock server, or single queue may be acceptable at small scale but become the point that limits the system. Sometimes a coordinator is needed for correctness. The question is whether it is on the critical path of every operation and whether it can fail safely.

Stored Data and Streaming Data

Stored data and streaming data need different designs. Stored data systems process a bounded dataset: a set of files, table rows, objects, or graph vertices. Streaming systems process data that keeps arriving: clicks, logs, metrics, sensor readings, orders, or financial events.

Batch systems optimize throughput over a known input. Streaming systems optimize continuous ingestion, ordering, backpressure, retention, replay, and consumer progress. A system can use both. For example, Kafka may collect events continuously while Spark jobs process historical windows of those events.

Local Parallelism Still Counts

Distribution does not remove the need for local concurrency. A service running on one machine should use multiple cores when the workload permits it. Threads, async I/O, event loops, and worker pools are still part of distributed system design because each node must use its local resources efficiently.

Local concurrency also introduces local failure modes: lock contention, deadlocks, memory leaks, thread-pool exhaustion, file descriptor exhaustion, and connection-pool exhaustion. These often appear only under load.

Designing for High Availability

Everything fails. Disks fail. SSDs fail. Routers fail. Switches fail. Memory fails. Power supplies fail. Network providers fail. Processes crash. Deployments break good code. Operators make mistakes. Configuration changes can do more damage than hardware failures.

Failure rates that look tiny become routine at scale. A fleet with ten thousand machines, each with a mean time between failures of thirty years, should expect about one machine failure per day. A storage system with a million drives should expect frequent drive failures even if each individual drive is highly reliable.

Large-scale studies of disk populations have shown that real failure behavior often differs from vendor estimates. The takeaway is a working assumption: production systems must run under constant background failure, whatever the exact rate turns out to be.

Replication

Replication is the main tool for availability. Copies of data or services run on multiple machines so the system can continue when one copy fails. Replication can also improve performance by placing data closer to users or by spreading read load across replicas.

Replication has a correctness cost. If replicas can accept updates, they need a rule for ordering and conflict resolution. A replicated state machine keeps replicas consistent by applying the same operations in the same order. Consensus protocols such as Raft and Paxos are used to agree on that order in the presence of failures.

Availability Zones and Disaster Recovery

Local replication protects against machine failure. It does not protect against a rack power failure, a network aggregation failure, a data center outage, or a regional cloud event. For higher availability, services and data should be distributed across availability zones or data centers with independent power, cooling, and networking.

Disaster recovery needs more than live replicas. It also needs backups, snapshots, software artifacts, configuration, secrets, deployment scripts, and restoration procedures. Offline or logically isolated backups help protect against catastrophic cyberattacks because ransomware and destructive automation can corrupt online replicas quickly.

A backup that has never been restored is not a recovery plan. Restoration testing is the only way to know whether the backup is complete, readable, and usable under time pressure.

Graceful Degradation

Distributed systems fail in pieces, so the user-visible service should not fail as a single unit. Graceful degradation means deciding in advance how the system behaves when dependencies are slow or unavailable.

Critical features should be protected from non-critical failures. A shopping site can survive without recommendations, reviews, or personalization. It cannot survive without checkout. A search service can return cached results if the live index is temporarily unavailable. A news feed can serve stale content when the personalization service is slow.

Common mechanisms are timeouts, circuit breakers, and fallbacks.

Timeouts cap how long a service waits for a dependency. Waiting forever converts one failure into a thread, connection, or request-pool exhaustion problem.

Circuit breakers stop calling a dependency that is clearly failing. This protects the caller and gives the dependency room to recover.

Fallbacks define what the service returns when a dependency fails. A fallback might be cached data, stale data, a default response, or a partial page.

Graceful degradation requires ranking features by importance. Without that ranking, the system tends to propagate failure to the user.

Backpressure, Load Shedding, and Rate Limiting

A distributed system under load has to protect itself. If every service accepts every request and every caller retries aggressively, overload spreads through the system, and recovery becomes harder.

Backpressure tells callers to slow down. A queue may stop accepting new work, a service may return a retry-after response, or a streaming system may stop pulling data until consumers catch up. Backpressure is how a system prevents a temporary overload from becoming a permanent collapse.

Load shedding drops lower-priority work so critical work can continue. A streaming video service under load may pause recommendation refreshes, defer analytics events, lower default video quality, or stop generating thumbnails while keeping playback responsive. The system is still degraded, but it is degraded on purpose.

Rate limiting protects services from clients that send too much traffic, whether by accident, abuse, or attack. It also keeps one tenant, user, or service from consuming capacity needed by others.

A system that never says no usually fails by making everything slow. A system that deliberately rejects work can preserve the parts that users need most.

Fault Detection and Observability

Fault detection tells you that something is wrong. Observability helps you determine what is wrong.

Traditional detection mechanisms include process monitoring, watchdog timers, health checks, heartbeats, and active probes. These mechanisms are useful, but they must be designed carefully. A heartbeat failure may mean a process died. It may also mean the network is slow, a partition occurred, or the detector is overloaded.

The detection interval affects recovery time. If it takes thirty seconds to decide that a service is dead and another thirty seconds to restart it, users may experience a long failure even though the system eventually recovers.

Metrics, Logs, and Traces

Logs are timestamped event records. They are useful for forensics and debugging individual cases. They are poor at answering aggregate questions such as whether the error rate is rising or which dependency caused a latency spike.

Metrics are numeric measurements over time: counters, gauges, histograms, rates, queue depths, memory use, CPU utilization, and latency distributions. Prometheus is a common open-source metrics system.

Distributed traces follow one request across many services. A trace ID is propagated through service calls, and each service records data showing where time was spent. Jaeger and Zipkin are common tracing systems, and OpenTelemetry provides a vendor-neutral way to instrument traces, metrics, and logs.

Two acronyms are useful for organizing observability data.

Acronym Used for Signals
RED Services Rate, Errors, Duration.
USE Resources Utilization, Saturation, Errors.

RED describes the work a service is doing.

USE describes the machinery that performs the work.

A service needs both views. A latency spike may be caused by a slow dependency, a saturated thread pool, a full disk queue, a database connection limit, or a network path change.

Tail latency is often more useful than average latency. The p99 latency is the point at which 99 percent of requests are faster and 1 percent are slower. Users feel the tail because a page or workflow may depend on many service calls. A service with a good average and a terrible p99 still has a serious performance problem.

SLOs and Error Budgets

Observability is useful only if the team knows what healthy means. A service-level objective, or SLO, defines the target behavior of a service. A typical SLO might say that 99.9 percent of requests should complete successfully within 200 milliseconds over a 30-day window.

An error budget is the amount of unreliability the service can tolerate while still meeting its SLO. If the SLO allows 0.1 percent of requests to fail or be too slow, that 0.1 percent is the budget.

Error budgets turn reliability into an engineering decision. If the service is well within budget, the team can take more deployment risk. If the budget is nearly exhausted, the team should stop pushing risky changes and spend time on reliability.

Perfect reliability is the wrong target for most services. The engineering question is whether the service meets the reliability level its users and business require.

Designing for Low Latency

Latency is constrained by physics and by system design. Signals cannot travel faster than light, and real networks do not follow direct great-circle paths. A round trip across a continent or ocean takes noticeable time before any software overhead is added.

Good low-latency design starts by avoiding unnecessary work. Do not move data that does not need to move. Do not call a remote service when a local function call would do. Do not serialize large payloads through a service that only needs a small field. Do not put a slow dependency on the critical path when the result can be computed later.

Several techniques reduce latency.

Each technique has a tradeoff. Batching may add waiting time. Caching can return stale data. Replication needs consistency rules. Parallel calls increase load and can amplify failure.

Asynchronous Operations

Some work should happen asynchronously. A user-facing request can acknowledge receipt while background workers update indexes, send email, replicate data, generate thumbnails, or refresh analytics.

Asynchrony reduces perceived latency because the user does not wait for every side effect. It also helps balance load because work can be queued and processed when capacity is available.

The risk is consistency. If the user sees a confirmation before all side effects complete, the system must define what happens when a background task fails. A message queue, retry policy, dead-letter queue, idempotency key, and reconciliation job may be needed to make asynchronous processing reliable.

If a system needs strong consistency, use a storage or transaction framework that provides it. Ad hoc consistency mechanisms are a common source of production bugs.

Measure the Cost of Everything

Engineers without measurement are guessing.

The cost of an operation depends on the hardware, operating system, language runtime, framework, network, load, configuration, and failure mode. Numbers from a blog post or a search result are useful only as rough orientation. The number that governs your design is the one measured in your system under your workload.

Measure the costs that affect critical paths.

Benchmark before the system is in crisis. Baselines from healthy operation make regressions visible later.

Testing, Profiling, and Optimization

Testing distributed systems means testing cases that the developer would rather not think about: partial failure, latency spikes, partitions, duplicate messages, slow disks, overloaded dependencies, old clients, malformed inputs, and deployments that only partially complete.

Unit tests check components in isolation. Integration tests check whether components work together. Smoke tests verify that a deployed system starts and answers basic requests. Load tests expose contention, memory leaks, queue buildup, connection-pool exhaustion, and tail-latency problems.

Failure injection deliberately kills services, drops packets, delays responses, fills disks, or partitions networks to observe how the system behaves. Chaos engineering is the discipline built around this idea.

Optimization should focus on critical paths. Interpreted languages and dynamic runtimes can be excellent for many services, but CPU-heavy or latency-sensitive paths may need a compiled language or a specialized implementation. The decision should come from profiling, not habit.

Understand the Tools You Use

Every distributed framework hides some complexity and exposes other complexity. Before depending on a tool, understand its assumptions.

You should know how it scales, how it partitions data, how it handles failures, what consistency guarantees it provides, what latency profile it has, what operational burden it adds, and what happens during upgrades.

MapReduce hides worker failure, data partitioning, and task scheduling. Pregel hides message delivery across graph vertices and organizes computation into supersteps. Dynamo-style systems hide replica selection and quorum reads and writes. Kubernetes hides some deployment and restart mechanics while exposing many configuration and networking concerns.

Good services provide the same property to their callers. The caller should not need to understand the full distributed machinery underneath unless the interface promises require it.

Security Engineering

Security in distributed systems is a collection of related disciplines: authentication, authorization, encryption, key management, protocol design, input validation, logging, auditability, and incident response. Weakness in any one area can undermine the rest.

API gateways are often used at the edge of a system. They can centralize authentication, rate limiting, request filtering, routing, and logging for traffic entering from outside.

OAuth 2.0 and OpenID Connect are common mechanisms for delegated authorization and identity. They let a service act on behalf of a user without sharing the user’s password with every component.

Secrets managers store and distribute credentials at runtime. HashiCorp Vault, AWS Secrets Manager, Google Secret Manager, Azure Key Vault, and Kubernetes secrets encrypted with a key management service (KMS) are common choices. The main rule is that private keys and long-lived credentials do not belong in source code or shared filesystems.

Certificate rotation should be automated. Manual certificate renewal creates outages because certificates expire at inconvenient times, and the runbooks (the written operational procedures used to renew them) drift out of date between renewals. The ACME protocol (Automated Certificate Management Environment) and tools built on it, such as cert-manager, cloud certificate managers, and service mesh control planes, handle issuance and renewal automatically.

Service meshes such as Istio and Linkerd can enforce mTLS, identity, authorization policy, and traffic rules between services. They also add another control plane, proxies on the request path, and more operational state. The benefit is a consistent security policy. The cost is complexity and overhead.

Infrastructure as Code

Infrastructure as code means describing environments in version-controlled text. Networks, subnets, routing tables, load balancers, machines, storage, DNS records, secrets, identity and access management (IAM) policies, and services should be reproducible from files rather than manually assembled in a cloud console.

The goal is reproducibility. A development environment, staging environment, production environment, new customer deployment, or new region should be created by running controlled automation. Manual configuration creates drift, and drift creates outages.

Terraform is a common declarative infrastructure tool. Pulumi and AWS CDK describe infrastructure using programming-language abstractions. Ansible, Chef, and Puppet manage configuration at the machine level.

Kubernetes applies the same idea to application workloads. YAML manifests describe pods, deployments, replica counts, container images, environment variables, mounted storage, services, ingress rules, and secrets. A controller compares desired state with actual state and acts to reconcile the two.

Manual configuration is the failure mode this discipline is designed to remove.

Configuration and Feature Flags

Configuration is part of the system. A bad configuration change can take down a service as effectively as a bad binary.

Configuration should be versioned, reviewed, validated, and rolled out gradually. It should have schemas, defaults, bounds checks, and ownership. A typo in a timeout, replica count, endpoint, permission, or region name should be caught before it reaches production.

Large systems often separate code deployment from feature release. The new code is deployed first, but the feature is enabled later through a feature flag. This allows you to turn behavior on for a single tenant, a single region, 1% of users, or internal traffic before exposing it broadly.

Feature flags make canary releases and fast rollbacks easier. They also create long-term risk if old flags are never removed. A flag that controls production behavior is production code, even if it lives in a configuration file.

Deployment Strategies

Deployment is part of distributed systems design because it changes the running systems while users are still using them. A safe deployment strategy limits the blast radius of bad code.

Blue/Green Deployment

Blue/green deployment uses two production-capable environments. One environment serves live traffic. The other runs the new version. After validation, traffic is switched to the new environment, usually through a load balancer, router, or DNS change.

Rollback is fast because the previous environment is still available. The system does not need to rebuild the old version under pressure.

Canary Deployment

A canary deployment sends a small fraction of traffic to the new version first. Metrics and traces are watched for errors, latency changes, and resource changes. If the canary looks healthy, the fraction increases. If it fails, the new version is withdrawn before most users see it.

Canaries work best when observability is strong. Without metrics and traces, a canary is only a smaller blind deployment.

Version Compatibility

Deployments create mixed-version systems. Some clients and servers run old code while others run new code. Interfaces and schemas must be backward- and forward-compatible across the rollout window.

A safe schema migration often has multiple steps: add a field, deploy code that can read both old and new formats, write both formats if needed, migrate existing data, then remove old behavior only after old clients are gone.

The Eight Fallacies of Distributed Computing

The Eight Fallacies of Distributed Computing are a classic list of false assumptions associated with L. Peter Deutsch and others at Sun Microsystems in the 1990s. They remain useful because most production failures still trace back to some version of these mistakes.

Fallacy Engineering consequence
The network is reliable. Code must handle lost connections, timeouts, partitions, and ambiguous outcomes.
Latency is zero. Remote calls must be treated as expensive operations.
Bandwidth is infinite. Data formats, payload sizes, and unnecessary transfers affect performance and cost.
The network is secure. Authentication, encryption, authorization, and auditing are required.
Topology does not change. Services must tolerate routing changes, scaling events, failover, and migration.
There is one administrator. Real systems cross teams, organizations, accounts, regions, and providers.
Transport cost is zero. Communication has monetary cost, latency cost, CPU cost, and operational cost.
The network is homogeneous. Real paths cross Wi-Fi, Ethernet, cellular, ISP networks, cloud fabrics, and long-haul links.

The fallacies are not historical trivia. They are design checks. When a design assumes one of these claims, it is probably fragile.

More Fallacies for Modern Systems

Modern distributed systems add more assumptions that fail in production.

Operating Environments Are Homogeneous

Real systems run on a mix of operating systems, kernel versions, container runtimes, library versions, CPU types, TLS implementations, and cipher suites. A service that works on one node may fail on another because a syscall, memory limit, certificate store, or crypto library differs.

Clocks Are Synchronized

Clocks drift. Time zones differ. The Network Time Protocol (NTP) can fail or step time backward. Clock synchronization can be engineered, but it should not be assumed. Use UTC for system timestamps and avoid local time except at user-facing boundaries.

Systems that depend on tight clock bounds need explicit clock infrastructure. Google Spanner’s TrueTime is an example of engineering the clock assumption rather than pretending it is free.

Test and Production Environments Are the Same

They are not. Test environments usually have less traffic, smaller datasets, weaker security constraints, fewer regions, fewer clients, and less operational noise. Bugs hide in that gap.

Staging, canaries, feature flags, and production observability reduce the risk. They do not make test identical to production.

Users Will Use Interfaces Correctly

Users and clients will send malformed inputs, omit required fields, pass nulls, exceed length limits, retry aggressively, hold connections open, use old API versions, and interpret documentation differently from what the author intended.

Validate inputs at every trust boundary. Treat every external input as untrusted, including inputs from internal services that may have been compromised or may contain bugs.

All Systems Will Run the Same Software Version

Mixed versions are normal during deployments. They are permanent when mobile clients, embedded devices, customer-hosted agents, or third-party integrations are involved.

API versioning, schema evolution, deprecation policies, and compatibility tests prevent new releases from breaking old clients.

Your Service Will Never Be a Target

Every reachable service is a target. Automated scanners probe open ports, default credentials, known vulnerable paths, and common API mistakes. Internal services can become exposed through misconfiguration.

Defense in depth means the system should not depend on one perimeter. Rate limiting, authentication, authorization, audit logs, patching, and least privilege all reduce the blast radius of compromise.

Failures Are Independent

Failures are often correlated. Services share data centers, networks, deployment pipelines, dependencies, libraries, certificates, configuration systems, cloud accounts, and operators. A single bad deployment or expired certificate can disable many replicas at once.

Redundancy only helps when replicas do not share the same root cause. Designing for availability means separating failure domains, not only counting copies.

Retries Make Things Better

Retries can make things worse. Without backoff, jitter, deadlines, idempotency, and circuit breakers (covered earlier under RPC and Failure), simultaneous retries against a failing service create a retry storm precisely when the service has the least capacity to absorb it.

Backups Work

Backups work only after they have been restored successfully. A backup that has never been exercised end to end is a hope, not a recovery plan.

Engineering Judgment

Distributed systems engineering is the discipline of replacing fragile assumptions with explicit design choices. The network is not reliable, so we use timeouts, retries, idempotency, and replication. Latency is not zero, so we reduce round trips, cache, pipeline, and place data near users. Failures are not independent, so we separate failure domains. Users do not follow the documentation, so we validate inputs and preserve compatibility.

Good distributed systems come from judgment, not from fashion. Their designers understand what could fail, measure the costs on the critical path, choose interfaces that survive change, and build enough operational visibility to find problems while they are still small.


Next: Watch out for hype