Distributed Systems

A deep dive into the principles behind Distributed Programming

Anshil Bhansali, Jun 21, 2020

What is it? A collection of independent computers that appears to its users as one computer. 3 characteristics - The computers operate concurrently, The computers can fail independently, The computers don’t share a global clock.

Distributed Storage - In large scale web apps, typically there are more reads than writes. To scale reads, we have ‘read replication’, and keep writes going to master. Here, we sacrifice is consistency; there could be some amount of lag between master and replica.

How do we scale writes? Sharding is an option. We can break up the data by some key (eg - email) into multiple databases, each one having read replicas. Here, we sacrifice is the data model, since we can’t join between shards. A good example of a scalable distributed storage system is Cassandra. The underlying architecture to implement this distributed system is pretty interesting and worth exploring; Consistent Hashing.

Distributed Computation - Distributed computing falls under a popular paradigm called Map Reduce. This design maps a function over several machines, each performing computation with different inputs. The outputs of all these machines are 'reduced' and gathered together. Another term for this paradigm is Scatter Gather.

A common implementation of this paradigm is the Hadoop ecosystem, but a more popular, faster, lighter version of it is Spark . Spark is a great tool for large scale distributed computing because it has a great abstraction over map reduce which makes it extremely developer friendly.

Distributed Messaging - This is a means of loosely coupling subsystems; a way to make microservices communicate with each other. The core principle of distributed messaging is PubSub. On a high level, messages are created by Producers and organized into 'topics'. Brokers are the nodes that process these messages and put them into a queue. And finally, Subscribers are the nodes that consume these messages, and act on them.

Apache Kafka is a great implementation of PubSub, and the most widely used. It is scalable because as the topics get too big, it gets partitioned (or sharded) into multiple nodes, each managed by a broker. It is important to remember that with scale, we are always sacrificing something. Here we are sacrificing a global ordering of messages in this partitioned systems. Therefore, we deal with that limitation. For example, we can only hash the userID sending the message, or the host IP, hence all the messages from that user or host go into one partition and they are ordered.