Key characteristics of a Distributed system

Niyanta Zamindar
3 min readDec 27, 2021

When designing a distributed system ( one of the most favourite aspect of technical interviews, mid-senior level onward ), these are the key non-functional areas that need your attention -

  • Scalability
  • Reliability
  • Availability
  • Efficiency
  • Manageability
Photo by ThisisEngineering RAEng on Unsplash

Let us go over them quickly :)

Scalability, is the ability of a system to incorporate the growing demands ( increasing user base, more requests per second etc ) such that its performance does not take a hit.
There are two ways to implement this,
- Horizontal scaling: adding more machines / servers into the pool of resources. HS is easier to achieve. It is facilitated by data stores like Cassandra, MongoDB.
- Vertical scaling: increasing capacity of the existing servers by adding more power ( CPU, RAM, storage ). VS is limited to the capacity of a single server and trying scale beyond that often involves system downtime. Example of VS can be found in MySQL.

Reliability, is determined by computing the probability of a system failing in a given period of time. A distributed system is deemed reliable if it keeps delivering its services even when one of its hardware / software component fails.
This is achieved with the help of data redundancy — which has its own cost. The aim here is to eliminate every single point of failure by keeping multiple instances around, which can be used to replace the failing instance.

Availability, is a measure of the amount of time a system, service or a machine remains operational under normal conditions. Availability takes into account maintainability, repair time, spares availability and other logistics. A system down for maintenance won’t be considered available during that time.
Reliability vs availability, if a system is reliable, it will be available. However the vice-versa is not always true. High reliability contributes to high availability, but it is possible to achieve a highly available system by minimising the repair time and ensuring that spares are always available when needed.

Efficiency, consider an operation that runs in a distributed manner and delivers a set of items as result. Then, the efficiency can be measured using — response time ( or latency ), which denotes the delay to obtain the first item and throughput ( or bandwidth ) which denotes the number of items delivered in a given time unit ( ex. a second ).
This comes down to the following unit costs —
- number of messages sent globally by the nodes of the system, regardless of the message size.
- size of messages representing the volume of data exchange.

Serviceability / Manageability, talks about
- how easy it is to operate or maintain.
- the time taken during repairs, as the system will not be available for use during repair.
- the ease of diagnosing and understanding the problem, the ease of making updates / modifications. An early detection of fault / faults can significantly lower or in some cases, avoid system downtime.

I hope these made perfect sense to you. Please feel free to expand / add to any topic if you see fit :))

--

--

Niyanta Zamindar

I read. I evolve. I get out of hibernation. I write.