The Lambda Architecture is an approach to building stream processing applications on top of MapReduce and near real-time data processing systems.
This has proven to be a surprisingly popular idea.
All data entering the system is dispatched to both the batch layer and the speed layer for processing.
The batch layer has two functions:
- Managing the master dataset (an immutable, append-only set of raw data)
- And to pre-compute the batch views
The serving layer indexes the batch views so that they can be queried in low-latency, ad-hoc way.
The speed layer compensates for the high latency of updates to the serving layer and deals with recent data only.
Any incoming query can be answered by merging results from batch views and real-time views.
Each of these layers can be realized using various big data technologies. For instance, the batch layer datasets can be in a distributed file system, while MapReduce can be used to create batch views that can be fed to the serving layer. The serving layer can be implemented using NoSQL technologies such as HBase or Cassandra, while querying can be implemented by technologies such as Apache Drill or Impala. Finally, the speed layer can be realized with data streaming technologies such as Apache Storm or Spark Streaming. Now let’s consider each of these layers by one at a time.
The batch layer is responsible for two things. The first is to store the immutable, constantly growing master dataset (HDFS), and the second is to compute arbitrary views from this dataset (MapReduce). Computing the views is a continuous operation, so when new data arrives it will be aggregated into the views when they are recomputed during the next MapReduce iteration.
The views should be computed from the entire dataset and therefore the batch layer is not expected to update the views frequently. Depending on the size of your dataset and cluster, each iteration could take hours.
The output from the batch layer is a set of flat files containing the precomputed views. The serving layer is responsible for indexing and exposing the views so that they can be queried. As the batch views are static, the serving layer only needs to provide batch updates and random reads. Users would then be able to query the views immediately.
Although, the batch and serving layers alone do not satisfy any real-time requirement because MapReduce (by design) is high latency and it could take a few hours for new data to be represented in the views and propagated to the serving layer. This is why we need the speed layer.
Just a note about the use of the term real-time. When we say real-time, we actually mean near real-time (NRT) and the time delay between the occurrence of an event and the availability of any processed data from that event.
In essence the speed layer is the same as the batch layer in that it computes views from the data it receives. The speed layer is needed to compensate for the high latency of the batch layer and it does this by computing real-time views. The real-time views contain only the delta results to supplement the batch views.
Whilst the batch layer is designed to continuously recompute the batch views from scratch, the speed layer uses an incremental model whereby the real-time views are incremented as and when new data is received. Whats clever about the speed layer is the real-time views are intended to be transient and as soon as the data propagates through the batch and serving layers the corresponding results in the real-time views can be discarded.
The final piece of the puzzle is exposing the real-time views so that they can be queried and merged with the batch views to get the complete results. As the real-time views are incremental, the speed layer requires both random reads and writes.