MapReduce
MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster. MapReduce is often times mentioned when learning about distributed systems, so I thought I'd try to get a more in-depth idea of what MapReduce is here.
References
- MapReduce Wikipedia Article
Definitions
- Programming Model
- an execution model coupled to an API or a particular pattern of code
- An execution model specifies the behavior of elements of the language
- Big Data
- Refers to data sets that are too large or complex to be dealt with by traditional data-processing application software
- Parallel
- Parallel computing is a type of computation in which many calculations or processes are carried out simultaneously
- Distributed
- Computer systems whose inter-communicating components are located on different networked computers
- Cluster
- Set of computers that work together so that they can be viewed as a single system.
- Map
- Idiom in parallel computing where a simple operation is applied to all elements of a sequence, potentially in parallel. It is used to solve embarrassingly parallel problems: those problems that can be decomposed into multiple subtasks, requiring no communication/synchronization between the subtasks except a join or barrier at the end.
- Reduce
- Type of operator that is commonly used in parallel programming to reduce the elements of an array into a single result.
- Marshalling
- Process of transforming the memory representation of an object into a data format suitable for storage or transmission, especially between different runtimes.
- Redundancy
- The intentional duplication of critical components or functions of a system with the goal of increasing reliability of the system, usually in the form of a backup or fail-safe, or to improve actual system performance.
- Fault Tolerance
- The ability of a system to maintain proper operation despite failures or faults in one or more of its components. This capability is essential for high-availability, mission critical, or even life-critical systems.
- Monoid
- In abstract algebra, a branch of mathematics, a monoid is a set equipped with an associated binary operation and an identity element. For example, the nonnegative integers with addition form a monoid, the identity element being 0.
Related
- Apache Hadoop
- Collection of open-source software utilities that facilitates using a network of many computers to solve problems involving massive amounts of data and computation.
Notes
MapReduce is a programming model and an associated implementation for processing and generating big data sets with a parallel, distributed algorithm on a cluster.
- A MapReduce program is composed of a map procedure, which performs filtering and sorting (such as sorting students by first name into queues, one queue for each name), and a reduce method, which performs a summary operation (such as counting the number of students in each queue, yielding name frequencies).
- The
MapReduce System
orchestrates the processing by marshalling the distributed servers, running various tasks in parallel, managing all communications and data transfers between the various parts of the system, and providing for redundancy and fault tolerance. - The key contributions of the MapReduce framework are the scalability and fault-tolerance achieved for a variety of applications due to parallelization.
- Gains are only seen in multi-threaded implementations on multi-processor hardware.
- The use of this model is beneficial only when the optimized distributed shuffle operation (which reduces network communication cost) and fault tolerance features of the MapReduce framework come into play.
- A popular open source implementation of MapReduce that has support for distributed shuffles is part of Apache Hadoop.
Overview
- MapReduce is a framework for processing parallelizable problems across large datasets using a large number of computers (nodes) collectively referred to as a cluster (if all nodes are on the same local network and use similar hardware) or a grid (if the nodes are shared across geographically and administratively distributed systems, and use more heterogeneous hardware.)
- MapReduce can work on data in a filesystem or in a database, and it takes advantage of the locality of data.
- A MapReduce framework is usually composed of the steps below:
- Map: each worker node applies the
map
function to the local data and writes the data to the local storage. - Shuffle: worker nodes redistribute the data based on output keys, such that all data belonging to one key is located on the same worker node.
- Reduce: worker nodes now process each group of output data, per key, in parallel
- MapReduce can be applied to significantly larger datasets than a single
commodity
server can handle - a large serve farm can use MapReduce to sort a petabyte of data in only a few hours. - Another way of looking at MapReduce:
- Prepare the Map() input - the
MapReduce system
designates Map processors, assigns the input key K1 that each processor would work on, and provides that processor all the input data associated with that key - Run the user-provided Map() code - Map() is run exactly once for each K1 key, generating output organized by key K2
Shuffle
the Map output to the Reduce processors - the MapReduce system designates Reduce processors, assigns the K2 key each processor should work on, and provides that processor with all the Map-generated data associated with that key- Run the user-provided Reduce() code - Reduce() is run exactly once for each K2 key produced by the Map step
- Produce the final output - the MapReduce system collects all the Reduce output, and sorts it by K2 to produce the final output
Logical View
- The Map and Reduce functions of MapReduce are both defined with respect to data structured in
(key,value)
pairs. Map takes one pair of data in one data domain, and returns a list of pairs in a different domain:
- The Map function is applied in parallel to every pair (keyed by
k1
) in the input dataset. This produces a list of pairs (keyed byk2
) for each call. After that, the MapReduce framework collects all pairs with the same keyk2
from all lists and groups them together, creating one group for each key. - The Reduce call typically produces either one key value pair or an empty return, though one call is allowed to return more than one key value pair.
- Canonical MapReduce Example:
function map(String name, String document):
// name: document name
// document: document contents
for each word w in document:
emit(w, 1)
function reduce(String word, Iterator partialCounts):
// word: a word
// partialCounts: a list of aggregated partial counts
sum = 0
for each pc in partialCounts:
sum += pc
emit(word, sum)
Here, each document is split into words, and each word is counted by the map function, using the word as the result key. The framework puts together all the pairs with the same key and feeds them to the same call to reduce. Thus, this function just needs to sum all of its input values to find the total appearance of that word.
Another Example:
SELECT age, AVG(contacts) FROM social.person GROUP BY age ORDER BY age;
Data Flow
- Software framework architecture adhere to open-closed principle where code is effectively divided into unmodifiable frozen spits and extensible hot spots. The frozen spot of the MapReduce framework is a large distributed sort. The hot spots, which the application defined, are:
- an input reader
- a Map function
- a partition function
- Each Map function output is allocated to a particular reducer by the application's partition function for sharding purposes. The partition function is given the key and the number of reducers and returns the index of the desired reducer.
- A typical default is to hash the key and use the hash value modulo the number of reducers. It is important to pick a partition function that gives an approximately uniform distribution of data per shard for load-balancing purposes
- a compare function
- a Reduce function
- an output writer
Theoretical Background
- Properties of monoids are the basis for ensuring the validity of MapReduce operations
Performance Considerations
- The author of a MapReduce function has to take the shuffle step into consideration; in particular, the partition function and the amount of data written by the Map function can have a large impact on the performance and scalability
- MapReduce applications can achieve sub-linear speedups under specific circumstances
Durability and Reliability
- MapReduce achieves reliability by parceling out a number of operations on the set of data to each node in the network.
- Implementations are not necessarily highly reliable
Uses
- MapReduce is useful in a wide range of applications, including pattern-based searching, distributed sorting, web-link graph traversal, Singular Value Decomposition, web access log stats, inverted index construction, document clustering, machine learning, and statistical machine translation.
- At Google, MapReduce was used to completely regenerate Google's index of the World Wide Web.