Bonus $100
Fury vs Usyk
IPL 2024
Paris 2024 Olympics
PROMO CODES 2024
UEFA Euro 2024
Users' Choice
88
87
85
69

To Hadoop, or not to Hadoop

23 Jul 2012
00:00
Read More

To Hadoop, or not to Hadoop: that is the question.

To many Map-Reduce is the panacea for all kinds of performance evils. It appears that somehow using the Map-Reduce in your application will magically transform your application into a high performing, screaming application.

The fact is the Map-Reduce algorithm is applicable to only certain class of problems. Ideally it is suited to what is commonly referred to as “embarrassingly parallel” class of problems. These are problems that are inherently parallel for e.g. the creation of inverted indices from web crawled documents.

Map-Reduce is an algorithm that has popularized by Google. The term map–reduce actually originates from Lisp in which the “map” function takes a list of arguments and performs the same operation on all of its argument. The ”reduce” then applies a common criterion to pick a reduced set of values from this list. Google uses the Map-Reduce to create an inverted index.

An inverted index basically provides a mapping of a word with the list of documents in which it occurs. This typically happens in two stages. A set of parallel “map” tasks take as input documents, parse them and emits a sequence of (word, document id) pairs. In other words, the map takes as input a key value pair (k1, v1) and maps it into an intermediate (k2, v2) pair.

The reduce tasks take the pair of (word, document id), reduces them, and emits a (word, list ). Clearly applications, like the inverted index, make sense for the Map-Reduce algorithm as several mapping tasks can work in parallel on separate documents. Another typical application is counting the occurrence of words in documents or the number of times a web URL has been hit from a traffic log.

The key point in all these typical class of problems is that the problem can be handled in parallel. Tasks that can execute independently besides being inherently parallel are eminently suitable for Hadoop processing. These tasks work on extraordinarily large data sets. This is also another criterion for Hadoop worthy applications.

.

Related content

Tags:
Rating: 5
Advertising