10.10.13

Generative InputFormats for MapReduce

InputFormats in hadoop are commonly used to abstract the process of reading input records from mappers.  Here;s how they work:

1) The InputFormat itself is defined at Runtime.

2) The InputFormat class provides a iterator-like API:

  - nextKeyValue (boolean)

3) The InputFormat class also provides the RecordReader and Splits to the higher level MapReduce framework, which creates Mappers and sends individual records to mappers.

The most common InputFormat is your FileInputFormat, which provides a series of InputSplits which, collectively, represent a whole file.

So - what if you want to generate input on the fly?

In this case, we can create our own, custom input format, which continues returning key value pairs.  The "amount" of pairs returned can be acquired from a configuration parameter if we want to.

Here's an example:

Loading ....


No comments:

Post a Comment