22.12.11

How to read non Writable values (not just keys) in Hadoop SequenceFile.Reader classes .

Special thanks to the DataSalt folks for this one.

This is a very, very specific post. Only those confused about hadoop, java, and custom serialization will find it useful. Actually, I think only programmers who are working on java development at either DataSalt or Peerindex will find it useful... But I guess that will be good for my Peerindex score because its really specific. So thats cool.

Anyways...

So , I've been trying to read in Key/Value pairs, from thrift , that are non-writable in hadoop.

Oddly, when using ThriftSerialization settings, you can easily use a SequenceFile.Reader to do the following :

Object myThriftBean = MyThriftClass.newInstance();
myReader.next(myThriftBean);

However, this is only because the SequenceFile.Reader class supports a

next(Object o); method. This method reads in only a key.

I'm sure we would all expect that there MUST be a corresponding method that reads in both the key and value of a given entry in a sequence file... right ? WRONG !

ODDLY : SequenceFile.Reader does NOT have a

next(Object k, Object v); method !

So - what if you want to read both key and value pairs of a SequenceFile ?

You can do the following :

Object k = myKeyClass.newInstance();
Object v = myValueClass.newInstance();

while(reader.next(k))
{
System.out.println(k);//the key has been read in already...
System.out.println(reader.getCurrentValue(v)); //now ,read the vlue before reading the next key.
}

Unfortunately, the SequenceFile.Reader documentation sais that the

next(Object o)

method "skips" a value, reading only a key. However, this is not the case, it appears that, until we read the NEXT key, the reader has access to the value in the file you are looking at.

No comments:

Post a Comment