12.12.13

Pig : Loading Data with multiple delimiters.

Here is how to load row delimited data with TWO delimiters in it into pig as a uniform series of tuples.  

The input line looks like this: [little  yellow  marks represent the heterogenous delimiters].
BigPetStore,storeCode_OK,2     yang,jay,Mon Dec 15 23:33:49 EST 1969,69.56,flea collar
 The output we want is like this:
BigPetStore    storeCode_OK    2    yang    jay    Mon Dec 15 23:33:49 EST 1969    69.56    flea collar
Now lets LOAD this as is into pig (this will take two steps).  

The default load operation will split by tabs for us.  Our second operation will have to go into the two comma separated fields and suck out the sub fields as top level tuple elements.

csvdata = LOAD 'petstoredata' AS (ID,DETAILS);
   The default loader for Pig uses a tab delimiter by default, so after the LOAD, it looks something like this:
( BigPetStore,storeCode_OK,2    yang,jay,Mon Dec 15 23:33:49 EST 1969,69.56,flea collar )

Attempt 1: Using STRSPLIT to break the individual fields, but forgetting to flatten them:

This got me part of the way there - it split the individual fields up for me.  But it ultimately creates a bag of two tuples. 

id_details = FOREACH csvdata GENERATE (STRSPLIT (ID,',',3)), (STRSPLIT (DETAILS,',',5))
( BigPetStore,storeCode_OK,2 )    ( yang,jay,Mon Dec 15 23:33:49 EST 19669.56,flea collar )
Attempt 2: Combining the STRSPLIT with flatteners over each tuple: 

    Okay, so finally: what we want to do is join all this into one large tuple, so we add the flatten operator to each tuple in the generator:

id_details = FOREACH csvdata GENERATE  
   flatten(STRSPLIT (ID,',',3)),  
   flatten(STRSPLIT (DETAILS,',',5))

And we get the following uniformed tuples: 
BigPetStore    storeCode_OK    2    yang    jay    Mon Dec 15 23:33:49 EST 1969    69.56    flea collar
YAY : So thats how you load lines into Pig when multiple delimiters are required.  

Now, for a breif aside [in very small font because I realize its a little silly].

This is in very small font because i know it is nit-picking.... but now, i gotta say: In plain old mapreduce this would be a little more transparent:
map(){
   f1 = string.split("\t")[0] ;
   f2 = string.split("\t")[1] ;
   String output = "";
   ArrayList a1 = Arrays.asList(f1);
   a1.addAll(Arrays.asList(f2) ;
   String joined = Strings.join(a1,",")
   emit(joined , NullWritable.get());
}

1 comment:

  1. nice… for more java examples, visit http://java2novice.com site.

    ReplyDelete