The input line looks like this: [little yellow marks represent the heterogenous delimiters].
BigPetStore,storeCode_OK,2 yang,jay,Mon Dec 15 23:33:49 EST 1969,69.56,flea collarThe output we want is like this:
BigPetStore storeCode_OK 2 yang jay Mon Dec 15 23:33:49 EST 1969 69.56 flea collarNow lets LOAD this as is into pig (this will take two steps).
The default load operation will split by tabs for us. Our second operation will have to go into the two comma separated fields and suck out the sub fields as top level tuple elements.
csvdata = LOAD 'petstoredata' AS (ID,DETAILS);
The default loader for Pig uses a tab delimiter by default, so after the LOAD, it looks something like this:
( BigPetStore,storeCode_OK,2 yang,jay,Mon Dec 15 23:33:49 EST 1969,69.56,flea collar )
Attempt 1: Using STRSPLIT to break the individual fields, but forgetting to flatten them:
This got me part of the way there - it split the individual fields up for me. But it ultimately creates a bag of two tuples.
id_details = FOREACH csvdata GENERATE (STRSPLIT (ID,',',3)), (STRSPLIT (DETAILS,',',5))
( BigPetStore,storeCode_OK,2 ) ( yang,jay,Mon Dec 15 23:33:49 EST 19669.56,flea collar )Attempt 2: Combining the STRSPLIT with flatteners over each tuple:
Okay, so finally: what we want to do is join all this into one large tuple, so we add the flatten operator to each tuple in the generator:
id_details = FOREACH csvdata GENERATE
flatten(STRSPLIT (ID,',',3)),
flatten(STRSPLIT (DETAILS,',',5))
And we get the following uniformed tuples:
BigPetStore storeCode_OK 2 yang jay Mon Dec 15 23:33:49 EST 1969 69.56 flea collarYAY : So thats how you load lines into Pig when multiple delimiters are required.
Now, for a breif aside [in very small font because I realize its a little silly].
This is in very small font because i know it is nit-picking.... but now, i gotta say: In plain old mapreduce this would be a little more transparent:
map(){
f1 = string.split("\t")[0] ;
f2 = string.split("\t")[1] ;
String output = "";
ArrayList a1 = Arrays.asList(f1);
a1.addAll(Arrays.asList(f2) ;
String joined = Strings.join(a1,",")
emit(joined , NullWritable.get());
}
nice… for more java examples, visit http://java2novice.com site.
ReplyDelete