Processing of the BigPetStore input data set is described in this post, using pig.
The basic data format (subject to change over time) is intentionally a little complex:
BigPetStore,storeCode_AK,1 watt,duane,Tue Dec 30 02:42:47 EST 1969,48.25,flea collar
BigPetStore,storeCode_AK,2 stevens,sanford,Thu Jan 01 00:01:50 EST 1970,66.85,premium cat food
In pig, we load the data set as is, and then deal with merging the two comma separated components into one big pig tuple by "flattening" the first segment, and then "flattening" the second segment, yielding something like this:
BigPetStore storeCode AK 1 watt duane Tue Dec 30 02:42:47 EST 1969 48.25 flea collar
In hive, I decided to try a different approach: Using the RegExSerde class.
A quick note on libraries before you try this:So : Here's how to define a Hive table over a file with more than one delimiter.
The RegExSerde class allows you to load heterogenously delimited data, its pretty simple to implement. I had to make sure that the hive-serde and hive-contrib jars were in my HADOOP_HOME/lib/ path (in pure hive, you can just issue an "add" command to add jars to the classpath at runtime, but in bigpetstore, we do everything in the IDE and in local hadoop mode, so I manually added these jars into HADOOP_HOME/lib/ to get it working from my IDE).
1) Craft a "base" definition for your Hive table the way you would any normal SQL table... Jot it down... Something like:
String create =
"CREATE TABLE hive_bigpetstore_etl (" +
" a1 STRING," +
" b2 STRING," +
" c3 STRING," +
" d4 STRING," +
" e5 STRING," +
" e6 STRING," +
" e7 STRING," +
" e8 STRING)" +
"STORED AS TEXTFILE";
2) You build a regex where each capturing group corresponds to a column in your hive table.
3) Copy your regex into the input.regex field of your CREATE TABLE statement.
Hive will then consecutively put each group matched per row of data into successive columns defined in your CREATE TABLE statement.
Here is a snippet from how we use this in the BigPetStore project to create a table over the raw input data set without doing any manual ETL steps:
String create =
"CREATE TABLE hive_bigpetstore_etl (" +
" a1 STRING," +
" b2 STRING," +
" c3 STRING," +
" d4 STRING," +
" e5 STRING," +
" e6 STRING," +
" e7 STRING," +
" e8 STRING)" +
" ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' " +
"WITH SERDEPROPERTIES (" +
"\"input.regex\" = \"INPUT_REGEX\" , " +
"\"output.format.string\" = \"%1$s %2$s %3$s %4$s %5$s\") " +
"STORED AS TEXTFILE";
Now, in my case, this regex was sufficiently complicated that I defined a "base" string that structured the "create" statement, and then layered the regex in programmatically. You could do this in bash to, or else, even in hive with hive variables. In my pure java hive create table statement, I did:
create=create.replaceAll(And finally, we execute the query:
"INPUT_REGEX",
"(?:BigPetStore,storeCode_)" +
"([A-Z][A-Z])," + //state (CT)
"([0-9]*)" + // state transaction id (1)
"(?:\t)" + // [tab]
"([a-z]*)," + //fname (jay)
"([a-z]*)," + //lname (vyas)
"([A-Z][^,]*)," + //date starts with capital letter (MWTFSS)
"([^,]*)," + //price (12.19)
"([^,]*).*"); //product (premium cat food)
System.out.println(create);
ResultSet res = stmt.executeQuery(create);
res = stmt.executeQuery(
"LOAD DATA INPATH '<rawInput>' INTO TABLE hive_bigpetstore_etl"
.replaceAll("<rawInput>", rawInput));
So there you have it: Its easy to put a uniform, relational view over line delimited data using Hive and the REGEX SerDe.... You just have to understand how capturing groups work !
Excellent article. Very interesting to read. I really love to read such a nice article. Thanks! keep rocking.Big data hadoop online Course India
ReplyDelete