30.12.13

Building a Hive table over multideliimiter CSV Files.

One of the goals in the BigPetStore project is to demonstrate how to process the same data set using multiple different ecosystem components.

Processing of the BigPetStore input data set is described in this post, using pig.


The basic data format (subject to change over time) is intentionally a little complex:

BigPetStore,storeCode_AK,1    watt,duane,Tue Dec 30 02:42:47 EST 1969,48.25,flea collar
BigPetStore,storeCode_AK,2    stevens,sanford,Thu Jan 01 00:01:50 EST 1970,66.85,premium cat food


In pig, we load the data set as is, and then deal with merging the two comma separated components into one big pig tuple by "flattening" the first segment, and then "flattening" the second segment, yielding something like this:

BigPetStore    storeCode    AK    1    watt    duane    Tue Dec 30 02:42:47 EST 1969    48.25    flea collar

In hive, I decided to try a different approach: Using the RegExSerde class.

A quick note on libraries before you try this:

The RegExSerde class allows you to load heterogenously delimited data, its pretty simple to implement.  I had to make sure that the hive-serde and hive-contrib jars were in my HADOOP_HOME/lib/ path (in pure hive, you can just issue an "add" command to add jars to the classpath at runtime, but in bigpetstore, we do everything in the IDE and in local hadoop mode, so I manually added these jars into HADOOP_HOME/lib/ to get it working from my IDE). 
So : Here's how to define a Hive table over a file with more than one delimiter.

1) Craft a "base" definition for your Hive table the way you would any normal SQL table... Jot it down... Something like:

     String create =
                "CREATE TABLE hive_bigpetstore_etl (" +
                "  a1 STRING," +
                "  b2 STRING," +
                "  c3 STRING," +
                "  d4 STRING," +
                "  e5 STRING," +
                "  e6 STRING," +
                "  e7 STRING," +
                "  e8 STRING)" +
                "STORED AS TEXTFILE";


2) You build a regex where each capturing group corresponds to a column in your hive table.

3) Copy your regex into the input.regex field of your CREATE TABLE statement.

Hive will then consecutively put each group matched per row of data into successive columns defined in your CREATE TABLE statement.

Here is a snippet from how we use this in the BigPetStore project to create a table over the raw input data set without doing any manual ETL steps:

 String create =
                "CREATE TABLE hive_bigpetstore_etl (" +
                "  a1 STRING," +
                "  b2 STRING," +
                "  c3 STRING," +
                "  d4 STRING," +
                "  e5 STRING," +
                "  e6 STRING," +
                "  e7 STRING," +
                "  e8 STRING)" +
               
                " ROW FORMAT SERDE 'org.apache.hadoop.hive.contrib.serde2.RegexSerDe' " +
                "WITH SERDEPROPERTIES  (" +
                "\"input.regex\" = \"INPUT_REGEX\" , " +
                "\"output.format.string\" = \"%1$s %2$s %3$s %4$s %5$s\") "
+
                "STORED AS TEXTFILE";       

 Now, in my case, this regex was sufficiently complicated that I defined a "base" string that structured the "create" statement, and then layered the regex in programmatically.  You could do this in bash to, or else, even in hive with hive variables.  In my pure java hive create table statement, I did:
         create=create.replaceAll(
                 "INPUT_REGEX",
                 "(?:BigPetStore,storeCode_)" +
                 "([A-Z][A-Z])," + //state (CT)
                 "([0-9]*)" + // state transaction id (1)
                 "(?:\t)" + // [tab]
                 "([a-z]*)," + //fname (jay)
                 "([a-z]*)," + //lname (vyas)
                 "([A-Z][^,]*)," + //date starts with capital letter (MWTFSS)
                 "([^,]*)," + //price (12.19)
                 "([^,]*).*"); //product (premium cat food)

        System.out.println(create);
        ResultSet res =  stmt.executeQuery(create);
 And finally, we execute the query:
        res = stmt.executeQuery(
                "LOAD DATA INPATH '<rawInput>' INTO TABLE hive_bigpetstore_etl"
                    .replaceAll("<rawInput>", rawInput));

So there you have it:  Its easy to put a uniform, relational view over line delimited  data using Hive and the REGEX SerDe.... You just have to understand how capturing groups work !

1 comment:

  1. Excellent article. Very interesting to read. I really love to read such a nice article. Thanks! keep rocking.Big data hadoop online Course India

    ReplyDelete