The Old days. Writing SQL-Friendly Data Types In Hadoop.
In hadoop, we typically have a problem where we need to process a dataset using SQL. In order to do this, we need to ETL the dataset in the correct format to disk. This involves donig one of several tasks.
- Serializing it as pure CSV format. This is easily readable by SQL tools like Hive.
- Serializing it using Writables or using something like thrift/avro. Since hive can be used to read data in these formats, given the schema, this also works.
This leads to things like : Constant rollover in serialization integration for the hadoop ecosystem, continuous integration for a company wide "data API", fancy integration tests which write data using java, and then read it in using some kind of custom configured pig/hive setup, and so on. These are all great projects, but are also symptoms of accidental complexity.
Monolithic data apps (in a good way): Spark serialization with case classes.
With spark, we serialize native spark data types, case classes. We in turn write those to sequence files or else, refer to them directly in our code as transformed RDDs. Since RDDs can come from disk or from memory, we remove alot of accidental complexity from processing large data sets by adopting them as our idiom.
- eliminate need for defining intermediate data sets by name
- allow us to ignore serialization implementation, by natively dealing with the same type of data structure scala supports.
- free us of having to maintain a data API
Example: SparkSQL
In bigpetstore, we recently had a need to do a create a SQL friendly version (i.e. SparkSQL or JDBC reflection friendly fields) of a data set. In order to do this, we can use pure scala, but we have to map java.util.Calendar into something that SparkSQL can map (i.e. something SparkSQL can use reflection).
To do this we simply create a intermediate RDD which maps the calendar field into another data. Rather than explain it, you can just look at the screenshot above. This screenshot also serves as an example of how to convert a non SparkSQL data type into a SparkSQL friendly one.

No comments:
Post a Comment