9.2.15

Scala case classes and lazy intermediates make for easy SQL friendly data interop

Laziness in Spark has lots of interesting applications.   Most focus is around machine learning, but people often forget things like SparkSQL also benefit from it.
Intermediate data sets are primitives in spark, rather than accidental entities which we manually need to maintain.  The result of this is that tasks like messaging data for SparkSQL is surprisingly painless.


The Old days.  Writing SQL-Friendly Data Types In Hadoop.

In hadoop, we typically have a problem where we need to process a dataset using SQL.  In order to do this, we need to ETL the dataset in the correct format to disk.  This involves donig one of several tasks.
  • Serializing it as pure CSV format.  This is easily readable by SQL tools like Hive.
  • Serializing it using Writables  or using something like thrift/avro.  Since hive can be used to read data in these formats, given the schema, this also works.  
In either case,  SQL Friendliness in hadoop requires an extra accidental complexity step - one which means we either write intermediate data to disk, or else, define a thrift or other intermediate serialization definition file.

This leads to things like : Constant rollover in serialization integration for the hadoop ecosystem, continuous integration for a company wide "data API", fancy integration tests which write data using java, and then read it in using some kind of custom configured pig/hive setup, and so on.  These are all great projects, but are also symptoms of accidental complexity. 

Monolithic data apps (in a good way): Spark serialization with case classes.

With spark, we serialize native spark data types, case classes.  We in turn write those to sequence files or else, refer to them directly in our code as transformed RDDs.  Since RDDs can come from disk or from memory, we remove alot of accidental complexity from processing large data sets by adopting them as our idiom.
  • eliminate need for defining intermediate data sets by name
  • allow us to ignore serialization implementation, by natively dealing with the same type of data structure scala supports.
  • free us of having to maintain a data API
Granted: For sophisticated scenarios, you might still want to use thrift or parquet, but even still, once you've loaded your RDDs, spark can even still efficiently manipulate them with a minimal of accidental complexity.

Example: SparkSQL

In bigpetstore, we recently had a need to do a create  a SQL friendly version (i.e. SparkSQL or JDBC reflection friendly fields) of a data set. In order to do this, we can use pure scala, but we have to map java.util.Calendar into something that SparkSQL can map (i.e. something SparkSQL can use reflection).

To do this we simply create a intermediate RDD which maps the calendar field into another data.  Rather than explain it, you can just look at the screenshot above.  This screenshot also serves as an example of how to convert a non SparkSQL data type into a SparkSQL friendly one.

No comments:

Post a Comment