I recently got a chance to look at a very interesting pull request into spark: the drop() functionality, by erik e. The idea behind a "drop" for an RDD is that, in some cases, you want to strip a header from a file (i.e. CSV file or a binary data file).
I've also been reading "Scala In Depth", specifically, the chapter on implicits and their usage.
Reviewing SPARK-2315 gave me an opportunity to go into the depths of how implicits actually are used in "real world" applications.
First, lets take a look at how this pull request modifies the SparkContext. It adds a new implicit : "rddToDropRDDFunctions". This implicit
Is added as a new feature in the SparkContext class.
But Alas ! Look at how the drop() is being called? I don't see a drop function over SparkContext !
How is it then, that the implicit is being used by spark to add all the Drop APIs into the spark context object?
IMPLICITS .. A Modern Implementation of Inversion of Control.
Scala is able to see that the RDD is of a certain type, and then based on the implicit definitions, it can scan the local scope to see what implicit conversions are available. If any of those implicit conversions are available, then the methods provided therein are also available to the higher order class.
Taking the example above, we have
1) An implicit which adds DoubleRDD methods. This is an easy one - If the RDD is of type Double, then we provide the new DoubleRDDFunctions(rdd) conversion. These functions do double specific operations to an RDD.
2) In the second example above, the numeric RDD is internally mapped to a double RDD, but otherwise, its the same as (1) (just a more general case).
3) Fianlly in the drop case : Since we can really drop from any class - there is no need to match to a particular type (Double or Numeric). Instead, we can just use the generic [T: ClassTag] hook to say "match this to any RDD type, where we refer to the type as 'T'".
So how do Implicits make the Spark API more flexible?
The punchline here is that, using implicits : this pull request demonstrates how we can expand the compiler's knowledge of valid RDD operations with adding machinery for explicit conversion between the high level RDD types. This prevents alot of unnecessary over-inheritance or over use of static Utility libraries, which we might find in other OO languages.
FYI I just learned actually a great example of this is the Cassandra SparkContext extensions...
ReplyDelete