int sum = 0; int[] numbers=....;
for(int i = 0 ; i < numbers.length ; i++)
sum += numbers[i];
Can be written as
(reduce + numbers)
That was easy: but what about in the real world ?
As problems get more complicated, it becomes easy to throw random data structures at them (sets, hashmaps, treemaps, ....). Domain specific "objects" get created, bloating your code, with names like "SpiderMonkeyRendevouz" and "SpiderMonkeyRelationFrequency"... And before you know it - you begin writing code to transform one data structure into another, for no reason whatsover, other than the fact that your "inital" data structure did not directly compute the answer to the problem you were trying to solve.
All that work just to calculate how often two spider monkeys were seen together in the same tree :)
So, without further ado, lets work our way through a case study in the removal of intermediate state by proper use of Clojure's "reduce" function (we will completely abandon the SpiderMonkey antics for this exersize). I'm reasonably confident that, if you read this post in its entirety, your ability to avoid such pitfalls will be significantly improved in the future.
"Given a string, count the number of instances of each word in that string, and output the result as a map".
Example 1: A very sub-optimal solution to the word-count problem.
; Here was my first (non-linear performing, as well as ugly) attempt at building a map of words to their counts in a string :
(defn word-enrichment
"input: a string 'a b b'
output: a map : {'a' 1 'b' 2}"
[str_in]
{:pre [(= (type str_in) (type ""))]}
(let [all (cs/split str_in #"\b+")]
(into {} ; <- store each emitted word count into this map
(for [unique_word (set all)]
[unique_word
(count (filter #(= unique_word %) all))]))))
How it works : This code first stores all words into a set by splitting a string by word boundaries. Then, it counts the words by rescanning the string for each word, emitting them into a map (thats what the 'into {}' prefix to the 'for' list comprehension is defining : a map as the collector).
Improving the above example by decreasing its use of intermediate state.
So the first red flag here is the unique set of words, which is obviously unnecessary. It would be more ideal to count each word as we read it, storing the results in some kind of map. This would be quite obvious to almost anyone....
However, the alternative, in a language like Clojure eschew's mutable data structures, might evade you at first : how can we iterate through the words in the text and build up a map without declaring a global variable?
Lets formalize these issues by adding to requirements to our new solution:
1) You can only read through the input string once.
2) You can't use any global or mutable variables.
How can we count words in a single pass without a mutable data structure ?
The key here is to think functionally. Since "for" emits items, wouldn't it be nice to emit them into a function, for example, a function which merges results, one after another, from each "for" emission.
Yes... And in fact, that is what "reduce" does !
First, lets take a look at the behaviour of the clojure reduce function - I know this is beginners stuff but don't worry, it will get fancy very quickly.
(reduce + [1 1])
>2
And now, we can see that it works with the sequential output of "for" just as well.
(reduce + (for [x [1 1]] x))
>2
And now, after a little RTFM i.e. (doc reduce), it becomes clear that we can write our own, 2 argument function implementation in reduce.
(reduce (fn add [a b] (+ a b)) [1 1])
>2
So... how does this relate to a word count ? Instead of taking 2 ints, and returning a new one - as above, we can take in a "map", and a "word", and return a new "map" ! So, as a template for this sort of function, we combine all these concepts below in a 3 argument call to reduce, which takes advantage of the fact that reduce can take an initial value.
> (reduce (fn add [a b] (+ a b)) 10 [1 1 1])
13
Now lets try to build our new word count function - first, lets see if we can output something other than a number from each reduce step (i.e. lets output an updated map with a count of 999 for the newly read in word) :
> (reduce #(assoc %1 %2 999) {} ["a" "a" "b"])
{"b" 999, "a" 999}
Good ! Our anonymous function as emitted "words" as keys, and the number 999 as values, and is clearly getting updated as it goes through the list. Now lets make it smart enough to increment, rather than overwrite, when a word has already been read:
> (reduce
#(let [v (%1 %2)]
(assoc %1 %2
(if v (inc v) 1) ))
{}
["a" "a" "b" "b" "b"] )
Example 2: The final result:
Lets do some clean up...
user=>
(defn word-enrichment
"input: a string 'a b b'
output: a map : {'a' 1 'b' 2}"
[str_in]
{:pre [(= (type str_in) (type ""))]}
(let [all (clojure.string/split str_in #"\b+")]
(reduce
#(let [v (%1 %2)]
(assoc %1 %2
(if v (inc v) 1) )) {} all)))
And a test:
user=> (word-enrichment "The practice of stateless programming is the sign of a fine young gentleman .")
{"" 1, " " 12, "a" 1, "is" 1, "stateless" 1, " ." 1, "The" 1, "the" 1, "of" 2, "young" 1, "programming" 1, "fine" 1, "practice" 1, "gentleman" 1, "sign" 1}
Summary
We have thus replaced a "data-structure" driven solution to a problem (i.e. creating an intermediary set of unique words) with a functional one, which is based around Clojure's "reduce" function. This is particularly relevant in today's map/reduce driven age of functional computation : What do you get when you abstracting iteration, and removing intermediate data on a massive scale? Hadoop!
So, although this post's contents are related specifically to how we can use "reduce" to eliminate intermediate state and data in a very simple situation, there is a deeper lesson- functional, side-effect free iteration is a far-superior approach to computation than the typical "dump-it-in-a-domain-specific-object-and-figure-the-details-out-later" approach.

How is the second example stateless? `reduce` is passing data all over the place ... and the first example is side-effect free
ReplyDeleteI would be interested to see a comparison of the memory usage between the two examples.
DeleteThe first one reads the list and processes it in memory... the 2nd does the processing in place... I think I may be using the word state in a somewhat over-broad context here. good point sir ! I might refine this post this wk .
DeleteFYI others, mister matt has done a quick benchmark ... We will have to look into it, but the 2nd one is slower ! I would assume reduce should be faster since it doesnt have to re-filter the strings each time... But maybe the creation of a new map at each step is slow.
ReplyDeleteTo be updated soon :)
ha well, the 2nd one actually is faster (see below comment). whew... i was scared for a minute there. Evidently, the assoc is actually performant because it shares the data structure with previous maps, so the reduction runs reasonably fast.
Delete(defn word-enrichment1
ReplyDelete"input: a string 'a b b'
output: a map : {'a' 1 'b' 2}"
[str_in]
{:pre [(= (type str_in) (type ""))]}
(let [all (clojure.string/split str_in #"\b+")]
(reduce
#(let [v (%1 %2)]
(assoc %1 %2
(if v (inc v) 1) )) {} all)))
(defn word-enrichment2
"input: a string 'a b b'
output: a map : {'a' 1 'b' 2}"
[str_in]
{:pre [(= (type str_in) (type ""))]}
(let [all (clojure.string/split str_in #"\b+")]
(reduce
#(let [v (%1 %2)]
(assoc %1 %2
(if v (inc v) 1) )) {} all)))
; Test the performance ......
(test-performance-wc
(let [txt (slurp "http://en.wikipedia.org/wiki/Pegasus")]
(print "time for set based word count\n")
(time (word-enrichment1 txt))
(print "time for reducer word count (should be less)\n")
(time (word-enrichment2 txt)))
)
I was watching Rich Hickey's 'Simple Made Easy' talk, in which he said that he doesn't like folds because they 'complect' the what with the how. I'm not quite sure what he meant by that though.
ReplyDelete