jayunit100: Real world functional iteration : An alternative to mutable intermediate data-structures.

Functional languages replace the need for mutability by providing us with convenient, idiomatic mechanisms for defining transformations without explicitly modifying or assigning variables. This level of decoupling transparently provides us with natural concurrency constructs and scalability, and is in fact the basis for the now famous map-reduce paradigm for processing data at massive scale.

Its obvious that simplistic loops can be easily replaced by deterministic, declarative functions in a functional language. For example:

int sum = 0; int[] numbers=....;

for(int i = 0 ; i < numbers.length ; i++)

sum += numbers[i];

Can be written as

(reduce + numbers)

That was easy: but what about in the real world ?

As problems get more complicated, it becomes easy to throw random data structures at them (sets, hashmaps, treemaps, ....). Domain specific "objects" get created, bloating your code, with names like "SpiderMonkeyRendevouz" and "SpiderMonkeyRelationFrequency"... And before you know it - you begin writing code to transform one data structure into another, for no reason whatsover, other than the fact that your "inital" data structure did not directly compute the answer to the problem you were trying to solve.

All that work just to calculate how often two spider monkeys were seen together in the same tree :)

So, without further ado, lets work our way through a case study in the removal of intermediate state by proper use of Clojure's "reduce" function (we will completely abandon the SpiderMonkey antics for this exersize). I'm reasonably confident that, if you read this post in its entirety, your ability to avoid such pitfalls will be significantly improved in the future.

"Given a string, count the number of instances of each word in that string, and output the result as a map".

Example 1: A very sub-optimal solution to the word-count problem.

; Here was my first (non-linear performing, as well as ugly) attempt at building a map of words to their counts in a string :
(defn word-enrichment
"input: a string 'a b b'
   output: a map : {'a' 1 'b' 2}"
[str_in]
{:pre [(= (type str_in) (type ""))]}
(let [all (cs/split str_in #"\b+")]
    (into {} ; <- store each emitted word count into this map
          (for [unique_word (set all)]
            [unique_word
             (count (filter #(= unique_word %) all))]))))

How it works : This code first stores all words into a set by splitting a string by word boundaries. Then, it counts the words by rescanning the string for each word, emitting them into a map (thats what the 'into {}' prefix to the 'for' list comprehension is defining : a map as the collector).

Improving the above example by decreasing its use of intermediate state.

So the first red flag here is the unique set of words, which is obviously unnecessary. It would be more ideal to count each word as we read it, storing the results in some kind of map. This would be quite obvious to almost anyone....

However, the alternative, in a language like Clojure eschew's mutable data structures, might evade you at first : how can we iterate through the words in the text and build up a map without declaring a global variable?

Lets formalize these issues by adding to requirements to our new solution:

1) You can only read through the input string once.

2) You can't use any global or mutable variables.

How can we count words in a single pass without a mutable data structure ?

The key here is to think functionally. Since "for" emits items, wouldn't it be nice to emit them into a function, for example, a function which merges results, one after another, from each "for" emission.

Yes... And in fact, that is what "reduce" does !

First, lets take a look at the behaviour of the clojure reduce function - I know this is beginners stuff but don't worry, it will get fancy very quickly.

(reduce + [1 1])
>2

And now, we can see that it works with the sequential output of "for" just as well.

(reduce + (for [x [1 1]] x))
>2

And now, after a little RTFM i.e. (doc reduce), it becomes clear that we can write our own, 2 argument function implementation in reduce.

(reduce (fn add [a b] (+ a b)) [1 1])

So... how does this relate to a word count ? Instead of taking 2 ints, and returning a new one - as above, we can take in a "map", and a "word", and return a new "map" ! So, as a template for this sort of function, we combine all these concepts below in a 3 argument call to reduce, which takes advantage of the fact that reduce can take an initial value.

> (reduce (fn add [a b] (+ a b)) 10 [1 1 1])
13

Now lets try to build our new word count function - first, lets see if we can output something other than a number from each reduce step (i.e. lets output an updated map with a count of 999 for the newly read in word) :

> (reduce #(assoc %1 %2 999) {} ["a" "a" "b"])
{"b" 999, "a" 999}

Good ! Our anonymous function as emitted "words" as keys, and the number 999 as values, and is clearly getting updated as it goes through the list. Now lets make it smart enough to increment, rather than overwrite, when a word has already been read:

> (reduce
#(let [v (%1 %2)]
     (assoc %1 %2
            (if v (inc v) 1) ))
     {}
     ["a" "a" "b" "b" "b"] )

Example 2: The final result:

Lets do some clean up...

user=>

(defn word-enrichment

"input: a string 'a b b'

output: a map : {'a' 1 'b' 2}"

[str_in]

{:pre [(= (type str_in) (type ""))]}

(let [all (clojure.string/split str_in #"\b+")]

(reduce

#(let [v (%1 %2)]

(assoc %1 %2

(if v (inc v) 1) )) {} all)))

And a test:

user=> (word-enrichment "The practice of stateless programming is the sign of a fine young gentleman .")

{"" 1, " " 12, "a" 1, "is" 1, "stateless" 1, " ." 1, "The" 1, "the" 1, "of" 2, "young" 1, "programming" 1, "fine" 1, "practice" 1, "gentleman" 1, "sign" 1}

Summary

We have thus replaced a "data-structure" driven solution to a problem (i.e. creating an intermediary set of unique words) with a functional one, which is based around Clojure's "reduce" function. This is particularly relevant in today's map/reduce driven age of functional computation : What do you get when you abstracting iteration, and removing intermediate data on a massive scale? Hadoop!

So, although this post's contents are related specifically to how we can use "reduce" to eliminate intermediate state and data in a very simple situation, there is a deeper lesson- functional, side-effect free iteration is a far-superior approach to computation than the typical "dump-it-in-a-domain-specific-object-and-figure-the-details-out-later" approach.

7 comments:

UnknownJuly 3, 2012 at 9:45 AM
How is the second example stateless? `reduce` is passing data all over the place ... and the first example is side-effect free
jayunit100July 3, 2012 at 10:05 AM
FYI others, mister matt has done a quick benchmark ... We will have to look into it, but the 2nd one is slower ! I would assume reduce should be faster since it doesnt have to re-filter the strings each time... But maybe the creation of a new map at each step is slow.

To be updated soon :)
jayunit100July 3, 2012 at 10:40 AM
(defn word-enrichment1
"input: a string 'a b b'
output: a map : {'a' 1 'b' 2}"
[str_in]
{:pre [(= (type str_in) (type ""))]}
(let [all (clojure.string/split str_in #"\b+")]
(reduce
#(let [v (%1 %2)]
(assoc %1 %2
(if v (inc v) 1) )) {} all)))

(defn word-enrichment2
"input: a string 'a b b'
output: a map : {'a' 1 'b' 2}"
[str_in]
{:pre [(= (type str_in) (type ""))]}
(let [all (clojure.string/split str_in #"\b+")]
(reduce
#(let [v (%1 %2)]
(assoc %1 %2
(if v (inc v) 1) )) {} all)))

; Test the performance ......

(test-performance-wc
(let [txt (slurp "http://en.wikipedia.org/wiki/Pegasus")]
(print "time for set based word count\n")
(time (word-enrichment1 txt))
(print "time for reducer word count (should be less)\n")
(time (word-enrichment2 txt)))
)
UnknownOctober 10, 2012 at 5:43 AM
I was watching Rich Hickey's 'Simple Made Easy' talk, in which he said that he doesn't like folds because they 'complect' the what with the how. I'm not quite sure what he meant by that though.

1.7.12

Real world functional iteration : An alternative to mutable intermediate data-structures.

7 comments: