8.4.15

balancing bit-rot risks in open source

I recently had a friendly debate w/ rj nowling and andrew purtell (both apache bigtop buddies) about the right way to do "iterations" on bigtop blueprints, which endeavour to demonstrate/exemplify bigdata tooling in bigtop.

Heres a totally made up graph that summarizes my view of things.



in the beggining
 
Originally, we built bigpetstore, which was a full stack application, but as new ideas (like bigtop bazaar) come into the picture, carving a path for maintainable blueprints in bigtop becomes more important which justify their maintainance/publishing cost.

the first obvious approach to this sort of thing, is to do it iteratively.  first iteration doesnt cost much.  next iteration is only done if first succeeds, and so on... however, iteration which doesnt satisfy a grand vision can lead to bit-rot. an example of this is the bit-rotted hive smoke tests in bigtop, which simply didn't work because they didnt have enough usabliity/impact to be used.

growing pains

BUT the problem with iterative development is there are many ways to do it and you don't want it to be carte blanche to justify bloated or dead code systems, and/or unnecessarily sloppy code.

in my case, our debate was around wether, the next apache bigtop blueprint data generator... bigtop bazaar, should include a bigdata use case (i.e. hbase), or wether or not it would be sufficient to exist as an island to be iterated on.

so... when debating this, we basically had two opposing ways to create bigtop bazaar.

one perfect component at a time. 

in this case, you're iterations are vertical.  you build the "first part", then the "second part", and so on.  I'm not a big fan of this, because the "second part" often just won't get done in an open source context, because nothing actually depends on it, and it isn't intrinically of use.

all components at a time, but some might be a little broken.

in this case, we start off with all components of a system working together, however, some might not be 100% functional, or ideal.    the downside here, of course, is that none of the components are incredibly powerful in isolation.  the upside is that the entire use case is outlined more or less.


how we did it in bigpetstore

in bigpetstore, on ASF bigtop, our approach was to start out with a fully working app which generated data and used multiple ecosystem components - https://issues.apache.org/jira/secure/attachment/12640190/BIGTOP-1089.patch.  This allowed other folks to come in and improve on parts which they were interested in (bashit parekh worked on mahout, I worked some more on pig/hive, and RJ worked on improving the data generator).  it also enabled some interesting full stack demos of hadoop off the cuff,(https://www.youtube.com/watch?v=zHYfLNJ7ncI ,  https://www.youtube.com/watch?v=OVB3nEKN94k ).

why i favor the slightly broken, horizontal approach for creating a new project

having a full stack app with a few rough edges allows many people to leverage it, decreasing entropy of things like readme docs and so on, since they get more exersize. 

in reality, the tension is a good thing

in reality for a project to really succeed, the irony is - you need a little tension.  and thankfully me and RJ have just that :).  I'm a breadth guy - i like to see all components (even if some are slightly broke or incomplete) so that i know that there will be some natural stressors keeping the application evolving in the right way.  he's a depth guy : focused with getting the mathematics and component architecture right, which means he's not alwayskeen on about clamming together a bunch of components in a first pass.

so the agreement we came to is, that when we add new blueprints to bigtop, ill take on the job of making sure they tell a good story and have context, while RJ focuses on getting the mathematics and component details perfect.

so, whats better ? horizontal or vertical evolution?

probably both :), if you have the manpower !

1 comment:

  1. Great overview, Jay! Discussion and disagreement are healthy and important parts of open-source projects. They help the project and individuals reflect on their goals and make them clearer. It can even be useful for a project to say no if contributions are tangential or represent a maintenance burden -- it also gives contributes feedback on how to direct their efforts.

    In our case, our debates and discussions always end up making our collaborations better! :)

    ReplyDelete