29.1.12

Python, VowpalWabbit, and PyVowpal

How does VowpalWabbit work - what are its inputs - and how do I format "my" input data set so that I can start to classify records using its real time learning capabilities ?


VowpalWabbit classifies things into one of two categories.  This is nothing new.  Its called BINARY classification [guess why :)].  But what IS new is that allows high-performance, real time, online learning -- continuously improving predictions and classifications.  

In other words, you can continuously feed it new data, and it will classify that data at nearly constant time, even if there are thousands of parameters.  This is done using the "hashing trick", which allows it to map features to binary buckets.... So, for example, if you have 1000 features, the amount of information in the classification model is roughly 1000 bits ~ only 1 KB of data in the model !

That means, unlike alot of other machine learning algorithms : VW SCALES with an EXTREMELY LOW linear slope for ANY data set. 

It is hosted here https://github.com/JohnLangford/vowpal_wabbit (but dont worry - you DONT need to compile it to run it ! You can get it using apt-get --- described below).

K So how do i start predicting stuff?

The simplest way is to echo a pipe delimited vector into vw, using the example inputs page: http://hunch.net/~vw/validate.html .

Its not particulalry easy to get started with VowpalWabbit because the input data types to machine learning algorithms are not easily understood.... And documentation on VW input types isnt all that clear.

Simply put - VW is used to predict parameters from an input data set in realtime with predictions improving over time as we add more data to the system.

Before we get into the inputs/outputs - lets talk about installation.

Installing vowpal wabbit is a single step process on linux :

sudo apt-get install vowpal-wabbit

However- running it is another story.  Vowpal wabbit's inputs are not entirely clear
from existing documentation - so I will attempt to clarify the way we build an input
set here.

Basically, VW takes a map of features and weights as input.  

Thankfully, this has been abstracted for us by Shilad Sen, available here
https://github.com/shilad/PyVowpal

Shilad's wrapper abstracts the task of creating VW inputs, and writes a file such as this for us,
which predicts "running speed" for a group of fictional sports atheletes.  The exemplary learning set looks
like this (as a VW input set, created using the PyVowpal)

 0.4 1.0 0|body weight:0.3 height:0.8|age age:0.4|sports football
 0.7 1.0 1|body weight:0.3 height:0.8|age age:0.3|sports soccer
 0.2 1.0 2|body weight:0.5 height:0.8|age age:0.7|sports soccer
 0.7 1.0 3|body weight:0.3 height:0.9|age age:0.2|sports track
 0.9 1.0 4|body weight:0.3 height:0.7|age age:0.3|sports track
 0.6 1.0 5|body weight:0.7 height:0.7|age age:0.2|sports track


We can see here that the input values have predicted, normalized speeds (.4,.7,...) associated with
ids for input records (0, 1, 2 ...) and pipe delimited sources, where each source has an associated
map of features.  For example, the "body" source, has features "weight" and "height".  There are 3
such sources. These inputs can be visualized by the slight modifications to Shilad's code of adding
a print statement to the method in the VWExample class which creates a VW string.


So - how do we store these features ?  What is the input to the VWExample class ?

Thankfully, further borrowing from Shilad's code , we can visualize the input format of VW if we decompose it as a Python json map, rather than a single | delimited string... This has the advantage of telling us something about the
semantic structure of data which VW takes as input :

DATA = [
    [0.4, {'body' : {'height' : 0.8, 'weight' : 0.3}, 'age' : {'age' : 0.4}, 'sports' : { 'football' : None }}],
    [0.7, {'body' : {'height' : 0.8, 'weight' : 0.3}, 'age' : {'age' : 0.3}, 'sports' : { 'soccer' : None }}],
    [0.2, {'body' : {'height' : 0.8, 'weight' : 0.5}, 'age' : {'age' : 0.7}, 'sports' : { 'soccer' : None }}],
    [0.7, {'body' : {'height' : 0.9, 'weight' : 0.3}, 'age' : {'age' : 0.2}, 'sports' : { 'track' : None }}],
    [0.9, {'body' : {'height' : 0.7, 'weight' : 0.3}, 'age' : {'age' : 0.3}, 'sports' : { 'track' : None }}],
    [0.6, {'body' : {'height' : 0.7, 'weight' : 0.7}, 'age' : {'age' : 0.2}, 'sports' : { 'track' : None }}],
    [None, {'body' : {'height' : 0.7, 'weight' : 0.2}, 'age' : {'age' : 0.3}, 'sports' : { 'track' : None }}],
    [None, {'body' : {'height' : 0.7, 'weight' : 0.2}, 'age' : {'age' : 0.3}, 'sports' : { 'soccer' : None }}],
]

So - whats in this data structure ? We can easily view this json in formatted style using http://jsonviewer.stack.hu/
this is what we get : a list of [score,map] lists.

If we look carefully at 1 single record, we see that there are 3 types of sources :body, age, and sports.  Each source is associated with features (height, weight, age, football).  Finally, and importantly - some features have numerical values (i.e. height) whereas others are free text (i.e. football).  The PyVowpal wrapper will interpret keys with null values essentially as free text attributes, which will be accordingly dealt with by vowpal wabbit ( see :  https://github.com/JohnLangford/vowpal_wabbit/wiki/Input-format)
 
[
  [
    0.4,
    {
      'body': {
        'height': 0.8,
        'weight': 0.3
      },
      'age': {
        'age': 0.4
      },
      'sports': {
        'football': None
      }
    }
  ],
  ....
]

FINALLY : Instructions for running the PyVowpal package.

1) sudo apt-get install vowpal-wabbit
2) make sure that 'vw' is on your path by typing vw - the man page should show up.
3) execute the examples : python ./test_examples.py (make sure that you edit the path to vowpal wabbit in the test_examples.py script before you run to being, simply, vw, or whatever the path is on your system !) .

Closing Remarks

The PyVowpal project makes it particularly easy to not only run VowpalWabbit (it runs it for you).  More importantly, PyVowpal has demonstrated an abstract conceptual model of the inputs to VowpalWabbit for us, so that we can focus on modelling the domain specific details of our input learning set without worrying about the internal formatting of VW inputs.  Best of all - since Python and JSON interop is virtually seamless, it should be easier for any seasoned python developer to adopt PyVowpal for their own machine learning needs .

3 comments:

  1. the value for some training inputs is nil? Is that how the program knows what records to use in creating the training ?

    ReplyDelete
  2. In the example above, you could use the last to (None) inputted values to "test" the algorithm , where as the former would be for training. but normally you would create the model separately (a bunch of non-nill values) , then test it with all nil values. And then once happy with performance, run a separate vw instance for production - which was continuoally ingesting.

    ReplyDelete