jayunit100: The fudge-factor portion of baysian classifiers.

If you built a classifier for recognizing lyrics to the above song, you would break it into snippet. Each snippet is named "A", "B" and so on. The individual snippets then contain words. We can decide how many snippets the input classifier is similar to. If we refine this strategy, formalize it - we essentially have a Baysian classifier.

My friend just published a post on Detecting Copied Text With Baysian Classifiers.

PLEASE NOTE: This is meant as a companion post. The concepts outlined here are meant to give those trying to understand the mathematical aspects of baysian classifiers an intuitive understanding. For theoretical rigor, go read the original post.

This post should serve as a good prerequisite for understanding the above post, which goes on into the deeper applied as well as theoretical aspects of tuning baysian classifiers. In particular I'll attempt to clarify three things which are always tricky for me about classifiers.

How they related to the classic Naive Bayes probability equation.
How the feature vectors are created from complex inputs (i.e. a whole document)
The real world context for how TfDIF and other algorithms get integrated into classifiers.

A good classifier algorithm reminds me of a good dog in a man hunt. You have some stuff that smells like a man, you give it to your dog. Then he helps you hunt the man down, without knowing exactly what he looks like or where he is. The Man in this analogy? Its a whole document:

The quick brown fox jumped over the lazy dog.

Taken from RJs original article, above. These are the pajama snippets of the man we are hunting. Get im!

Now. Before we bring in the algorithm. How will we see if this dog can hunt (that is - how will we check if our algorithm is capable of detecting true copies of our original text) ? Below, we will use A,B,C,D,E and a new string of text, which we make up, F. We will build our classifier using B,C and D - and test it with A and F as inputs. Hopefully, A will be predicted as "plagiarized", and F as not so.

ALSO taken from RJs original article, above.

Okay ! Now lets take a look at the the actual naive bayes classifier equation... Our main focus will be thinking about how the P(...) terms are calculated. In a very pure system, we might simply calculate P the "right" way, by counting the number of times a word occurs in a phrase of the training set, and dividing it by the number of total words. In other systems, we might fudge the value of P by apriori knowledge.

This is a fancy way of saying we will bin a document into class Y iff that class has the highest probability compared to all other classes. The possible values of Y are "plagiarized" and "non-plagiarized".

Okay, so we've seen some sample text, and a fancy equation... But how do we populate the X vector in the equation with terms from our document... like this: We build a vector, where each index of the vector maps to a particular word in our dictionary (i.e 0 = quick, 1 = over, ...). The value at each index of the vector represents presence/absence of a particular word multiplied by the fudge factor for each word.

Xw = [ 1*f0 , 1*f1, 3*f2 , 2*f3 , 2*f4 ] when Y is plagiarized. The f terms are fudge factors we will talk about later.

Why are fudge factors important? Because of cases like this! In the above text, our samples lose alot of information if we just count "absence" or "presence" of the words without weighting them based on overall frequency. We might actually want our P function to take into account the number of total times we see each word in a training segment... not just checking a box of wether it is present or missing.

One key to understanding how baysian theory applies to classifiers is, to grok that we dont use "pure" probabilities, but instead, we use multipliers for the pure probabilities P(...). Thos multipliers modify the probability, by a fudge factor, which weights term presence/absence with additional information. A term that == the probability term (i.e 3) * a fudge factor which is different for every word... and this fudge factor will do one of several things...

will make our equation more or less sensitive to certain words (tfdif)
will make the equation sensitive to total word count, not just presence/absence (multinomial)
will just be = 1 (bernoulli)

So, in the original post above, you will see that RJ walks your through the details of different P calculations. In particular, we have...

X vector type	Value of the element i in the vector	Example of the vector	# of terms in vector (n=Vocab size)
Bernoulli	binary (word is on or off).	{0,1,0,1,0,0,0,0}	n
Multinomial	number of occurences (words value can be > 1).	{0,8,0,2}	n-k
TfDif	words value ~ frequency in the doc / frequency in the document.	{0.1, 2.9}	n-k

Note that the Bernoulli method counts all words, including missing ones. This means that you might get strings of (1/Q)(1/Q)(1/Q) in the P term, which means documents with very few matching words will get heavily penalized. See the original post for details.

As you might imagine, the choice of how we build the fudge factor is what is going to make our classifier faster, slower, more or less scalable. For example, if we dont have all documents, we cannot do a tfdif very effectively so we might just use the bernoulli method. If we think raw term frequency is important, for example in a human defined system where there are strict threshold.

Interested in seeing how you actually implement one of these?

Like i said, please check out the original post. This is just a roadmap for those in need of some TLC in understanding the slippery road from theoretical bayes' to the actual implementation of a real world classifiers.

20.8.14

The fudge-factor portion of baysian classifiers.

1 comment: