PLEASE NOTE: This is meant as a companion post. The concepts outlined here are meant to give those trying to understand the mathematical aspects of baysian classifiers an intuitive understanding. For theoretical rigor, go read the original post.
This post should serve as a good prerequisite for understanding the above post, which goes on into the deeper applied as well as theoretical aspects of tuning baysian classifiers. In particular I'll attempt to clarify three things which are always tricky for me about classifiers.
- How they related to the classic Naive Bayes probability equation.
- How the feature vectors are created from complex inputs (i.e. a whole document)
- The real world context for how TfDIF and other algorithms get integrated into classifiers.
The quick brown fox jumped over the lazy dog.
![]() |
Taken from RJs original article, above. These are the pajama snippets of the man we are hunting. Get im! |
Now. Before we bring in the algorithm. How will we see if this dog can hunt (that is - how will we check if our algorithm is capable of detecting true copies of our original text) ? Below, we will use A,B,C,D,E and a new string of text, which we make up, F. We will build our classifier using B,C and D - and test it with A and F as inputs. Hopefully, A will be predicted as "plagiarized", and F as not so.
![]() |
ALSO taken from RJs original article, above. |
Okay, so we've seen some sample text, and a fancy equation... But how do we populate the X vector in the equation with terms from our document... like this: We build a vector, where each index of the vector maps to a particular word in our dictionary (i.e 0 = quick, 1 = over, ...). The value at each index of the vector represents presence/absence of a particular word multiplied by the fudge factor for each word.
- will make our equation more or less sensitive to certain words (tfdif)
- will make the equation sensitive to total word count, not just presence/absence (multinomial)
- will just be = 1 (bernoulli)
X vector type | Value of the element i in the vector | Example of the vector | # of terms in vector (n=Vocab size) |
---|---|---|---|
Bernoulli | binary (word is on or off). | {0,1,0,1,0,0,0,0} | n |
Multinomial | number of occurences (words value can be > 1). | {0,8,0,2} | n-k |
TfDif | words value ~ frequency in the doc / frequency in the document. | {0.1, 2.9} | n-k |
Note that the Bernoulli method counts all words, including missing ones. This means that you might get strings of (1/Q)(1/Q)(1/Q) in the P term, which means documents with very few matching words will get heavily penalized. See the original post for details.
As you might imagine, the choice of how we build the fudge factor is what is going to make our classifier faster, slower, more or less scalable. For example, if we dont have all documents, we cannot do a tfdif very effectively so we might just use the bernoulli method. If we think raw term frequency is important, for example in a human defined system where there are strict threshold.
Interested in seeing how you actually implement one of these?
Like i said, please check out the original post. This is just a roadmap for those in need of some TLC in understanding the slippery road from theoretical bayes' to the actual implementation of a real world classifiers.
Couldn't u also detect plagiarism by using LSH techniques like SimHashing?
ReplyDelete