I ran my first couple of training sets today. I must confess, the results are not pretty. Let’s start with the summary:
Summary
The training set for the text categorization example given by Joachims contains 2000 weighted example vectors. The precision of the resultant model, as estimated by svm_learn, is 93.07%.
My first training set used a search for “cars” for positive examples and a search for “film -cars” for negative examples. It contains 63 binary example vectors. The estimated precision is 12.90%.
My second training set used a search for “basketball” for positive examples and a search for “racing -basketball” for negative examples. It contains 61 binary example vectors. The estimated precision is 9.09%. Furthermore, I turned the sentence “Michael Jordan is out to shoot some hoops on the court this week.” into a test vector. It was categorized incorrectly.
Analysis
There aren’t the sort of statistics I was hoping to see. There are a number of reasons why I might be getting these subpar results.
- Quantity
Sixty-some example vectors simply aren’t going to stand up to the example set of 2000. Of course, the internet is a big place, so there’s no reason (other than Google’s API limitations) that I shouldn’t be generating my own large training sets. - Quality (Part A)
The example set uses weighted vectors while my sets use only binary vectors. In short, I’m not including information about word frequency, only about word appearance. - Quality (Part B)
I don’t know how counterexamples were selected for the example set, but I’ll admit that my current strategy for finding negative examples is flawed. The selection of a counterexample search term was arbitrary, and using a single search term probably produces an undesirably uniform counterexample set. - Quality (Part C)
The example set was generated by a system trained to ignore trivial words and to reduce complex words to word-parts for consistency. My system currently has no such bells and whistles. I had hoped that the equal presence of elements like markup in positive and negative examples would lead the vector machine to ignore those elements, but the results say otherwise.
What remains to be seen is whether these factors can account for an 80% difference in accuracy. Next steps:
- Quantity
Time to switch to Yahoo’s API and start pulling down large result sets. - Quality (Part A)
I can try switching to using word frequency within a document, but I’ll need to modify my shared dictionary class to use the same weight calculation that the example set does. - Quality (Part B)
I’ll either generate counterexamples using a set of searches over other category keywords, or just an OR search. One counter-keyword is not enough. - Quality (Part C)
I’ll start a word filter list to ignore low-content words like “the.” - Persistence
Everything is runtime right now. I need to rebuild some things and include a mechanism for saving and reloading a common dictionary at the very least. I also need to be able to consult the dictionary to get a feel for which words it’s picking up.