Here is a list of the potential changes to SVMTrainer that were suggested to me during this weekend’s conference.
Searcher
- Implement conditions on acceptable web document sizes to optimize document retrieval time
- Try using a small initial search as a seed to get other search terms and expand the diversity of my training set – Yahoo! Term Extraction might be good for this, too.
WordFilter
- Try implementing WordNet in the WordFilter class
- Find a use for Yahoo! Term Extraction
WebDocument
- Implement parallelism in the retrieval of search results and the retrieval of web documents
- Implement a document retrieval timeout and a URL blacklist to prevent hanging on bad downloads
Other
- Investigate the use of SVMstruct for categorization/ranking problem in multiple dimensions
- Start doing an independent check on the accuracy of trained sets by keeping 10% of results for categorization rather than training
- Learn about Xi Alpha estimates and what exactly they mean