Today I learned about some under-documented limits on Google’s AJAX search API. While working on my Searcher class (that will eventually generate training sets for the SVM) I asked Java to print the first 50 page titles that Google returned. Every time I ran the program I would get a JSONException after 28 results. Upon further examination, I found that Google returned the following 400 Bad Request JSON whenever I sent a request with the parameter &start
greater than 28:
{
"responseData": null,
"responseDetails": "out of range start",
"responseStatus": 400
}
This seemed a little absurd, considering that in previous queries Google claimed to have found over 14 million results for the same search terms. Naturally, I started digging online to see if anyone else had encountered this magic 28 barrier. I soon learned that the AJAX search API is limited to 32 results, and that in order to get all 32 you must include the &rsz=large
directive in your request, dictating 8 results per request instead of 4.
This could really hinder the quality of my training sets. I suppose I can just add results to the 100 most recent for each category (I wrote a nice little class to do just that) but then it could take a while to build a diverse training set, several days even if the results changed every day. On the other hand, I read that Yahoo’s web search API offers up to 1000 results with a cap of 5000 queries in 24 hours. Switching to Yahoo might be a good option, if their results are kept as up-to-date as Google’s. I’ll have to do some research, or maybe make the search interface modular so I can try both.