> Home > Search Engine Wrapper

Search Engine Wrapper

This package provides a Java wrapper framework for unifying programmatic access to search engines. A convenience class is also included for downloading the files at the URLs in the search engine results. This package contains an API as well as a command-line application.

The search engine wrapper has the following key features. Firstly, as search engines place restrictions on the number of results they return per request, the search engine wrapper automatically issues as many requests as needed to obtain the required number of results for a particular query string. Secondly, when transient errors such as network errors and server reboots occur, the search engine wrapper will retry search engine queries as many times as needed to obtain the results. The waiting time between successive retries is determined by an exponential backoff mechanism. Thirdly, for search engines that requires a key to function, the search engine wrapper can manage a pool of keys to be rotated.

Currently, this package contains clients for the following search engines: Google AJAX Search API, Yahoo! Search API, Yahoo! Search BOSS API, and Bing API (formerly known as Live Search API). The search engine wrapper framework is designed such that additional clients for other search engines can easily be integrated into this framework.

This search engine wrapper package is open source and licensed under the GNU General Public License, either version 3 or (at your option) any later version. Sun Java 1.6 is required to use the package.

Status

This package is under general release.

Please note that Google AJAX Search API does not return reliable total number of results. Unfortunately the quality of the data returned by search engine providers are not within our control, so users who require reliable total counts should use another search engine API such as Yahoo! Search API.

Downloads

Release: 2009 September 30

This release makes two improvements in the output XML files:

  1. The output XML files are now well-formed and can be parsed by a XML parser. Backwards compatability is ensured: this release can read in XML files produced by previous releases, and previous releases can read in XML files produced by this release. This is achieved through two new classes, ResultsXMLReader and ResultsXMLWriter, which replaces the legacy SearchEngineResultsXML class which is now deprecated.
  2. The output XML files now only contain a single <results> element for each query, rather than having possibly multiple <results> elements for the same query but with differing start indices. This is achieved by modifying the search engine wrapper itself.

Following the retirement of Google SOAP Search, the search engine client for Google SOAP Search has been removed from this package. Other deprecated classes of search engine clients have been removed as well.

This release also includes an updated version of the charset detect stream reader.

Downloads

Release: 2009 July 17

Added a client for Yahoo! Search BOSS, and removed the dependency on the Yahoo! Search client on yahoo-search.jar such that yahoo-search.jar can be removed from this distribution. This distribution also includes updated versions of the file downloader and the charset detect stream reader components.

Release: 2009 June 25

Fixed a bug where the Google AJAX Search client throws a NullPointerException wrapped inside a SearchEngineException when a query returns no results. Also, the search engine wrapper recognizes the maximum number of results that can be returned per query (Google AJAX Search allows the user to retrieve only the first 64 results).

It is now possible create input query files in UTF-8, UTF-16, or UTF-32 with the byte-order mark, such as by saving a text file in UTF-8 format in Notepad, and the search engine wrapper package now reads in such files correctly. The project that supports this functionality is the charset detect stream reader.

Release: 2009 June 15

This release has googleapi.jar removed, together with all dependencies on it, to avoid possible compilation errors or other conflicts. Also, HTML character entities in the search engine results are now unescaped.

The sg.yeefan.filedownloader package has been moved out from the search engine wrapper project into its own project. Both binary and source distributions for that package can be downloaded from the file downloader project page.

Release: 2009 June 9

There are quite a few significant changes in this release. In summary:

  1. Output XML files are now saved as UTF-8. This and later releases should be able to read XML files saved by earlier releases, but earlier releases may not be able to read XML files saved by this and later releases without information loss unless the platform default encoding is UTF-8.
  2. Search engine clients in the sg.yeefan.searchenginewrapper are now deprecated; replacements are in the newly created package sg.yeefan.searchengnewrapper.clients.
  3. A search engine client for Google AJAX Search API is implemented in this release. As Google will be retiring the Google SOAP Search API on 2009 August 31, current users of the Google SOAP Search API are advised to migrate to Google AJAX Search API or another search engine.
  4. The search engine wrapper now requests as many results as possible per request, subject to the allowances of the search engine. Previously, the search engine wrapper always requests for ten results per request.

Although I strive to make the changes backwards compatible, some of these changes may affect backwards compatibility. Please refer to the IMPORTANT NOTICES section of the README.TXT in this distribution for more details.

Release: 2009 May 28

Search engine clients can now throw a SearchEngineQuotaException to indicate that the daily quota has been used up. For search engines that require a key and where we have a pool of keys to manage, the behaviour of the search engine wrapper is now modified to rotate to the next key immediately without waiting when a SearchEngineQuotaException is caught. Also, the parameters of the exponential backoff waiting mechanism can now be modified programmatically through the API.

Finally, the documentation has been expanded with quite a bit of details, in both the README.TXT file and the Javadocs.

Release: 2009 March 6

This version adds an utility class for downloading the web pages and other files in the returned search results.

Release: 2009 February 24

This is the initial release of the search engine wrapper package.