> Home > Record Matching Package

Record Matching Package

A similar set of tasks, including record linkage, record deduplication, entity resolution, object consolidation, and citation matching, all involve matching of items. Here, we assume that the input items are expressed as records and simply call it record matching.

This record matching package is written as an extensible framework in Java, with the goal of making the writing of programs that perform record matching tasks easier. The focus here is on pairwise comparison of records, and this package includes building blocks for similarity or distance metrics, blocking algorithms, and clustering algorithms. A couple of metrics and algorithms are included in this package, and new metrics and algorithms can be implemented by subclassing suitable classes in this framework.

The choice of Java as the programming language is mainly for portability reasons, as well as promoting code reuse by facilitating the integration of other Java packages into this record matching framework. In particular, it is easy to include bridges to other approximate matching components that are written in Java. Bundled in this package are access to two popular string matching libraries, SecondString and SimMetrics, as well as a means to transform record pairs into instances such that classification algorithms in the Weka collection of machine learners can be used.

This record matching package is open source and licensed under the GNU General Public License, either version 3 or (at your option) any later version. Sun Java 1.6 is required to use the package.

Status

This project is currently in beta.

Downloads

Release: 2009 April 7 (beta)

There is a major change to the API. Instead of having all classes residing in the sg.yeefan.recordmatching package, this package is now split into several subpackages that group classes by their functionality: sg.yeefan.recordmatching.fields for record fields, sg.yeefan.recordmatching.metrics for similarity and distance metrics, sg.yeefan.recordmatching.blockers for blocking algorithms, and sg.yeefan.recordmatching.clusterers for clustering algorithms. The interface of the individual classes remains unchanged.

A new subpackage sg.yeefan.recordmatching.transformers is also introduced in this release. Currently it holds the base classes which serves as a foundation for transforming records and record fields.

Downloads

Release: 2009 February 24 (alpha)

This is the initial release of the record matching package.