> Home > Downloads > Record Matching Package

Record Matching Package

A similar set of tasks, including record linkage, record deduplication, entity resolution, object consolidation, and citation matching, all involve matching of items. Here, we assume that the input items are expressed as records and simply call it record matching.

This record matching package is written as an extensible framework in Java, with the goal of making the writing of programs that perform record matching tasks easier. The focus here is on pairwise comparison of records, and this package includes building blocks for similarity or distance metrics, blocking algorithms, and clustering algorithms. A couple of metrics and algorithms are included in this package, and new metrics and algorithms can be implemented by subclassing suitable classes in this framework.

The choice of Java as the programming language is mainly for portability reasons, as well as promoting code reuse by facilitating the integration of other Java packages into this record matching framework. In particular, it is easy to include bridges to other approximate matching components that are written in Java. Bundled in this package are access to two popular string matching libraries, SecondString and SimMetrics, as well as a means to transform record pairs into instances such that classification algorithms in the Weka collection of machine learners can be used.

Oracle Java 1.6 is required to use this package.


The record matching package is open source. All releases are licensed under the GNU General Public License, either Version 3 or (at your option) any later version.


This package is under general release.

However, the sg.yeefan.recordmatching.clusterers package should be considered unstable for now.


Release: 2010 February 14

In the sg.yeefan.recordmatching.clusterers package, the classes related to hierarchical clustering have been restructured. These affects HierarchicalClusterer, DendrogramNode and their subclasses. At the same time, the AgglomerativeClusterer class now contains a more efficient O(n2 log n) implementation of the hierarchical agglomerative clustering algorithm. The changes to the API are not backwards compatiable.

The classes originally in sg.yeefan.util are now moved to its own project in common utilities, with a greatly expanded set of utility classes. Note that the old and new sg.yeefan.util packages have a number of API changes.

In the sg.yeefan.recordmatching.metrics package, CombinationMetric and its subclasses have been improved slightly. The changes are backwards compatiable.

The whole package is now considered general release, except that the sg.yeefan.recordmatching.clusterers package should be considered as unstable in this release.

Release: 2009 April 7 (beta)

There is a major change to the API. Instead of having all classes residing in the sg.yeefan.recordmatching package, this package is now split into several subpackages that group classes by their functionality: sg.yeefan.recordmatching.fields for record fields, sg.yeefan.recordmatching.metrics for similarity and distance metrics, sg.yeefan.recordmatching.blockers for blocking algorithms, and sg.yeefan.recordmatching.clusterers for clustering algorithms. The interface of the individual classes remains unchanged.

A new subpackage sg.yeefan.recordmatching.transformers is also introduced in this release. Currently it holds the base classes which serves as a foundation for transforming records and record fields.


Release: 2009 February 24 (alpha)

This is the initial release of the record matching package.