Record Matching Package
A similar set of tasks, including record linkage, record deduplication, entity resolution, object consolidation, and citation matching, all involve matching of items. Here, we assume that the input items are expressed as records and simply call it record matching.
This record matching package is written as an extensible framework in Java, with the goal of making the writing of programs that perform record matching tasks easier. The focus here is on pairwise comparison of records, and this package includes building blocks for similarity or distance metrics, blocking algorithms, and clustering algorithms. A couple of metrics and algorithms are included in this package, and new metrics and algorithms can be implemented by subclassing suitable classes in this framework.
The choice of Java as the programming language is mainly for portability reasons, as well as promoting code reuse by facilitating the integration of other Java packages into this record matching framework. In particular, it is easy to include bridges to other approximate matching components that are written in Java. Bundled in this package are access to two popular string matching libraries, SecondString and SimMetrics, as well as a means to transform record pairs into instances such that classification algorithms in the Weka collection of machine learners can be used.
Oracle Java 1.6 is required to use this package.
The record matching package is open source. All releases are licensed under the GNU General Public License, either Version 3 or (at your option) any later version.
This package is under general release.
sg.yeefan.recordmatching.clusterers package should
be considered unstable for now.
Release: 2010 February 14
sg.yeefan.recordmatching.clusterers package, the classes
related to hierarchical clustering have been restructured. These affects
DendrogramNode and their
subclasses. At the same time, the
AgglomerativeClusterer class now
contains a more efficient O(n2 log n) implementation of the
hierarchical agglomerative clustering algorithm. The changes to the API are not
The classes originally in
sg.yeefan.util are now moved to its
own project in common utilities, with a greatly
expanded set of utility classes. Note that the old and new
sg.yeefan.util packages have a number of API changes.
CombinationMetric and its subclasses have been improved slightly.
The changes are backwards compatiable.
The whole package is now considered general release, except that the
sg.yeefan.recordmatching.clusterers package should be considered as
unstable in this release.
Release: 2009 April 7 (beta)
There is a major change to the API. Instead of having all classes residing
sg.yeefan.recordmatching package, this package is now split
into several subpackages that group classes by their functionality:
sg.yeefan.recordmatching.fields for record fields,
sg.yeefan.recordmatching.metrics for similarity and distance
sg.yeefan.recordmatching.blockers for blocking algorithms,
sg.yeefan.recordmatching.clusterers for clustering algorithms.
The interface of the individual classes remains unchanged.
A new subpackage
sg.yeefan.recordmatching.transformers is also
introduced in this release. Currently it holds the base classes which serves as
a foundation for transforming records and record fields.
Release: 2009 February 24 (alpha)
This is the initial release of the record matching package.