An Empirical Evaluation of Set Similarity Join Techniques
W. Mann, N. Augsten, P. Bouros. An Empirical Evaluation of Set Similarity Join Techniques.
In The Proceedings of the VLDB Endowment (PVLDB 2016). Link, local copy
Source Code, Version 0.6 enhanced version (support for Hamming distance*, fixes for newer compiler versions)
License: GPL-3
* Hamming distance support:
Full support for Hamming distance has only be implemented in the self-join version of AllPairs.
The problem that had to be addressed is the following: Assume two sets r={a,b} and s={c,d}. The hamming distance is 4. Given a Hamming distance threshold of 4, this pair should be in the output. However, inverted list based algorithms do not detect pairs that do not have any token in common. We solve this problem by a partial nested loop approach, that ensures that all sets pairs where |r|+|s|≤ t are treated as matching pairs. The runtime complexity of the additional code is linear to the number of pairs where |r|+|s|≤ t, but does not affect other similarity functions.