An Empirical Evaluation of Set Similarity Join Techniques

W. Mann, N. Augsten, P. Bouros. An Empirical Evaluation of Set Similarity Join Techniques. In The Proceedings of the VLDB Endowment (PVLDB 2016). Link, local copy


Set similarity joins compute all pairs of similar sets from two collections of sets. We conduct extensive experiments on seven state-of-the-art algorithms for set similarity joins. These algorithms adopt a filter-verification approach. Our analysis shows that verification has not received enough attention in previous works. In practice, efficient verification inspects only a small, constant number of set elements and is faster than some of the more sophisticated filter techniques. Although we can identify three winners, we find that most algorithms show very similar performance. The key technique is the prefix filter, and AllPairs, the first algorithm adopting this techniques is still a relevant competitor. We repeat experiments from previous work and discuss diverging results. All our claims are supported by a detailed analysis of the factors that determine the overall runtime.


Additional Information


Willi Mann (wmann AT