An Empirical Evaluation of Set Similarity Join Techniques - Compilation Instructions

W. Mann, N. Augsten, P. Bouros. An Empirical Evaluation of Set Similarity Join Techniques. In The Proceedings of the VLDB Endowment (PVLDB 2016)

Go back to paper's main page

Pre-Requisites

Compiling

These instructions have been tested on Debian jessie (Debian 8). They should work on most UNIX-like systems.

Execution

The set_sim_join binary by default expects fully preprocessed input, i.e.:

If your input does not fulfill these conditions, you can request preprocessing by --whitespace (every consecutive sequence of non-whitespace characters is a token) or --qgram N ( to build q-grams). Again, one set per line.

Available algorithms

Algorithm
cmd-line algorithm string
Required Options
AllPairs
allpairs
PPJoin
ppjoin
PPJoinPlus
ppjoin
--suffixfilter
MPJoin
ppjoin
---mpjoin
PEL
ppjoin
--mpjoin --pljoin
GroupJoin
groupjoin
AdaptJoin
adaptjoin

Example execution

./set_sim_join --statistics --timings flickr-dedup-raw.txt  ppjoin --mpjoin --pljoin 0.75

This executes the PEL algorithm with Jaccard threshold 0.75 on pre-processed input file flickr-dedup-raw.txt and outputs algorithm-specific statistics and timing information. Further options are documented in the command-line output:

./set_sim_join --help

Contact

Willi Mann (wmann AT cosy.sbg.ac.at)