W. Mann, N. Augsten, P. Bouros. An Empirical Evaluation of Set Similarity Join Techniques. In The Proceedings of the VLDB Endowment (PVLDB 2016)
tar xavf ssjoin-0.1.tar.xz COMPILE_IN=bin mkdir $COMPILE_IN cd $COMPILE_IN cmake -DCMAKE_BUILD_TYPE=Release ../ssjoin-0.1 make -j2Note: the
-j2
lets make compile two files in parallel. You might adjust this number according the number of cores and the available main memory on your system. The code may take some minutes to compile.
tar xavf ssjoin-0.1-sparsehash.tar.xz
export CXXFLAGS="-I../ssjoin-0.1" cmake -DCMAKE_BUILD_TYPE=Release ../ssjoin-0.1
The set_sim_join binary by default expects fully preprocessed input, i.e.:
If your input does not fulfill these conditions, you can request preprocessing by --whitespace (every consecutive sequence of non-whitespace characters is a token) or --qgram N ( to build q-grams). Again, one set per line.
allpairs
ppjoin
ppjoin
--suffixfilter
ppjoin
---mpjoin
ppjoin
--mpjoin --pljoin
groupjoin
adaptjoin
./set_sim_join --statistics --timings flickr-dedup-raw.txt ppjoin --mpjoin --pljoin 0.75
This executes the PEL algorithm with Jaccard threshold 0.75 on pre-processed input file flickr-dedup-raw.txt and outputs algorithm-specific statistics and timing information. Further options are documented in the command-line output:
./set_sim_join --help
Willi Mann (wmann AT cosy.sbg.ac.at)