W. Mann, N. Augsten, P. Bouros. An Empirical Evaluation of Set Similarity Join Techniques. In The Proceedings of the VLDB Endowment (PVLDB 2016)
tar xzvf aol-data.tar.gz ./aol-data.sh user-ct-test-collection-*.txt.gz > aol-data.txt
./createraw.py --bywhitespace aol-data.txt aol-data-white-raw.txt
./createraw.py --bywhitespace --dedup aol-data.txt aol-data-white-dedup-raw.txt
./createraw.py --bywhitespace --dedup aol-data.txt aol-data-white-dedup-raw.txt ./shuflength.py aol-data-white-dedup-raw.txt > aol-data-white-dedup-lenshuf-raw.txt
unzip KDDCup2000.zip unzip BMS_ASSOC_DATA.zip ./orkut.py users BMS-POS.dat > bms-pos.txt ./createraw.py --bywhitespace bms-pos.txt bms-pos-raw.txt
wget dblp.dtd xsltproc dblp_author_title.xsl dblp.xml > dblp.txt ./random_lines.py 100000 dblp.txt > dblp-ss100000.txt
./createraw.py --uppercase --alphanum --qgram 2 dblp-ss100000.txt dblp-ss100000-upper-2q-raw.txt uniq dblp-ss100000-upper-2q-raw.txt > dblp-ss100000-upper-2q-dedup-raw.txt
./createraw.py --uppercase --alphanum --qgram 3 dblp-ss100000.txt dblp-ss100000-upper-3q-raw.txt uniq dblp-ss100000-upper-3q-raw.txt > dblp-ss100000-upper-3q-dedup-raw.txt
./createraw.py --uppercase --alphanum --qgram 4 dblp-ss100000.txt dblp-ss100000-upper-4q-raw.txt uniq dblp-ss100000-upper-4q-raw.txt > dblp-ss100000-upper-4q-dedup-raw.txt
./createraw.py --uppercase --alphanum --qgram 2 dblp-ss100000.txt dblp-ss100000-upper-2q-raw.txt uniq dblp-ss100000-upper-2q-raw.txt > dblp-ss100000-upper-2q-dedup-raw.txt ./shuflength.py dblp-ss100000-upper-2q-dedup-raw.txt > dblp-ss100000-upper-2q-dedup-lenshuf-raw.txt
tar xzvf enron.format\(data+query\).tar.gz
./createraw.py --bywhitespace enron.format enron-adaptjoin-paper-dedupitems-raw.txt
./createraw.py --dedup --dedupitems --bywhitespace enron.format \ enron-adaptjoin-paper-dedup-dedupitems-raw.txt
./createraw.py --dedup --dedupitems --bywhitespace enron.format \ enron-adaptjoin-paper-dedup-dedupitems-raw.txt ./shuflength.py enron-adaptjoin-paper-dedup-dedupitems-raw.txt > \ enron-adaptjoin-paper-dedup-dedupitems-lenshuf-raw.txt
wget kosarak.dat
./createraw.py --bywhitespace --dedupitems kosarak.dat kosarak-raw.txt
uniq kosarak-raw.txt > kosarak-dedup-raw.txt
./orkut.py users <(zcat livejournal-groupmemberships.txt.gz) > livejournal-userswithgroups.txt ./createraw.py --bywhitespace livejournal-userswithgroups.txt livejournal-userswithgroups-raw.txt uniq livejournal-userswithgroups-raw.txt > livejournal-userswithgroups-dedup-raw.txt
zcat orkut-groupmemberships.txt.gz > tt.txt ./orkut.py users tt.txt > orkut.out ./createraw.py --bywhitespace orkut.out orkut-userswithgroups-raw.txt uniq orkut-userswithgroups-raw.txt > orkut-userswithgroups-dedup-raw.txt
zcat orkut-groupmemberships.txt.gz > tt.txt ./orkut.py users tt.txt > orkut.out ./createraw.py --bywhitespace orkut.out orkut-userswithgroups-raw.txt uniq orkut-userswithgroups-raw.txt > orkut-userswithgroups-dedup-raw.txt
./spotify.py spotify.csv > spotify-track.txt ./createraw.py --bywhitespace spotify-track.txt spotify-track-raw.txt uniq spotify-track-raw.txt > spotify-track-dedup-raw.txt
./zipf_dist.py 100000 50 100000 1 > zipf-s100k-ss50-u100k-a1.txt ./createraw.py --dedup --bywhitespace zipf-s100k-ss50-u100k-a1.txt zipf-s100k-ss50-u100k-a1-dedup-raw.txt
./uniform_dist.py 100000 10 50 > uniform-s100000-ss10-u50.txt ./createraw.py --dedup --bywhitespace uniform-s100000-ss10-u50.txt uniform-s100000-ss10-u50-dedup-raw.txt
* Files won't match, but the statistics are the same or similar.
These preprocessing instructions were documented and verified by Anna Steger. Her work was supported by FFG Grant 10894560 (FFG Talente Praktika).
Willi Mann (wmann AT cosy.sbg.ac.at)