W. Mann, N. Augsten, P. Bouros. An Empirical Evaluation of Set Similarity Join Techniques. In The Proceedings of the VLDB Endowment (PVLDB 2016)
tar xzvf aol-data.tar.gz
./aol-data.sh user-ct-test-collection-*.txt.gz > aol-data.txt
./createraw.py --bywhitespace aol-data.txt aol-data-white-raw.txt
./createraw.py --bywhitespace --dedup aol-data.txt aol-data-white-dedup-raw.txt
./createraw.py --bywhitespace --dedup aol-data.txt aol-data-white-dedup-raw.txt ./shuflength.py aol-data-white-dedup-raw.txt > aol-data-white-dedup-lenshuf-raw.txt
unzip KDDCup2000.zip
unzip BMS_ASSOC_DATA.zip
./orkut.py users BMS-POS.dat > bms-pos.txt
./createraw.py --bywhitespace bms-pos.txt bms-pos-raw.txt
wget dblp.dtd
xsltproc dblp_author_title.xsl dblp.xml > dblp.txt
./random_lines.py 100000 dblp.txt > dblp-ss100000.txt
./createraw.py --uppercase --alphanum --qgram 2 dblp-ss100000.txt dblp-ss100000-upper-2q-raw.txt
uniq dblp-ss100000-upper-2q-raw.txt > dblp-ss100000-upper-2q-dedup-raw.txt
./createraw.py --uppercase --alphanum --qgram 3 dblp-ss100000.txt dblp-ss100000-upper-3q-raw.txt
uniq dblp-ss100000-upper-3q-raw.txt > dblp-ss100000-upper-3q-dedup-raw.txt
./createraw.py --uppercase --alphanum --qgram 4 dblp-ss100000.txt dblp-ss100000-upper-4q-raw.txt
uniq dblp-ss100000-upper-4q-raw.txt > dblp-ss100000-upper-4q-dedup-raw.txt
./createraw.py --uppercase --alphanum --qgram 2 dblp-ss100000.txt dblp-ss100000-upper-2q-raw.txt
uniq dblp-ss100000-upper-2q-raw.txt > dblp-ss100000-upper-2q-dedup-raw.txt
./shuflength.py dblp-ss100000-upper-2q-dedup-raw.txt > dblp-ss100000-upper-2q-dedup-lenshuf-raw.txt
tar xzvf enron.format\(data+query\).tar.gz
./createraw.py --bywhitespace enron.format enron-adaptjoin-paper-dedupitems-raw.txt
./createraw.py --dedup --dedupitems --bywhitespace enron.format \
enron-adaptjoin-paper-dedup-dedupitems-raw.txt
./createraw.py --dedup --dedupitems --bywhitespace enron.format \ enron-adaptjoin-paper-dedup-dedupitems-raw.txt ./shuflength.py enron-adaptjoin-paper-dedup-dedupitems-raw.txt > \ enron-adaptjoin-paper-dedup-dedupitems-lenshuf-raw.txt
wget kosarak.dat
./createraw.py --bywhitespace --dedupitems kosarak.dat kosarak-raw.txt
uniq kosarak-raw.txt > kosarak-dedup-raw.txt
./orkut.py users <(zcat livejournal-groupmemberships.txt.gz) > livejournal-userswithgroups.txt
./createraw.py --bywhitespace livejournal-userswithgroups.txt livejournal-userswithgroups-raw.txt
uniq livejournal-userswithgroups-raw.txt > livejournal-userswithgroups-dedup-raw.txt
zcat orkut-groupmemberships.txt.gz > tt.txt
./orkut.py users tt.txt > orkut.out
./createraw.py --bywhitespace orkut.out orkut-userswithgroups-raw.txt
uniq orkut-userswithgroups-raw.txt > orkut-userswithgroups-dedup-raw.txt
zcat orkut-groupmemberships.txt.gz > tt.txt
./orkut.py users tt.txt > orkut.out
./createraw.py --bywhitespace orkut.out orkut-userswithgroups-raw.txt
uniq orkut-userswithgroups-raw.txt > orkut-userswithgroups-dedup-raw.txt
./spotify.py spotify.csv > spotify-track.txt ./createraw.py --bywhitespace spotify-track.txt spotify-track-raw.txt uniq spotify-track-raw.txt > spotify-track-dedup-raw.txt
./zipf_dist.py 100000 50 100000 1 > zipf-s100k-ss50-u100k-a1.txt
./createraw.py --dedup --bywhitespace zipf-s100k-ss50-u100k-a1.txt zipf-s100k-ss50-u100k-a1-dedup-raw.txt
./uniform_dist.py 100000 10 50 > uniform-s100000-ss10-u50.txt
./createraw.py --dedup --bywhitespace uniform-s100000-ss10-u50.txt uniform-s100000-ss10-u50-dedup-raw.txt
* Files won't match, but the statistics are the same or similar.
These preprocessing instructions were documented and verified by Anna Steger. Her work was supported by FFG Grant 10894560 (FFG Talente Praktika).
Willi Mann (wmann AT cosy.sbg.ac.at)