Finding the Most Popular SciHub Articles with Approximate Heavy Hitters

I’ve added heavy hitters functionality to the dsrs crate (in addition to a variant of Count-Min). It’s another streaming algorithm which helps us find the most popular repeated lines in a stream. In this blog post, we’ll see how this approximate algorithm saves memory over an exact approach.

For instance, maybe we have access logs which contain IP addresses like so:

1.1.1
1.2.3
1.1.1
2.1.2
1.1.1

where there could be millions of unique IP addresses accessing our server, but we’d only be interested in monitoring the ones like 1.1.1.1 that access it most often to check for possible malicious behavior such as a DoS attack. In principle, we could track every single unique IP address and how often it appears in the log, but this’d require as much memory as there are unique IPs. If we’re only interested in the top-\(k \) IPs by frequency, could we do better?

Indeed, if we’re willing to give approximate answers! Sketching approaches have nuanced guarantees, but generally work well in practice. The dsrs library provides an API for the heavy hitters sketch, which accepts a textual stream and returns the approximate top-\(k \) most popular items in that stream.

Tim Bray has a tuned Go package which I’ve installed as tf below which answers the exact top-\(k \) query. Over several blog posts, Tim’s package has evolved to be mostly I/O bound. So it’ll be tough competition for the approximate approach.

In the experiment below, we seek to answer, what are the 10 most popular of 28 million downloaded SciHub articles from September 2015 to February 2016? At 2.6 GB we’ll see which approach best answers this question on my laptop!

%%bash

cd /tmp
test -f scihub.zip || curl -s -o scihub.zip -L https://datadryad.org/stash/downloads/file_stream/1483
du -hs scihub.zip
unzip -qf scihub.zip
test -d topfew && test -f topfew/bin/tf || ( \
  git clone git@github.com:timbray/topfew.git 2>/dev/null && \
  cd topfew && make 2>&1 >/dev/null)

echo 'will cite' | parallel --citation 1> /dev/null 2> /dev/null 

du -hsc scihub_data/*.tab | tail -1

parallel --pipepart wc -l :::: scihub_data/*.tab \
  | awk '{s+=$1}END{print s " downloads"}'

653M	scihub.zip
2.6G	total
27819965 downloads

# the true exact top-10 most downloaded articles via tbray's topfew
! cat /tmp/scihub_data/*.tab | /usr/bin/time -f "%e sec %M KB" /tmp/topfew/bin/tf -f 3 -n 10

10.1007/978-1-4419-9716-6_11
10.1056/NEJMoa1402121
10.1116/1.4904970
10.1103/PhysRevB.63.224204
10.1182/asheducation-2015.1.8
10.4028/www.scientific.net/AMM.7-8.159
10.1111/j.1365-277X.2004.00520.x
10.1002/pmic.200600525
10.1161/CIRCRESAHA.117.306290
10.1002/smll.201002009
83 sec 2128580 KB

# approximate top-10 (along with very weak upper bounds of counts)
! cat /tmp/scihub_data/*.tab | cut -d$'\t' -f2 | /usr/bin/time -f "%e sec %M KB" dsrs --hh 10

1112828 10.1002/ppsc.201300314
1112828 10.1016/j.physio.2015.03.3636
1112828 10.1177/014920638701300408
1112828 10.1053/j.gastro.2015.08.004
1112828 10.1002/jbm.a.31063
1112828 10.1645/0022-3395(2000)086[1137:EAISMS]2.0.CO;2
1112828 10.1016/j.biortech.2014.11.112
1112828 10.1016/j.reval.2014.02.154
1112828 10.1016/j.tet.2015.07.005
1112828 10.2174/1568026023394443
11.49 sec 4716 KB

%%bash
# hoping that a sketch with only ~10 slots of space can recover the exact top 10 is wishful thinking
# but it really doesn't take that much to get to the top-10. Asking for an *approximate* top-4100
# gets us to the *exact* top-10
cd /tmp
M=4100
cat scihub_data/*.tab | cut -d$'\t' -f2 \| /usr/bin/time -f "%e sec %M KB" dsrs --hh $M > hh-lots
cat scihub_data/*.tab | topfew/bin/tf -f 3 -n 10 > exact

# right outer join minus inner join should be empty if the second argument is a subset
join -v2 <(cut -d" " -f2 hh-lots | sort) <(cut -d" " -f2 exact | sort)

11.04 sec 5868 KB

In the logs above, we observe the total runtime and memory use in KB for a tuned Go implementation based on a hashmap versus two approximate competitors: approximate top-\(k \) and top-\(M \), where \(M \) was found via binary search as roughly the smallest constant for which all of the true top-\(k \) articles appear.

We notice a couple of things

The estimates from the sketch can’t be trusted (nor do they ever purport to be that trustworthy). However, a low-memory second pass could be used to recover exact counts for just the heavy hitters selected by the sketch.
The approximate approach significantly improves on both runtime and memory usage. Even with the larger \(M=4100 \) necessary to recover the true top-\(k \) at \(k=10 \), the approximation was about \(2\times \) faster and used \(362\times \) less memory!

I hope this motivates you to try out dsrs next time you have a lot of logfiles to churn through but don’t want to reach for a heavyweight distributed computing solution.

Try the notebook out yourself.

Just in case you were curious for the actual names:

10.1007/978-1-4419-9716-6_11 Full-scale modal wind turbine tests: comparing shaker excitation with wind excitation. Conference Proceedings of the Society for Experimental Mechanics Series, 113–124
10.1056/NEJMoa1402121 
10.1116/1.4904970 Photosensitive field emission study of SnS2 nanosheets. Journal of Vacuum Science & Technology B, Nanotechnology and Microelectronics: Materials, Processing, Measurement, and Phenomena, 33(3), 03C106
10.1103/PhysRevB.63.224204 Griffiths effects and quantum critical points in dirty superconductors without spin-rotation invariance: One-dimensional examples. Physical Review B, 63(22)
10.1182/asheducation-2015.1.8 Iron deficiency: new insights into diagnosis and treatment. Hematology, 2015(1), 8–13
10.4028/www.scientific.net/AMM.7-8.159 Monitoring the Evolution of Fatigue in Corrugated Paperboard under Random Loads. Applied Mechanics and Materials, 7-8, 159–164
10.1111/j.1365-277X.2004.00520.x Intentional mis-reporting of food consumption and its relationship with body mass index and psychological scores in women. Journal of Human Nutrition and Dietetics, 17(3), 209–218
10.1002/pmic.200600525 Conifer defense against insects: Proteome analysis of Sitka spruce (Picea sitchensis) bark induced by mechanical wounding or feeding by white pine weevils (Pissodes strobi). PROTEOMICS, 7(2), 248–270
10.1161/CIRCRESAHA.117.306290 Efficient Gene Disruption in Cultured Primary Human Endothelial Cells by CRISPR/Cas9Novelty and Significance. Circulation Research, 117(2), 121–128
10.1002/smll.201002009 Graphene-Based Materials: Synthesis, Characterization, Properties, and Applications. Small, 7(14), 1876–1902

Vlad Feinberg

Finding the Most Popular SciHub Articles with Approximate Heavy Hitters