Evaluating Memento Service Optimizations

1. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA Evaluating Service Optimizations Martin Klein Lyudmila Balakireva Harihar Shankar Research Library Los Alamos National Laboratory https://arxiv.org/abs/1906.00058

2. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA Memento Time Travel http://timetravel.mementoweb.org/ 2

3. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA Memento Time Travel http://timetravel.mementoweb.org/list/20160214085934/http://jcdl.org 3

4. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA Memento Time Travel https://arquivo.pt/wayback/20160515103313/http://www.jcdl.org/ 4

5. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA How does this work? Memento Aggregator (simplistic view) 5

6. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA Memento Aggregator 6

7. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA Memento Aggregator 7

8. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA • As the number of archives grows, sending requests to each archive for every incoming request is not feasible • Response times • Memento infrastructure load • Load on distributed archives LANL Memento Aggregator - Problem 8

9. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA • We could predict whether or not to issue a request to a specific archive? • By merely looking at the requested URI-R • A binary classifier per archive • We could train the classifiers using cached data? • That would be pretty neat, indeed: • Retrain classifiers as web archive collections evolve • Not dependent on external data • Querying classifiers probably way faster (msec) than polling archives (sec) What if… 9

10. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA 10 We can! Published @ JCDL 2016 https://doi.org/10.1145/2910896.2910899 • ML models based on simple URI features • Character count, n-grams, domain • Common ML algorithms used per archive • Logistic Regression, Multinomial Bayes, SVM • Optimized for • Prediction time, not training time • Reduction of false positive rate Results: • Requests per URI-R: 3.96 vs 11 • Response time: 2.16s vs 3.08s • Recall: 0.847

11. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA 11

12. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA In Production… 12

19. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA • How effective is the cache? • What is the hit/miss ratio? Does it vary for different Memento services? • Is the cache freshness period appropriate? • How effective is the ML process? • What is the recall and the false positive rate? • Do we need to retrain the models? How often? Questions to Ask 19

20. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA • Memento Aggregator currently covers • 23 web archives • 17 with native memento support • 6 with by-proxy memento support • Analysis of log files • recorded between July 4th 2017 and October 17th 2018 • > 11m requests in total • Approx. 2.6m requests against machine learning process • Results in 2.6m lookups to populate cache • Used as “truth” to assess ML prediction Evaluation 20

21. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA Cache Hit/Miss Rate 21

22. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA Cache Hit/Miss Rate 22 humanshumans machines machinesMostly driven by

23. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA Recall 23 0.847 0.727

24. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA False Positives 24

25. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA Dynamic Web Archives 25

26. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA • Memento Aggregator cache is very effective • ~ 60% of requests served from cache • Human-driven services benefit the most • Machine learning process saves! • Requests & time while at acceptable recall level • Recall: 0.727 • Re-training seems necessary, frequency TBD Optimization • ML model trained on archival holdings, not usage logs/cache • Beneficial for new archives • Neural network classifier, based on simple URI features, show promising results Takeaways 26

27. Evaluating Memento Service Optimizations @mart1nkle1n JCDL 2019, 06/04/2019, Urbana-Champaign, IL, USA Evaluating Service Optimizations Martin Klein Lyudmila Balakireva Harihar Shankar Research Library Los Alamos National Laboratory https://arxiv.org/abs/1906.00058