Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Smart Routing of Memento Requests

124 views

Published on

Smart Routing of Memento Requests
Presentation at IIPC WAC 2018

Published in: Internet
  • Be the first to comment

  • Be the first to like this

Smart Routing of Memento Requests

  1. 1. Smart Routing of Memento Requests @mart1nkle1n IIPC WAC 2018, 11/15/2018, Wellington, NZ Smart Routing of Requests Martin Klein1 Lyudmila Balakireva1 Harihar Shankar1 James Powell1 Herbert Van de Sompel2 1Research Library Los Alamos National Laboratory 2Data Archiving and Networked Services The Netherlands
  2. 2. Smart Routing of Memento Requests @mart1nkle1n IIPC WAC 2018, 11/15/2018, Wellington, NZ Memento http://timetravel.mementoweb.org/ 2
  3. 3. Smart Routing of Memento Requests @mart1nkle1n IIPC WAC 2018, 11/15/2018, Wellington, NZ Memento http://timetravel.mementoweb.org/list/20140809200708/https://www.wellingtonnz.com/ 3
  4. 4. Smart Routing of Memento Requests @mart1nkle1n IIPC WAC 2018, 11/15/2018, Wellington, NZ Memento https://arquivo.pt/wayback/20141207132322/http://www.wellingtonnz.com/ 4
  5. 5. Smart Routing of Memento Requests @mart1nkle1n IIPC WAC 2018, 11/15/2018, Wellington, NZ How does this work? Memento Aggregator (very simplistic view) 5
  6. 6. Smart Routing of Memento Requests @mart1nkle1n IIPC WAC 2018, 11/15/2018, Wellington, NZ Memento Aggregator 6
  7. 7. Smart Routing of Memento Requests @mart1nkle1n IIPC WAC 2018, 11/15/2018, Wellington, NZ Memento Aggregator 7
  8. 8. Smart Routing of Memento Requests @mart1nkle1n IIPC WAC 2018, 11/15/2018, Wellington, NZ • As the number of archives grows, sending requests to each archive for every incoming request is not feasible • Response times • Memento infrastructure load • Load on distributed archives LANL Memento Aggregator - Problem 8
  9. 9. Smart Routing of Memento Requests @mart1nkle1n IIPC WAC 2018, 11/15/2018, Wellington, NZ • We could predict, by merely looking at a URI-R, whether or not to issue a request to a specific archive? • A binary classifier per archive • We could train the classifiers using cached data? • That would be pretty neat, indeed: • Retrain classifiers as web archive collections evolve • Not dependent on external data • Querying classifiers probably way faster (msec) than polling archives (sec) What if… 9
  10. 10. Smart Routing of Memento Requests @mart1nkle1n IIPC WAC 2018, 11/15/2018, Wellington, NZ 10 We can! Published @ JCDL 2016 https://doi.org/10.1145/2910896.2910899 • ML models based on simple URI features • Character count, n-grams, domain • Common ML algorithms used per archive • Logistic Regression, Multinomial Bayes, SVM • Optimized for • Prediction time, not training time • Reduction of false positive rate Results: • Requests per URI-R: 3.96 vs 11 • Response time: 2.16s vs 3.08s • Recall: 84.7%
  11. 11. Smart Routing of Memento Requests @mart1nkle1n IIPC WAC 2018, 11/15/2018, Wellington, NZ 11
  12. 12. Smart Routing of Memento Requests @mart1nkle1n IIPC WAC 2018, 11/15/2018, Wellington, NZ In Production… 12
  13. 13. Smart Routing of Memento Requests @mart1nkle1n IIPC WAC 2018, 11/15/2018, Wellington, NZ 13
  14. 14. Smart Routing of Memento Requests @mart1nkle1n IIPC WAC 2018, 11/15/2018, Wellington, NZ 14
  15. 15. Smart Routing of Memento Requests @mart1nkle1n IIPC WAC 2018, 11/15/2018, Wellington, NZ 15
  16. 16. Smart Routing of Memento Requests @mart1nkle1n IIPC WAC 2018, 11/15/2018, Wellington, NZ 16
  17. 17. Smart Routing of Memento Requests @mart1nkle1n IIPC WAC 2018, 11/15/2018, Wellington, NZ 17
  18. 18. Smart Routing of Memento Requests @mart1nkle1n IIPC WAC 2018, 11/15/2018, Wellington, NZ 18
  19. 19. Smart Routing of Memento Requests @mart1nkle1n IIPC WAC 2018, 11/15/2018, Wellington, NZ Populating the Cache 19
  20. 20. Smart Routing of Memento Requests @mart1nkle1n IIPC WAC 2018, 11/15/2018, Wellington, NZ • How effective is the cache? • What is the hit/miss ratio? Does it vary for different Memento services? • Is the cache freshness period appropriate? • How effective is the ML process? • What is the false negative and false positive rate? • Do we need to retrain the models? How often? Questions to Ask 20
  21. 21. Smart Routing of Memento Requests @mart1nkle1n IIPC WAC 2018, 11/15/2018, Wellington, NZ • Memento Aggregator currently covers • 23 web archives • 17 with native memento support • 6 with by-proxy memento support • Analysis of log files • recorded between July 4th 2017 and October 17th 2018 • > 11m requests in total • Approx. 2.6m requests against machine learning process • Results in 2.6m lookups to populate cache • Used as “truth” to assess ML prediction Evaluation 21
  22. 22. Smart Routing of Memento Requests @mart1nkle1n IIPC WAC 2018, 11/15/2018, Wellington, NZ Cache Hit/Miss Rate 22
  23. 23. Smart Routing of Memento Requests @mart1nkle1n IIPC WAC 2018, 11/15/2018, Wellington, NZ Cache Hit/Miss Rate humans humansmachinesmachinesMostly driven by 23
  24. 24. Smart Routing of Memento Requests @mart1nkle1n IIPC WAC 2018, 11/15/2018, Wellington, NZ False Negatives by Number of Archives 24
  25. 25. Smart Routing of Memento Requests @mart1nkle1n IIPC WAC 2018, 11/15/2018, Wellington, NZ False Negatives by Archive 25
  26. 26. Smart Routing of Memento Requests @mart1nkle1n IIPC WAC 2018, 11/15/2018, Wellington, NZ False Positives by Number of Archives 26
  27. 27. Smart Routing of Memento Requests @mart1nkle1n IIPC WAC 2018, 11/15/2018, Wellington, NZ False Positives by Archive 27
  28. 28. Smart Routing of Memento Requests @mart1nkle1n IIPC WAC 2018, 11/15/2018, Wellington, NZ Changes in Archive Holdings 28
  29. 29. Smart Routing of Memento Requests @mart1nkle1n IIPC WAC 2018, 11/15/2018, Wellington, NZ Archives Added 29
  30. 30. Smart Routing of Memento Requests @mart1nkle1n IIPC WAC 2018, 11/15/2018, Wellington, NZ Archives Removed 30
  31. 31. Smart Routing of Memento Requests @mart1nkle1n IIPC WAC 2018, 11/15/2018, Wellington, NZ • Memento Aggregator cache is very effective • Especially for human-driven services • Machine learning process saves! • Requests & time while at acceptable recall level • FPR: 0.33 (std dev: 0.16) • Re-training seems necessary, frequency TBD Optimization • ML model trained on archival holdings, not usage logs/cache • Beneficial for new archives • Neural network classifier, based on simple URI features, show promising results Takeaways 31
  32. 32. Smart Routing of Memento Requests @mart1nkle1n IIPC WAC 2018, 11/15/2018, Wellington, NZ Smart Routing of Memento Requests Martin Klein1 Lyudmila Balakireva1 Harihar Shankar1 James Powell1 Herbert Van de Sompel2 1Research Library Los Alamos National Laboratory 2Data Archiving and Networked Services The Netherlands

×