Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

The Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving

6 views

Published on

The Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
Presentation at TPDL 2019

Published in: Internet
  • DOWNLOAD FULL BOOKS, INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • DOWNLOAD FULL BOOKS, INTO AVAILABLE FORMAT ......................................................................................................................... ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. PDF EBOOK here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. EPUB Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... 1.DOWNLOAD FULL. doc Ebook here { https://tinyurl.com/y6a5rkg5 } ......................................................................................................................... ......................................................................................................................... ......................................................................................................................... .............. Browse by Genre Available eBooks ......................................................................................................................... Art, Biography, Business, Chick Lit, Children's, Christian, Classics, Comics, Contemporary, Cookbooks, Crime, Ebooks, Fantasy, Fiction, Graphic Novels, Historical Fiction, History, Horror, Humor And Comedy, Manga, Memoir, Music, Mystery, Non Fiction, Paranormal, Philosophy, Poetry, Psychology, Religion, Romance, Science, Science Fiction, Self Help, Suspense, Spirituality, Sports, Thriller, Travel, Young Adult,
       Reply 
    Are you sure you want to  Yes  No
    Your message goes here
  • Be the first to like this

The Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving

  1. 1. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Martin Klein Los Alamos National Laboratory martinklein0815@gmail.com @mart1nkle1n with Harihar Shankar (98point6) Lyudmila Balakireva (LANL) Herbert Van de Sompel (DANS) The Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving
  2. 2. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 A major challenge in web archiving: Scale vs. Quality
  3. 3. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 IA’s Scale! https://twitter.com/brewster_kahle/status/1016003169589981184
  4. 4. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 IA’s Scale!! https://twitter.com/brewster_kahle/status/1118172506777509890
  5. 5. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 IA’s Scale!!! https://twitter.com/brewster_kahle/status/1139700494748663809
  6. 6. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 IA’s Scale!!!! https://twitter.com/brewster_kahle/status/1170820482104348672
  7. 7. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Fidelity? http://web.archive.org/web/*/http://cnn.com
  8. 8. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Fidelity? http://web.archive.org/web/20190808041346/https://www.cnn.com/
  9. 9. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Fidelity? https://ws-dl.blogspot.com/2017/01/2017-01-20-cnncom-has-been-unarchivable.html
  10. 10. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Webrecorder’s Fidelity! https://webrecorder.io/martinklein/tpdl_test_collection/20190417221002/https://www.cnn.com/
  11. 11. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Webrecorder’s Fidelity!! https://twitter.com/ianmilligan1/status/1136703505442324481https://twitter.com/MellonFdn/status/1138811967060267011
  12. 12. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Webrecorder’s Scale? https://twitter.com/mart1nkle1n/status/1136705116738904067
  13. 13. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Scale vs. Quality • Crawler-based approaches scale well • Crawling quality is not always as desired • Human-driven approaches often result in great quality • Not necessarily designed for (web) scale
  14. 14. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Scale vs. Quality • Crawler-based approaches scale well • Crawling quality is not always as desired • Human-driven approaches often result in great quality • Not necessarily designed for (web) scale Memento Tracer http://tracer.mementoweb.org
  15. 15. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Memento Tracer Framework http://tracer.mementoweb.org Inspired by: • LOCKSS • Same automated approach for resources of a class • Webrecorder • Manual recording of web resources • Various attempts aimed at automating interactions/behaviors • E.g., Brozzler, Browsertrix
  16. 16. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Memento Tracer Framework http://tracer.mementoweb.org
  17. 17. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Memento Tracer Implementation • Client-side: • Tracer Chrome extension leveraging Selenium IDE • JSON-formatted Trace for download • Server-side: • Stormcrawler • Selenium (Chrome) with Tracer plug-in • WarcProxy • file-system storage for WARC files http://stormcrawler.net/ https://www.seleniumhq.org/projects/webdriver/ https://github.com/odie5533/WarcProxy
  18. 18. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Tracer Interactions https://github.com/mementoweb/memento_extensions
  19. 19. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Tracer Interactions https://github.com/mementoweb/memento_extensions
  20. 20. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Tracer Interactions https://github.com/mementoweb/memento_extensions
  21. 21. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Tracer Interactions https://github.com/mementoweb/memento_extensions
  22. 22. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Tracer Interactions https://www.slideshare.net/martinklein0815/evaluating-memento-service-optimizations
  23. 23. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Current Memento Tracer Capabilities • Single clicks/links • All links in an area • Repeated click on links, with stop condition • Slides • Pagination • Nested traces i.e., “trace in a trace” • Trace for portal A  follow link to portal B  execute trace for portal B • Identification of page/portal for which a trace exists by URI (pattern)
  24. 24. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Memento Tracer Benefits • Scalability • Trace created once is applicable to all web resources of the same class • Traces shared via repository (edits, versioning) • Quality • Trace used as set of instructions for browser-based capture framework • Resource boundary explicit • Tradeoff • Quality vs performance
  25. 25. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Evaluation of Scalability & Quality • Dataset made of GitHub repositories and Slideshare slide decks • 17,646 GitHub repositories (via changelog.com) • 12,280 Slideshare decks (via Explore feature) • Archival goals: • GitHub: get all repository files and ZIP file • Slideshare: get all slides and notes • Quality eval: • Compare against Webrecorder • Scalability eval: • Large amount of high-quality captures • Compare against crawl time of common crawler
  26. 26. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Quality • Not a trivial dimension to evaluate! • Decision to evaluate by amount of URIs in live web version vs. archived snapshot • Based on manually generated snapshots with Webrecorder • Random sample of 100 repos and slide decks • Expectation: • 100% of URIs from live web in archived snapshot
  27. 27. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Quality 100 @ GitHub 100 @ Slideshare
  28. 28. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Quality at Scale 17,646 @ GitHub 12,280 @ Slideshare
  29. 29. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Cost of Quality at Scale • Runtime difference between Memento Tracer and common web crawler for the same amount of URIs • Plus 20 seconds per URI, on average • Faster than previous approaches, discovers many more URIs
  30. 30. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Take aways • Memento Tracer aims at finding a balance between quality and scale • Human in the loop, benefits from patterns of web resources • Experiments provide indicators for high quality, reliability, scale • Cost involved, slower than simple crawlers • Optimizations possible, further potential and limitations to be explored
  31. 31. Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving @mart1nkle1n TPDL, Oslo, Norway, September 10 2019 Martin Klein Los Alamos National Laboratory martinklein0815@gmail.com @mart1nkle1n with Harihar Shankar (98point6) Lyudmila Balakireva (LANL) Herbert Van de Sompel (DANS) The Memento Tracer Framework: Balancing Quality and Scalability for Web Archiving

×