Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

A Web-Centric Pipeline for Archiving Scholarly Artifacts

5 views

Published on

TPDL/DCMI 2018 Keynote
A Web-Centric Pipeline for Archiving Scholarly Artifacts
Martin Klein & Herbert Van de Sompel
Los Alamos National Laboratory

Published in: Internet
  • Be the first to comment

  • Be the first to like this

A Web-Centric Pipeline for Archiving Scholarly Artifacts

  1. 1. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Martin Klein Los Alamos National Laboratory @mart1nkle1n https://orcid.org/0000-0003-0130-2097 Herbert Van de Sompel Los Alamos National Laboratory @hvdsomp https://orcid.org/0000-0002-0715-6126 A Web-Centric Pipeline for Archiving Scholarly Artifacts The Scholarly Orphans project is funded by the Andrew W. Mellon Foundation
  2. 2. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Scholarly Orphans – Project Motivation
  3. 3. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 • Consideration • Researchers are increasingly using a variety of web platforms for collaboration and communication • Why? • Many of these platforms have desirable characteristics • Versioning • Time stamping • Social embedding • Their institutions do not provide platforms that have global reach • Collaboration, cf. Github ~ productivity • Communication, cf. SlideShare ~ visibility Research and Research Communication on the Web
  4. 4. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Emma Schymanski https://orcid.org/0000-0001-6868-8145 https://github.com/schymane https://www.slideshare.net/EmmaSchymanski https://figshare.com/authors/Emma_Schymanski/5087039 https://publons.com/author/1538491/emma-schymanski#profile
  5. 5. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Shawn Jones https://orcid.org/0000-0002-4372-870X http://www.shawnmjones.org/ https://github.com/shawnmjones https://www.slideshare.net/shawnmjones https://en.wikipedia.org/wiki/User:Shawnmjones https://www.blogger.com/profile/17827543974149663194
  6. 6. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 • Consideration • Researchers deposit artifacts in these web platforms • Web Platforms: • Dedicated to scholarship: • Commercial: e.g., FigShare, Publons • Not for profit: e.g., OSF, Zenodo • General purpose: • Commercial: e.g., GitHub, SlideShare • Not for profit: e.g., Wikipedia, Wikidata Research and Research Communication on the Web
  7. 7. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 • Consideration • Researchers deposit artifacts in these web platforms • Status quo - The researchers’ institutions commonly: • Do not know about the existence of these artifact • Do not have a copy of these artifacts Research and Research Communication on the Web
  8. 8. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 • Consideration • Researchers deposit artifacts in these web platforms • Status quo – Uncertainty regarding long-term accessibility of these artifacts: • General purpose platforms don’t provide long-term access guarantees; platforms dedicated to scholarship commonly do • Uncertainty regarding the sustainability of unhindered long- term access to artifacts in these platforms: • Commercial: when is the change in business model coming? • Not for profit: will the next round of grant applications, member contributions be successful? Research and Research Communication on the Web
  9. 9. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 • Consideration • Researchers deposit artifacts in these web platforms • Status quo - These artifacts are not systematically archived: • No frameworks like LOCKSS/Portico exist for these artifacts • Researchers only selectively deposit artifacts in portals that provide archival guarantees; to obtain a cite-able DOI • Can’t expect researchers to (also) upload all artifacts in IRs • Web archives only incidentally archive these artifacts • Anecdotal & Hiberlink evidence Research and Research Communication on the Web
  10. 10. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Emma’s SlideShare Artifact: 0 Mementos https://www.slideshare.net/EmmaSchymanski/dmcm2018-community-resources-connecting-chemistry-and-toxicity-knowledge http://timetravel.mementoweb.org/
  11. 11. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Shawn’s GitHub Artifact: 1 Memento https://github.com/shawnmjones/mediawiki http://web.archive.org/
  12. 12. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Hiberlink Evidence Web resources referenced in Elsevier corpus (1996-2012) without representative Memento in public web archives Martin Klein, Herbert Van de Sompel, et al. (2014) Scholarly context not found. In: PLOS ONE https://doi.org/10.1371/journal.pone.0115253
  13. 13. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 The Need for an Archiving Infrastructure Herbert Van de Sompel & Andrew Treloar (2014) A Perspective on Archiving the Scholarly Web https://hvdsomp.info/papers/Papers/2014/iPres2014_Sompel_Treloar.pdf
  14. 14. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Recording versus Archiving Recording Archiving Short-term Longer-term No guarantees provided Attempt to provide guarantees Write many/read many Write once/Read many Scholarly process Scholarly record Herbert Van de Sompel & Andrew Treloar (2014) A Perspective on Archiving the Scholarly Web https://hvdsomp.info/papers/Papers/2014/iPres2014_Sompel_Treloar.pdf
  15. 15. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Scholarly Orphans – Project Overview
  16. 16. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 The Scholarly Orphans Project • Funded by the Andrew W. Mellon Foundation • Los Alamos National Laboratory & New Mexico Consortium • Old Dominion University • 04/2016 - 03/2019 • How to capture Scholarly Orphans (i.e., the scholarly artifacts deposited in web portals) for long-term archiving? • Experimental project, aimed at exploring technical possibilities
  17. 17. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 The Scholarly Orphans Project • Explores an institution-driven paradigm • Academic institutions typically have a long shelf life • A basic premise underlying e.g., LOCKSS, perma.cc • An academic institution should be interested in capturing the artifacts (intellectual property) its scholars deposit on the web • Collecting and archiving such artifacts aligns with the mission of academic libraries
  18. 18. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 An Institutional Perspective
  19. 19. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 The Scholarly Orphans Project • Explores a paradigm inspired by web archiving • Scale of the problem • Can’t expect researchers to upload all artifacts in an institutional repository • Bilateral agreements for archival purposes with most web portals unlikely
  20. 20. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 A Web Archiving Perspective
  21. 21. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Inspiration • LOCKSS • Web crawling approach • Focused on journal literature • Archive-It • On-demand, subscription-based web archiving • Not focused on scholarly orphans • Institutional repository, auto-discovery of journal articles • Capture an institution’s output • Focused on journal literature • The Locker Project & Amy Guy’s Personal Web Observatory work • Capture an individual’s web presence • Not focused on scholarly orphans http://rhiaro.co.uk/ https://rhiaro.github.io/thesis/
  22. 22. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Scholarly Orphans – Prototype Pipeline Overview
  23. 23. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Prototype Pipeline
  24. 24. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Prototype Pipeline
  25. 25. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Demo - myresearch.institute
  26. 26. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 myresearch.institute - Researchers • Uniquely identified by ORCIDs • Web identities in multiple portals • Create various types of artifacts
  27. 27. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 myresearch.institute - Portals • Tracking started August 27 2018 • Tracking artifacts created starting August 1 2018 • >2,200 artifacts tracked to date for all 16 researchers
  28. 28. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 myresearch.institute - Artifacts • schema.org typology: • Answer • Article • BlogPosting • Comment • Dataset • PresentationDigitalDocument • Question • Review • SoftwareSourceCode
  29. 29. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Tracking Artifacts
  30. 30. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Tracking Artifacts - Description • In order to track artifacts that were recently deposited by an institutional researcher in a portal, one reasonably needs: • The web identity of the researcher in the portal • Algorithmic discovery • Discovery via a registry
  31. 31. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Algorithmic Discovery of Web Identities James Powell, Harihar Shankar, Marko Rodriguez, and Herbert Van de Sompel (2014) EgoSystem: Where are our alumni? In: code4lib http://journal.code4lib.org/articles/9519
  32. 32. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Martin Klein and Herbert Van de Sompel (2017) Discovering Scholarly Orphans Using ORCID In: JCDL2017 https://arxiv.org/abs/1703.09343 Discovery of Web Identities via a Registry (ORCID)
  33. 33. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 https://orcid.org/0000-0002-4372-870X Shawn’s ORCID Record
  34. 34. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 https://orcid.org/0000-0001-6868-8145 Emma’s ORCID Record
  35. 35. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Tracking Artifacts - Description • In order to track artifacts that were recently deposited by an institutional researcher in a portal, one reasonably needs: • The web identity of the researcher in the portal • Algorithmic discovery • Discovery via a registry • A portal API that supports: • Access by web identity • Access to contributions “since …” for the web identity • Result of tracking: • URI(s) of new artifact(s) discovered in the portal
  36. 36. Tracking Artifacts - Architecture
  37. 37. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Tracking Artifacts - Implementation • Tracker event notifications: • Linked Data Notifications (JSON-LD) using AS2, PROV-O, schema.org • Identifiers: Unique tracker event identifier per notification • Dates: artifact publication date & artifact tracked date • URIs: 1+ artifact URI • Event database: • Notifications stored/indexed in ElasticSearch • Researcher database: • SQLite
  38. 38. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Tracking Artifacts - Demo Demo: https://myresearch.institute/
  39. 39. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Tracking Artifacts - Challenges • Discovery of web identities of researchers • Algorithmic, registry-based currently not adequate • Fallback: manual discovery and entry • With help of researcher • Portal API access by web identity • Broadly supported by general purpose portals • Typically not supported by scholarly portals • Some lack an API altogether • Should add ORCID access to APIs • OAI-PMH and ResourceSync need sets per web identity • Professional versus personal contributions • Tracking frequency/scale
  40. 40. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Capturing Artifacts
  41. 41. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Capturing Artifacts - Description • The capture process takes as input the URI of a new artifact discovered in a portal • Its task is to create a representative institutional capture of the artifact • Result of capture: • WARC file for new artifact in an institutional archive
  42. 42. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Capturing Artifacts - Description • Challenges: • Delineate the web boundary of the artifact • More than the input artifact URI • The boundary is in the eye of the beholder • Create a high-fidelity capture using an approach that scales for a steady stream of new artifacts • Unsolved problem
  43. 43. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Capturing Artifacts
  44. 44. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Capturing Artifacts
  45. 45. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Capturing Artifacts
  46. 46. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Memento Tracer - Framework http://tracer.mementoweb.org
  47. 47. Capturing Artifacts - Architecture
  48. 48. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Capturing Artifacts - Implementation • Capture event notifications: • Identifiers: Unique capture event identifier per notification ; Preceding tracker event identifier conveyed as provenance • Dates: Datetime of WARC file creation • URIs: 1+ WARC file URI • Tracer, client-side: • Tracer Chrome extension leveraging Selenium IDE • Tracer, server-side: • Stormcrawler ; Selenium (Chrome) with Tracer plug-in ; WarcProxy ; file-system storage for WARC files http://stormcrawler.net/ https://www.seleniumhq.org/projects/webdriver/ https://github.com/odie5533/WarcProxy
  49. 49. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Capturing Artifacts - Demo Demo: https://myresearch.institute/
  50. 50. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Capturing Artifacts - Challenges • Memento Tracer: • Language used to express Traces (interoperability) • Organization of the shared repository for Traces • Limitations of the browser event listener approach for recording Traces • Selection of a Trace for capturing a web publication by other means than URI pattern • Legal constraints
  51. 51. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Archiving Artifacts
  52. 52. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Archiving Artifacts - Description • The archiving process takes as input the URI of a WARC file generated by the capture process • Its task is to ingest the WARC file in a cross-institutional web archive • This can be achieved using off-the-shelf web archiving software, e.g., pywb, Open Wayback • Result of archiving: • Mementos pertaining to newly discovered artifact in a cross- institutional, Memento-compliant web archive
  53. 53. Archiving Artifacts - Architecture
  54. 54. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Archiving Artifacts - Implementation • Archiver event notifications: • Identifiers: Unique archiver event identifier per notification ; preceding tracker/capturer event identifiers conveyed as provenance • Dates: WARC file ingest date ; Memento-Datetime values URIs: 1+ Memento URI, each corresponding to an artifact URI • Web Archive: • pywb • Social card: • MementoEmbed https://github.com/webrecorder/pywb https://github.com/oduwsdl/MementoEmbed
  55. 55. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Archiving Artifacts - Demo Demo: https://myresearch.institute/
  56. 56. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Archiving Artifacts - Challenges • Attempted to use ipwb, a pywb version that uses IPFS • Cross-institutional distributed file system with redundancy • Ran out of time to get it operationally stable Sawood Alam, Mat Kelly, and Michael L. Nelson (2016) InterPlanetary Wayback: The Permanent Web Archive https://doi.org/10.1145/2910896.2925467
  57. 57. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Scholarly Orphans – Summary
  58. 58. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Summary (1/2) • The Scholarly Orphans project explores an institution-driven approach to capture scholarly artifacts deposited in web portals • Artifacts out of scope of existing archival approaches such as LOCKSS, Portico, web archives • Institutions have a long shelf life, should be interested in collecting these artifacts, and have feasible scale for identity/artifact discovery
  59. 59. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Summary (2/2) • Components of the experimental pipeline: • Tracker: Automatically discover artifacts because researchers will not upload them to the institution • Capturer: High fidelity artifact captures through crowd-sourcing navigation patterns with Memento Tracer • Archiver: Cross-institutional, Memento-compliant scholarly web archive
  60. 60. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Acknowledgments • Los Alamos National Laboratory: • Lyudmila Balakireva • Martin Klein • James Powell • Harihar Shankar • Herbert Van de Sompel • Old Dominion University: • Sawood Alam • Grant Atkins • Shawn Jones • Mat Kelly • Michael L. Nelson • myresearch.institute – all volunteering researchers
  61. 61. @mart1nkle1n @hvdsomp TPDL2018, Porto, Portugal, 12 Sep 2018 Martin Klein Los Alamos National Laboratory @mart1nkle1n https://orcid.org/0000-0003-0130-2097 Herbert Van de Sompel Los Alamos National Laboratory @hvdsomp https://orcid.org/0000-0002-0715-6126 A Web-Centric Pipeline for Archiving Scholarly Artifacts The Scholarly Orphans project is funded by the Andrew W. Mellon Foundation

×