Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

To the Rescue of Scholarly Orphans

47 views

Published on

Presentation for PIDapalooza 2019, Dublin, Ireland.

The Scholarly Orphans project, funded by the Andrew W. Mellon Foundation, explores technical approaches aimed at capturing and archiving scholarly artifacts that researchers deposit in web productivity portals as a means to collaborate and communicate with their peers. These artifacts are not collected by other frameworks aimed at archiving the scholarly record (e.g., LOCKSS, Portico, Institutional Repositories) and are only incidentally captured by web archives. The project explores an institution-driven approach inspired by web archiving. To demonstrate the ongoing thinking, the project has devised an experimental automated pipeline that continuously discovers, captures, and archives artifacts. These are created by actual researchers who, for the purpose of the experiment, were virtually enlisted in a fictive research institution. A portal at myresearch.institute provides an overview of the artifacts that were discovered and provides access to archived versions stored in both an institutional and a cross-institutional archive. The set-up leverages a range of technologies that share a flavor of persistence: Memento, Memento Tracer, Robust Links, Signposting.

Published in: Internet
  • Be the first to comment

To the Rescue of Scholarly Orphans

  1. 1. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Herbert Van de Sompel DANS @hvdsomp https://orcid.org/0000-0002-0715-6126 To the Rescue of Scholarly Orphans The Scholarly Orphans project is funded by the Andrew W. Mellon Foundation
  2. 2. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Scholarly Orphans Team • Los Alamos National Laboratory: • Lyudmila Balakireva • Martin Klein • James Powell • Harihar Shankar • Herbert Van de Sompel • Old Dominion University: • Sawood Alam • Grant Atkins • Shawn Jones • Mat Kelly • Michael L. Nelson
  3. 3. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Scholarly Orphans – Project Motivation
  4. 4. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 • Consideration • Researchers deposit artifacts in web platforms • Web Platforms: • Dedicated to scholarship: • Commercial: e.g., FigShare, Publons • Not for profit: e.g., OSF, Zenodo • General purpose: • Commercial: e.g., GitHub, SlideShare • Not for profit: e.g., Wikipedia, Wikidata Research and Research Communication on the Web
  5. 5. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 • Consideration • Researchers deposit artifacts in web platforms • Status quo - The researchers’ institutions are in the dark • Do not know about the existence of these artifact • Do not have a copy of these artifacts Research and Research Communication on the Web
  6. 6. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 • Consideration • Researchers deposit artifacts in web platforms • Status quo – Uncertainty regarding long-term access • Commercial: changing business model, no preservation commitment • Not for profit: unpredictable funding stream Research and Research Communication on the Web
  7. 7. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 • Consideration • Researchers deposit artifacts in web platforms • Status quo - Not systematically archived • No frameworks like LOCKSS/Portico exist for these artifacts • Researchers only selectively deposit artifacts in portals that provide archival guarantees; to obtain a cite-able DOI • Can’t expect researchers to (also) upload all artifacts in IRs • Web archives only incidentally archive these artifacts, cf. Hiberlink research Research and Research Communication on the Web Martin Klein, Herbert Van de Sompel, et al. (2014) Scholarly context not found. In: PLOS ONE https://doi.org/10.1371/journal.pone.0115253
  8. 8. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Scholarly Orphans – Project Overview How to capture Scholarly Orphans for long-term archiving?
  9. 9. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 The Scholarly Orphans Project • Explores an institution-driven paradigm • Academic institutions typically have a long shelf life • A basic premise underlying e.g., LOCKSS, perma.cc • An academic institution should be interested in capturing the artifacts (intellectual property) its scholars deposit on the web • Collecting and archiving such artifacts aligns with the mission of academic libraries
  10. 10. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 An Institutional Perspective
  11. 11. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 The Scholarly Orphans Project • Explores a paradigm inspired by web archiving • Scale of the problem • Can’t expect researchers to upload all artifacts in an institutional repository • Bilateral agreements for archival purposes with most web portals unlikely
  12. 12. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 A Web Archiving Perspective
  13. 13. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Scholarly Orphans – Prototype Pipeline Overview
  14. 14. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Prototype Pipeline
  15. 15. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Prototype Pipeline
  16. 16. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Tracking Artifacts
  17. 17. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Tracking Artifacts - Description • In order to track artifacts that were recently deposited by an institutional researcher in a portal, one reasonably needs: • The web identity of the researcher in the portal • Algorithmic discovery • Discovery via a registry • Manual collection • A portal API that supports: • Access by web identity • Access to contributions “since …” for the web identity • Result of tracking: • URI(s) of new artifact(s) discovered in the portal
  18. 18. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Tracking Artifacts - Challenges • Portal API access by web identity • Broadly supported by general purpose portals • Typically not supported by scholarly portals • Some lack an API altogether • Should add ORCID access to APIs • OAI-PMH and ResourceSync need sets per web identity
  19. 19. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Capturing Artifacts
  20. 20. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Capturing Artifacts - Description • The capture process takes as input the URI of a new artifact discovered in a portal • Its task is to create a representative institutional capture of the artifact • Result of capture: • WARC file for new artifact in an institutional archive
  21. 21. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Capturing Artifacts - Challenges • Delineate the web boundary of the artifact • More than the input artifact URI • The boundary is in the eye of the beholder • Create a high-fidelity capture using an approach that scales for a steady stream of new artifacts • Unsolved problem • We made a significant breakthrough with the Memento Tracer framework Memento Tracer: http://tracer.mementoweb.org
  22. 22. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Capturing Artifacts
  23. 23. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Archiving Artifacts
  24. 24. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Archiving Artifacts - Description • The archiving process takes as input the URI of a WARC file generated by the capture process • Its task is to ingest the WARC file in a cross-institutional web archive • This can be achieved using off-the-shelf web archiving software, e.g., pywb, Open Wayback • Result of archiving: • Mementos pertaining to newly discovered artifact in a cross- institutional, Memento-compliant web archive • Possibility to link to artifacts using Robust Links: <a href=“URI-A” data-versionurl=“URI-M” data-versiondate=“date-of-capture” Robust Links: http://robustlinks.mementoweb.org/about/
  25. 25. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Archiving Artifacts - Challenges • Attempted to use ipwb, a pywb version that uses IPFS • Cross-institutional distributed file system with redundancy • Ran out of time to get it operationally stable Sawood Alam, Mat Kelly, and Michael L. Nelson (2016) InterPlanetary Wayback: The Permanent Web Archive https://doi.org/10.1145/2910896.2925467
  26. 26. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Demo - myresearch.institute
  27. 27. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 myresearch.institute - Researchers • Uniquely identified by ORCIDs • Web identities in multiple portals • Create various types of artifacts
  28. 28. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 myresearch.institute - Portals • Tracking started August 27 2018 • Tracking artifacts created starting August 1 2018 • 9000+ artifacts tracked to date for all 16 researchers
  29. 29. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 myresearch.institute - Artifacts • schema.org typology: • Answer • Article • BlogPosting • Comment • Dataset • PresentationDigitalDocument • Question • Review • SoftwareSourceCode
  30. 30. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Pipeline Demo https://myresearchinstitute.org
  31. 31. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Scholarly Orphans – Summary
  32. 32. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Summary (1/2) • The Scholarly Orphans project explores an institution-driven approach to capture scholarly artifacts deposited in web portals • Artifacts out of scope of existing archival approaches such as LOCKSS, Portico, web archives • Institutions have a long shelf life, should be interested in collecting these artifacts, and have feasible scale for identity/artifact discovery
  33. 33. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Summary (2/2) • Components of the experimental pipeline: • Tracker: Automatically discover artifacts because researchers will not upload them to the institution • Capturer: High fidelity artifact captures through crowd-sourcing navigation patterns with Memento Tracer • Archiver: Cross-institutional, Memento-compliant scholarly web archive
  34. 34. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Herbert Van de Sompel DANS @hvdsomp https://orcid.org/0000-0002-0715-6126 To the Rescue of Scholarly Orphans The Scholarly Orphans project is funded by the Andrew W. Mellon Foundation
  35. 35. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 No Long-Term Access Guarantees “We reserve the right at any time and from time to time to modify or discontinue, temporarily or permanently, the Website (or any part of it) with or without notice. GitHub Terms of Service http://help.github.com/articles/github-terms-of-service https://help.github.com/articles/github-terms-of-service/
  36. 36. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Funding Not for Profits FAQ 19: How is arXiv funded and governed? Answer: It is currently supported by funds from a network of member libraries, the Simon’s foundation and financial support, labor, and infrastructure provided by the Cornell University Library. The annual membership fees depend on the institutional usage and range from $1,500 to just $3,000 per year, comparable to the author fees for a single open access article. … nearly 200 libraries in 24 countries, who have made contributions to support arXiv via the membership program …. GitHub Terms of Service http://help.github.com/articles/github-terms-of-service Paul Ginsparg (2017) Preprint Déjà Vu: an FAQ https://arxiv.org/abs/1706.04188
  37. 37. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Hiberlink Evidence Web resources referenced in Elsevier corpus (1996-2012) without representative Memento in public web archives
  38. 38. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Inspiration • LOCKSS • Web crawling approach • Focused on journal literature • Archive-It • On-demand, subscription-based web archiving • Not focused on scholarly orphans • Institutional repository, auto-discovery of journal articles • Capture an institution’s output • Focused on journal literature • The Locker Project & Amy Guy’s Personal Web Observatory work • Capture an individual’s web presence • Not focused on scholarly orphans http://rhiaro.co.uk/ https://rhiaro.github.io/thesis/
  39. 39. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Tracking Artifacts - Implementation • Tracker event notifications: • Linked Data Notifications (JSON-LD) using AS2, PROV-O, schema.org • Identifiers: Unique tracker event identifier per notification • Dates: artifact publication date & artifact tracked date • URIs: 1+ artifact URI • Event database: • Notifications stored/indexed in ElasticSearch • Researcher database: • SQLite
  40. 40. Tracking Artifacts - Architecture
  41. 41. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Capturing Artifacts - Implementation • Capture event notifications: • Identifiers: Unique capture event identifier per notification ; Preceding tracker event identifier conveyed as provenance • Dates: Datetime of WARC file creation • URIs: 1+ WARC file URI • Tracer, client-side: • Tracer Chrome extension leveraging Selenium IDE • Tracer, server-side: • Stormcrawler ; Selenium (Chrome) with Tracer plug-in ; WarcProxy ; file-system storage for WARC files http://stormcrawler.net/ https://www.seleniumhq.org/projects/webdriver/ https://github.com/odie5533/WarcProxy
  42. 42. Capturing Artifacts - Architecture
  43. 43. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Archiving Artifacts - Implementation • Archiver event notifications: • Identifiers: Unique archiver event identifier per notification ; preceding tracker/capturer event identifiers conveyed as provenance • Dates: WARC file ingest date ; Memento-Datetime values URIs: 1+ Memento URI, each corresponding to an artifact URI • Web Archive: • pywb • Social card: • MementoEmbed https://github.com/webrecorder/pywb https://github.com/oduwsdl/MementoEmbed
  44. 44. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Memento Tracer - Framework http://tracer.mementoweb.org
  45. 45. @hvdsomp PIDapalooza 2019, Dublin, Ireland, 23 Jan 2019 Capturing Artifacts - Challenges • Memento Tracer: • Language used to express Traces (interoperability) • Organization of the shared repository for Traces • Limitations of the browser event listener approach for recording Traces • Selection of a Trace for capturing a web publication by other means than URI pattern • Legal constraints
  46. 46. Archiving Artifacts - Architecture

×