Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Collecting the organizational scholarly record

80 views

Published on

Slides used for a keynote presentation at the VIVO 2019 Conference in Podgorica, Montenegro.

Abstract: The invitation to present a keynote at the VIVO Conference and the goal of the VIVO platform, as stated on the DuraSpace site, to create an integrated record of the scholarly work of an organisation reminded me of various efforts that I have been involved in over the past years that had similar goals. EgoSystem (2014) attempted to gather information about postdocs that had left the organisation, leaving little or no contact details behind. Autoload (2017), an operational service, discovers papers by organisational researchers in order to upload them in the institutional repository. myresearch.institute (2018), an experiment that is still in progress, discovers artefacts that researchers deposit in web productivity portals and subsequently archives them. More recently, I have been involved in thinking about the future of NARCIS, a portal that provides an overview of research productivity in The Netherlands. The approach taken in all these efforts share a characteristic motivated by a desire to devise scalable and sustainable solutions: let machines rather than humans do the work. In this talk, I will provide an overview of these efforts, their motivations, the challenges involved, and the nature of success (if any).

Published in: Internet
  • Be the first to comment

Collecting the organizational scholarly record

  1. 1. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro Herbert Van de Sompel DANS @hvdsomp https://orcid.org/0000-0002-0715-6126 Collecting the Organizational Scholarly Record
  2. 2. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro James Powell, Harihar Shankar, Marko Rodriguez, and Herbert Van de Sompel (2014) EgoSystem: Where are our Alumni? code{4}lib journal, issue 24. https://journal.code4lib.org/articles/9519 2013 - EgoSystem
  3. 3. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro EgoSystem Team • Los Alamos National Laboratory: • James Powell • Harihar Shankar • Herbert Van de Sompel • Aurellius: • Marko Rodriguez
  4. 4. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro Motivation • When postdocs leave LANL, the local information systems maintain very little information about them • But senior management is interested in engaging them after they leave LANL as Ambassadors and Advocates • They needs answers to questions like: • Who is currently working where? • Who is involved in what areas of research? • Who might serve as advocates for the Lab? • Who knows someone who knows someone we need to connect with?
  5. 5. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro 2012 - Initial Approach: Set Up a VIVO Instance • 2700+ records were ingested from LANL Postdoc Office data to create initial user profiles • 8 postdoc alumni were contacted to complete their profile
  6. 6. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro • Up-to-date information at all times is essential to meet the need of senior LANL management • Some existing VIVO instances seemed to have been pre- populated but then remained static after launch • Would current and former postdocs be interested in maintaining a professional profile on a VIVO instance intended to help out LANL? Doubts about the VIVO Instance
  7. 7. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro • Leverage public, network-level information pertaining to LANL Alumni • Find their network presences - social portals, scientific portals, homepages, etc. • Recurrently collect information from those presences: current employer, social network neighborhood, geo location, etc. • Create applications based on that information • Rationale: People have incentives to keep network-layer information up-to-date • Goal: Devise a sustainable approach to gather and use up- to-date information pertaining to LANL Alumni 2013 - New Approach: Leverage Network-Level Information
  8. 8. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro
  9. 9. Available information elements for PostDocs: • Z# • Name • Institutions: o PhD University; LANL; Institution after LANL • Field of Study • Discipline
  10. 10. Find network identities: • Various queries based on information elements in: o Yahoo Boss API; MS Academic Search API • Search for candidate identities: o LinkedIn; MS Academic; Twitter; Homepage; Blogger; SlideShare; WikiPedia • Rank and select candidate identities o Reward when: same identities from various searches; content matches information elements
  11. 11. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro LinkedIn Identity
  12. 12. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro LinkedIn Identity
  13. 13. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro LinkedIn Identity
  14. 14. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro Twitter Identity
  15. 15. Network-derived information: • Identities: o LinkedIn; MS Academic; Twitter; Homepage; Blogger; SlideShare; WikiPedia • Additional information elements: o Current institution; geo location; updated discipline
  16. 16. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro 0 200 400 600 800 1000 1200 1400 1600 1800 none one two three four five Web Identities Discovered Per Postdoc
  17. 17. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro Resulting Identity Types per Postdoc 0 500 1000 1500 2000 2500 3000 3500 LANL MS Academic LinkedIn Twitter
  18. 18. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro • Random set of 100 postdocs • MS Academic o 86 correct - 71 correctly discovered identities - 15 correctly labeled as not having identity o 14 incorrect - 2 discovered identities did not match the postdoc - 12 existing identities were not discovered • Algorithms favored precision over recall Evaluation of the Discovery Algorithm
  19. 19. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro Network-derived information: • Network neighborhood: o Social network ~ Twitter: followers, followed o Academic network ~ co-authors MS Academic o Affiliations ~ LinkedIn, homepage • Artifacts: papers, slide decks • Concepts
  20. 20. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro • Platonic vertices o Persons o Institutions o Artifacts o Concepts • Affiliation vertices o Different types o Different time periods • Graph extent, started with 3,005 postdocs: o Vertices: 9,015,844 o Edges: 19,399,683 Property Graph Representation of Resulting Information
  21. 21. Property Graph Representation of Resulting Information
  22. 22. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro Graph Database for Storage/Retrieval/Analysis Titan Distributed Graph Database http://titan.thinkaurelius.com/
  23. 23. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro • Simple web query interface • Shareable profile page for individuals • Graph analytics (aggregate social networks, path analysis) and graph visualization • Who’s where (the LANL Director travels) search • Capability to add non-LANL person to the graph o To find closest path to the person via a LANL postdoc EgoSystem Application
  24. 24. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro Success? • At the end of the demo meeting, the director said (paraphrasing) o “I didn’t know what I wanted when we first met but this looks like what I want, what I need.” • Project discontinued because of the inability to access LinkedIn data in legitimate manner • As a result of heuristic-based processes, the database, query results are not necessarily correct/complete. This made EgoSystem an approximating application. • Fantastic 2 month (~ 6 MM) project that did not yield a production system but in which we learned an awful lot
  25. 25. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro James Powell, Martin Klein, and Herbert Van de Sompel (2017) Autoload: a pipeline for expanding the holdings of an Institutional Repository enabled by ResourceSync code{4}lib journal, issue 36. https://journal.code4lib.org/articles/12427 2016 - Autoload
  26. 26. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro 2018 – myresearch.institute The Scholarly Orphans project is funded by the Andrew W. Mellon Foundation
  27. 27. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro myresearch.institute Team • Los Alamos National Laboratory: • Lyudmila Balakireva • Martin Klein • James Powell • Harihar Shankar • Herbert Van de Sompel • Old Dominion University: • Sawood Alam • Grant Atkins • Shawn Jones • Mat Kelly • Michael L. Nelson
  28. 28. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro • Consideration • Researchers are increasingly using a variety of web platforms for collaboration and communication • Why? • Many of these platforms have desirable characteristics • Versioning • Time stamping • Social embedding • Their institutions do not provide platforms that have global reach • Collaboration, cf. Github ~ productivity • Communication, cf. SlideShare ~ visibility Research and Research Communication on the Web
  29. 29. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro • Consideration • Researchers are increasingly using a variety of web platforms for collaboration and communication • Web Platforms: • Dedicated to scholarship: • Commercial: e.g., FigShare, Publons • Not for profit: e.g., OSF, Zenodo • General purpose: • Commercial: e.g., GitHub, SlideShare • Not for profit: e.g., Wikipedia, Wikidata Research and Research Communication on the Web
  30. 30. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro Emma Schymanski https://orcid.org/0000-0001-6868-8145 https://github.com/schymane https://www.slideshare.net/EmmaSchymanski https://figshare.com/authors/Emma_Schymanski/5087039 https://publons.com/author/1538491/emma-schymanski#profile https://www.eawag.ch/en/aboutus/portrait/organisation/staff/profile/emma-schymanski/
  31. 31. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro Shawn Jones https://orcid.org/0000-0002-4372-870X http://www.shawnmjones.org/ https://github.com/shawnmjones https://www.slideshare.net/shawnmjones https://en.wikipedia.org/wiki/User:Shawnmjones https://www.blogger.com/profile/17827543974149663194
  32. 32. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro • Consideration • Researchers deposit artifacts in web platforms • Status quo - The researchers’ institutions are in the dark • Do not know about the existence of these artifact • Do not have a copy of these artifacts Research and Research Communication on the Web
  33. 33. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro • Consideration • Researchers deposit artifacts in web platforms • Status quo – Uncertainty regarding long-term access • Commercial: changing business model, no preservation commitment • Not for profit: unpredictable funding stream Research and Research Communication on the Web
  34. 34. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro • Consideration • Researchers deposit artifacts in web platforms • Status quo - Not systematically archived • No frameworks like LOCKSS/Portico exist for these artifacts • Researchers only selectively deposit artifacts in portals that provide archival guarantees; to obtain a cite-able DOI • Can’t expect researchers to (also) upload all artifacts in IRs • Web archives only incidentally archive these artifacts, cf. anecdotal & Hiberlink project evidence Research and Research Communication on the Web Martin Klein, Herbert Van de Sompel, et al. (2014) Scholarly context not found. In: PLOS ONE https://doi.org/10.1371/journal.pone.0115253
  35. 35. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro Emma’s SlideShare Artifact: 0 Mementos https://www.slideshare.net/EmmaSchymanski/dmcm2018-community-resources-connecting-chemistry-and-toxicity-knowledge http://timetravel.mementoweb.org/
  36. 36. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro Shawn’s GitHub Artifact: 1 Memento https://github.com/shawnmjones/mediawiki https://web.archive.org/web/*/https://github.com/shawnmjones/mediawiki
  37. 37. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro Evidence from the Hiberlink Project Web resources referenced in Elsevier corpus (1996-2012) without representative Memento in public web archives
  38. 38. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro The Scholarly Orphans Project: How to Archive these Artifacts? • Explores an institution-driven paradigm • Academic institutions typically have a long shelf life • A basic premise underlying e.g., LOCKSS, perma.cc • An academic institution should be interested in capturing the artifacts (intellectual property) its scholars deposit on the web • Collecting and archiving such artifacts aligns with the mission of academic libraries
  39. 39. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro An Institutional Perspective
  40. 40. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro The Scholarly Orphans Project: How to Archive these Artifacts? • Explores a paradigm inspired by web archiving • Scale of the problem • Can’t expect researchers to upload all artifacts in an institutional repository • Bilateral agreements for archival purposes with most web portals unlikely
  41. 41. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro A Web Archiving Perspective
  42. 42. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro myresearch.institute Prototype Pipeline
  43. 43. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro Tracking Artifacts
  44. 44. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro Tracking Artifacts - Description • In order to track artifacts that were recently deposited by an institutional researcher in a portal, one reasonably needs: • The web identity of the researcher in the portal • Algorithmic discovery, cf. EgoSystem • Discovery via a registry, cf. ORCID paper • Manual collection • A portal API that supports: • Access by web identity • Access to contributions “since …” for the web identity • Result of tracking: • URI(s) of new artifact(s) discovered in the portal Klein, M., and Van de Sompel, H. (2017) Discovering Scholarly Orphans Using ORCID. Proceedings of the 2017 ACM/IEEE Joint Conference on Digital Libraries https://arxiv.org/abs/1703.09343
  45. 45. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro Tracking Artifacts - Challenges • Portal API access by web identity • Broadly supported by general purpose portals • Typically not supported by scholarly portals • Some lack an API altogether • Should add ORCID access to APIs • OAI-PMH and ResourceSync need sets per web identity • Professional versus personal contributions
  46. 46. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro Capturing Artifacts
  47. 47. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro Capturing Artifacts - Description • The capture process takes as input the URI of a new artifact discovered in a portal • Its task is to create a representative institutional capture of the artifact • Result of capture: • WARC file for new artifact in an institutional archive
  48. 48. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro Capturing Artifacts - Challenges • Create a high-fidelity capture using an approach that scales for a steady stream of new artifacts • Handle dynamic content & interactive features of web pages • Determine the web boundary of the artifact • More than the input artifact URI • The boundary is in the eye of the beholder • We made a significant breakthrough with the Memento Tracer framework • Others (cf. webrecorder.io Autopilot, IA Brozzler) are working on the same problem Memento Tracer: http://tracer.mementoweb.org Autopilot: https://blog.webrecorder.io/2019/08/14/autopilot Brozzler: https://github.com/internetarchive/brozzler
  49. 49. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro Capturing Artifacts
  50. 50. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro Memento Tracer - Framework http://tracer.mementoweb.org
  51. 51. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro Archiving Artifacts
  52. 52. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro Archiving Artifacts - Description • The archiving process takes as input the URI of a WARC file generated by the capture process • Its task is to ingest the WARC file in a cross-institutional web archive • This can be achieved using off-the-shelf web archiving software, e.g., pywb, Open Wayback • Result of archiving: • Mementos pertaining to newly discovered artifact in a cross- institutional, Memento-compliant web archive
  53. 53. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro Archiving Artifacts - Challenges • Attempted to use ipwb, a pywb version that uses IPFS • Cross-institutional distributed file system with redundancy • Ran out of time to get it operationally stable Sawood Alam, Mat Kelly, and Michael L. Nelson (2016) InterPlanetary Wayback: The Permanent Web Archive https://doi.org/10.1145/2910896.2925467
  54. 54. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro myresearch.institute - Researchers • Uniquely identified by ORCIDs • Web identities in multiple portals • Create various types of artifacts
  55. 55. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro myresearch.institute - Portals • Tracking started August 27 2018 • Tracking artifacts created starting August 1 2018
  56. 56. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro Scholarly Orphans – Pipeline • 16,005 unique artifacts tracked, captured, and archived between 20180801 and 20190828 • 60MB event database • 83GB of WARC files • 3GB of web archive index
  57. 57. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro Showtime: myresearch.institute Portal https://myresearchinstitute.org
  58. 58. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro Success? • “Interesting project! I’m happy to participate.” “One more thing, is it possible to get a copy of the URI-Rs that you guys detected so that I can feed them into an archive of my choice?...” • Prototype pipeline developed over 8 months (24 MM) • Metrics of the prototype demonstrate that researchers generate a lot of artifacts (that their institutions are typically not aware of) • Metrics of the prototype suggest it should be possible to run a production pipeline at the scale of an academic institution • But would they …?
  59. 59. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro Some Final Thoughts • For a number of reasons, applications that leverage network-level information at scale (e.g. EgoSystem, myresearch.institute, Autoload) tend not to be perfect. But they are automatic. • Do institutions reserve sufficient resources for innovation and failure? The alternative seems to be outsourcing and loss of expertise. • Ideas/visions are rarely fully realized when working on them. But many times, the work does improve on the status quo. So keep dreaming and working!
  60. 60. @hvdsomp VIVO Conference 2019, September 5 2019, Podgorica, Montenegro Herbert Van de Sompel DANS @hvdsomp https://orcid.org/0000-0002-0715-6126 Collecting the Organizational Scholarly Record

×