Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

Comparing the Performance of OAI-PMH with ResourceSync

108 views

Published on

Comparing the Performance of OAI-PMH with ResourceSync
Presentation at Open Repositories 2019
Petr Knoth, Matteo Cancellieri, Martin Klein

Published in: Internet
  • Be the first to comment

  • Be the first to like this

Comparing the Performance of OAI-PMH with ResourceSync

  1. 1. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany Comparing the Performance of OAI-PMH with ResourceSync Petr Knoth, Matteo Cancellieri Knowledge Media institute The Open University UK Martin Klein Research Library Los Alamos National Laboratory USA
  2. 2. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany “A single scientific repository is of limited value, real benefits come from the ability to exchange data within a network … … interoperability allows us to exploit today's computational power so that we can aggregate, data mine, create new tools and services, and generate new knowledge from repository content.” - COAR ResourceSync and repositories 2
  3. 3. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany Protocols for data exchange are the blood of the scholarly communication system ResourceSync and repositories 3
  4. 4. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany Aggregators and ResourceSync 4 ResourceSync (CORE FastSync) 3rd parties -data analysis - TDM
  5. 5. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany Repository aggregators have large full text collections core.ac.uk stats: • 13,117,488 Hosted full texts • 135,539,113 Metadata records • ~78m Links to full text • 15TB of raw plain text • 4,123 Data providers 5
  6. 6. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany Many OAI-PMH implementations challenges … Locating full text URLs in metadata Restrictions on full text downloading Sequential nature of OAI-PMH Failing resumption tokens Incremental updates Scalability Metadata interoperability Reliability No content harvesting support 6
  7. 7. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany Speed of OAI-PMH implementations 7
  8. 8. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany Aggregators and ResourceSync 8 ResourceSync (CORE FastSync) 3rd parties -data analysis - TDM
  9. 9. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany Aggregators and ResourceSync 9 ResourceSync (CORE FastSync) 3rd parties -data analysis - TDM
  10. 10. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany Aggregators have a lot of usage • January 2019 – CORE reached over 10M monthly active users for the first time • 571% increase from January 2018 • core.ac.uk by usage in the top 0.0009% of global websites 10
  11. 11. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany Aggregator’s challenge • Stay up to date despite thousands of data providers • Efficiently expose large amounts of data to many users: • Human users • Machines (scalability!) • OAI-PMH implementations can hardly deal with the job: • Scalability • Metadata inconsistency • Supports for metadata harvesting only 11
  12. 12. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany Research question 12 Is ResourceSync better suited for the job than OAI-PMH?
  13. 13. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany OAI-PMH - Background 13 http://openarchives.org/pmh/ • Recurrent metadata exchange from a Data Provider to Service Providers • XML metadata only • Repository centric • Devised 1999-2002, prior to REST, prior to dominance of web search engines
  14. 14. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany ResourceSync - Background 14 http://www.openarchives.org/rs/1.1/resourcesync • Synchronization of resources from a Source to Destinations • Web resources, anything with an HTTP URI & representation • Resource centric • Devised 2012-2013, leverages key ingredients of web interoperability, existing specifications, existing Search Engine Optimization practice
  15. 15. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany ResourceSync in a Nutshell 15
  16. 16. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany ResourceSync Capabilities 16
  17. 17. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany ResourceSync Capabilities 17
  18. 18. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany ResourceSync Capabilities 18
  19. 19. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany ResourceSync Capabilities 19
  20. 20. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany ResourceSync Capabilities 20
  21. 21. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany Many to One - Aggregator 21
  22. 22. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany ResourceSync is based on Sitemaps 22 <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9"> <url> <loc>http://example.com/res1</loc> <lastmod>2013-01-02T13:00:00Z</lastmod> </url> <url> <loc>http://example.com/res2</loc> <lastmod>2013-01-02T14:00:00Z</lastmod> </url> … </urlset>
  23. 23. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany ResourceSync Resource List 23 <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability="resourcelist" at="2019-06-11T09:00:00Z" completed="2019-06-11T09:00:44Z" /> <url> <loc>http://example.com/res1_metadata.xml</loc> <lastmod>2019-06-02T13:00:00Z</lastmod> <rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6" length="823" type="text/xml" /> </url> </urlset>
  24. 24. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany Resource List with Link 24 <urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9" xmlns:rs="http://www.openarchives.org/rs/terms/"> <rs:md capability="resourcelist" at="2019-06-11T09:00:00Z" completed="2019-06-11T09:00:44Z" /> <url> <loc>http://example.com/res1_metadata.xml</loc> <lastmod>2019-06-02T13:00:00Z</lastmod> <rs:md hash="md5:1584abdf8ebdc9802ac0c6a7402c03b6" length="823" type="text/xml" /> <rs:ln href="http://example.com/res1_content.pdf" rel="describes" length="8876" type="application/pdf" /> </url> </urlset>
  25. 25. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany • Designed to allow synchronization of resources, not just metadata • Explicit link between metadata and the described resource • Not prescriptive about the metadata format • Web-centric • Push-based Change Notifications (WebSub) ResourceSync Characteristics 25
  26. 26. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany 1. Assess the speed of OAI-PMH implementations across repositories See results on slide #7 Comparative Analysis 26
  27. 27. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany 1. Assess the speed of OAI-PMH implementations across repositories 2. Understand the recall in full-text harvesting Comparative Analysis 27
  28. 28. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany Recall of full-text harvesting – the power of the explicit full text link 28
  29. 29. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany 1. Assess the speed of OAI-PMH implementations across repositories 2. Understand the recall in full-text harvesting 3. Evaluate simulated metadata harvesting with ResourceSync implementations for: a) Standard Mode • Resources sync’ed via Resource Lists, one resource at a time (per HTTP transaction) b) Resource Dump Mode • Resources packaged into a Resource Dump, transferred via one HTTP transaction c) Batch Mode • Resources are packaged into partial and on-demand Resource Dumps, transferred via multiple HTTP transactions 4. Comparative Analysis 29
  30. 30. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany Speed simulated ResourceSync implementations 30
  31. 31. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany Speed simulated ResourceSync implementations 31
  32. 32. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany Why On Demand Resource Dump • Many repositories have hundreds of OAI sets: • Cannot materialize (too much data and processing requirements) • Cannot rely on Resource List (too slow) • HATEOAS approach: https://blog.core.ac.uk/2018/03/17/increasing-the-speed-of-harvesting- with-on-demand-resource-dumps/ 32
  33. 33. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany Recommendations for data providers • Adopt ResourceSync at a platform level (Eprints, Dspace, Fedora, etc.) • Many considerations: • Support Change Lists? Dump? Naming of Capability Lists? On Demand Dumps? How to link resources? WebSub? • Guidelines needed! • Resource List adoption only viable for small providers • Support for on-demand Resource Dumps needed! • ResourceSync Client-Server implementation available: https://github.com/resync/resync • CORE happy to benchmark repository platforms • LANL working on validator 33
  34. 34. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany • OAI-PMH implementations vary substantially in terms of number of records downloaded per second • ResourceSync provides up to 10 times faster harvesting speeds with Resource Dumps • On-demand Resource Dumps for optimization • Not yet part of the standard • Thanks to resource linking, low recall less of an issue! Take-aways 34
  35. 35. Comparing the Performance of OAI-PMH with ResourceSync @petrknoth @mart1nkle1n OR 2019, 06/12/2019, Hamburg, Germany Comparing the Performance of OAI-PMH with ResourceSync Petr Knoth, Matteo Cancellieri Knowledge Media institute The Open University UK Martin Klein Research Library Los Alamos National Laboratory USA

×