Successfully reported this slideshow.
We use your LinkedIn profile and activity data to personalize ads and to show you more relevant ads. You can change your ad preferences anytime.

On the Persistence of Persistent Identifiers of the Scholarly Web

107 views

Published on

CNI Spring 2020 presentation
pre-print at: https://arxiv.org/abs/2004.03011

Published in: Internet
  • Be the first to comment

  • Be the first to like this

On the Persistence of Persistent Identifiers of the Scholarly Web

  1. 1. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 Martin Klein Los Alamos National Laboratory martinklein0815@gmail.com @mart1nkle1n On the Persistence of Persistent Identifiers of the Scholarly Web HEAD GET GET+ Chrome
  2. 2. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 https://arxiv.org/abs/2004.03011 For more background, details, results
  3. 3. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020
  4. 4. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020
  5. 5. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020
  6. 6. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020
  7. 7. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020
  8. 8. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020
  9. 9. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020
  10. 10. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020
  11. 11. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020
  12. 12. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020
  13. 13. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020
  14. 14. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020
  15. 15. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020
  16. 16. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020
  17. 17. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020
  18. 18. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 Regardless of location, phone used … … when calling a well-known number that uniquely identifies a(n) (emergency) resource … … would you not expect to get the same response? Do you trust in the persistence of that number (and the response)?
  19. 19. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 No more scary emergency scenarios! • Phones == web clients • Locations == network environments • 911 calls == HTTP requests against DOIs • Regardless of the web client and network location, would you not expect the same response from a web server when requesting the same DOI?
  20. 20. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 Idea… • Comparative study investigating scholarly publishers’ responses • To common HTTP requests • Against DOIs • Using different web clients and request methods, resembling • Machines ”browsing”, crawling • Humans browsing • From network environments with different subscriptions/licenses • Amazon Web Service EC2 instance • LANL internal
  21. 21. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 How does this work? 10.1007/978-3-540-87599-4_38
  22. 22. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 How does this (not) work? 10.1007/978-3-540-87599-4_38
  23. 23. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 How does this work via HTTP? https://doi.org/10.1007/978-3-540-87599-4_38
  24. 24. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 What do you see? https://doi.org/10.1007/978-3-540-87599-4_38
  25. 25. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 How does this work via HTTP? https://doi.org/10.1007/978-3-540-87599-4_38
  26. 26. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 How does this work via HTTP? https://doi.org/10.1007/978-3-540-87599-4_38  (HTTP redirect) http://link.springer.com/10.1007/978-3-540-87599-4_38
  27. 27. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 How does this work via HTTP? https://doi.org/10.1007/978-3-540-87599-4_38  (HTTP redirect) http://link.springer.com/10.1007/978-3-540-87599-4_38  (HTTP redirect) https://link.springer.com/10.1007/978-3-540-87599-4_38
  28. 28. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 How does this work via HTTP? https://doi.org/10.1007/978-3-540-87599-4_38  (HTTP redirect) http://link.springer.com/10.1007/978-3-540-87599-4_38  (HTTP redirect) https://link.springer.com/10.1007/978-3-540-87599-4_38  (HTTP redirect) https://link.springer.com/chapter/10.1007%2F978-3-540-87599-4_38
  29. 29. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 DOI dataset • Gathering a representative sample is not trivial! • Internet Archive conducts crawls of the scholarly domain • June 2018: 93 million DOIs • Obtained WARC files and extracted DOI redirect chain • Investigate publisher distribution • Final link of redirect chain and extract host e.g.: https://link.springer.com/chapter/10.1007%2F978-3-540-87599-4_38  Domain: springer.com • Randomly pick 100 DOIs from the 100 most frequent domains • 10,000 DOIs
  30. 30. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 Web clients and HTTP requests 1/4 • HEAD request • Server responds with response headers • *but no* response body • Client: cURL
  31. 31. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 Web clients and HTTP requests 1/4 • HEAD request • Server responds with response headers • *but no* response body • Client: cURL
  32. 32. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 Web clients and HTTP requests 2/4 • GET request • Server responds with response headers • *and* response body • Client: cURL
  33. 33. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 Web clients and HTTP requests 2/4 • GET request • Server responds with response headers • *and* response body • Client: cURL
  34. 34. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 Web clients and HTTP requests 3/4 • GET+ • GET request with request headers • User Agent (desktop Chrome browser) • Specified connection timeout • Specified maximum number of redirects • Cookies accepted and stored • Insecure connections allowed • Client: cURL
  35. 35. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 Web clients and HTTP requests 3/4 • GET+ • GET request with request headers • User Agent (desktop Chrome browser) • Specified connection timeout • Specified maximum number of redirects • Cookies accepted and stored • Insecure connections allowed • Client: cURL
  36. 36. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 Web clients and HTTP requests 4/4 • Chrome: • GET request via Selenium Webdriver controlled browser • Client: Chrome
  37. 37. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 Web clients and HTTP requests 4/4 • Chrome: • GET request via Selenium Webdriver controlled browser • Client: Chrome
  38. 38. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 Regarding response headers, RFC 7231 states: (highlights mine) “The server SHOULD send the same header fields in response to a HEAD request as it would have sent if the request had been a GET...”. https://tools.ietf.org/html/rfc7231
  39. 39. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 HTTP response codes • 2xx • Success • 3xx • Redirection • 4xx • Client error • 5xx • Server error
  40. 40. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 Response codes of last link in redirect chain by DOI HEAD GET GET+ Chrome 2xx 3xx 4xx 5xx Err10,000DOIs
  41. 41. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 Response codes of last link in redirect chain by DOI HEAD GET GET+ Chrome 2xx 3xx 4xx 5xx Err 48.3% • < 50% successful requests across all methods
  42. 42. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 Response codes of last link in redirect chain by DOI HEAD GET GET+ Chrome 2xx 3xx 4xx 5xx Err 48.3% • < 50% successful requests across all methods • > 40% 300-level responses w/ GET
  43. 43. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 Response codes of last link in redirect chain by DOI HEAD GET GET+ Chrome 2xx 3xx 4xx 5xx Err 48.3% • < 50% successful requests across all methods • > 40% 300-level responses w/ GET • 25% of them 200- level w/ HEAD/Chrome
  44. 44. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 Response codes of last link in redirect chain by DOI HEAD GET GET+ Chrome 2xx 3xx 4xx 5xx Err 48.3% • < 50% successful requests across all methods • > 40% 300-level responses w/ GET • 25% of them 200- level w/ HEAD/Chrome • 13% 400-level responses w/ HEAD
  45. 45. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 Response codes of last link in redirect chain by DOI HEAD GET GET+ Chrome 2xx 3xx 4xx 5xx Err 48.3% • < 50% successful requests across all methods • > 40% 300-level responses w/ GET • 25% of them 200- level w/ HEAD/Chrome • 13% 400-level responses w/ HEAD • 25% of them w/ 200-level response w/ any other method
  46. 46. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 Response code comparison external vs internal network HEAD GET GET+ Chrome 2xx 3xx 4xx 5xx Err 66.9% HEAD GET GET+ Chrome 2xx 3xx 4xx 5xx Err 48.3%
  47. 47. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 Response code comparison OA vs non-OA HEAD GET GET+ Chrome 2xx 3xx 4xx 5xx Err 48.3% HEAD GET GET+ Chrome 2xx 3xx 4xx 5xx Err 59.5% OA 973DOIs HEAD GET GET+ Chrome 2xx 3xx 4xx 5xx Err 47.1% non-OA 10,000DOIs 9,027DOIs
  48. 48. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 HEAD GET GET+ Chrome 2xx 3xx 4xx 5xx Err 66.9% HEAD GET GET+ Chrome 2xx 3xx 4xx 5xx Err 64.4% HEAD GET GET+ Chrome 2xx 3xx 4xx 5xx Err 84.3% Response code comparison SUB vs non-SUB SUB 1,266DOIs non-SUB 10,000DOIs 8,734DOIs
  49. 49. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 Take-aways • Frequently, scholarly publishers respond inconsistently to different requests against the same DOI, depending on: • HTTP client, request method, network environment • Implications for (perceived) persistence of DOIs? • Inconsistent DOI resolution does not build trust in DOIs • Lack of adherence to standards does not build trust • More work needed but initial findings seem to indicate: • OA DOIs more consistent than non-OA DOIs • DOIs for subscribed & licensed content show more consistency • Implications for archival efforts? • Test different combinations of clients/request methods/networks • Pretend to be as human as possible
  50. 50. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 • https://unsplash.com/photos/SBiVq9eWEtQ • https://unsplash.com/photos/9e9PD9blAto • https://myescambia.com/our-services/public- safety/communications • https://unsplash.com/photos/97HfVpyNR1M • https://unsplash.com/photos/UYwjKbrwUos • https://unsplash.com/photos/Se7vVKzYxTI • https://unsplash.com/photos/_geAgtjqLzY • https://unsplash.com/photos/r-enAOPw8Rs • https://unsplash.com/photos/A4qmsfG6ywM • https://unsplash.com/photos/eWqOgJ-lfiI • https://unsplash.com/photos/goholCAVTRs • https://unsplash.com/photos/vpXbwh6Qk9U • https://unsplash.com/photos/K21Dn4OVxNw • https://unsplash.com/photos/HzOclMmYryc • https://unsplash.com/photos/OW5KP_Pj85Q Photo credits
  51. 51. On the Persistence of Persistent Identifiers of the Scholarly Web @mart1nkle1n CNI Spring Virtual Meeting 2020 Martin Klein Los Alamos National Laboratory martinklein0815@gmail.com @mart1nkle1n On the Persistence of Persistent Identifiers of the Scholarly Web HEAD GET GET+ Chrome Thank you & stay safe!

×