New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Extract DOI from the current web page URL #1799

Open
wants to merge 2 commits into
base: master
from

Conversation

Projects
None yet
3 participants
@mrtcode
Contributor

mrtcode commented Dec 14, 2018

From some URLs DOI can't be correctly extracted, i.e. http://onlinelibrary.wiley.com/journal/10.1111/%28ISSN%291470-9856/issues would result to 10.1111/%28ISSN%291470-9856/issues instead of 10.1111/%28ISSN%291470-9856.

Some URLs can also have multiple DOIs i.e. http://api.crossref.org/works/?filter=doi:10.1117/3.1002595.ch10,doi:10.3403/00522251u,doi:10.3403/00522251,doi:10.3403/30217493,doi:10.3403/30289582,doi:10.1117/12.939903,doi:10.3403/02454346u,doi:10.1364/ofc.1979.thf1,doi:10.5772/7558,doi:10.3758/BF03202760,doi:10.3758/bf03195760,doi:10.1006/jmla.1997.2532,doi:10.1037/h0082866.

@dstillman

This comment has been minimized.

Member

dstillman commented Dec 14, 2018

Shouldn't we return document instead of multiple when there's a single DOI? Even if it's invalid or not found, there's still no reason to show the Select Items dialog. (That would only make sense if it could actually result in a different item from the one you were expecting.)

@mrtcode

This comment has been minimized.

Contributor

mrtcode commented Dec 14, 2018

Do we want to change this behavior for DOIs that are scraped from document too? Why then it was set to return multiple?

@dstillman

This comment has been minimized.

Member

dstillman commented Dec 14, 2018

No. In the page we don't know if it describes the main item for the page. In the URL we do.

@mrtcode

This comment has been minimized.

Contributor

mrtcode commented Dec 17, 2018

If a random string, which looks like a DOI, is extracted from a URL, it would prevent the further DOIs from document extraction. Which means if we want to return a single item we have to firstly resolve DOI metadata to make sure it's valid. And if not then extract and resolve items from a document. The current commit just puts all DOIs into one list, where all items are resolved and user just needs to select which one he thinks is correct. To be able to separately resolve DOI extracted from URL and from document, we would probably need serious modifications of the translator.

DOI.js Outdated
var dois = [], m;

// Extract DOIs from the current URL
var rx = /10.[0-9]{4,}?\/[^\s&"'?#,]*[^\s&"'?#\/.,]/g;

This comment has been minimized.

@dstillman

dstillman Dec 17, 2018

Member

We usually use re for this, not rx.

Also, seems like we don't need to exclude quotes in this context.

And maybe we do need to URL decode? I'm not sure whether a URL that passes through here is necessarily decoded.

DOI.js Outdated
// Extract DOIs from the current URL
var rx = /10.[0-9]{4,}?\/[^\s&"'?#,]*[^\s&"'?#\/.,]/g;
while (m = rx.exec(url)) {
if (dois.indexOf(m[0]) === -1) {

This comment has been minimized.

@dstillman

dstillman Dec 17, 2018

Member

includes instead of indexOf

@mrtcode mrtcode force-pushed the mrtcode:extract-doi-from-url branch from b1a50c0 to 92f8713 Dec 17, 2018

@mrtcode

This comment has been minimized.

Contributor

mrtcode commented Jan 10, 2019

So as I said previously, it's not that rare to encounter web page URLs that contain a DOI, but we can't extract it reliably. For example:

URL: http://iopscience.iop.org/article/10.1088/0004-637X/768/1/87
Extracted DOI: 10.1088/0004-637X/768/1/87
Actual DOI: 10.1088/0004-637X/768/1/87

URL: http://journal.frontiersin.org/article/10.3389/fmicb.2014.00402/full
Extracted DOI: 10.3389/fmicb.2014.00402/full
Actual DOI: 10.3389/fmicb.2014.00402

And we can't do anything about this.

DOI from URL can be extracted incorrectly but it applies not only for the web page URL, but also for URLs found in the body.

For DOI(s) extracted from a web page URL there are a few possible outcomes:

  1. It's correct and results to correct metadata
  2. It's incorrect and results to no metadata
  3. It's correct but results to no metadata (DOI RAs don't have it, i.e. JSTOR)
  4. It's incorrect and results to incorrect metadata - is not possible, I would say. Except maybe in same rare cases it can result to the actual journal instead of the article

So the translator should work like this, depending on where and how many DOIs were found:

  • one in URL - single
  • one in URL, one in body and they are equal - single
  • one in body - multiple
  • in all other cases - multiple
@dstillman

This comment has been minimized.

Member

dstillman commented Jan 11, 2019

URL: http://iopscience.iop.org/article/10.1088/0004-637X/768/1/87
Extracted DOI: 10.1088/0004-637X/768/1/87
Actual DOI: 10.1088/0004-637X/768/1/87

Is this what you meant? These are the same.

URL: http://journal.frontiersin.org/article/10.3389/fmicb.2014.00402/full
Extracted DOI: 10.3389/fmicb.2014.00402/full
Actual DOI: 10.3389/fmicb.2014.00402

And we can't do anything about this.

We could try stripping some likely suffixes, like /full and /pdf.

@adam3smith

This comment has been minimized.

Collaborator

adam3smith commented Jan 11, 2019

We could try stripping some likely suffixes, like /full and /pdf

given the most common academic CMS's, /full$ /pdf$
(as per the above examples)
/abstract$ and /abs$ (speculating about these -- I've mainly seen them before the DOI where they're no problem)

(edit: removed the ones already covered by the regex)

@mrtcode

This comment has been minimized.

Contributor

mrtcode commented Jan 11, 2019

URL: http://iopscience.iop.org/article/10.1088/0004-637X/768/1/87
Extracted DOI: 10.1088/0004-637X/768/1/87
Actual DOI: 10.1088/0004-637X/768/1/87

Is this what you meant? These are the same.

Yeah, I just wanted to demonstrate the difference between the two URLs.

URL: http://journal.frontiersin.org/article/10.3389/fmicb.2014.00402/full
Extracted DOI: 10.3389/fmicb.2014.00402/full
Actual DOI: 10.3389/fmicb.2014.00402
And we can't do anything about this.

We could try stripping some likely suffixes, like /full and /pdf.

Worth to investigate this idea, but there can be many variants. More examples:
http://www.oxfordreference.com/view/10.1093/acref/9780199608218.001.0001/acref-9780199608218
http://onlinelibrary.wiley.com/journal/10.1111/%28ISSN%291470-9856/issues
http://iopscience.iop.org/article/10.1088/0022-3727/34/10/311/meta

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment