Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upExtract DOI from the current web page URL #1799
Conversation
This comment has been minimized.
This comment has been minimized.
Shouldn't we return |
This comment has been minimized.
This comment has been minimized.
Do we want to change this behavior for DOIs that are scraped from document too? Why then it was set to return |
This comment has been minimized.
This comment has been minimized.
No. In the page we don't know if it describes the main item for the page. In the URL we do. |
This comment has been minimized.
This comment has been minimized.
If a random string, which looks like a DOI, is extracted from a URL, it would prevent the further DOIs from document extraction. Which means if we want to return a single item we have to firstly resolve DOI metadata to make sure it's valid. And if not then extract and resolve items from a document. The current commit just puts all DOIs into one list, where all items are resolved and user just needs to select which one he thinks is correct. To be able to separately resolve DOI extracted from URL and from document, we would probably need serious modifications of the translator. |
var dois = [], m; | ||
|
||
// Extract DOIs from the current URL | ||
var rx = /10.[0-9]{4,}?\/[^\s&"'?#,]*[^\s&"'?#\/.,]/g; |
This comment has been minimized.
This comment has been minimized.
dstillman
Dec 17, 2018
Member
We usually use re
for this, not rx
.
Also, seems like we don't need to exclude quotes in this context.
And maybe we do need to URL decode? I'm not sure whether a URL that passes through here is necessarily decoded.
// Extract DOIs from the current URL | ||
var rx = /10.[0-9]{4,}?\/[^\s&"'?#,]*[^\s&"'?#\/.,]/g; | ||
while (m = rx.exec(url)) { | ||
if (dois.indexOf(m[0]) === -1) { |
This comment has been minimized.
This comment has been minimized.
mrtcode
force-pushed the
mrtcode:extract-doi-from-url
branch
from
b1a50c0
to
92f8713
Dec 17, 2018
This comment has been minimized.
This comment has been minimized.
So as I said previously, it's not that rare to encounter web page URLs that contain a DOI, but we can't extract it reliably. For example: URL: http://iopscience.iop.org/article/10.1088/0004-637X/768/1/87 URL: http://journal.frontiersin.org/article/10.3389/fmicb.2014.00402/full And we can't do anything about this. DOI from URL can be extracted incorrectly but it applies not only for the web page URL, but also for URLs found in the body. For DOI(s) extracted from a web page URL there are a few possible outcomes:
So the translator should work like this, depending on where and how many DOIs were found:
|
This comment has been minimized.
This comment has been minimized.
Is this what you meant? These are the same.
We could try stripping some likely suffixes, like |
This comment has been minimized.
This comment has been minimized.
given the most common academic CMS's, (edit: removed the ones already covered by the regex) |
This comment has been minimized.
This comment has been minimized.
Yeah, I just wanted to demonstrate the difference between the two URLs.
Worth to investigate this idea, but there can be many variants. More examples: |
mrtcode commentedDec 14, 2018
From some URLs DOI can't be correctly extracted, i.e.
http://onlinelibrary.wiley.com/journal/10.1111/%28ISSN%291470-9856/issues
would result to10.1111/%28ISSN%291470-9856/issues
instead of10.1111/%28ISSN%291470-9856
.Some URLs can also have multiple DOIs i.e.
http://api.crossref.org/works/?filter=doi:10.1117/3.1002595.ch10,doi:10.3403/00522251u,doi:10.3403/00522251,doi:10.3403/30217493,doi:10.3403/30289582,doi:10.1117/12.939903,doi:10.3403/02454346u,doi:10.1364/ofc.1979.thf1,doi:10.5772/7558,doi:10.3758/BF03202760,doi:10.3758/bf03195760,doi:10.1006/jmla.1997.2532,doi:10.1037/h0082866
.