Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upNew Emerald Insight Translator #2037
Conversation
This comment has been minimized.
This comment has been minimized.
We do, unfortunately. |
This looks great! I have just some comments to further improve this as well as some questions for you. |
if (!rows.length) { | ||
// book | ||
rows = doc.querySelectorAll('li.intent_book_chapter>a'); | ||
} |
This comment has been minimized.
This comment has been minimized.
zuphilip
Oct 26, 2019
Collaborator
You can also combine two CSS paths with a comma, i.e.
rows = doc.querySelectorAll('.intent_issue_item h4>a, li.intent_book_chapter>a');
Or will the other path in the wrong case give unwanted results?
This comment has been minimized.
This comment has been minimized.
|
||
|
||
function scrape(doc, url) { | ||
var DOI = url.match(/\/(10\.[^#?/]+\/[^#?/]+)\//)[1]; |
This comment has been minimized.
This comment has been minimized.
zuphilip
Oct 26, 2019
Collaborator
It is (theoretically) possible that this does not match. Is it better to fail in such cases or should we prevent them from the following lines by checking that the url matches already in the detectWeb
?
This comment has been minimized.
This comment has been minimized.
adam3smith
Oct 27, 2019
Author
Collaborator
improved check in detectWeb
and added one more fallback here. Also added the scrape function to work on PDF pages.
// they number authors in their RIS... | ||
text = text.replace(/A\d+\s+-/g, "AU -"); | ||
// add a comma after the last name | ||
text = text.replace(/AU\s\s-\s(\w+)/g, "AU - $1,"); |
This comment has been minimized.
This comment has been minimized.
zuphilip
Oct 26, 2019
Collaborator
I think we should test here first that no comma is present at the moment (because they might fix that at some point). Moreover, can you extend the comment for this a little more?
It seems to me that they are always treating the last word of the author string as the last name and everything else as the first name(s), which they then put into RIS data as "lastname firstname(s)" without a comma. The splitting does not need to be correct, e.g. https://www.emerald.com/insight/content/doi/10.1108/S1572-832320170000026008/full/html , but it is the only thing we can get from their metadata.
This comment has been minimized.
This comment has been minimized.
"mimeType": "application/pdf" | ||
} | ||
], | ||
"tags": [], |
This comment has been minimized.
This comment has been minimized.
zuphilip
Oct 26, 2019
Collaborator
There are keywords which could be saved as tags, i.e. look for doc.querySelectorAll('li .intent_text');
.
This comment has been minimized.
This comment has been minimized.
adam3smith
Oct 28, 2019
Author
Collaborator
Done. Won't use when scraping from PDF view which I think is acceptable.
"creatorType": "author" | ||
} | ||
], | ||
"date": "January 1, 2017", |
This comment has been minimized.
This comment has been minimized.
zuphilip
Oct 26, 2019
Collaborator
Should we make this language independent with ZU.strToISO
? (Maybe even in the RIS translator itself?)
This comment has been minimized.
This comment has been minimized.
adam3smith
Oct 27, 2019
Author
Collaborator
I'll impement this here -- let's think through some RIS scenarios together. Generally sounds like we should.
], | ||
"date": "January 1, 2015", | ||
"ISBN": "9781784415877 9781784415884", | ||
"abstractNote": "AbstractOriginality/value\nThis technique creates opportunities for students to have unique assignments encouraging student to student teaching and can be applied to assignments in any accounting course (undergraduate and graduate). This testing method has been used in Intermediate I and II, Individual Taxation, and Corporate Taxation.", |
This comment has been minimized.
This comment has been minimized.
zuphilip
Oct 26, 2019
Collaborator
Maybe it is cleaner to take here
ZU.cleanInternal(doc.getElementById('abstract').textContent);
for the abstract. See also https://www.emerald.com/insight/content/doi/10.1108/07419051111154758/full/html for more subcontent under abstract.
This comment has been minimized.
This comment has been minimized.
adam3smith
Oct 28, 2019
Author
Collaborator
Done (won't use this when scraping from PDF, which I deemed acceptable)
This comment has been minimized.
This comment has been minimized.
Thanks -- take another look please if you have a moment. |
This comment has been minimized.
This comment has been minimized.
That all looks good, but can you update the test cases as well? |
adam3smith commentedOct 24, 2019
Closes #2036
@dstillman do we still need to include
attr
andtext
code in translators? If not, can we add them to linting as defined?