Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New Emerald Insight Translator #2037

Open
wants to merge 2 commits into
base: master
from

Conversation

@adam3smith
Copy link
Collaborator

adam3smith commented Oct 24, 2019

Closes #2036

@dstillman do we still need to include attr and text code in translators? If not, can we add them to linting as defined?

Closes #2036
@adam3smith adam3smith requested a review from zuphilip Oct 24, 2019
@dstillman

This comment has been minimized.

Copy link
Member

dstillman commented Oct 24, 2019

do we still need to include attr and text code in translators? If not, can we add them to linting as defined?

We do, unfortunately.

Copy link
Collaborator

zuphilip left a comment

This looks great! I have just some comments to further improve this as well as some questions for you.

if (!rows.length) {
// book
rows = doc.querySelectorAll('li.intent_book_chapter>a');
}

This comment has been minimized.

Copy link
@zuphilip

zuphilip Oct 26, 2019

Collaborator

You can also combine two CSS paths with a comma, i.e.

rows = doc.querySelectorAll('.intent_issue_item h4>a, li.intent_book_chapter>a');

Or will the other path in the wrong case give unwanted results?

This comment has been minimized.

Copy link
@adam3smith

adam3smith Oct 27, 2019

Author Collaborator

done



function scrape(doc, url) {
var DOI = url.match(/\/(10\.[^#?/]+\/[^#?/]+)\//)[1];

This comment has been minimized.

Copy link
@zuphilip

zuphilip Oct 26, 2019

Collaborator

It is (theoretically) possible that this does not match. Is it better to fail in such cases or should we prevent them from the following lines by checking that the url matches already in the detectWeb?

This comment has been minimized.

Copy link
@adam3smith

adam3smith Oct 27, 2019

Author Collaborator

improved check in detectWeb and added one more fallback here. Also added the scrape function to work on PDF pages.

// they number authors in their RIS...
text = text.replace(/A\d+\s+-/g, "AU -");
// add a comma after the last name
text = text.replace(/AU\s\s-\s(\w+)/g, "AU - $1,");

This comment has been minimized.

Copy link
@zuphilip

zuphilip Oct 26, 2019

Collaborator

I think we should test here first that no comma is present at the moment (because they might fix that at some point). Moreover, can you extend the comment for this a little more?

It seems to me that they are always treating the last word of the author string as the last name and everything else as the first name(s), which they then put into RIS data as "lastname firstname(s)" without a comma. The splitting does not need to be correct, e.g. https://www.emerald.com/insight/content/doi/10.1108/S1572-832320170000026008/full/html , but it is the only thing we can get from their metadata.

This comment has been minimized.

Copy link
@adam3smith

adam3smith Oct 27, 2019

Author Collaborator

Done and done.

"mimeType": "application/pdf"
}
],
"tags": [],

This comment has been minimized.

Copy link
@zuphilip

zuphilip Oct 26, 2019

Collaborator

There are keywords which could be saved as tags, i.e. look for doc.querySelectorAll('li .intent_text');.

This comment has been minimized.

Copy link
@adam3smith

adam3smith Oct 28, 2019

Author Collaborator

Done. Won't use when scraping from PDF view which I think is acceptable.

"creatorType": "author"
}
],
"date": "January 1, 2017",

This comment has been minimized.

Copy link
@zuphilip

zuphilip Oct 26, 2019

Collaborator

Should we make this language independent with ZU.strToISO? (Maybe even in the RIS translator itself?)

This comment has been minimized.

Copy link
@adam3smith

adam3smith Oct 27, 2019

Author Collaborator

I'll impement this here -- let's think through some RIS scenarios together. Generally sounds like we should.

],
"date": "January 1, 2015",
"ISBN": "9781784415877 9781784415884",
"abstractNote": "AbstractOriginality/value\nThis technique creates opportunities for students to have unique assignments encouraging student to student teaching and can be applied to assignments in any accounting course (undergraduate and graduate). This testing method has been used in Intermediate I and II, Individual Taxation, and Corporate Taxation.",

This comment has been minimized.

Copy link
@zuphilip

zuphilip Oct 26, 2019

Collaborator

Maybe it is cleaner to take here

ZU.cleanInternal(doc.getElementById('abstract').textContent);

for the abstract. See also https://www.emerald.com/insight/content/doi/10.1108/07419051111154758/full/html for more subcontent under abstract.

This comment has been minimized.

Copy link
@adam3smith

adam3smith Oct 28, 2019

Author Collaborator

Done (won't use this when scraping from PDF, which I deemed acceptable)

Emerald Insight.js Show resolved Hide resolved
@adam3smith

This comment has been minimized.

Copy link
Collaborator Author

adam3smith commented Oct 28, 2019

Thanks -- take another look please if you have a moment.

@zuphilip

This comment has been minimized.

Copy link
Collaborator

zuphilip commented Oct 28, 2019

That all looks good, but can you update the test cases as well?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.