Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Translator for Oxford Reference #1325

Merged
merged 6 commits into from Nov 24, 2017

Conversation

Projects
None yet
3 participants
@sonali0901
Copy link
Contributor

sonali0901 commented Jun 8, 2017

Need some inputs on how to identify item type as I couldn't figure out any way to uniquely identify books and books sections.
Fixes #796

@zuphilip

This comment has been minimized.

Copy link
Collaborator

zuphilip commented Jun 11, 2017

I looked at your two examles and I would suggest to try to differentetiate the two type by looking at the classes in body, i.e. try something like this (not tested):

var body = document.getElementsByTagName("body")[0];
var item;
if (body.className).indexOf('dctype-oxencycl-entry') > -1) {
   item = new Zotero.Item("encyclopediaArticle");
} else { // class is then 'dctype-book'
   item = new Zotero.Item("book");
}
@sonali0901

This comment has been minimized.

Copy link
Contributor Author

sonali0901 commented Jun 15, 2017

@zuphilip For pages that are locked and only fully visible after subscription have limited content in the abstract note (for eg. Accutane in test cases). Is it fine to leave it that way?

@sonali0901 sonali0901 changed the title WIP: Translator for Oxford Reference Translator for Oxford Reference Jun 15, 2017

@zuphilip
Copy link
Collaborator

zuphilip left a comment

Some of your xpath should be already covered by the Embedded Metadata translator and in the future we also try to scrape there the JSON-LD. Thus, I suggest that you call EM first and then just add the missing parts.

Please have also a look at my other comments.

"translatorID": "62415874-b53c-4afd-86e8-814e18a986f6",
"label": "Oxford Reference",
"creator": "Sonali Gupta",
"target": "http://www.oxfordreference.com/",

This comment has been minimized.

Copy link
@zuphilip

zuphilip Jun 15, 2017

Collaborator

Start with ^https? and escape the points, which means \. in Scaffold or \\. in the textfile directly.


function detectWeb(doc, url) {
if (url.indexOf("/search") != -1)
return "multiple";

This comment has been minimized.

Copy link
@zuphilip

zuphilip Jun 15, 2017

Collaborator

Make another (nested) check for getSearchResults(doc, true).

return "bookSection";
}
else
return "book";

This comment has been minimized.

Copy link
@zuphilip

zuphilip Jun 15, 2017

Collaborator

I guess this leads also to a lot of misclassifications, because all other websites from the domain will return book, e.g. http://www.oxfordreference.com/page/law-subject/law but also simply http://www.oxfordreference.com/

Is it enough to filter on url.indexOf('/view/')>-1?

var items = {};
var found = false;
var rows = ZU.xpath(doc, '//span[@class="titlePart"]/a');
var rowsExtendedTitle = ZU.xpath(doc, '//span[@class="title"]');

This comment has been minimized.

Copy link
@zuphilip

zuphilip Jun 15, 2017

Collaborator

Why do you need two xpaths here? This is IMO fragile because the elements of the two xpaths could be numbered differently...

This comment has been minimized.

Copy link
@sonali0901

sonali0901 Jun 15, 2017

Author Contributor

I did this because with the first XPath only the tile of the word we searched was visible. For eg, I searched Shalimar the Clown and all the entries in the pop up had Salman Rushdie as the title. So I used the second XPath to concatenate the name of the book from where the reference came to make it easier for the user to choose which one to save.

This comment has been minimized.

Copy link
@zuphilip

zuphilip Jun 15, 2017

Collaborator

I think it would then be better to try something like:

var rows = ZU.xpath(doc, '//span[@class="titlePart"]');
...
   var href = ZU.xpathText(rows[i], './a/@href');
   var title = ZU.trimInternal(rows[i].textContent);

(Note: this is untested code.)

This comment has been minimized.

Copy link
@sonali0901

sonali0901 Jun 15, 2017

Author Contributor

The two spans have different class ids. I will check this though.

This comment has been minimized.

Copy link
@zuphilip

zuphilip Jun 15, 2017

Collaborator

Okay, I looked closer now.

How about the xpath

var rows = ZU.xpath(doc, '//h3[@class="source"]/a[span[@class="title"]]');

?

This comment has been minimized.

Copy link
@sonali0901

sonali0901 Jun 22, 2017

Author Contributor

This xpath is giving the title as the book title(and not the section) in the pop-up window and the href is the book section page.


var edition = ZU.xpathText(doc, '//meta[@property="http://schema.org/bookEdition"]/@content');
if(edition)
item.edition = edition;

This comment has been minimized.

Copy link
@zuphilip

zuphilip Jun 15, 2017

Collaborator

You can do this directly (because ZU.xpathText is a nice function), i.e.

item.edition = ZU.xpathText(doc, '//meta[@property="http://schema.org/bookEdition"]/@content');
"items": [
{
"itemType": "bookSection",
"title": "Accutane - Oxford Reference",

This comment has been minimized.

Copy link
@zuphilip

zuphilip Jun 15, 2017

Collaborator

Shouldn't this be just "Accutane"?

"url": "http://www.oxfordreference.com/view/10.1093/acref/9780199546572.001.0001/acref-9780199546572-e-0009",
"items": [
{
"itemType": "bookSection",

This comment has been minimized.

Copy link
@zuphilip

zuphilip Jun 15, 2017

Collaborator

The title of the containing book should also be saved, i.e. here it should be "A-Z of Plastic Surgery".

"url": "http://www.oxfordreference.com/view/10.1093/acref/9780199546572.001.0001/acref-9780199546572-e-0009",
"items": [
{
"itemType": "bookSection",

This comment has been minimized.

Copy link
@zuphilip

zuphilip Jun 15, 2017

Collaborator

For chapters we should also save the title of the book. This seems missing here.

],
"date": "2008",
"ISBN": "9780199546572",
"abstractNote": "Isotretinoin. The synthetic retinoid derivative 13-cis-retinoic acid (Accutane) used for severe Acne vulgaris. The dose is 1",

This comment has been minimized.

Copy link
@zuphilip

zuphilip Jun 15, 2017

Collaborator

That is IMO fine, but maybe add the three dots as well...

"items": [
{
"itemType": "book",
"title": "Concise Oxford Companion to English Literature - Oxford Reference",

This comment has been minimized.

Copy link
@zuphilip

zuphilip Jun 15, 2017

Collaborator

We should clean the title such that the output is without the " - Oxford Reference".

sonali0901 and others added some commits Jun 22, 2017

@adam3smith adam3smith merged commit 80bd37d into zotero:master Nov 24, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@adam3smith

This comment has been minimized.

Copy link
Collaborator

adam3smith commented Nov 24, 2017

Thanks!

psisquared2 added a commit to psisquared2/translators that referenced this pull request Feb 8, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.