Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MIDAS Journal Translator #1030

Merged
merged 8 commits into from Apr 16, 2018

Conversation

@str4w
Copy link
Contributor

commented Mar 23, 2016

This is a translator for the journals hosted by Kitware - Insight Journal, VTK Journal, etc.

splitDownloadPath = downloadAllXPath.iterateNext().value.split('/');
version=splitDownloadPath[splitDownloadPath.length-1]
newItem.extra="Revision: " + version
newItem.issue=splitDownloadPath[splitDownloadPath.length-2]

This comment has been minimized.

Copy link
@zuphilip

zuphilip Apr 10, 2016

Collaborator

All three lines above are missing a semicolon.

//Zotero.debug("splitname");
//Zotero.debug(splitname);
var z=1;
name="";

This comment has been minimized.

Copy link
@zuphilip

zuphilip Apr 10, 2016

Collaborator

var is missing

"translatorID": "86c35e86-3f97-4e80-9356-8209c97737c2",
"label": "MIDAS Journals",
"creator": "Rupert Brooks",
"target": "(insight-journal|midasjournal|vtkjournal).org/browse/publication",

This comment has been minimized.

Copy link
@zuphilip

zuphilip Apr 10, 2016

Collaborator

What is in front this? Normally we have something like ^https?://(www)?. Can we do this here also?

Moreover, escape the point in the regexp.

@zuphilip

This comment has been minimized.

Copy link
Collaborator

commented Apr 10, 2016

Can you give some specific examples for your translator?

*/

function detectWeb(doc, url) {
if (url.match("browse/publication")) return "journalArticle";

This comment has been minimized.

Copy link
@aurimasv

aurimasv Apr 10, 2016

Contributor

this is already guaranteed by the target regexp above. Since the only required field for translation is item.title, you probably want to check here that you can get the title.

var namespace = doc.documentElement.namespaceURI;
var nsResolver = namespace ? function(prefix) {
if (prefix == 'x') return namespace; else return null;
} : null;

This comment has been minimized.

Copy link
@aurimasv

aurimasv Apr 10, 2016

Contributor

Drop all of this and, below, use ZU.xpath(doc, "your xpath here") for an element array or ZU.xpathText(doc, "your xpath here") for the .textContent of the elements.

var submittedXPath = doc.evaluate('//*[@id="publication"]/div[@class="submittedby"]', doc, nsResolver, XPathResult.ANY_TYPE, null);
var abstractXPath = doc.evaluate('//*[@id="publication"]/div[@class="abstract"]', doc, nsResolver, XPathResult.ANY_TYPE, null);
var downloadAllXPath = doc.evaluate('//a[contains(text(),"Download All")]/@href', doc, nsResolver, XPathResult.ANY_TYPE, null);
var pdfXPath1 = doc.evaluate('//a[contains(text(),"Download Paper")]/@href', doc, nsResolver, XPathResult.ANY_TYPE, null);

This comment has been minimized.

Copy link
@aurimasv

aurimasv Apr 10, 2016

Contributor

Move all of these down to where they are actually assigned to the Zotero item.

//var pdfXPath2 = doc.evaluate('//a[contains(text(),".pdf")]/@href', doc, nsResolver, XPathResult.ANY_TYPE, null);


pdfhref=pdfXPath1.iterateNext();

This comment has been minimized.

Copy link
@aurimasv

aurimasv Apr 10, 2016

Contributor

missing var. Apply throughout, please.

pdflink=tmp[0]+'//' +tmp[2]+pdfhref.value;
//Zotero.debug(pdflink);
}
datematch=new RegExp("on\\w+([0-9]+)\\-([0-9]+)\\-([0-9]+)");

This comment has been minimized.

Copy link
@aurimasv

aurimasv Apr 10, 2016

Contributor

Instead of new RegExp(...) use /.../ (and you don't need to declare them ahead of time. Use "foo".match(/o/) instead of regexp.exec("foo"))

//Zotero.debug(pdflink);
}
datematch=new RegExp("on\\w+([0-9]+)\\-([0-9]+)\\-([0-9]+)");
datematch=new RegExp("on +([0-9]+)\\-([0-9]+)\\-([0-9]+)");

This comment has been minimized.

Copy link
@aurimasv

aurimasv Apr 10, 2016

Contributor

add spaces before and after =. Apply throughout please.

newItem.issue=splitDownloadPath[splitDownloadPath.length-2]

var a=0;
while (a<authors.length) {

This comment has been minimized.

Copy link
@aurimasv

aurimasv Apr 10, 2016

Contributor

This (and below) should be a for loop. for (var i=0; i<authors.length; i++) { ...

name=name+splitname[z]+" ";
z=z+1;
}
name=name+splitname[0];

This comment has been minimized.

Copy link
@aurimasv

aurimasv Apr 10, 2016

Contributor

Please leave a comment in the code of what kind of string you are processing in this whole block. I can see that you're reshuffling something like Prastawa M. into M. Prastawa and then calling cleanAuthor to parse it. It would be much simpler and cleaner to

var m = "Prastawa M.".match(/(\S+)\s+(.+)/);
item.creators.push({
    firstName: m[2],
    lastName: m[1],
    creatorType: 'author'
})`

This comment has been minimized.

Copy link
@str4w

str4w Apr 10, 2018

Author Contributor

would be even better to take prestructured data from the XML, see below comment

//pdf
if(pdfhref) {
newItem.attachments.push({
title: newItem.publicationTitle+" "+newItem.issue + " v" + version + " PDF",

This comment has been minimized.

Copy link
@aurimasv

aurimasv Apr 10, 2016

Contributor

Simply "Full Text PDF" is sufficient. The specifics are available in the metadata

title: newItem.publicationTitle+" "+newItem.issue + " v" + version +" Snapshot",
url: url,
mimeType:"text/html"
});

This comment has been minimized.

Copy link
@aurimasv

aurimasv Apr 10, 2016

Contributor

There is generally no reason to attach a snapshot if we're attaching a full text pdf

This comment has been minimized.

Copy link
@str4w

str4w Apr 10, 2018

Author Contributor

I see your point, but most of the other translators I looked at (at least ArXiv, and ACM) do this. It seems to be the standard practice.

}

function doWeb(doc, url) {
if(detectWeb(doc,url)=="journalArticle") {

This comment has been minimized.

Copy link
@aurimasv

aurimasv Apr 10, 2016

Contributor

Given you detectWeb, this is guaranteed.

if(detectWeb(doc,url)=="journalArticle") {
var articles = new Array();
articles = [url];
Zotero.Utilities.processDocuments(articles, scrape, function(){Zotero.done();});

This comment has been minimized.

Copy link
@aurimasv

aurimasv Apr 10, 2016

Contributor

I'm not sure what you're trying to do here. The page is already loaded. Why re-load it?

var articles = new Array();
articles = [url];
Zotero.Utilities.processDocuments(articles, scrape, function(){Zotero.done();});
Zotero.wait();

This comment has been minimized.

Copy link
@aurimasv

aurimasv Apr 10, 2016

Contributor

This is no longer necessary. It was needed a long time ago.

@aurimasv

This comment has been minimized.

Copy link
Contributor

commented Apr 10, 2016

As @zuphilip says, please provide some examples and, preferably, add test pages using Scaffold.

@zuphilip

This comment has been minimized.

Copy link
Collaborator

commented Jan 1, 2018

@str4w What is the status here?

@str4w

This comment has been minimized.

Copy link
Contributor Author

commented Jan 2, 2018

Oops. Honestly, I assessed the number of comments at the time as "Too large to complete this week", and then I totally forgot about it after that. It is back on my radar, I will try to look at it in the near future.

@zuphilip

This comment has been minimized.

Copy link
Collaborator

commented Jan 2, 2018

Okay, thank you for the response. Can you start with the test cases?

I see that there is now also several export options, which we could also use instead of scraping from the website:

midas-export

We could use for example the structured data in the XML for the authors (full first names, split into first and last names, several authors are split). What do you think?

@adam3smith

This comment has been minimized.

Copy link
Collaborator

commented Jan 2, 2018

looking at this quickly, wouldn't it make sense to just import BibTeX or Medline (for which we already have import translators) and then add journal title and ISSN quickly based on the URL?

@zuphilip

This comment has been minimized.

Copy link
Collaborator

commented Jan 2, 2018

BibTeX does not look that good, e.g. for http://www.vtkjournal.org/browse/publication/880 we have

  Author	= "I. Herrera and C. Buchart and D. Borro",

Medline format could work also as there are also the full first names there, but publication title seems missing and maybe more. That is the reason I asked for examples...

@adam3smith

This comment has been minimized.

Copy link
Collaborator

commented Jan 2, 2018

right, publication titles seem to be missing everywhere -- that's why I thought we'd add them based on URL.

@str4w

This comment has been minimized.

Copy link
Contributor Author

commented Apr 10, 2018

Sorry this dragged so long. This should be improved from the last iteration. Questions -

Looking at some of the other examples, sometimes they use
Zotero.
Z.
ZU.
what is the correct way, what are these globals?

More importantly, this would be cleaner and the data would be more complete if it could pull the exported XML (sadly, as the previous comments point out, that data is incomplete and we can't just use that for everything). To get that XML, need to make an HTTP post and parse the output - how does one do that - can the Zotero.doGet method do it?

@zuphilip zuphilip removed the waiting label Apr 15, 2018

@zuphilip

This comment has been minimized.

Copy link
Collaborator

commented Apr 15, 2018

@str4w I worked directly on top of your work here and created a PR into your fork: str4w#1 . If you merge this PR then it will be automatically update here.

There are some special functions which you can use in coding Zotero translators and they either start with Zotero. (or short Z.) or Zotero.Utilities. (short ZU.). See also https://www.zotero.org/support/dev/translators/functions

We need a post call here and therfore I used Zotero.doPost. It is important then to continue inside this function and not afterwards. See my changes.

Merge pull request #1 from zuphilip/str4w-master
Update MIDAS and use XML data
@str4w

This comment has been minimized.

Copy link
Contributor Author

commented Apr 15, 2018

Thanks for the explanations and improvements. I merged with my fork so it appears here.

@zuphilip

This comment has been minimized.

Copy link
Collaborator

commented Apr 15, 2018

Okay, great. Let me know whether there is anything left from your point of view.

@str4w

This comment has been minimized.

Copy link
Contributor Author

commented Apr 16, 2018

I'm running a few tests to be sure. A couple things - I see that you have added multiples. However, I tried this link
http://www.vtkjournal.org/browse/journal/53
and detectWeb did not detect it.

I tried this link http://www.insight-journal.org/browse/publication/921 and the doWeb returns the tags below. I don't understand the + and -. Is it normal? I am not actually using the autogenerated tags in Zotero, so this doesn't trouble me directly, just wondering.

         "tags": [
           {
     -       "tag": "Reproducibility"
     +       "tag": "IPython"
           }
           {
     -       "tag": "Tardigrades"
     +       "tag": "Reproducibility"
           }
           {
             "tag": "Scipy"
           }
           {
     -       "tag": "IPython"
     +       "tag": "Tardigrades"
           }
           {
     -       "tag": "notebook"
     +       "tag": "dexy"
           }
           {
     -       "tag": "dexy"
     +       "tag": "docker"
           }
           {
             "tag": "github"
           }
           {
     -       "tag": "docker"
     +       "tag": "notebook"
           }
         ]
@zuphilip

This comment has been minimized.

Copy link
Collaborator

commented Apr 16, 2018

Okay, I added another path for this multiple with some filtering on the expected urls. Moroever, some clean-ups. Please merge again: str4w#2

Yes, what you see with the tags is normal. The differences which are indicated with the -/+ are just the ordering, which is performed alphabetically for the test cases.

str4w added some commits Apr 16, 2018

Merge pull request #2 from zuphilip/str4w-master
[MIDAS Journals]: Add another multiple, clean-up
@str4w

This comment has been minimized.

Copy link
Contributor Author

commented Apr 16, 2018

I checked a bunch of articles, ran the tests, looks good. I added the ISSN tag, since it is available in many cases, why not have it?

Are you sure you want the article number to go into the pages field? If that is consistent with what other translators do then it would make sense, it just doesn't fit the literal definition of that field as I understand it.

Assuming the ISSN tag is ok with you and the page field for the article number was intentional, looks good to me, go for it.

@zuphilip

This comment has been minimized.

Copy link
Collaborator

commented Apr 16, 2018

Are you sure you want the article number to go into the pages field? If that is consistent with what other translators do then it would make sense, it just doesn't fit the literal definition of that field as I understand it.

@adam3smith Do I remember correctly? Can you confirm (or correct) that saving an article id is best done in the pages field?

(I will merge it as soon as we clarified this.)

@adam3smith

This comment has been minimized.

Copy link
Collaborator

commented Apr 16, 2018

Do I remember correctly? Can you confirm (or correct) that saving an article id is best done in the pages field?

Yes, that's right

@zuphilip zuphilip merged commit d8bfa59 into zotero:master Apr 16, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@zuphilip

This comment has been minimized.

Copy link
Collaborator

commented Apr 16, 2018

Merged! 🎉 Thank you @str4w for the work! (You should reset your master branch now before any new work.)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.