Join GitHub today
GitHub is home to over 36 million developers working together to host and review code, manage projects, and build software together.
Sign upKstudy translator #1308
Conversation
This comment has been minimized.
This comment has been minimized.
I've raised my hand, so ... a question first for the translator gurus listening in. The translator is based on the Framework, but it's going to need some out-of-the-ordinary post-processing of name field content, at least one intermediate xmlhttprequest call to acquire a ToC for one form of "multiple" content, and another to acquire attachments. Is it possible to shoehorn arbitrary JS code into a Framework translator, or should things be recoded from scratch? |
This comment has been minimized.
This comment has been minimized.
So FW does have some capacity to hook into the processing and run arbitrary JS, but I suspect it's better to rewrite than to do much complex work in those hooks |
This comment has been minimized.
This comment has been minimized.
One can add more JS to the functions Lines 17 to 23 in 346e164 Personally, I prefer the non-FW translators and here it seems that anything else would be pretty complex as well. @yunusong I can rewrite what you have as a normal translator without the FW, if this helps. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
@fbennett Here is a non-FW version of this code: https://github.com/zuphilip/translators/blob/yunusong-patch-2/KStudy.js |
This comment has been minimized.
This comment has been minimized.
Author field is not that big of a deal, actually. I tried adding ".remove(/(/).remove()/)" after ".split(/,/)" so that it won't take ")" as the last name. Then it took the romanized last name as the last name, and that was good enough for me. Separating last name and first name is almost never necessary in Korean academia. On the other hand, I don't mind you guys rewriting the code at all. If it helps improving the code, I am all for it. What I would really like is the translator to automatically fetch the pdf files. The xpath for link to pdf files is "//div[@Class="choice"]/span[1]/a/@href" and an example link is |
This comment has been minimized.
This comment has been minimized.
This looks great. You should replace the original file with this one on the pull request and also add your name to the creator. Thank you for this.
2017년5월27일 (토요일) 오전 10:02에 Philipp Zumstein <notifications@github.com>님이 쓴 메시지:
@fbennett Here is a non-FW version of this code: https://github.com/zuphilip/translators/blob/yunusong-patch-2/KStudy.js—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub, or mute the thread.
|
This comment has been minimized.
This comment has been minimized.
@yunusong I created a PR towards your fork and branch yunusong#1 . After accepting this, we should see it here as well. |
yunusong
added some commits
May 27, 2017
zuphilip
reviewed
May 27, 2017
{ | ||
"translatorID": "b298ca93-0010-48f5-97fb-e9923519a380", | ||
"label": "KStudy", | ||
"creator": "Yunwoo Song","Philipp Zumstein" |
This comment has been minimized.
This comment has been minimized.
zuphilip
May 27, 2017
•
Collaborator
Okay, thank you, but this needs to be one string containing both names, i.e.
"Yunwoo Song, Philipp Zumstein"
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
I located the JS file which contains the function for downloading pdf. So I coded item.attachments as follows.
But Zotero still won't fetch the pdf. I tried fetching the resulting url by assigning to an unused slot. eg. issn. as I tried pasting this address directly on a browser and it will start downloading the pdf. The odd thing I noticed about the original function was that it does not end the "var url =" statement with a semicolon. The semicolon only appears at the end of the second line. Would this have something to do with the problem? Thank you guys for the comments. |
zuphilip
reviewed
May 27, 2017
@@ -1,7 +1,7 @@ | |||
{ | |||
"translatorID": "b298ca93-0010-48f5-97fb-e9923519a380", | |||
"label": "KStudy", | |||
"creator": "Yunwoo Song","Philipp Zumstein" | |||
"creator": "Yunwoo Song, Philipp Zumstein" |
This comment has been minimized.
This comment has been minimized.
zuphilip
May 27, 2017
Collaborator
This is not valid JSON. You need to add a comma at the end of this line.
This comment has been minimized.
This comment has been minimized.
Can you try |
This comment has been minimized.
This comment has been minimized.
I'll take a look on Monday (I need to be in the University net to get access). As an offhand thought in the meantime, assuming that the translator is loading properly, there may be a redirect that the translator is not able to follow without some further jiggery-pokery. (You can follow the transactions by opening Developer Tools in the browser and then refreshing on the download URL.) |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Thanks everyone -- please let me know once this is ready for final review. As for FW, given #1277, I'd prefer us avoiding new FW translators. |
This comment has been minimized.
This comment has been minimized.
@fbennett I have one more question. I think I need the translator to access this final url directly. The gist of the html code that calls the url is as follows.
If you paste this html code into a new text and open it in a browser, you will see that it downloads a pdf file even without requiring proxy. Do you think you can work around this code and have the zotero to fetch the final url? I really appreciate your help. |
This comment has been minimized.
This comment has been minimized.
Yes, that's the plan. Likely it can be done by nesting another call that
runs inside the pop-up.
…On May 29, 2017 00:12, "yunusong" ***@***.***> wrote:
@fbennett <https://github.com/fbennett> I have one more question.
When I erased all other attributes of the attachment and just left the url
as "url : pdfurl" (which gives a long url that would have been the result
of the function SiteSelect1), Zotero saves the snapshot of the page that
leads to downloading, and the browser downloads the pdf file to my download
folder in my computer.
I think I need the translator to access this final url directly. The gist
of the html code that calls the url is as follows.
If you paste this html code into a new text and open it in a browser, you
will see that it downloads a pdf file even without requiring proxy. Do you
think you can work around this code and have the zotero to fetch the final
url? I really appreciate your help.
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub
<#1308 (comment)>,
or mute the thread
<https://github.com/notifications/unsubscribe-auth/AAEmSj241_OO6f-rfnGR3xNFLU5gZB5tks5r-Y7egaJpZM4Nnvpq>
.
|
This comment has been minimized.
This comment has been minimized.
Can you guys tell me what is wrong with my code? I don't think I quite understand how to use ZU.doPost. But I am quite certain that I have to use it because the actual page that leads to pdf has method="post". So in the scrape function I added the code for attachments as follows.
I was hoping that this would process the getpdf function on the page that opens. And for getpdf function, I coded it as follows.
I checked how other translators use the doPost, and they were very different from how I wrote it, but I don't quite understand how to fix it. |
This comment has been minimized.
This comment has been minimized.
That is a random guess, but I think you have to switch the two things, i.e. something like ZU.doPost(pdflinkurl, postdata, function(pdfdoc) {
item.attachments.push({
doc: pdfdoc,
title: "Fulltext",
mimetype: "pplication/pdf"
});
item.complete();
} But I don't really know if this can work, because usually the |
This comment has been minimized.
This comment has been minimized.
Picking up now (fell asleep last night, but I turned on the office proxy yesterday, so I'm "at work" this morning). Will fiddle around a bit and see what I can come up with. |
This comment has been minimized.
This comment has been minimized.
Ok, I think I figured out how to use ZU.doPost. But I still couldn't get Zotero to save pdf. This is the code I added at the end of the scrape function. The weird thing is, though, when I run test in Scaffold, the result will show "Translation Successful" AND my browser will download the pdf to my computer, even though I only ran test. And when I actually test it with Zotero, it will not fetch pdf. Any thoughts on what I should change?
|
This comment has been minimized.
This comment has been minimized.
Cracked it. I have code that attaches a readable PDF. As it turned out, no further call (to I'll file a pull request in a bit. |
This comment has been minimized.
This comment has been minimized.
|
This comment has been minimized.
This comment has been minimized.
I generalized the translator for the example(s) found by @adam3smith but choosed in the end Thus, I decided to skip the PDF download for the moment. Maybe, we can start with the translator as it is now. Any improvement for PDF download could be added later by someone with access to the database. |
This comment has been minimized.
This comment has been minimized.
Please have a look at the new version. |
This comment has been minimized.
This comment has been minimized.
socheres
commented
Jan 2, 2018
@zuphilip below some more test cases and remarks. test cases: http://kiss.kstudy.com/public/public2-article.asp?key=50789039
http://kiss.kstudy.com/thesis/thesis-view.asp?key=3550330
http://kiss.kstudy.com/public/public2-article.asp?key=50375984
|
zuphilip
reviewed
Jan 2, 2018
@@ -131,11 +133,23 @@ function scrape(doc, url) { | |||
var containerParts = containerValue.match(/(.*)\s+(\d+)\D\s*(\d+)/); | |||
if (containerParts) { | |||
item.publicationTitle = containerParts[1]; | |||
item.volume = containerParts[2]; | |||
if (/\d{4}/.test(containerParts[2])) { | |||
containerParts[2] = null; |
This comment has been minimized.
This comment has been minimized.
zuphilip
Jan 2, 2018
Collaborator
Add a comment why you are doing this. Moreover, I guess you don't need to delete the value in the first case, but rather simply want to concentrate on the second case. Is this correct? Then you could instead try something like
if (containerParts[2].length<4) { // or the regexp if really needed
item.volume = containerParts[2];
}
This comment has been minimized.
This comment has been minimized.
yunusong
Jan 2, 2018
Author
Contributor
Oh, okay thank you for pointing that out.
The reason I added that line was that I found that sometimes the pages have published year in place of the volume number. So sometimes it would appear as Vol. 2013, and I didn't want that to happen, but your correction makes much more sense. Thank you. I'll fix that.
This comment has been minimized.
This comment has been minimized.
@yunusong For your question about the multiple in the public part: You need to tweak also the CSS selector for the different entries, i.e. this line var rows = ZU.xpath(doc, '//div[contains(@class, "thesis-info")]/h5/a[contains(@href, "/thesis")]'); (The last check is not true |
This comment has been minimized.
This comment has been minimized.
@socheres Thank you for the review! The ISSN is a good idea to add as well. I am not sure that I understand the naming correctly. Did you see currently an error in the splitting of author names? Example? |
This comment has been minimized.
This comment has been minimized.
socheres
commented
Jan 2, 2018
@zuphilip authors name splitting works only correct in case 1, not in 2 or 3. (Kim|Park|김) = last name In case 2,3 they should be - if possible - in a full name field, like in Zotero where i can switch author field from double in single field. |
This comment has been minimized.
This comment has been minimized.
socheres
commented
Jan 2, 2018
i think the website structure - authors - is too fragile and i think the effort is not worth it, because all the metadata on http://kiss.kstudy.com/ can be downloaded - third button below 인용하기 - in RIS. |
This comment has been minimized.
This comment has been minimized.
Uh... RIS, yes we should use that in the translator (it is the blue button with the arrow to the right). I will look into this now... |
This comment has been minimized.
This comment has been minimized.
Please have a look at the new version, which is pretty much a rewrite, but much simpler by using the RIS data. |
adam3smith
requested changes
Jan 3, 2018
A couple of small questions & requests. |
|
||
function scrape(doc, url) { | ||
var key = url.match(/\bkey=(\d+)\b/)[1]; | ||
var risURL = "http://kiss.kstudy.com/p-common/export_endnote.asp"; |
This comment has been minimized.
This comment has been minimized.
|
||
|
||
function scrape(doc, url) { | ||
var key = url.match(/\bkey=(\d+)\b/)[1]; |
This comment has been minimized.
This comment has been minimized.
adam3smith
Jan 3, 2018
Collaborator
we should check for this in detect, then, to make sure we have a key
} | ||
item.language = "ko-KR"; | ||
if (item.pages) { | ||
item.pages = item.pages.replace('-', '–'); |
This comment has been minimized.
This comment has been minimized.
adam3smith
Jan 3, 2018
Collaborator
we don't usually do this -- do we actually prefer en-dashes over hyphens in the database?
This comment has been minimized.
This comment has been minimized.
dstillman
Jan 3, 2018
Member
I don't think there's any harm in doing so, though we should probably just be automatically converting hyphens in page ranges to en dashes on save.
This comment has been minimized.
This comment has been minimized.
zuphilip
Jan 3, 2018
Collaborator
Hm... my mistake then. How is the page range then shown in citation styles? Maybe the replacement could also happens at some later step... I will delete it on the translator code, beause this would then be a larger issue.
"items": [ | ||
{ | ||
"itemType": "journalArticle", | ||
"title": "KOLMOGOROV DISTANCE FOR MULTIVARIATE NORMAL APPROXIMATION", |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
zuphilip
Jan 3, 2018
Collaborator
I tried to think about this before. I don't know how to check easily whether the title is all-uppercase but not a Korean title. Any ideas?
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
zuphilip
Jan 3, 2018
Collaborator
Thank you. This also changes the title 치과용 콘빔 CT를 이용한 상악 정중과잉치의 3차원 분석
, which I wanted to avoid. I guess some test like [\u0000-\u00FF]*
should work for Latin alphabet only titles...
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
socheres
Jan 3, 2018
it will be great if the English title can be normalized, taking capitalization e.g. proper noun into account, and other words such as USA, EU in capitalize form. There are some similar projects (https://github.com/nlp-compromise/compromise), but I haven't found a "killer Javascript function" yet.
This comment has been minimized.
This comment has been minimized.
dstillman
Jan 3, 2018
Member
If a site gives a title in all-caps, as seems to be the case here, we generally just title case it according to some basic rules and leave the fixing of proper nouns and acronyms up to users. (Along other things, we don’t even necessarily know the language in order to reliably exclude words.)
The issue here just seems to be (I haven’t looked at this closely) that some Korean titles include English characters, and we want to avoid changing the case of those, either in the translator or in a more generalized way in capitalizeTitle
.
This comment has been minimized.
This comment has been minimized.
zuphilip
Jan 3, 2018
Collaborator
@dstillman Okay, my mistake. But the code pattern can still make sense then, to prevent making title case, when sentence case is correctly given.
I test that the majority of the title is from Latin alphabet. But your suggestion for capitalizeTitle
on the words
not change words if they contain any characters outside [\u0000-\u00FF]
sounds reasonable for me.
@socheres The function ZU.capitalizeTitle
can be used to "normalize" English titles into title case in translators, see https://github.com/zotero/zotero/blob/master/chrome/content/zotero/xpcom/utilities.js#L921 . However, it will add a capital letter to more or less every word with only a small number of exceptions cf. https://github.com/zotero/zotero/blob/2baa537542fcd5bc6e2826eb0eef6b34c571dcfc/chrome/content/zotero/xpcom/utilities.js#L922-L924 . To do anything similar e.g. for German and thereby differentiating between nouns and verbs etc. might be out of scope for Zotero.
This comment has been minimized.
This comment has been minimized.
zuphilip
Jan 3, 2018
Collaborator
Here is another example http://kiss.kstudy.com/thesis/thesis-view.asp?key=3480580 with the title
Analysis of Enzymes related to Lignin Modification of Phanerochaete chrysosporium ATCC20696 through the Transcriptomic and Proteomic Approaches
which is fine, but not title cased.
This comment has been minimized.
This comment has been minimized.
socheres
reviewed
Jan 3, 2018
"creatorType": "author", | ||
"fieldMode": 1 | ||
} | ||
], |
This comment has been minimized.
This comment has been minimized.
socheres
Jan 3, 2018
romanized korean names with comma create two last names:
Korean names are correct imported in "filedMode 1", but it will be nice to have, if korean names can be separated in last and first name like the function "ZU.cleanAuthor" for korean names.
This comment has been minimized.
This comment has been minimized.
zuphilip
Jan 3, 2018
Collaborator
Okay, you are right. The RIS data contains wrongly four authors in the AU
tags. But I still can reuse some of the previous code for scraping it directly in these cases. Stay tuned..
socheres
reviewed
Jan 3, 2018
"lastName": "Hyun Suk", | ||
"creatorType": "author", | ||
"fieldMode": 1 | ||
} |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
I did changes according to the reviews of @adam3smith and @socheres . Sometimes it is needed to scrape the information for the authors directly from the page. I hope that we have the author information now correctly. Please have a look at the new version. |
This comment has been minimized.
This comment has been minimized.
socheres
commented
Jan 3, 2018
it looks source > 저자 : Kim, Yoon Tae, Park, Hyun Suk |
This comment has been minimized.
This comment has been minimized.
I checked the update, and things look really great. Thanks for the work. |
adam3smith
merged commit f4a0d8e
into
zotero:master
Jan 4, 2018
1 check passed
This comment has been minimized.
This comment has been minimized.
Great, thanks everyone! |
This comment has been minimized.
This comment has been minimized.
|
yunusong commentedMay 26, 2017
Translator for the Kstudy database.
Help appreciated for attachments, changing authors to a single field, and opening a javascript url at "journal/list_name.asp" detect. But any suggestions to improve the translator is welcome.
Thank you.