Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Kstudy translator #1308

Merged
merged 19 commits into from Jan 4, 2018

Conversation

Projects
None yet
7 participants
@yunusong
Copy link
Contributor

yunusong commented May 26, 2017

Translator for the Kstudy database.

Help appreciated for attachments, changing authors to a single field, and opening a javascript url at "journal/list_name.asp" detect. But any suggestions to improve the translator is welcome.

Thank you.

@fbennett

This comment has been minimized.

Copy link
Contributor

fbennett commented May 27, 2017

I've raised my hand, so ... a question first for the translator gurus listening in. The translator is based on the Framework, but it's going to need some out-of-the-ordinary post-processing of name field content, at least one intermediate xmlhttprequest call to acquire a ToC for one form of "multiple" content, and another to acquire attachments.

Is it possible to shoehorn arbitrary JS code into a Framework translator, or should things be recoded from scratch?

@avram

This comment has been minimized.

Copy link
Contributor

avram commented May 27, 2017

So FW does have some capacity to hook into the processing and run arbitrary JS, but I suspect it's better to rewrite than to do much complex work in those hooks

@zuphilip

This comment has been minimized.

Copy link
Collaborator

zuphilip commented May 27, 2017

Is it possible to shoehorn arbitrary JS code into a Framework translator, or should things be recoded from scratch?

One can add more JS to the functions detectWeb, doWeb for example

function detectWeb(doc, url) {
if(!ZU.xpath(doc, '//article/h1').length && !ZU.xpath(doc, '//div[@class="inner"]//div[contains(@class, "text_wrapper")]/h2/a')) {
return;
}
return FW.detectWeb(doc, url);
}
function doWeb(doc, url) { return FW.doWeb(doc, url); }
; or define a new function and call it somewhere in the FW chain.

Personally, I prefer the non-FW translators and here it seems that anything else would be pretty complex as well. @yunusong I can rewrite what you have as a normal translator without the FW, if this helps.

@fbennett

This comment has been minimized.

Copy link
Contributor

fbennett commented May 27, 2017

@zuphilip if you have time, and if @yunusong agrees, that would be great (Philipp is much more skilled with translators than I am).

With a non-FW translator as a base, I could then splice in conditional operations that would wake up in Juris-M, for transliterated names and whatnot.

@zuphilip

This comment has been minimized.

Copy link
Collaborator

zuphilip commented May 27, 2017

@yunusong

This comment has been minimized.

Copy link
Contributor Author

yunusong commented May 27, 2017

Thanks @zuphilip @fbennett !

Author field is not that big of a deal, actually. I tried adding ".remove(/(/).remove()/)" after ".split(/,/)" so that it won't take ")" as the last name. Then it took the romanized last name as the last name, and that was good enough for me. Separating last name and first name is almost never necessary in Korean academia.

On the other hand, I don't mind you guys rewriting the code at all. If it helps improving the code, I am all for it. What I would really like is the translator to automatically fetch the pdf files. The xpath for link to pdf files is "//div[@Class="choice"]/span[1]/a/@href" and an example link is
"javascript:SiteSelect1('http://210.101.116.18/kiss10/inFTP_Journal.asp', '87300332.pdf', 2201, 25169, 1);"
I tried rewriting the code without the framework myself, but I didn't have any better luck with the attachments. If you can help me with the attachments, that would be wonderful. Thank you.

@yunusong

This comment has been minimized.

Copy link
Contributor Author

yunusong commented May 27, 2017

@zuphilip

This comment has been minimized.

Copy link
Collaborator

zuphilip commented May 27, 2017

@yunusong I created a PR towards your fork and branch yunusong#1 . After accepting this, we should see it here as well.

yunusong added some commits May 27, 2017

Merge pull request #1 from zuphilip/yunusong-patch-2
Rewrite KStudy.js as Non-FW translator
Update href links for Journal Issues
I figured out how to unravel the javascripted links in Journal Issue type of "multiple". I tested on my zotero and confirmed that it works.
KStudy.js Outdated
{
"translatorID": "b298ca93-0010-48f5-97fb-e9923519a380",
"label": "KStudy",
"creator": "Yunwoo Song","Philipp Zumstein"

This comment has been minimized.

@zuphilip

zuphilip May 27, 2017

Collaborator

Okay, thank you, but this needs to be one string containing both names, i.e.

"Yunwoo Song, Philipp Zumstein"

This comment has been minimized.

@yunusong

yunusong May 27, 2017

Author Contributor

Okay

@yunusong

This comment has been minimized.

Copy link
Contributor Author

yunusong commented May 27, 2017

I located the JS file which contains the function for downloading pdf.
The function goes as
function SiteSelect1(ftproot, a_imag, inst_key, publ_key,isDownLoad) {
var url = "../search/download.asp?ftproot=" + ftproot + "&inst_key=" + inst_key + "&a_imag=" + a_imag + "&isDownLoad=" + isDownLoad + "&publ_key=" + publ_key
window.open(url, "", "toolbar=0,location=0,directories=0,status=0,menubar=0,scrollbars=1,resizable=1,copyhistory=1,top=170,left=250,width=540,height=500");
}

So I coded item.attachments as follows.

var pdfJSurl = ZU.xpathText(doc, '//div[@class="search_box"]/div[@class="choice"]/span[1]/a/@href');
var pdfurlKeys = pdfJSurl.match(/Select1\(\'(\S+)\'\,\s?\'(\S+)\'\,\s?(\S+)\,\s?(\S+)\,\s?(\S+)\)\;/);
var pdfurl = "../search/download.asp?ftproot=" + pdfurlKeys[1] + "&inst_key=" + pdfurlKeys[3] + "&a_imag=" + pdfurlKeys[2] + "&isDownLoad=" + pdfurlKeys[5] + "&publ_key=" + pdfurlKeys[4];
item.attachments.push({
	title : item.title,
	url : pdfurl,
	mimeType : "application/pdf",
	snapshot : false
})

But Zotero still won't fetch the pdf.

I tried fetching the resulting url by assigning to an unused slot. eg. issn. as
item.ISSN = pdfurl;
An example of a resulting url is as follows.
../search/download.asp?ftproot=http://210.101.116.17/kiss8/inFTP_Journal.asp&inst_key=7074&a_imag=40323420.pdf&isDownLoad=1&publ_key=25408

I tried pasting this address directly on a browser and it will start downloading the pdf.
So I am wondering what is still blocking Zotero from storing the pdf.

The odd thing I noticed about the original function was that it does not end the "var url =" statement with a semicolon. The semicolon only appears at the end of the second line. Would this have something to do with the problem? Thank you guys for the comments.

KStudy.js Outdated
@@ -1,7 +1,7 @@
{
"translatorID": "b298ca93-0010-48f5-97fb-e9923519a380",
"label": "KStudy",
"creator": "Yunwoo Song","Philipp Zumstein"
"creator": "Yunwoo Song, Philipp Zumstein"

This comment has been minimized.

@zuphilip

zuphilip May 27, 2017

Collaborator

This is not valid JSON. You need to add a comma at the end of this line.

@zuphilip

This comment has been minimized.

Copy link
Collaborator

zuphilip commented May 27, 2017

Can you try snapshot : true in your code above?

@fbennett

This comment has been minimized.

Copy link
Contributor

fbennett commented May 27, 2017

I'll take a look on Monday (I need to be in the University net to get access). As an offhand thought in the meantime, assuming that the translator is loading properly, there may be a redirect that the translator is not able to follow without some further jiggery-pokery. (You can follow the transactions by opening Developer Tools in the browser and then refreshing on the download URL.)

@yunusong

This comment has been minimized.

Copy link
Contributor Author

yunusong commented May 27, 2017

Thank you @fbennett !
@zuphilip I just tried "snapshot : true" but nothing changed. Zotero still shows a big red X for attachments, and the test page at scaffold will not show any value for "url" under attachments. I also tried "snapshot : false," but no pdf either.

@adam3smith

This comment has been minimized.

Copy link
Collaborator

adam3smith commented May 28, 2017

Thanks everyone -- please let me know once this is ready for final review. As for FW, given #1277, I'd prefer us avoiding new FW translators.

@yunusong

This comment has been minimized.

Copy link
Contributor Author

yunusong commented May 28, 2017

@fbennett I have one more question.
When I erased all other attributes of the attachment and just left the url as "url : pdfurl" (which gives a long url that would have been the result of the function SiteSelect1), Zotero saves the snapshot of the page that leads to downloading, and the browser downloads the pdf file to my download folder in my computer.

I think I need the translator to access this final url directly. The gist of the html code that calls the url is as follows.

<body onLoad="document.down.submit();">
<form name="down" method="post" action="http://210.101.116.17/kiss8/inftp_journal.asp">
<input type="hidden" name="a_code" value="40323420.pdf">
<input type="hidden" name="code" value="7624112825580073101">
<input type="hidden" name="isDownLoad" value="1">
</form>
</body>

If you paste this html code into a new text and open it in a browser, you will see that it downloads a pdf file even without requiring proxy. Do you think you can work around this code and have the zotero to fetch the final url? I really appreciate your help.

@fbennett

This comment has been minimized.

Copy link
Contributor

fbennett commented May 28, 2017

@yunusong

This comment has been minimized.

Copy link
Contributor Author

yunusong commented May 29, 2017

Can you guys tell me what is wrong with my code? I don't think I quite understand how to use ZU.doPost. But I am quite certain that I have to use it because the actual page that leads to pdf has method="post".

So in the scrape function I added the code for attachments as follows.

`var` pdfJSurl = ZU.xpathText(doc, '//div[@class="search_box"]/div[@class="choice"]/span[1]/a/@href');
`var` pdfurlKeys = pdfJSurl.match(/Select1\(\'(\S+)\'\,\s?\'(\S+)\'\,\s?(\S+)\,\s?(\S+)\,\s?(\S+)\)\;/);
`var` pdflinkurl = "../search/download.asp?ftproot=" + pdfurlKeys[1] + "&inst_key=" + pdfurlKeys[3] + "&a_imag=" + pdfurlKeys[2] + "&isDownLoad=" + pdfurlKeys[5] + "&publ_key=" + pdfurlKeys[4];

`item.attachments.push({`
	url : ZU.processDocuments(pdflinkurl, getpdf),
	snapshot : true,
})

I was hoping that this would process the getpdf function on the page that opens. And for getpdf function, I coded it as follows.

function getpdf(doc, url) {
var theurl = ZU.xpathText(doc, '//form[@name="down"]/@action');
var a_code = ZU.xpathText(doc, '//input[@name="a_code"]/@value');
var code = ZU.xpathText(doc, '//input[@name="code"]/@value');
var isDownload = ZU.xpathText(doc, '//input[@name="isDownload"]/@value');
var postData = [a_code, code, isDownload];
var headerData = {"a_code": a_code, "code": code, "isDownload": isDownload};
ZU.doPost(theurl, a_code + code + isDownload, function(pdfurl2) {});
}

I checked how other translators use the doPost, and they were very different from how I wrote it, but I don't quite understand how to fix it.

@zuphilip

This comment has been minimized.

Copy link
Collaborator

zuphilip commented May 29, 2017

That is a random guess, but I think you have to switch the two things, i.e. something like

ZU.doPost(pdflinkurl, postdata, function(pdfdoc) {
   item.attachments.push({
	doc: pdfdoc,
	title: "Fulltext",
        mimetype: "pplication/pdf"
   });
   item.complete();
}

But I don't really know if this can work, because usually the function has to work with text...

@fbennett

This comment has been minimized.

Copy link
Contributor

fbennett commented May 29, 2017

Picking up now (fell asleep last night, but I turned on the office proxy yesterday, so I'm "at work" this morning). Will fiddle around a bit and see what I can come up with.

@yunusong

This comment has been minimized.

Copy link
Contributor Author

yunusong commented May 29, 2017

Ok, I think I figured out how to use ZU.doPost. But I still couldn't get Zotero to save pdf. This is the code I added at the end of the scrape function. The weird thing is, though, when I run test in Scaffold, the result will show "Translation Successful" AND my browser will download the pdf to my computer, even though I only ran test. And when I actually test it with Zotero, it will not fetch pdf. Any thoughts on what I should change?

var pdfJSurl = ZU.xpathText(doc, '//div[@class="search_box"]/div[@class="choice"]/span[1]/a/@href');
var pdfurlKeys = pdfJSurl.match(/Select1\(\'(\S+)\'\,\s?\'(\S+)\'\,\s?(\S+)\,\s?(\S+)\,\s?(\S+)\)\;/);
var pdflinkurl = "../search/download.asp?ftproot=" + pdfurlKeys[1] + "&inst_key=" + pdfurlKeys[3] + "&a_imag=" + pdfurlKeys[2] + "&isDownLoad=" + pdfurlKeys[5] + "&publ_key=" + pdfurlKeys[4];
ZU.processDocuments(pdflinkurl, function(doc, url) {
	var actionurl = ZU.xpathText(doc, '//form[@name="down"]/@action');
	var inputData = "a_code=" + ZU.xpathText(doc, '//input[@name="a_code"]/@value') + "&code=" + ZU.xpathText(doc, '//input[@type="code"]/@value') + "&isDownload=" + ZU.xpathText(doc, '//input[@name="isDownload"]/@value');
	ZU.doPost(actionurl, inputData, function(pdfdoc) {
		item.attachments.push({
			url : pdfdoc,
		});
	});
});

item.complete();
@fbennett

This comment has been minimized.

Copy link
Contributor

fbennett commented May 29, 2017

Cracked it. I have code that attaches a readable PDF. As it turned out, no further call (to doGet, doPost or processDocuments) was necessary. Tried various things, but what did the trick was to use the viewer.asp URL that is the ultimate target of a view request. It's a POST call on the site, but setting the post data as a query string on the URL (which I presume the translator calls with GET) worked anyway. The only post data needed is code_num, which is just the server-side filename of the PDF. So it turns out to be super simple, after all our guesswork.

I'll file a pull request in a bit.

@zuphilip

This comment has been minimized.

Copy link
Collaborator

zuphilip commented Jan 2, 2018

🌥 Okay, I am continuing to work here now. I will let you know when I have a new version...

@zuphilip

This comment has been minimized.

Copy link
Collaborator

zuphilip commented Jan 2, 2018

I generalized the translator for the example(s) found by @adam3smith but choosed in the end journalArticle for that documented in new test cases. However, it seems for me, that the PDF download we see in the public part is maybe tailored for that and could be fragile at all:

kiss-public-download

Thus, I decided to skip the PDF download for the moment. Maybe, we can start with the translator as it is now. Any improvement for PDF download could be added later by someone with access to the database.

Update KStudy.js to work also in the public part
Deleted also the comments for downloading PDFs,
which we do not yet handle.
@zuphilip

This comment has been minimized.

Copy link
Collaborator

zuphilip commented Jan 2, 2018

Please have a look at the new version.

@socheres

This comment has been minimized.

Copy link

socheres commented Jan 2, 2018

@zuphilip below some more test cases and remarks.
author in Korean should import in a "single field" in Zotero., because the first letters(Silben) of Korean names are usually surname and the other two letters are first name. Also, there is no middle name in korean.
Authors in latinized form with comma is separated correctly.
one more suggestions for issn.
//e.g. issn = ISSN(Print) var issn = ZU.xpathText(doc, '//li[label[text()="ISSN(Print)"]]'); if(issn){ item.ISSN = ZU.cleanISSN(issn); }

test cases:

http://kiss.kstudy.com/public/public2-article.asp?key=50789039

18:30:10 Returned item:
           {
             "itemType": "journalArticle"
             "creators": [
               {
                 "firstName": "Yoon Tae"
                 "lastName": "Kim"
                 "creatorType": "author"
               }
               {
                 "firstName": "Hyun Suk"
                 "lastName": "Park"
                 "creatorType": "author"
               }
             ]
             "notes": []
             "tags": []
             "seeAlso": []
             "attachments": []
             "title": "KOLMOGOROV DISTANCE FOR MULTIVARIATE NORMAL APPROXIMATION"
             "language": "ko-KR"
             "publicationTitle": "Korean Journal of mathematics (강원경기수학회)"
             "volume": "23"
             "issue": "1"
             "date": "2015"
             "pages": "1–10"
             "abstractNote": "초록 보기"
             "libraryCatalog": "kiss"
           }
18:30:10 Translation successful

http://kiss.kstudy.com/thesis/thesis-view.asp?key=3550330

18:31:51 Returned item:
           {
             "itemType": "journalArticle"
             "creators": [
               {
                 "lastName": "( Yoon Tae Kim ) "
                 "fieldMode": true
                 "creatorType": "author"
               }
               {
                 "lastName": " ( Hyun Suk Park )"
                 "fieldMode": true
                 "creatorType": "author"
               }
             ]
             "notes": []
             "tags": []
             "seeAlso": []
             "attachments": []
             "title": "Convergence rate of a test statistics observed by the longitudinal data with long memory"
             "language": "ko-KR"
             "publicationTitle": "CSAM(Communications for Statistical Applications and Methods)"
             "volume": "24"
             "issue": "5"
             "date": "2017-09"
             "pages": "481–492"
             "abstractNote": "초록 보기 This paper investigates a convergence rate of a test statistics given by two scale sampling method based on Ait-Sahalia and Jacod (Annals of Statistics, 37, 184-222, 2009). This statistics tests for longitudinal data having the existence of long memory dependence driven by fractional Brownian motion with Hurst parameter H ∈ (1/2, 1). We obtain an upper bound in the Kolmogorov distance for normal approximation of this test statistic. As a main tool for our works, the recent results in Nourdin and Peccati (Probability Theory and Related Fields, 145, 75-118, 2009; Annals of Probability, 37, 2231-2261, 2009) will be used. These results are obtained by employing techniques based on the combination between Malliavin calculus and Stein`s method for normal approximation."
             "libraryCatalog": "kiss"
           }
18:31:51 Translation successful

http://kiss.kstudy.com/public/public2-article.asp?key=50375984

18:33:10 Returned item:
           {
             "itemType": "journalArticle"
             "creators": [
               {
                 "firstName": "Ja-Son Y."
                 "lastName": "Kim"
                 "creatorType": "author"
               }
               {
                 "firstName": "Tae-Joon"
                 "lastName": "Park"
                 "creatorType": "author"
               }
               {
                 "firstName": "Jin-Sol"
                 "lastName": "Lee"
                 "creatorType": "author"
               }
               {
                 "firstName": "Ji-Yong"
                 "lastName": "Chun"
                 "creatorType": "author"
               }
               {
                 "firstName": "Joon-Seol"
                 "lastName": "Bae"
                 "creatorType": "author"
               }
               {
                 "firstName": "Byung-Lae"
                 "lastName": "Park"
                 "creatorType": "author"
               }
               {
                 "firstName": "Hyun-Sub"
                 "lastName": "Cheong"
                 "creatorType": "author"
               }
               {
                 "firstName": "Hyo-Suk"
                 "lastName": "Lee"
                 "creatorType": "author"
               }
               {
                 "firstName": "Yoon-Jun"
                 "lastName": "Kim"
                 "creatorType": "author"
               }
               {
                 "firstName": "Hyoung-Doo"
                 "lastName": "Shin"
                 "creatorType": "author"
               }
             ]
             "notes": []
             "tags": []
             "seeAlso": []
             "attachments": []
             "title": "Association Analysis of SERPINB5 Polymorphisms with HBV Clearance and HCC Occurrence in a Korean Population"
             "language": "ko-KR"
             "publicationTitle": "Genomics & informatics (한국유전체학회)"
             "volume": "8"
             "issue": "1"
             "date": "2010"
             "pages": "1–8"
             "abstractNote": "초록 보기"
             "libraryCatalog": "kiss"
           }
18:33:10 Translation successful
Update KStudy.js
I added an itemType "report" for one type of new formats and added some minor details.

I think we need to add "url.indexOf('public2-jrd-sub.asp')>-1" as one of the options for "multiple", but for some reason I keep getting errors. Can someone take a look at this?

The example address is: http://kiss.kstudy.com/public/public2-jrd-sub.asp?key1=100171&key2=10017&selYEAR=2012&selVOL1=1&selNUM1=2&selMon=
KStudy.js Outdated
@@ -131,11 +133,23 @@ function scrape(doc, url) {
var containerParts = containerValue.match(/(.*)\s+(\d+)\D\s*(\d+)/);
if (containerParts) {
item.publicationTitle = containerParts[1];
item.volume = containerParts[2];
if (/\d{4}/.test(containerParts[2])) {
containerParts[2] = null;

This comment has been minimized.

@zuphilip

zuphilip Jan 2, 2018

Collaborator

Add a comment why you are doing this. Moreover, I guess you don't need to delete the value in the first case, but rather simply want to concentrate on the second case. Is this correct? Then you could instead try something like

if (containerParts[2].length<4) { // or the regexp if really needed
   item.volume = containerParts[2];
}

This comment has been minimized.

@yunusong

yunusong Jan 2, 2018

Author Contributor

Oh, okay thank you for pointing that out.
The reason I added that line was that I found that sometimes the pages have published year in place of the volume number. So sometimes it would appear as Vol. 2013, and I didn't want that to happen, but your correction makes much more sense. Thank you. I'll fix that.

@zuphilip

This comment has been minimized.

Copy link
Collaborator

zuphilip commented Jan 2, 2018

@yunusong For your question about the multiple in the public part: You need to tweak also the CSS selector for the different entries, i.e. this line

var rows = ZU.xpath(doc, '//div[contains(@class, "thesis-info")]/h5/a[contains(@href, "/thesis")]');

(The last check is not true contains(@href, "/thesis"); but change it wisely/carefully in order not to influence the other multiples negatively.)

@zuphilip

This comment has been minimized.

Copy link
Collaborator

zuphilip commented Jan 2, 2018

@socheres Thank you for the review! The ISSN is a good idea to add as well. I am not sure that I understand the naming correctly. Did you see currently an error in the splitting of author names? Example?

@socheres

This comment has been minimized.

Copy link

socheres commented Jan 2, 2018

@zuphilip authors name splitting works only correct in case 1, not in 2 or 3.
1.Kim, Moon Tae , Park, Hyun Suk
2.( Yoon Tae Kim ) , ( Hyun Suk Park )
3.김윤태 ( Yuntae Kim )

(Kim|Park|김) = last name

In case 2,3 they should be - if possible - in a full name field, like in Zotero where i can switch author field from double in single field.
3. 김 = last name
윤태 = first name
Kim = last name
Yuntae = first name

@socheres

This comment has been minimized.

Copy link

socheres commented Jan 2, 2018

i think the website structure - authors - is too fragile and i think the effort is not worth it, because all the metadata on http://kiss.kstudy.com/ can be downloaded - third button below 인용하기 - in RIS.

@zuphilip

This comment has been minimized.

Copy link
Collaborator

zuphilip commented Jan 2, 2018

Uh... RIS, yes we should use that in the translator (it is the blue button with the arrow to the right). I will look into this now...

@zuphilip

This comment has been minimized.

Copy link
Collaborator

zuphilip commented Jan 3, 2018

Please have a look at the new version, which is pretty much a rewrite, but much simpler by using the RIS data. 🚀

@adam3smith
Copy link
Collaborator

adam3smith left a comment

A couple of small questions & requests.

KStudy.js Outdated

function scrape(doc, url) {
var key = url.match(/\bkey=(\d+)\b/)[1];
var risURL = "http://kiss.kstudy.com/p-common/export_endnote.asp";

This comment has been minimized.

@adam3smith

adam3smith Jan 3, 2018

Collaborator

lets make this relative in case the site starts offering https



function scrape(doc, url) {
var key = url.match(/\bkey=(\d+)\b/)[1];

This comment has been minimized.

@adam3smith

adam3smith Jan 3, 2018

Collaborator

we should check for this in detect, then, to make sure we have a key

KStudy.js Outdated
}
item.language = "ko-KR";
if (item.pages) {
item.pages = item.pages.replace('-', '');

This comment has been minimized.

@adam3smith

adam3smith Jan 3, 2018

Collaborator

we don't usually do this -- do we actually prefer en-dashes over hyphens in the database?

This comment has been minimized.

@dstillman

dstillman Jan 3, 2018

Member

I don't think there's any harm in doing so, though we should probably just be automatically converting hyphens in page ranges to en dashes on save.

This comment has been minimized.

@zuphilip

zuphilip Jan 3, 2018

Collaborator

Hm... my mistake then. How is the page range then shown in citation styles? Maybe the replacement could also happens at some later step... I will delete it on the translator code, beause this would then be a larger issue.

KStudy.js Outdated
"items": [
{
"itemType": "journalArticle",
"title": "KOLMOGOROV DISTANCE FOR MULTIVARIATE NORMAL APPROXIMATION",

This comment has been minimized.

@adam3smith

adam3smith Jan 3, 2018

Collaborator

let's title case all caps titles

This comment has been minimized.

@zuphilip

zuphilip Jan 3, 2018

Collaborator

I tried to think about this before. I don't know how to check easily whether the title is all-uppercase but not a Korean title. Any ideas?

This comment has been minimized.

@dstillman

dstillman Jan 3, 2018

Member

You should be able to just use ZU.capitalizeTitle(title, true).

This comment has been minimized.

@zuphilip

zuphilip Jan 3, 2018

Collaborator

Thank you. This also changes the title 치과용 콘빔 CT를 이용한 상악 정중과잉치의 3차원 분석, which I wanted to avoid. I guess some test like [\u0000-\u00FF]* should work for Latin alphabet only titles...

This comment has been minimized.

@dstillman

dstillman Jan 3, 2018

Member

How does it change it?

This comment has been minimized.

@socheres

socheres Jan 3, 2018

it will be great if the English title can be normalized, taking capitalization e.g. proper noun into account, and other words such as USA, EU in capitalize form. There are some similar projects (https://github.com/nlp-compromise/compromise), but I haven't found a "killer Javascript function" yet.

This comment has been minimized.

@dstillman

dstillman Jan 3, 2018

Member

If a site gives a title in all-caps, as seems to be the case here, we generally just title case it according to some basic rules and leave the fixing of proper nouns and acronyms up to users. (Along other things, we don’t even necessarily know the language in order to reliably exclude words.)

The issue here just seems to be (I haven’t looked at this closely) that some Korean titles include English characters, and we want to avoid changing the case of those, either in the translator or in a more generalized way in capitalizeTitle.

This comment has been minimized.

@zuphilip

zuphilip Jan 3, 2018

Collaborator

@dstillman Okay, my mistake. But the code pattern can still make sense then, to prevent making title case, when sentence case is correctly given.

I test that the majority of the title is from Latin alphabet. But your suggestion for capitalizeTitle on the words

not change words if they contain any characters outside [\u0000-\u00FF]

sounds reasonable for me.

@socheres The function ZU.capitalizeTitle can be used to "normalize" English titles into title case in translators, see https://github.com/zotero/zotero/blob/master/chrome/content/zotero/xpcom/utilities.js#L921 . However, it will add a capital letter to more or less every word with only a small number of exceptions cf. https://github.com/zotero/zotero/blob/2baa537542fcd5bc6e2826eb0eef6b34c571dcfc/chrome/content/zotero/xpcom/utilities.js#L922-L924 . To do anything similar e.g. for German and thereby differentiating between nouns and verbs etc. might be out of scope for Zotero.

This comment has been minimized.

@zuphilip

zuphilip Jan 3, 2018

Collaborator

Here is another example http://kiss.kstudy.com/thesis/thesis-view.asp?key=3480580 with the title

Analysis of Enzymes related to Lignin Modification of Phanerochaete chrysosporium ATCC20696 through the Transcriptomic and Proteomic Approaches

which is fine, but not title cased.

This comment has been minimized.

@socheres

socheres Jan 3, 2018

thank you @avram @zuphilip for explaining. i totally understand that fixing crappy data records is out of scope for zotero. it was just feedback - nice to have feature - from the users/scientists. publisher who create metadata have to deal it.

KStudy.js Outdated
"creatorType": "author",
"fieldMode": 1
}
],

This comment has been minimized.

@socheres

socheres Jan 3, 2018

romanized korean names with comma create two last names:

저자 : Kim, Yoon Tae, Park, Hyun Suk

Korean names are correct imported in "filedMode 1", but it will be nice to have, if korean names can be separated in last and first name like the function "ZU.cleanAuthor" for korean names.

This comment has been minimized.

@zuphilip

zuphilip Jan 3, 2018

Collaborator

Okay, you are right. The RIS data contains wrongly four authors in the AU tags. But I still can reuse some of the previous code for scraping it directly in these cases. Stay tuned..

KStudy.js Outdated
"lastName": "Hyun Suk",
"creatorType": "author",
"fieldMode": 1
}

This comment has been minimized.

@socheres
@zuphilip

This comment has been minimized.

Copy link
Collaborator

zuphilip commented Jan 3, 2018

I did changes according to the reviews of @adam3smith and @socheres . Sometimes it is needed to scrape the information for the authors directly from the page. I hope that we have the author information now correctly.

Please have a look at the new version.

@socheres

This comment has been minimized.

Copy link

socheres commented Jan 3, 2018

it looks 👍
source: > ( Chang-young Hong ) , ( Su-yeon Lee ) , ( Myungkil Kim ) , ( In-gyu Choi )
{ "itemType": "journalArticle" "creators": [ { "firstName": "Chang-young" "lastName": "Hong" "creatorType": "author" } { "firstName": "Su-yeon" "lastName": "Lee" "creatorType": "author" } { "firstName": "Myungkil" "lastName": "Kim" "creatorType": "author" } { "firstName": "In-gyu" "lastName": "Choi" "creatorType": "author" }

source > 저자 : Kim, Yoon Tae, Park, Hyun Suk
{ "itemType": "journalArticle" "creators": [ { "firstName": "Yoon Tae" "lastName": "Kim" "creatorType": "author" } { "firstName": "Hyun Suk" "lastName": "Park" "creatorType": "author" } ]
source > 김윤태 ( Yuntae Kim )
{ "itemType": "journalArticle" "creators": [ { "lastName": "김윤태" "creatorType": "author" "fieldMode": 1 }👏

@yunusong

This comment has been minimized.

Copy link
Contributor Author

yunusong commented Jan 3, 2018

I checked the update, and things look really great. Thanks for the work.

@adam3smith adam3smith merged commit f4a0d8e into zotero:master Jan 4, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@adam3smith

This comment has been minimized.

Copy link
Collaborator

adam3smith commented Jan 4, 2018

Great, thanks everyone!

@zuphilip

This comment has been minimized.

Copy link
Collaborator

zuphilip commented Jan 4, 2018

Thank you all @yunusong, @fbennett, @avram, @dstillman, @socheres, @adam3smith for the work and the review! That was a massive collaboration on a Zotero translator!

GuyAglionby added a commit to GuyAglionby/translators that referenced this pull request Jan 28, 2018

Kstudy translator (zotero#1308)
Using RIS data and author tweaks

psisquared2 added a commit to psisquared2/translators that referenced this pull request Feb 8, 2018

Kstudy translator (zotero#1308)
Using RIS data and author tweaks
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.