Semantic Scholar extend PDF extraction + fix errors when logged in #2103

GuyAglionby · 2019-12-31T17:43:38Z

Some relatively minor changes

Change method of checking if a PDF is available, as they removed the hasPDF property
Go through list of alternative paper URLs to find a PDF if the main link isn't one (the ones identified the old way, via 's2' and 'arxiv', all ended in .pdf already).
More robust way of combing through the encoded data, as the previous method broke when you were logged in
Updated the tests to reflect underlying changes on website (including new URLs). Removed one as it now redirects to a different paper. It wasn't covering anything the others don't.


        Extend PDF extraction + fix errors when logged in

zuphilip

Thank you! This looks fine. I only have some small comments for simplification of the code as well as make the extraction of arXiv IDs more general as well as one question. Everything should be easy to implement. Let me know if my comments/suggestions are not yet clear.

zuphilip · 2019-12-31T21:55:31Z

Semantic Scholar.js

+				pdfLinkElement = rawData.primaryPaperLink;
+			}
+			else if (rawData.alternatePaperLinks) {
+				for (let i = 0; i < rawData.alternatePaperLinks.length; i++) {


Suggested change

for (let i = 0; i < rawData.alternatePaperLinks.length; i++) {

for (let alternateElement of rawData.alternatePaperLinks) {

Then the following line is not anymore needed (and you never use the variable i anyways here), so this is a simplification of your code.

zuphilip · 2019-12-31T21:55:31Z

Semantic Scholar.js

+			else if (rawData.alternatePaperLinks) {
+				for (let i = 0; i < rawData.alternatePaperLinks.length; i++) {
+					let alternateElement = rawData.alternatePaperLinks[i];
+					if (alternateElement.url.endsWith('.pdf')) {


Suggested change

if (alternateElement.url.endsWith('.pdf')) {

if (!pdfLinkElement && alternateElement.url.endsWith('.pdf')) {

Then delete the break below and move the code about the arXiv ID up here. (This then covers the cases that there is pdf link which we use to the element, but there is also also another link following to arXiv.)

zuphilip · 2019-12-31T21:55:31Z

Semantic Scholar.js

+					mimeType: 'application/pdf'
+				});
+
+				if (pdfLinkElement.linkType == 'arxiv') {


Move this code block up.

zuphilip · 2019-12-31T21:55:31Z

Semantic Scholar.js

 				"itemID": "Dalvi2018TrackingSC",
 				"libraryCatalog": "Semantic Scholar",
 				"proceedingsTitle": "NAACL-HLT",
+				"publicationTitle": "NAACL-HLT",


Are both fields here come from Scaffold?

This looks strange as proceedingsTitle is only some sort of alias to publicationTitle...

Scaffold doesn't work well with updating tests for some reason -- the element used to determine type in detectWeb isn't found for some reason, even with defer: true. This is despite it appearing when I wget or curl a page. Not sure why this is, but the tests run/pass as usual.

GuyAglionby · 2020-01-10T12:44:31Z

Thanks for the review -- I don't think I understood the exact changes you suggested with the arXiv IDs, but I think the amended code should incorporate the idea


        Semantic scholar PDF fixes

adam3smith · 2020-01-19T01:38:01Z

I believe this is superseded by #2112 which includes PDF scraping, but haven't compared closely

GuyAglionby · 2020-01-19T01:39:33Z

Yep, I think that's the case

Extend PDF extraction + fix errors when logged in

Verified

This commit was signed with a verified signature.

GuyAglionby Guy Aglionby

GPG key ID: 65A11B40BDE7B9B1 Learn about signing commits

Loading status checks…

ad09094

GuyAglionby changed the title ~~Extend PDF extraction + fix errors when logged in~~ Semantic Scholar extend PDF extraction + fix errors when logged in Dec 31, 2019

zuphilip added the Improvements label Dec 31, 2019

zuphilip reviewed Dec 31, 2019

View changes

Semantic scholar PDF fixes

Verified

This commit was signed with a verified signature.

GuyAglionby Guy Aglionby

GPG key ID: 65A11B40BDE7B9B1 Learn about signing commits

Loading status checks…

f75a1e6

adam3smith closed this Jan 19, 2020

Please note that GitHub no longer supports your web browser.

zotero / translators

Semantic Scholar extend PDF extraction + fix errors when logged in #2103

Semantic Scholar extend PDF extraction + fix errors when logged in #2103

GuyAglionby commented Dec 31, 2019

zuphilip left a comment

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

GuyAglionby commented Jan 10, 2020

This comment has been minimized.

adam3smith commented Jan 19, 2020

This comment has been minimized.

GuyAglionby commented Jan 19, 2020

Please note that GitHub no longer supports your web browser.

zotero / translators

Join GitHub today

Semantic Scholar extend PDF extraction + fix errors when logged in #2103

Semantic Scholar extend PDF extraction + fix errors when logged in #2103

Conversation

GuyAglionby commented Dec 31, 2019

zuphilip left a comment

This comment has been minimized.

zuphilip Dec 31, 2019

This comment has been minimized.

zuphilip Dec 31, 2019

This comment has been minimized.

zuphilip Dec 31, 2019

This comment has been minimized.

zuphilip Dec 31, 2019

This comment has been minimized.

zuphilip Dec 31, 2019

This comment has been minimized.

zuphilip Dec 31, 2019

This comment has been minimized.

GuyAglionby Jan 10, 2020

This comment has been minimized.

GuyAglionby commented Jan 10, 2020

This comment has been minimized.

adam3smith commented Jan 19, 2020

This comment has been minimized.

GuyAglionby commented Jan 19, 2020