Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upSemantic Scholar extend PDF extraction + fix errors when logged in #2103
Conversation
Thank you! This looks fine. I only have some small comments for simplification of the code as well as make the extraction of arXiv IDs more general as well as one question. Everything should be easy to implement. Let me know if my comments/suggestions are not yet clear. |
pdfLinkElement = rawData.primaryPaperLink; | ||
} | ||
else if (rawData.alternatePaperLinks) { | ||
for (let i = 0; i < rawData.alternatePaperLinks.length; i++) { |
This comment has been minimized.
This comment has been minimized.
zuphilip
Dec 31, 2019
Collaborator
for (let i = 0; i < rawData.alternatePaperLinks.length; i++) { | |
for (let alternateElement of rawData.alternatePaperLinks) { |
This comment has been minimized.
This comment has been minimized.
zuphilip
Dec 31, 2019
Collaborator
Then the following line is not anymore needed (and you never use the variable i
anyways here), so this is a simplification of your code.
else if (rawData.alternatePaperLinks) { | ||
for (let i = 0; i < rawData.alternatePaperLinks.length; i++) { | ||
let alternateElement = rawData.alternatePaperLinks[i]; | ||
if (alternateElement.url.endsWith('.pdf')) { |
This comment has been minimized.
This comment has been minimized.
zuphilip
Dec 31, 2019
Collaborator
if (alternateElement.url.endsWith('.pdf')) { | |
if (!pdfLinkElement && alternateElement.url.endsWith('.pdf')) { |
This comment has been minimized.
This comment has been minimized.
zuphilip
Dec 31, 2019
Collaborator
Then delete the break
below and move the code about the arXiv ID up here. (This then covers the cases that there is pdf link which we use to the element, but there is also also another link following to arXiv.)
mimeType: 'application/pdf' | ||
}); | ||
|
||
if (pdfLinkElement.linkType == 'arxiv') { |
This comment has been minimized.
This comment has been minimized.
"itemID": "Dalvi2018TrackingSC", | ||
"libraryCatalog": "Semantic Scholar", | ||
"proceedingsTitle": "NAACL-HLT", | ||
"publicationTitle": "NAACL-HLT", |
This comment has been minimized.
This comment has been minimized.
zuphilip
Dec 31, 2019
Collaborator
Are both fields here come from Scaffold?
This looks strange as proceedingsTitle
is only some sort of alias to publicationTitle
...
This comment has been minimized.
This comment has been minimized.
GuyAglionby
Jan 10, 2020
Author
Contributor
Scaffold doesn't work well with updating tests for some reason -- the element used to determine type in detectWeb
isn't found for some reason, even with defer: true
. This is despite it appearing when I wget or curl a page. Not sure why this is, but the tests run/pass as usual.
This comment has been minimized.
This comment has been minimized.
Thanks for the review -- I don't think I understood the exact changes you suggested with the arXiv IDs, but I think the amended code should incorporate the idea |
This comment has been minimized.
This comment has been minimized.
I believe this is superseded by #2112 which includes PDF scraping, but haven't compared closely |
This comment has been minimized.
This comment has been minimized.
Yep, I think that's the case |
GuyAglionby commentedDec 31, 2019
Some relatively minor changes
hasPDF
property