Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Semantic Scholar extend PDF extraction + fix errors when logged in #2103

Closed
wants to merge 2 commits into from

Conversation

@GuyAglionby
Copy link
Contributor

GuyAglionby commented Dec 31, 2019

Some relatively minor changes

  • Change method of checking if a PDF is available, as they removed the hasPDF property
  • Go through list of alternative paper URLs to find a PDF if the main link isn't one (the ones identified the old way, via 's2' and 'arxiv', all ended in .pdf already).
  • More robust way of combing through the encoded data, as the previous method broke when you were logged in
  • Updated the tests to reflect underlying changes on website (including new URLs). Removed one as it now redirects to a different paper. It wasn't covering anything the others don't.
@GuyAglionby GuyAglionby changed the title Extend PDF extraction + fix errors when logged in Semantic Scholar extend PDF extraction + fix errors when logged in Dec 31, 2019
Copy link
Collaborator

zuphilip left a comment

Thank you! This looks fine. I only have some small comments for simplification of the code as well as make the extraction of arXiv IDs more general as well as one question. Everything should be easy to implement. Let me know if my comments/suggestions are not yet clear.

pdfLinkElement = rawData.primaryPaperLink;
}
else if (rawData.alternatePaperLinks) {
for (let i = 0; i < rawData.alternatePaperLinks.length; i++) {

This comment has been minimized.

Copy link
@zuphilip

zuphilip Dec 31, 2019

Collaborator
Suggested change
for (let i = 0; i < rawData.alternatePaperLinks.length; i++) {
for (let alternateElement of rawData.alternatePaperLinks) {

This comment has been minimized.

Copy link
@zuphilip

zuphilip Dec 31, 2019

Collaborator

Then the following line is not anymore needed (and you never use the variable i anyways here), so this is a simplification of your code.

else if (rawData.alternatePaperLinks) {
for (let i = 0; i < rawData.alternatePaperLinks.length; i++) {
let alternateElement = rawData.alternatePaperLinks[i];
if (alternateElement.url.endsWith('.pdf')) {

This comment has been minimized.

Copy link
@zuphilip

zuphilip Dec 31, 2019

Collaborator
Suggested change
if (alternateElement.url.endsWith('.pdf')) {
if (!pdfLinkElement && alternateElement.url.endsWith('.pdf')) {

This comment has been minimized.

Copy link
@zuphilip

zuphilip Dec 31, 2019

Collaborator

Then delete the break below and move the code about the arXiv ID up here. (This then covers the cases that there is pdf link which we use to the element, but there is also also another link following to arXiv.)

mimeType: 'application/pdf'
});

if (pdfLinkElement.linkType == 'arxiv') {

This comment has been minimized.

Copy link
@zuphilip

zuphilip Dec 31, 2019

Collaborator

Move this code block up.

"itemID": "Dalvi2018TrackingSC",
"libraryCatalog": "Semantic Scholar",
"proceedingsTitle": "NAACL-HLT",
"publicationTitle": "NAACL-HLT",

This comment has been minimized.

Copy link
@zuphilip

zuphilip Dec 31, 2019

Collaborator

Are both fields here come from Scaffold?

This looks strange as proceedingsTitle is only some sort of alias to publicationTitle...

This comment has been minimized.

Copy link
@GuyAglionby

GuyAglionby Jan 10, 2020

Author Contributor

Scaffold doesn't work well with updating tests for some reason -- the element used to determine type in detectWeb isn't found for some reason, even with defer: true. This is despite it appearing when I wget or curl a page. Not sure why this is, but the tests run/pass as usual.

@GuyAglionby

This comment has been minimized.

Copy link
Contributor Author

GuyAglionby commented Jan 10, 2020

Thanks for the review -- I don't think I understood the exact changes you suggested with the arXiv IDs, but I think the amended code should incorporate the idea

@adam3smith

This comment has been minimized.

Copy link
Collaborator

adam3smith commented Jan 19, 2020

I believe this is superseded by #2112 which includes PDF scraping, but haven't compared closely

@GuyAglionby

This comment has been minimized.

Copy link
Contributor Author

GuyAglionby commented Jan 19, 2020

Yep, I think that's the case

@adam3smith adam3smith closed this Jan 19, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.