Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Update CiteSeer to work from PDF paper view #1535

Merged
merged 3 commits into from Jan 28, 2018

Conversation

Projects
None yet
2 participants
@GuyAglionby
Copy link
Contributor

GuyAglionby commented Jan 28, 2018

As the title says, as the DOI is available in the PDF url we can get the URL for the summary page and grab the information from there.

Thanks and any comments appreciated.

@zuphilip
Copy link
Collaborator

zuphilip left a comment

Can you add a test case for this as well? Please have a look also at my other two inline comments, just two nits.

CiteSeer.js Outdated
@@ -40,7 +40,8 @@ function detectWeb(doc, url) {
if ((url.indexOf('/search') != -1 || url.indexOf('/showciting') != -1) && getSearchResults(doc).length) {
return "multiple";
}
if (url.indexOf('/viewdoc/') != -1 && doc.getElementById('bibtex')) {
if ((url.indexOf('/viewdoc/') != -1 && doc.getElementById('bibtex'))
|| url.indexOf('/download?doi=') != -1) {

This comment has been minimized.

Copy link
@zuphilip

zuphilip Jan 28, 2018

Collaborator

We prefer nowadays the function .includes() for such tests. Can you change these two (and possibly also the ones in the line above)?

CiteSeer.js Outdated
@@ -66,6 +67,11 @@ function doWeb(doc, url) {
}
ZU.processDocuments(articles, scrape);
});
} else if (url.indexOf('/download?doi=') !== -1) {
let doi = url.replace('http://citeseerx.ist.psu.edu/viewdoc/download?doi=', '');
doi = doi.replace('&rep=rep1', '').replace('&type=pdf', '');

This comment has been minimized.

Copy link
@zuphilip

zuphilip Jan 28, 2018

Collaborator

This looks a little fragile. Thus, I would prefer to work here with a regular expression, i.e. something like:

let doi = url.match(/\/download\?doi=([^&]*)/);
let paperUrl = 'http://citeseerx.ist.psu.edu/viewdoc/summary?doi=' + doi[1];
ZU.processDocuments(paperUrl, scrape);

(Note, this code is not tested.)

@GuyAglionby

This comment has been minimized.

Copy link
Contributor Author

GuyAglionby commented Jan 28, 2018

Thanks for the comments! I added a test with the correct info, however when I run it I get an error:

TranslatorTester: Translating http://citeseerx.ist.psu.edu/viewdoc/download?doi=10.1.1.415.9750&rep=rep1&type=pdf

TranslatorTester: CiteSeer Test 1: failed (Translation failed to initialize: TypeError: doc is null)

I assume this is because it's a direct link to a PDF. I looked at the translator for arXiv.org and it has a PDF test case that also fails with the same error. I'm on Zotero 5.0.34 and Scaffold 3.3.0. Let me know if you want me to push the test case anyway.

I have made the other changes as requested and tested them :)

@zuphilip

This comment has been minimized.

Copy link
Collaborator

zuphilip commented Jan 28, 2018

Thank you! You are right, Scaffold tries to open the pdf directly and therefore a detection is not possible. Thus, I mentioned the test case in the comment and add some other cleanup and pushed directly to your branch.

@zuphilip zuphilip merged commit ba56c70 into zotero:master Jan 28, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

psisquared2 added a commit to psisquared2/translators that referenced this pull request Feb 8, 2018

Update CiteSeer to work from PDF paper view (zotero#1535)
Also hardcode detection of one test case and small cleanup work.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
You can’t perform that action at this time.