Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Updating ThesesFR translator #2150

Merged
merged 16 commits into from May 17, 2020
Merged

Updating ThesesFR translator #2150

merged 16 commits into from May 17, 2020

Conversation

@eonm-abes
Copy link
Contributor

eonm-abes commented Mar 23, 2020

Hi all I updated the ThesesFR.js translator.

Change log :

  • The translator uses a RDF file to get Metadata. This RDF file is more precise and up to date than embedded metadata in the html page
  • The translators is able to import multiples items from organization and person pages
  • Supervisors are considered as contributors of the final work
  • The mention "Thèses de doctorat" (defended thesis) or "Thèse en préparation" (thesis in preparation) has been added to the type field. This change was requested by theses.fr users.
  • More info has been added in the extra field (jury members, laboratory)
  • The mention "Les données de Theses.fr sont sous licence Etalab" has been added in the authorization field. (Can this field be used for licensing purposes ?)

It's the first time I contribute to update a translator so feedback are welcome. I had some trouble to lint the code with eslint I hope there is no more issues.

A few questions nonetheless :

  • I use const as often as possible to avoid reassignment. Is this a good practice in the context of translators ? Should I use let instead ?
  • I use arrow functions. Can it causes compatibility issues ?

Regards

eonm-abes added 5 commits Mar 13, 2020
* Retreinving data from theses.fr .rdf files
* Considering supervisors as contributors
* Adding multiples imports for organization/supervisors/doctors pages
* Making a distinction between "Thèses de doctorat" and "Thèse en préparation"
* Adding extra information such as laboratory, graduate schools, jury members
* Adding licence for theses.fr data
Lint code
Use eslint to fix issues
Lint
Use eslint to fix format issues
@TesteFixe
Copy link
Contributor

TesteFixe commented Mar 23, 2020

Nice Work!

@adam3smith
Copy link
Collaborator

adam3smith commented May 11, 2020

@dstillman -- do you have an opinion on const? See e.g.
https://github.com/zotero/translators/pull/2150/files#diff-ae76dd6444254ff5be2191eeb6d4189cR51
for a respresentative example.
I don't see a downside, but it's not something we've been doing (let alone enforcing) in translators otherwise.

Copy link
Collaborator

adam3smith left a comment

Thanks -- overall this looks great. I have a couple of change requests and a couple of questions. Also waiting to hear from dstillman on your use of const, but for now assume that's fine. Arrow functions are definitely OK. We've given up on IE ;)

ThesesFR.js Outdated
const title = ZU.xpathText(xmlDoc, '//dc:title', ns);

// The case should't exists : preve;nts crashes while importing multiple records
if (!title) return;

This comment has been minimized.

Copy link
@adam3smith

adam3smith May 11, 2020

Collaborator

Let's take this out please or replace it with a throw -- if an item unexpectedly doesn't import, we would want an error to be thrown rather than to continue silently, letting the user assume all selected items have been imported.

ThesesFR.js Outdated
});

const abstractNotes = ZU.xpath(xmlDoc, '//dcterms:abstract', ns);
newItem.abstractNote = abstractNotes.length > 0 ? abstractNotes[0].textContent : undefined;

This comment has been minimized.

Copy link
@adam3smith

adam3smith May 11, 2020

Collaborator

Why no just use ZU.xpathText() here? (and similarly in other cases below)

This comment has been minimized.

Copy link
@eonm-abes

eonm-abes May 12, 2020

Author Contributor

My xpath skills was a bit rusty. I failed to get the first element of the abstract (and other nodes) with pure xpath.

Originally I tried to use this expression //dcterms:abstract[1] with xpathText() function, but this expression was wrong in my context. It should have been (//dcterms:abstract)[1].

So I used a workaround by mixing xpath and javascript which is not the best implementation. Now that I have found the right xpath expression I'll fix the code to use xpathText() function.

This comment has been minimized.

Copy link
@dstillman

dstillman May 12, 2020

Member

(Note that you can also use text(docOrElem, selector[, index]) or attr(docOrElem, selector, attribute[, index]) with CSS selectors (as with querySelector()/querySelectorAll()) instead of using XPaths.)

This comment has been minimized.

Copy link
@eonm-abes

eonm-abes May 12, 2020

Author Contributor

I already fixed the code by using xpath (2c91144) and Z.xpathText() but that's a good trick to know.

ThesesFR.js Outdated
newItem.rights = 'Les données de Theses.fr sont sous licence Etalab';

// Keep extra information such as laboratory, graduate schools, etc.
newItem.extra = Array.from(doc.getElementsByClassName('donnees-ombre')).map((description) => {

This comment has been minimized.

Copy link
@adam3smith

adam3smith May 11, 2020

Collaborator

This is better suited for a note than extra

This comment has been minimized.

Copy link
@eonm-abes

eonm-abes May 12, 2020

Author Contributor

In this case I'll use a note instead

"url": "http://www.theses.fr/2016SACLS590",
"attachments": [
{
"title": "Full Text PDF",

This comment has been minimized.

Copy link
@adam3smith

adam3smith May 11, 2020

Collaborator

no more PDFs? That seems like it'd be useful

This comment has been minimized.

Copy link
@eonm-abes

eonm-abes May 11, 2020

Author Contributor

@adam3smith I agree with you, It would be very useful. I tried to implement it my self but I encountered multiple issues with PDF attachments :

  • Most PDF files are around 2Go or more (Theses.fr doesn't enforce PDF size). It might cause real performance issues when you have to import multiple PDF files at once. In my case my web browser slows down or freezes. Over time PDF files get bigger and bigger on theses.fr
  • Finding full text PDF on theses.fr is a hard task. Currently there is a special path in the URL /document that redirects user to the full text PDF if it exists. But this URL redirects either to :

It seems like the previous implementation is also broken.

To implement it I need more time and maybe help. I'm also concerned about user consent.

This comment has been minimized.

Copy link
@adam3smith

adam3smith May 11, 2020

Collaborator

It's certainly fine to leave this out, I was just checking.

FWIW, I'm not super concerned about user consent to retrieve PDFs from 3rd party academic websites -- Zotero's unpaywall integration does this, too. The fact that we'd only be able to successfully import PDFs from 1 and 2 of your four scenarios with reasonable effort is also fine: that's why we check the mimeType.

However, we don't have a way to check for file size limits in translators and I certainly don't want to auto-import 2GB document, so if this is a regular occurrence, that'd indeed be a problem.

return "multiple";
// Match against a results page or a Ph. D/supervisor/organization page which might contains multiple records e.g.
// http://www.theses.fr/fr/?q=zotero
// http://www.theses.fr/fr/154750417

This comment has been minimized.

Copy link
@adam3smith

adam3smith May 11, 2020

Collaborator

cool -- please add a test for this

@adam3smith
Copy link
Collaborator

adam3smith commented May 11, 2020

I don't quite understand this question:

The mention "Les données de Theses.fr sont sous licence Etalab" has been added in the authorization field. (Can this field be used for licensing purposes ?)

are you asking whether the rights field is the correct place for licensing information? If so, the answer is yes.

@dstillman
Copy link
Member

dstillman commented May 11, 2020

@dstillman -- do you have an opinion on const? […] I don't see a downside, but it's not something we've been doing (let alone enforcing) in translators otherwise.

We generally don't use const anywhere in Zotero beyond actual constants (e.g., all caps variables). It's more verbose without adding much value beyond let (particularly in small scopes, where it's just noise). There's no need to reject translators that use const, but for brevity and consistency we prefer it not be used.

(We also tend to use var at the top level of functions to differentiate variables shared across multiple scopes from block-level let, as well as to allow for quicker try debugging, but we certainly don't enforce that.)

@eonm-abes
Copy link
Contributor Author

eonm-abes commented May 12, 2020

Thank you @adam3smith and @dstillman for your comments and feedback. I updated the code accordingly to your advice.

@adam3smith my question was indeed about the right place to put licensing information. I updated the code to keep only the license name (Etalab).

@dstillman As suggested I replaced const with let.

It should be good now.

@eonm-abes eonm-abes requested review from dstillman and adam3smith May 12, 2020
Copy link
Collaborator

adam3smith left a comment

Thanks -- two remaining nits and we should be good here

ThesesFR.js Outdated
@@ -230,16 +247,19 @@ var testCases = [
"tag": "W boson mass"
}
],
"extra": "Sous la direction de Maarten Boonekamp. Soutenue le 19-09-2016,à l'Université Paris-Saclay (ComUE) , dans le cadre de École doctorale Particules, Hadrons, Énergie et Noyau : Instrumentation, Imagerie, Cosmos et Simulation (Orsay, Essonne ; 2015-....) , en partenariat avec Département de physique des particules (Gif-sur-Yvette, Essonne) (laboratoire) , Centre européen pour la recherche nucléaire (laboratoire) et de Université Paris-Sud (1970-2019) (établissement opérateur d'inscription) .",
"rights": "Les données de Theses.fr sont sous licence Etalab",

This comment has been minimized.

Copy link
@adam3smith

adam3smith May 14, 2020

Collaborator

doesn't look like you updated the tests after changing rights

ThesesFR.js Outdated
newItem.extra = Array.from(doc.getElementsByClassName('donnees-ombre')).map((description) => {
return Array.from(description.getElementsByTagName('p')).map(description => description.textContent.replace(/\n/g, ' '));
if (notePrepa) {
newItem.notes.push({ description: notePrepa });

This comment has been minimized.

Copy link
@adam3smith

adam3smith May 14, 2020

Collaborator

This should be {note: notePrepa} as per https://www.zotero.org/support/dev/translators/coding#notes
Same below

ThesesFR.js Outdated
}).join(' ');

if (note) {
newItem.notes.push({ description: note });

This comment has been minimized.

Copy link
@adam3smith

adam3smith May 14, 2020

Collaborator

ditto

eonm-abes added 2 commits May 14, 2020
@eonm-abes eonm-abes requested a review from adam3smith May 14, 2020
@adam3smith
Copy link
Collaborator

adam3smith commented May 15, 2020

Hi -- sorry, I know I said we'd be done, but I'm just realizing that the combination of target and detect function is significantly too loose: The translator detects any page as a thesis, including search start pages (http://theses.fr/fr/personnes/) and documentation (http://theses.fr/fr/personnes/). This isn't your bug, it's already the case in the current version, but it's... not good.

I don't think you can easily address this in the target, so will probably require something on the page itself -- one of the metatags might make most sense, e.g. DC.type?

Would you mind tightening this up a bit more?

@eonm-abes
Copy link
Contributor Author

eonm-abes commented May 15, 2020

No problem. This is my fault, I forget to check the target regex when I updated the translator.

I come up with this regex ^https?:\/\/(www\.)?theses\.fr\/([a-z]{2}\/)?((s\d+|\d{4}.{8}|\d{8}X|\d{9})(?!\.(rdf|xml)$)|(sujets\/\?q=|\?q=))(?!.*&format=(json|xml)). What do you think of it ?

This regex should exclude search start pages, .xml and .rdf pages and API pages.

More explanation and example here : https://regex101.com/r/IIr1ak/1

@adam3smith adam3smith merged commit a380a6c into zotero:master May 17, 2020
1 check passed
1 check passed
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@adam3smith
Copy link
Collaborator

adam3smith commented May 17, 2020

Thanks so much. This looks great!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

5 participants
You can’t perform that action at this time.