Updating ThesesFR translator #2150

eonm-abes · 2020-03-23T09:19:25Z

Hi all I updated the ThesesFR.js translator.

Change log :

The translator uses a RDF file to get Metadata. This RDF file is more precise and up to date than embedded metadata in the html page
The translators is able to import multiples items from organization and person pages
Supervisors are considered as contributors of the final work
The mention "Thèses de doctorat" (defended thesis) or "Thèse en préparation" (thesis in preparation) has been added to the type field. This change was requested by theses.fr users.
More info has been added in the extra field (jury members, laboratory)
The mention "Les données de Theses.fr sont sous licence Etalab" has been added in the authorization field. (Can this field be used for licensing purposes ?)

It's the first time I contribute to update a translator so feedback are welcome. I had some trouble to lint the code with eslint I hope there is no more issues.

A few questions nonetheless :

I use const as often as possible to avoid reassignment. Is this a good practice in the context of translators ? Should I use let instead ?
I use arrow functions. Can it causes compatibility issues ?

Regards


        Update ThesesFR.js

* Retreinving data from theses.fr .rdf files * Considering supervisors as contributors * Adding multiples imports for organization/supervisors/doctors pages * Making a distinction between "Thèses de doctorat" and "Thèse en préparation" * Adding extra information such as laboratory, graduate schools, jury members * Adding licence for theses.fr data


        Fix typo


        Lint code

Use eslint to fix issues


        Lint

Use eslint to fix format issues


        Lint fixing error 'item' is not defined no-undef

TesteFixe · 2020-03-23T12:27:04Z

Nice Work!

adam3smith · 2020-05-11T01:36:31Z

@dstillman -- do you have an opinion on const? See e.g.
https://github.com/zotero/translators/pull/2150/files#diff-ae76dd6444254ff5be2191eeb6d4189cR51
for a respresentative example.
I don't see a downside, but it's not something we've been doing (let alone enforcing) in translators otherwise.

adam3smith

Thanks -- overall this looks great. I have a couple of change requests and a couple of questions. Also waiting to hear from dstillman on your use of const, but for now assume that's fine. Arrow functions are definitely OK. We've given up on IE ;)

adam3smith · 2020-05-11T01:57:04Z

ThesesFR.js

+		const title = ZU.xpathText(xmlDoc, '//dc:title', ns);
+
+		// The case should't exists : preve;nts crashes while importing multiple records
+		if (!title) return;


Let's take this out please or replace it with a throw -- if an item unexpectedly doesn't import, we would want an error to be thrown rather than to continue silently, letting the user assume all selected items have been imported.

adam3smith · 2020-05-11T01:57:04Z

ThesesFR.js

+		});
+
+		const abstractNotes = ZU.xpath(xmlDoc, '//dcterms:abstract', ns);
+		newItem.abstractNote = abstractNotes.length > 0 ? abstractNotes[0].textContent : undefined;


Why no just use ZU.xpathText() here? (and similarly in other cases below)

My xpath skills was a bit rusty. I failed to get the first element of the abstract (and other nodes) with pure xpath.

Originally I tried to use this expression //dcterms:abstract[1] with xpathText() function, but this expression was wrong in my context. It should have been (//dcterms:abstract)[1].

So I used a workaround by mixing xpath and javascript which is not the best implementation. Now that I have found the right xpath expression I'll fix the code to use xpathText() function.

(Note that you can also use text(docOrElem, selector[, index]) or attr(docOrElem, selector, attribute[, index]) with CSS selectors (as with querySelector()/querySelectorAll()) instead of using XPaths.)

I already fixed the code by using xpath (2c91144) and Z.xpathText() but that's a good trick to know.

adam3smith · 2020-05-11T01:57:04Z

ThesesFR.js

+		newItem.rights = 'Les données de Theses.fr sont sous licence Etalab';
+
+		// Keep extra information such as laboratory, graduate schools, etc.
+		newItem.extra = Array.from(doc.getElementsByClassName('donnees-ombre')).map((description) => {


This is better suited for a note than extra

In this case I'll use a note instead

adam3smith · 2020-05-11T01:57:04Z

ThesesFR.js

 				"url": "http://www.theses.fr/2016SACLS590",
-				"attachments": [
-					{
-						"title": "Full Text PDF",


no more PDFs? That seems like it'd be useful

@adam3smith I agree with you, It would be very useful. I tried to implement it my self but I encountered multiple issues with PDF attachments :

Most PDF files are around 2Go or more (Theses.fr doesn't enforce PDF size). It might cause real performance issues when you have to import multiple PDF files at once. In my case my web browser slows down or freezes. Over time PDF files get bigger and bigger on theses.fr

Finding full text PDF on theses.fr is a hard task. Currently there is a special path in the URL /document that redirects user to the full text PDF if it exists. But this URL redirects either to :

The PDF file on theses.fr (https://www.theses.fr/2011PA040275/document)

The PDF file on another website (University website, Open archive, ...) (https://www.theses.fr/2008ECDL0023/document). In this case we are fetching resources from 3rd party websites without user consent. Furthermore there is no way to know if the URL is pointing to a third-party website before being redirected. Currently half of open access PDF are stored on external websites.

A web page containing the PDF file. In this case we have to parse each pages individually to figure out where the PDF is stored. It implies to make multiple HTTP calls to find and download the PDF file. (https://www.theses.fr/2020REN20023/document).

A restricted access web page (https://www.theses.fr/2019LIL1R054/document)

It seems like the previous implementation is also broken.

To implement it I need more time and maybe help. I'm also concerned about user consent.

It's certainly fine to leave this out, I was just checking.

FWIW, I'm not super concerned about user consent to retrieve PDFs from 3rd party academic websites -- Zotero's unpaywall integration does this, too. The fact that we'd only be able to successfully import PDFs from 1 and 2 of your four scenarios with reasonable effort is also fine: that's why we check the mimeType.

However, we don't have a way to check for file size limits in translators and I certainly don't want to auto-import 2GB document, so if this is a regular occurrence, that'd indeed be a problem.

adam3smith · 2020-05-11T01:57:04Z

ThesesFR.js

-		return "multiple";
+	// Match against a results page or a Ph. D/supervisor/organization page which might contains multiple records e.g.
+	// http://www.theses.fr/fr/?q=zotero
+	// http://www.theses.fr/fr/154750417


cool -- please add a test for this

adam3smith · 2020-05-11T01:58:35Z

I don't quite understand this question:

The mention "Les données de Theses.fr sont sous licence Etalab" has been added in the authorization field. (Can this field be used for licensing purposes ?)

are you asking whether the rights field is the correct place for licensing information? If so, the answer is yes.

dstillman · 2020-05-11T05:13:50Z

@dstillman -- do you have an opinion on const? […] I don't see a downside, but it's not something we've been doing (let alone enforcing) in translators otherwise.

We generally don't use const anywhere in Zotero beyond actual constants (e.g., all caps variables). It's more verbose without adding much value beyond let (particularly in small scopes, where it's just noise). There's no need to reject translators that use const, but for brevity and consistency we prefer it not be used.

(We also tend to use var at the top level of functions to differentiate variables shared across multiple scopes from block-level let, as well as to allow for quicker try debugging, but we certainly don't enforce that.)


        replace const with let


        Use xpathText() when necessary


        Throw real error


        Transform extra to note


        Add note for thesis not yet defended


        Adding tests for organisation and people pages


        Updating rights field


        Fixing lint issues + Throw real error not litteral

eonm-abes · 2020-05-12T10:18:39Z

Thank you @adam3smith and @dstillman for your comments and feedback. I updated the code accordingly to your advice.

@adam3smith my question was indeed about the right place to put licensing information. I updated the code to keep only the license name (Etalab).

@dstillman As suggested I replaced const with let.

It should be good now.

adam3smith

Thanks -- two remaining nits and we should be good here

adam3smith · 2020-05-14T02:27:23Z

ThesesFR.js

@@ -230,16 +247,19 @@ var testCases = [
 						"tag": "W boson mass"
 			   		}
 				],
-				"extra": "Sous la direction de  Maarten Boonekamp.  Soutenue le 19-09-2016,à l'Université Paris-Saclay (ComUE) , dans le cadre de   École doctorale Particules, Hadrons, Énergie et Noyau : Instrumentation, Imagerie, Cosmos et Simulation (Orsay, Essonne ; 2015-....) , en partenariat avec  Département de physique des particules (Gif-sur-Yvette, Essonne)   (laboratoire)  ,  Centre européen pour la recherche nucléaire   (laboratoire)   et de  Université Paris-Sud (1970-2019)   (établissement opérateur d'inscription)  .",
 				"rights": "Les données de Theses.fr sont sous licence Etalab",


doesn't look like you updated the tests after changing rights

adam3smith · 2020-05-14T02:27:23Z

ThesesFR.js

-		newItem.extra = Array.from(doc.getElementsByClassName('donnees-ombre')).map((description) => {
-			return Array.from(description.getElementsByTagName('p')).map(description => description.textContent.replace(/\n/g, ' '));
+		if (notePrepa) {
+			newItem.notes.push({ description: notePrepa });


This should be {note: notePrepa} as per https://www.zotero.org/support/dev/translators/coding#notes
Same below

adam3smith · 2020-05-14T02:27:23Z

ThesesFR.js

 		}).join(' ');

+		if (note) {
+			newItem.notes.push({ description: note });



        Fix notes


        Fix tests with proper rights

adam3smith · 2020-05-15T03:25:28Z

Hi -- sorry, I know I said we'd be done, but I'm just realizing that the combination of target and detect function is significantly too loose: The translator detects any page as a thesis, including search start pages (http://theses.fr/fr/personnes/) and documentation (http://theses.fr/fr/personnes/). This isn't your bug, it's already the case in the current version, but it's... not good.

I don't think you can easily address this in the target, so will probably require something on the page itself -- one of the metatags might make most sense, e.g. DC.type?

Would you mind tightening this up a bit more?

eonm-abes · 2020-05-15T11:09:49Z

No problem. This is my fault, I forget to check the target regex when I updated the translator.

I come up with this regex ^https?:\/\/(www\.)?theses\.fr\/([a-z]{2}\/)?((s\d+|\d{4}.{8}|\d{8}X|\d{9})(?!\.(rdf|xml)$)|(sujets\/\?q=|\?q=))(?!.*&format=(json|xml)). What do you think of it ?

This regex should exclude search start pages, .xml and .rdf pages and API pages.

More explanation and example here : https://regex101.com/r/IIr1ak/1


        Update regex

adam3smith · 2020-05-17T01:37:17Z

Thanks so much. This looks great!

eonm-abes added 5 commits Mar 13, 2020

Fix typo

Verified

This commit was created on GitHub.com and signed with a verified signature using GitHub’s key.

GPG key ID: 4AEE18F83AFDEB23 Learn about signing commits

0f76375

Lint code

Verified

This commit was created on GitHub.com and signed with a verified signature using GitHub’s key.

GPG key ID: 4AEE18F83AFDEB23 Learn about signing commits

f6f215b

Use eslint to fix issues

Lint

Verified

This commit was created on GitHub.com and signed with a verified signature using GitHub’s key.

GPG key ID: 4AEE18F83AFDEB23 Learn about signing commits

e1737f1

Use eslint to fix format issues

Lint fixing error 'item' is not defined no-undef

Verified

This commit was created on GitHub.com and signed with a verified signature using GitHub’s key.

GPG key ID: 4AEE18F83AFDEB23 Learn about signing commits

Loading status checks…

ac73ab1

adam3smith requested changes May 11, 2020

View changes

Eonm and others added 8 commits May 12, 2020

replace const with let

6a142e7

Use xpathText() when necessary

2c91144

Throw real error

8ae7f7e

Transform extra to note

f4a6fc4

Add note for thesis not yet defended

92bb713

Adding tests for organisation and people pages

4286268

Updating rights field

Loading status checks…

50b8a0b

Fixing lint issues + Throw real error not litteral

Loading status checks…

6d6f583

eonm-abes requested review from dstillman and adam3smith May 12, 2020

adam3smith requested changes May 14, 2020

View changes

eonm-abes added 2 commits May 14, 2020

Fix notes

Verified

This commit was created on GitHub.com and signed with a verified signature using GitHub’s key.

GPG key ID: 4AEE18F83AFDEB23 Learn about signing commits

Loading status checks…

e242501

Fix tests with proper rights

Verified

This commit was created on GitHub.com and signed with a verified signature using GitHub’s key.

GPG key ID: 4AEE18F83AFDEB23 Learn about signing commits

Loading status checks…

bdc86d1

eonm-abes requested a review from adam3smith May 14, 2020

Update regex

Verified

This commit was created on GitHub.com and signed with a verified signature using GitHub’s key.

GPG key ID: 4AEE18F83AFDEB23 Learn about signing commits

Loading status checks…

d14772c

adam3smith merged commit a380a6c into zotero:master May 17, 2020
1 check passed

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

zotero / translators

Updating ThesesFR translator #2150

Updating ThesesFR translator #2150

eonm-abes commented Mar 23, 2020

TesteFixe commented Mar 23, 2020

adam3smith commented May 11, 2020

adam3smith left a comment

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

adam3smith commented May 11, 2020

dstillman commented May 11, 2020

eonm-abes commented May 12, 2020

adam3smith left a comment

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

adam3smith commented May 15, 2020

eonm-abes commented May 15, 2020

adam3smith commented May 17, 2020

zotero / translators

Join GitHub today

Updating ThesesFR translator #2150

Updating ThesesFR translator #2150

Conversation

eonm-abes commented Mar 23, 2020

TesteFixe commented Mar 23, 2020

adam3smith commented May 11, 2020

adam3smith left a comment

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

dstillman May 12, 2020 • edited

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

adam3smith commented May 11, 2020

dstillman commented May 11, 2020

eonm-abes commented May 12, 2020

adam3smith left a comment

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

adam3smith commented May 15, 2020

eonm-abes commented May 15, 2020

adam3smith commented May 17, 2020

dstillman May 12, 2020 •

edited