Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upUpdating ThesesFR translator #2150
Conversation
* Retreinving data from theses.fr .rdf files * Considering supervisors as contributors * Adding multiples imports for organization/supervisors/doctors pages * Making a distinction between "Thèses de doctorat" and "Thèse en préparation" * Adding extra information such as laboratory, graduate schools, jury members * Adding licence for theses.fr data
Use eslint to fix issues
Use eslint to fix format issues
Nice Work! |
@dstillman -- do you have an opinion on |
Thanks -- overall this looks great. I have a couple of change requests and a couple of questions. Also waiting to hear from dstillman on your use of |
const title = ZU.xpathText(xmlDoc, '//dc:title', ns); | ||
|
||
// The case should't exists : preve;nts crashes while importing multiple records | ||
if (!title) return; |
This comment has been minimized.
This comment has been minimized.
adam3smith
May 11, 2020
Collaborator
Let's take this out please or replace it with a throw
-- if an item unexpectedly doesn't import, we would want an error to be thrown rather than to continue silently, letting the user assume all selected items have been imported.
}); | ||
|
||
const abstractNotes = ZU.xpath(xmlDoc, '//dcterms:abstract', ns); | ||
newItem.abstractNote = abstractNotes.length > 0 ? abstractNotes[0].textContent : undefined; |
This comment has been minimized.
This comment has been minimized.
adam3smith
May 11, 2020
Collaborator
Why no just use ZU.xpathText()
here? (and similarly in other cases below)
This comment has been minimized.
This comment has been minimized.
eonm-abes
May 12, 2020
Author
Contributor
My xpath skills was a bit rusty. I failed to get the first element of the abstract (and other nodes) with pure xpath.
Originally I tried to use this expression //dcterms:abstract[1]
with xpathText()
function, but this expression was wrong in my context. It should have been (//dcterms:abstract)[1]
.
So I used a workaround by mixing xpath and javascript which is not the best implementation. Now that I have found the right xpath expression I'll fix the code to use xpathText()
function.
This comment has been minimized.
This comment has been minimized.
dstillman
May 12, 2020
•
Member
(Note that you can also use text(docOrElem, selector[, index])
or attr(docOrElem, selector, attribute[, index])
with CSS selectors (as with querySelector()
/querySelectorAll()
) instead of using XPaths.)
This comment has been minimized.
This comment has been minimized.
eonm-abes
May 12, 2020
Author
Contributor
I already fixed the code by using xpath (2c91144) and Z.xpathText()
but that's a good trick to know.
newItem.rights = 'Les données de Theses.fr sont sous licence Etalab'; | ||
|
||
// Keep extra information such as laboratory, graduate schools, etc. | ||
newItem.extra = Array.from(doc.getElementsByClassName('donnees-ombre')).map((description) => { |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
"url": "http://www.theses.fr/2016SACLS590", | ||
"attachments": [ | ||
{ | ||
"title": "Full Text PDF", |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
eonm-abes
May 11, 2020
Author
Contributor
@adam3smith I agree with you, It would be very useful. I tried to implement it my self but I encountered multiple issues with PDF attachments :
- Most PDF files are around 2Go or more (Theses.fr doesn't enforce PDF size). It might cause real performance issues when you have to import multiple PDF files at once. In my case my web browser slows down or freezes. Over time PDF files get bigger and bigger on theses.fr
- Finding full text PDF on theses.fr is a hard task. Currently there is a special path in the URL
/document
that redirects user to the full text PDF if it exists. But this URL redirects either to :- The PDF file on theses.fr (https://www.theses.fr/2011PA040275/document)
- The PDF file on another website (University website, Open archive, ...) (https://www.theses.fr/2008ECDL0023/document). In this case we are fetching resources from 3rd party websites without user consent. Furthermore there is no way to know if the URL is pointing to a third-party website before being redirected. Currently half of open access PDF are stored on external websites.
- A web page containing the PDF file. In this case we have to parse each pages individually to figure out where the PDF is stored. It implies to make multiple HTTP calls to find and download the PDF file. (https://www.theses.fr/2020REN20023/document).
- A restricted access web page (https://www.theses.fr/2019LIL1R054/document)
It seems like the previous implementation is also broken.
To implement it I need more time and maybe help. I'm also concerned about user consent.
This comment has been minimized.
This comment has been minimized.
adam3smith
May 11, 2020
Collaborator
It's certainly fine to leave this out, I was just checking.
FWIW, I'm not super concerned about user consent to retrieve PDFs from 3rd party academic websites -- Zotero's unpaywall integration does this, too. The fact that we'd only be able to successfully import PDFs from 1 and 2 of your four scenarios with reasonable effort is also fine: that's why we check the mimeType.
However, we don't have a way to check for file size limits in translators and I certainly don't want to auto-import 2GB document, so if this is a regular occurrence, that'd indeed be a problem.
return "multiple"; | ||
// Match against a results page or a Ph. D/supervisor/organization page which might contains multiple records e.g. | ||
// http://www.theses.fr/fr/?q=zotero | ||
// http://www.theses.fr/fr/154750417 |
This comment has been minimized.
This comment has been minimized.
I don't quite understand this question:
are you asking whether the |
We generally don't use (We also tend to use |
Thank you @adam3smith and @dstillman for your comments and feedback. I updated the code accordingly to your advice. @adam3smith my question was indeed about the right place to put licensing information. I updated the code to keep only the license name (Etalab). @dstillman As suggested I replaced It should be good now. |
Thanks -- two remaining nits and we should be good here |
@@ -230,16 +247,19 @@ var testCases = [ | |||
"tag": "W boson mass" | |||
} | |||
], | |||
"extra": "Sous la direction de Maarten Boonekamp. Soutenue le 19-09-2016,à l'Université Paris-Saclay (ComUE) , dans le cadre de École doctorale Particules, Hadrons, Énergie et Noyau : Instrumentation, Imagerie, Cosmos et Simulation (Orsay, Essonne ; 2015-....) , en partenariat avec Département de physique des particules (Gif-sur-Yvette, Essonne) (laboratoire) , Centre européen pour la recherche nucléaire (laboratoire) et de Université Paris-Sud (1970-2019) (établissement opérateur d'inscription) .", | |||
"rights": "Les données de Theses.fr sont sous licence Etalab", |
This comment has been minimized.
This comment has been minimized.
newItem.extra = Array.from(doc.getElementsByClassName('donnees-ombre')).map((description) => { | ||
return Array.from(description.getElementsByTagName('p')).map(description => description.textContent.replace(/\n/g, ' ')); | ||
if (notePrepa) { | ||
newItem.notes.push({ description: notePrepa }); |
This comment has been minimized.
This comment has been minimized.
adam3smith
May 14, 2020
Collaborator
This should be {note: notePrepa}
as per https://www.zotero.org/support/dev/translators/coding#notes
Same below
}).join(' '); | ||
|
||
if (note) { | ||
newItem.notes.push({ description: note }); |
This comment has been minimized.
This comment has been minimized.
Hi -- sorry, I know I said we'd be done, but I'm just realizing that the combination of target and detect function is significantly too loose: The translator detects any page as a thesis, including search start pages (http://theses.fr/fr/personnes/) and documentation (http://theses.fr/fr/personnes/). This isn't your bug, it's already the case in the current version, but it's... not good. I don't think you can easily address this in the target, so will probably require something on the page itself -- one of the metatags might make most sense, e.g. DC.type? Would you mind tightening this up a bit more? |
No problem. This is my fault, I forget to check the target regex when I updated the translator. I come up with this regex This regex should exclude search start pages, .xml and .rdf pages and API pages. More explanation and example here : https://regex101.com/r/IIr1ak/1 |
Thanks so much. This looks great! |
eonm-abes commentedMar 23, 2020
Hi all I updated the ThesesFR.js translator.
Change log :
It's the first time I contribute to update a translator so feedback are welcome. I had some trouble to lint the code with eslint I hope there is no more issues.
A few questions nonetheless :
const
as often as possible to avoid reassignment. Is this a good practice in the context of translators ? Should I uselet
instead ?Regards