New translator for www.springer.com #1440

psisquared2 · Oct 8, 2017

I have written a translator for www.springer.com (not to be confused with link.springer.com). The embedded metadata translator is only capable of extracting the title of the book on that website. I have also implemented various fallbacks to the embedded metadata translator in case the website changes.

Example website: http://www.springer.com/gb/book/9780387952697

adam3smith · Oct 15, 2017

adam3smith requested changes Oct 15, 2017

View changes

Thanks!
We'd really want to add multiples here -- e.g. for search results such as http://www.springer.com/us/search?facet-type=type__book&query=testing&submit=Submit+Query
Looks eminently possible to me. Let me know if you need pointers to template code you can (re)use.

adam3smith · Oct 15, 2017

Springer Books.js

+		//Try to find additional information
+		if (!item.publisher) {
+			publisher = ZU.xpathText(doc, '//dd[@itemprop="publisher"]/span');
+			item.publisher = publisher;


I'd try to split this into publisher and place. Since presumably the first part is always "Springer-Verlag" you could do something like:

if (item.publisher && item.publisher.search(/^Springer-Verlag\s.+/)!=-1) { var publisherInfo = item.publisher.match(/^(Springer-Verlag)\s(.+)/); item.publisher = publisherInfo[1] item.place = publisherInfo[2] }

adam3smith · Oct 15, 2017

Springer Books.js

+		//The abstract note is shortened in the <meta> field; try to load
+		//the full abstract note
+		long_abstractNote = ZU.xpath(doc, '//div[@class="product-about"]//div[@class="springer-html"]');
+		long_abstractNote = long_abstractNote[0].textContent.trim();


test for long_abstractNote.length before doing all this.

psisquared2 · Oct 22, 2017

Thank you for the feedback. I have addressed the issues you mentioned, added support for multiples and also improved a few minor quirks on the way. Please let me know if you see further issues or room for improvement.

One question: In case the publisher is officially listed as "Springer-Verlag Berlin Heidelberg" (which happens a lot), do we want to split this into publisher="Springer-Verlag" and place="Berlin Heidelberg"? Seems a little weird to have two locations in "place", but it's probably the best we can do. The only other option I see is to revert to not filling the "place" field at all.

zuphilip · Jan 3, 2018

zuphilip reviewed Jan 3, 2018

View changes

zuphilip · Jan 3, 2018

Springer Books.js

+	"maxVersion": "",
+	"priority": 100,
+	"inRepository": true,
+	"translatorType": 12,


No, translator type is 4 here.

zuphilip · Jan 3, 2018

Springer Books.js

+	"priority": 100,
+	"inRepository": true,
+	"translatorType": 12,
+	"browserSupport": "gcsbv",


Add all options, also i.

zuphilip · Jan 3, 2018

Springer Books.js

+/*
+	***** BEGIN LICENSE BLOCK *****
+
+	Copyright © 2017 YourName


Replace YourName with your name.

zuphilip · Jan 3, 2018

Springer Books.js

+			// test if any relevant <meta> information is available
+			if(ZU.xpathText(doc, '//meta[@property="og:title"]/@content')) return "book";
+			//test for relevant search entries
+		case "search": return getSearchResults(doc, true) ? "multiple" : false;


Write this on two lines and then move the comment from above to corresponding line, i.e.

case "search": // test for relevant search entries return getSearchResults(doc, true) ? "multiple" : false;

zuphilip · Jan 3, 2018

Springer Books.js

+		var item_type = search_entries[i].getAttribute("class");
+		//test if we actually have a usable result
+		if (!item_type || item_type.search(/result-item result-item-\d+ result-type-book.*/) == -1) continue;
+		var title_link = ZU.xpath(search_entries[i], "h4/a");


This xpath looks strange... Does this work?

This html looks e.g. like

<div class="result-item result-item-3 result-type-book"> <some stuff> <h4> <a href="/us/book/978-981-10-0217-5">National Test</a> </h4>

and the xpath looks for a top level h4 with an a below. Would you suggest a different xpath for that?

zuphilip · Jan 3, 2018

Springer Books.js

+	switch (detection_type) {
+		case "book": scrapeBook(doc, url);
+		case "multiple": {
+			if (!getSearchResults(doc, true)) return;


Test this as part of the detectWeb function instead.

zuphilip · Jan 3, 2018

Springer Books.js

+	for (var i=0; i<search_entries.length; i++) {
+		var item_type = search_entries[i].getAttribute("class");
+		//test if we actually have a usable result
+		if (!item_type || item_type.search(/result-item result-item-\d+ result-type-book.*/) == -1) continue;


This looks fragile, because there is no order for the classes of a node. I would suggest something like

item_type.includes('result-item') && item_type.includes('result-item-') && item_type.includes('result-type-book')

Not tested. Or maybe just part of what is important...

zuphilip · Jan 3, 2018

Springer Books.js

+		//forward compatibility: check if the embedded metadata translator
+		//has found a creator (i.e. relevant metadata)
+		if (item.creators.length > 0) {
+			item.complete; return;


Why can we end here if one creator is found?

Can we do more individual tests that certain parts are possibly missing?

My idea was that if Springer ever changes their website and decides to include more metadata, my translator would fall back to the embedded metadata translator.

zuphilip · Jan 3, 2018

Springer Books.js

+			//Step 1: decompose item.title into the "title" field from the bibliography
+			//and the (potentially abbreviated) subtitle
+			var titlesXpath = ZU.xpath(doc, '//div[@class="product-bibliographic"]/dl/dd/div/div/dl/dd');
+			var titleCand = null;


I don't think this title e.g. Theoretical Physics 7 - Quantum Mechanics - Methods and | Wolfgang Nolting | Springer is a good text to extract any information from. How about the information at the bottom, e.g.

? One should also check for the remaining lines here...

The problem is that the structure of this table is not consistent across different books, and only the author, publisher, and ISBN can be accessed in a stable way. I decided to first parse the information from the metadata because that is guaranteed to be correct, but might be abbreviated. In a second step I search the information at the bottom for the complete title, subtitle, and author and check these against the data I have from step 1. That is where the function shortenedStringTest from above is used. I found that this is the most stable thing one can do.

zuphilip · Jan 3, 2018

Springer Books.js

+	if (origStr.length < 10)
+		return (origStr == newStr.substring(0, origStr.length));
+	else
+		return (origStr.substring(0,9) == newStr.substring(0,9));


I don't understand this function. The 9 looks like an arbitrary number here. Can you clarify?

This function is a plausibility check if one string is a shortened version of a different one, e.g. Quantum Mechanics - Methods and ... and Quantum Mechanics - Methods and Applications. It essentially compares the first 10 characters, or less if the string is shorter. See below why this is needed. One could, alternatively, check the first half of both strings; checking the full string fails in some cases where a "..." appears in the shortened string.

adam3smith · Jan 14, 2018

@zuphilip leaving this for you to review

zuphilip · Jan 14, 2018

@adam3smith I don't think I can a review the next two weeks, sorry...

zuphilip · Feb 4, 2018

@psisquared2: I simplified several things and updated the translator and pushed directly to your fork. Please have a look at the changes. This would be fine for me now to merge.

psisquared2 · Feb 4, 2018

@zuphilip Thank you for the improvements. The changes look fine. However, the tests failed on my machine (both Scaffold 3.3.0 release and the latest commit, which had identical outputs). I updated the tests to match Scaffold. Was there a reason for this discrepancy?

zuphilip · Feb 4, 2018

Okay, thank you!

I updated the tests to match Scaffold. Was there a reason for this discrepancy?

I did the test within your branch with all other files at this stage. I guess that you are using a newer EM file with the changes 7081d6b . That would explain the language info. I don't know exactly why the spaces difference, but I guess we can neglect it for the moment.

zuphilip · Feb 4, 2018

Thank you very much, @psisquared2 ! 🎉

(We squashed the different commits together. Thus you might need to reset your master branch in order to have a clean base again.)

psisquared2 added some commits Oct 7, 2017

added translator for www.springer.com

920ae68

vast improvements to Springer Books translator

94b4e14

added test cases

Loading status checks…

84a18ff

Springer books: added multiple / search support, added date and impro…
…ved subtitle detection, publisher and place separation

Loading status checks…

73a35a5

psisquared2 added some commits Oct 23, 2017

minor improvement of title / subtitle detection

Loading status checks…

19197cd

added license block

Loading status checks…

57f6dd8

zuphilip added the New Translator label Nov 12, 2017

various minor issues

Loading status checks…

a201a2e

Simplified and updated Springer Books.js

Loading status checks…

9b4dfdc

updated tests with Scaffold 3.3.0

Loading status checks…

0501dde

zuphilip merged commit 377d00c into zotero:master Feb 4, 2018
1 check passed

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

zotero/translators

Join GitHub today

New translator for www.springer.com #1440

Conversation

psisquared2 commented Oct 8, 2017

psisquared2 added some commits Oct 7, 2017

adam3smith requested changes Oct 15, 2017 View changes

adam3smith left a comment

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

psisquared2 commented Oct 22, 2017

psisquared2 added some commits Oct 23, 2017

zuphilip added the New Translator label Nov 12, 2017

zuphilip reviewed Jan 3, 2018 View changes

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

adam3smith commented Jan 14, 2018

This comment has been minimized.

zuphilip commented Jan 14, 2018

This comment has been minimized.

zuphilip commented Feb 4, 2018

This comment has been minimized.

psisquared2 commented Feb 4, 2018

This comment has been minimized.

zuphilip commented Feb 4, 2018

Hide details View details zuphilip merged commit 377d00c into zotero:master Feb 4, 2018 1 check passed

1 check passed

This comment has been minimized.

zuphilip commented Feb 4, 2018

mtrojan-ub added a commit to ubtue/zotero-translators that referenced this pull request Feb 6, 2018

adam3smith requested changes Oct 15, 2017

View changes

zuphilip reviewed Jan 3, 2018

View changes

zuphilip merged commit `377d00c` into zotero:master Feb 4, 2018
1 check passed