Add The Economic Times.js #1282

sonali0901 · Mar 18, 2017

I am learning to write a translator. The following scraping is working fine in Scaffold and the translation is successful. I want to ask how the tests are automatically created by Scaffold?

{
	"translatorID": "6ec8008d-b206-4a4c-8d0a-8ef33807703b",
	.....
	"browserSupport": "gcsibv",
	"lastUpdated": "2014-04-04 09:55:12"
}

I believe, the above section and the test cases that we need to write for each translator are generated by Scaffold and not coded by us. Can anyone explain how to do that?

Which documentation are you using for learning? A lot of the things you're doing in the translator can be done much simpler with current Zotero code.

As for Scaffold, the header is written automatically. You should just save the translator from Scaffold and then open it from it's location on the harddisk, i.e. in the translators folder in the Zotero data director.

The translator tests are created automatically when you click on "New Web" in the "Testing" pane in translator while looking at a relevant page. You'll then want to click "Save" again in Testing once you have all tests and make sure to save your translator (using the "save" icon at the top of Scaffold).

adam3smith · Mar 18, 2017

Which documentation are you using for learning? A lot of the things you're doing in the translator can be done much simpler with current Zotero code.

As for Scaffold, the header is written automatically. You should just save the translator from Scaffold and then open it from it's location on the harddisk, i.e. in the translators folder in the Zotero data director.

The translator tests are created automatically when you click on "New Web" in the "Testing" pane in translator while looking at a relevant page. You'll then want to click "Save" again in Testing once you have all tests and make sure to save your translator (using the "save" icon at the top of Scaffold).

I went through the HWZT guide http://niche-canada.org/member-projects/zotero-guide/chapter1.html
I took help from other pre-written translators in the same category.
I think I'll learn as I explore, thanks for the quick reply 👍

sonali0901 · Mar 18, 2017

I went through the HWZT guide http://niche-canada.org/member-projects/zotero-guide/chapter1.html
I took help from other pre-written translators in the same category.
I think I'll learn as I explore, thanks for the quick reply 👍

yeah, HWZT is almost 10 years old, pretty much everything it says related to code is outdated, so we'll ask you to redo a lot of the things that you base on that. Do check recently updated translators, @zuphilip 's common code blocks (and mistakes): https://github.com/zuphilip/translators/wiki/Common-code-blocks-for-translators as well as the translator coding documentation at https://www.zotero.org/support/dev/translators/coding

adam3smith · Mar 18, 2017

yeah, HWZT is almost 10 years old, pretty much everything it says related to code is outdated, so we'll ask you to redo a lot of the things that you base on that. Do check recently updated translators, @zuphilip 's common code blocks (and mistakes): https://github.com/zuphilip/translators/wiki/Common-code-blocks-for-translators as well as the translator coding documentation at https://www.zotero.org/support/dev/translators/coding

adam3smith · Mar 18, 2017

adam3smith reviewed Mar 18, 2017

View changes

just a couple of pointers so you don't waste time re-doing things.

adam3smith · Mar 18, 2017

The Economic Times.js

+}
+
+function doWeb(doc, url) {
+	var namespace = doc.documentElement.namespaceURI;


e.g. you don't need this for any regular translator

adam3smith · Mar 18, 2017

The Economic Times.js

+	//srcaping titles of search results
+	if (detectWeb(doc, url) == "multiple") {
+		var myXPath = '//main/descendant::h3';
+		var myXPathObject = doc.evaluate(myXPath, doc, nsResolver, XPathResult.ANY_TYPE, null); 


doing this type of thing with ZU.xpath is much easier

adam3smith · Mar 18, 2017

The Economic Times.js

+
+function scrape(doc, url) {
+	newItem = new Zotero.Item("newspaperArticle");
+	newItem.url = doc.location.href;


don't use this if you have the URL from the url variable

Understood.

The test creation for search pages is failing in Scaffold, reporting that the date is null. When each search result is passed for scraping through Zotero.Utilities.processDocuments(articles, scrape);, why is it showing date null? When a single article (through its url) is scraped the tests run fine. Any idea? Should the dates be scraped from search result page?

sonali0901 · Mar 19, 2017

The test creation for search pages is failing in Scaffold, reporting that the date is null. When each search result is passed for scraping through Zotero.Utilities.processDocuments(articles, scrape);, why is it showing date null? When a single article (through its url) is scraped the tests run fine. Any idea? Should the dates be scraped from search result page?

I guess you have to restrict the search results to articles only. Currently I also find "slidesshows" e.g. http://economictimes.indiatimes.com/slideshows/nation-world/many-memorable-firsts-in-2017-up-assembly-elections/many-firsts/slideshow/57577668.cms which will give you an error.

zuphilip · Mar 19, 2017

I guess you have to restrict the search results to articles only. Currently I also find "slidesshows" e.g. http://economictimes.indiatimes.com/slideshows/nation-world/many-memorable-firsts-in-2017-up-assembly-elections/many-firsts/slideshow/57577668.cms which will give you an error.

zuphilip · Mar 19, 2017

zuphilip reviewed Mar 19, 2017

View changes

zuphilip · Mar 19, 2017

The Economic Times.js

+		var myXPath = '//main/descendant::h3';
+		var myXPathObject = ZU.xpathText(doc, '//main/div[1]/div[2]/a/h2',',','*');
+		myXPathObject = myXPathObject.concat(ZU.xpathText(doc, myXPath,',','*'));
+		var array = myXPathObject.split('*')


Try something like this instead:

function getSearchResults(doc, checkOnly) { var items = {}; var found = false; //TODO: adjust the xpath var rows = ZU.xpath(doc, '//main//a[(h2 or h3) and contains(@href, "/articleshow")]'); for (var i=0; i<rows.length; i++) { //TODO: check and maybe adjust var href = rows[i].href; //TODO: check and maybe adjust var title = ZU.trimInternal(rows[i].textContent); if (!href || !title) continue; if (checkOnly) return true; found = true; items[href] = title; } return found ? items : false; } function doWeb(doc, url) { if (detectWeb(doc, url) == "multiple") { Zotero.selectItems(getSearchResults(doc, false), function (items) { if (!items) { return true; } var articles = []; for (var i in items) { articles.push(i); } ZU.processDocuments(articles, scrape); }); } else { scrape(doc, url); } }

This is some stable code which is used in a lot of other translators as well. Check the TODOs and adjust them possibly.

zuphilip · Mar 19, 2017

zuphilip reviewed Mar 19, 2017

View changes

zuphilip · Mar 19, 2017

The Economic Times.js

+	if(!date) date = ZU.xpathText(doc, '//article/div[1]/div[2]');
+	date = date.substring(date.indexOf("|") + 1);
+	var date_day = date.replace('Updated:','');
+	newItem.date = date_day;


This looks very fragile, because it depends on the exact order of the nodes. Usually it is better to use some attributes to filter on the correct elements (especially class). Moreover, if the variables like date can still be null, we should test that before continue.

I suggest here something like:

var date = ZU.xpathText(doc, '//article/div/div[contains(@class, "publish_on") or contains(@class, "byline")]'); if (date) { newItem.date = ZU.strToISO(date); }

Note: The ZU.strToISO is deleting most of the other noise around the date and it seems already to work good for a few examples I tested.

HWZT is almost 10 years old, pretty much everything it says related to code is outdated

@adam3smith Should we remove it from the wiki then?

owcz · Mar 19, 2017

HWZT is almost 10 years old, pretty much everything it says related to code is outdated

@adam3smith Should we remove it from the wiki then?

The original HWZT isn't on the wiki, so we'd have to ask Adam Crymble to take it down or put a warning on the page. There is a version of it on the wiki (https://www.zotero.org/support/dev/how_to_write_a_zotero_translator_plusplus); that version is also outdated and would need major work to be a fully appropriate introduction to translator writing.

avram · Mar 20, 2017

The original HWZT isn't on the wiki, so we'd have to ask Adam Crymble to take it down or put a warning on the page. There is a version of it on the wiki (https://www.zotero.org/support/dev/how_to_write_a_zotero_translator_plusplus); that version is also outdated and would need major work to be a fully appropriate introduction to translator writing.

zuphilip · Mar 22, 2017

zuphilip reviewed Mar 22, 2017

View changes

Looks good for me. The target should be adjusted and the rest is just some nits.

zuphilip · Mar 22, 2017

The Economic Times.js

+	"translatorID": "1a9a7ecf-01e9-4d5d-aa19-a7aa4010da83",
+	"label": "The Economic Times",
+	"creator": "Sonali Gupta",
+	"target": "http://economictimes.indiatimes.com",


This is a regular expression and we can therefore be more strict about the beginning, but already now allow possible https urls, and escape the points, i.e.

-"target": "http://economictimes.indiatimes.com", +"target": "^https?://economictimes\\.indiatimes\\.com",

Note: In Scaffold you only need to escape them once, i.e. \..

I thought since this section is generated automatically, so no changes were to be made. Thanks for letting me know.

I don't think you can generate the target regexp automatically...

Let me say it better: You input the information on the first tab in Saffold and all these information will be automatically written into this part here.

zuphilip · Mar 22, 2017

The Economic Times.js

+*/
+
+function detectWeb(doc, url) {
+	if (url.indexOf("/topic/") != -1) {


We should also check here that there are any results for multiples, i.e.

if (url.indexOf("/topic/") != -1 && getSearchResults(doc, true)) {

zuphilip · Mar 22, 2017

The Economic Times.js

+
+	//get abstract
+	var abstract = ZU.xpathText(doc, '//meta[@property="og:description"]/@content');
+	newItem.abstractNote = abstract;


There is no need for an extra variable abstract here and you can write these two lines in one line.

zuphilip · Mar 22, 2017

The Economic Times.js

+			newItem.creators.push({lastName:authors,  creatorType: "author",fieldMode: 1})
+		}
+	}
+	


Is the language always English? Then you can add this as a constant value. Maybe also an ISSN exists?

ISSN doesn't exist. I will make the desired changes.

zuphilip · Mar 22, 2017

The Economic Times.js

+			authors_org=authors.substring(0,authors.lastIndexOf("|")-1);
+			var regex = /(.*By\s+)(.*)/;
+			authors = authors_org.replace(regex, "$2");
+			newItem.creators.push({lastName:authors,  creatorType: "author",fieldMode: 1})


Add space before fieldMode, delete one of the two spaces before creatorType.

@zuphilip There is an option to choose from different languages, but that redirects to other URL. For example http://hindi.economictimes.indiatimes.com/ or http://gujarati.economictimes.indiatimes.com/
The HTML structure of these pages is also not similar. So I think keeping the language as English for the target URL should be fine.

sonali0901 · Mar 24, 2017

@zuphilip There is an option to choose from different languages, but that redirects to other URL. For example http://hindi.economictimes.indiatimes.com/ or http://gujarati.economictimes.indiatimes.com/
The HTML structure of these pages is also not similar. So I think keeping the language as English for the target URL should be fine.

Thanks you & sorry for the delay with this.

adam3smith · Apr 5, 2017

Thanks you & sorry for the delay with this.

add The Economic Times WIP

6808a54

sonali0901 added some commits Mar 19, 2017

WIP:Test cases added

0b3cba0

WIP:Test cases added

eb95d22

Test case for search added

fe1538f

sonali0901 changed the title from WIP: Add The Economic Times.js to Add The Economic Times.js Mar 19, 2017

Regex for target added

092da16

adam3smith merged commit 4333b9c into zotero:master Apr 5, 2017
1 check passed

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

sonali0901 deleted the sonali0901:TET branch Apr 5, 2017

zuphilip added a commit to zuphilip/translators that referenced this pull request Mar 28, 2018

Add The Economic Times.js (#1282)

c4be1b4

zuphilip added a commit to zuphilip/translators that referenced this pull request Mar 28, 2018

Add The Economic Times.js (#1282)

70e4fad

zotero/translators

Join GitHub today

Add The Economic Times.js #1282

Conversation

sonali0901 commented Mar 18, 2017

This comment has been minimized.

adam3smith commented Mar 18, 2017

This comment has been minimized.

sonali0901 commented Mar 18, 2017

This comment has been minimized.

adam3smith commented Mar 18, 2017

adam3smith reviewed Mar 18, 2017 View changes

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

sonali0901 added some commits Mar 19, 2017

This comment has been minimized.

sonali0901 commented Mar 19, 2017 • edited Edited 1 time sonali0901 edited Mar 19, 2017 (most recent)

This comment has been minimized.

zuphilip commented Mar 19, 2017

zuphilip reviewed Mar 19, 2017 View changes

This comment has been minimized.

zuphilip reviewed Mar 19, 2017 View changes

This comment has been minimized.

This comment has been minimized.

owcz commented Mar 19, 2017

sonali0901 changed the title from WIP: Add The Economic Times.js to Add The Economic Times.js Mar 19, 2017

This comment has been minimized.

avram commented Mar 20, 2017

zuphilip reviewed Mar 22, 2017 View changes

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

sonali0901 Mar 24, 2017 • edited Edited 1 time sonali0901 edited Mar 24, 2017 (most recent)

This comment has been minimized.

This comment has been minimized.

sonali0901 commented Mar 24, 2017 • edited Edited 1 time sonali0901 edited Mar 24, 2017 (most recent)

Hide details View details adam3smith merged commit 4333b9c into zotero:master Apr 5, 2017 1 check passed

1 check passed

This comment has been minimized.

adam3smith commented Apr 5, 2017

sonali0901 deleted the sonali0901:TET branch Apr 5, 2017

zuphilip added a commit to zuphilip/translators that referenced this pull request Mar 28, 2018

zuphilip added a commit to zuphilip/translators that referenced this pull request Mar 28, 2018

adam3smith reviewed Mar 18, 2017

View changes

sonali0901 commented Mar 19, 2017 •

edited

Edited 1 time

sonali0901 edited Mar 19, 2017 (most recent)

zuphilip reviewed Mar 19, 2017

View changes

zuphilip reviewed Mar 19, 2017

View changes

zuphilip reviewed Mar 22, 2017

View changes

sonali0901 Mar 24, 2017 •

edited

Edited 1 time

sonali0901 edited Mar 24, 2017 (most recent)

sonali0901 commented Mar 24, 2017 •

edited

Edited 1 time

sonali0901 edited Mar 24, 2017 (most recent)

adam3smith merged commit `4333b9c` into zotero:master Apr 5, 2017
1 check passed