New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add The Economic Times.js #1282

Merged
merged 5 commits into from Apr 5, 2017

Conversation

Projects
None yet
5 participants
@sonali0901
Contributor

sonali0901 commented Mar 18, 2017

I am learning to write a translator. The following scraping is working fine in Scaffold and the translation is successful. I want to ask how the tests are automatically created by Scaffold?

{
	"translatorID": "6ec8008d-b206-4a4c-8d0a-8ef33807703b",
	.....
	"browserSupport": "gcsibv",
	"lastUpdated": "2014-04-04 09:55:12"
}

I believe, the above section and the test cases that we need to write for each translator are generated by Scaffold and not coded by us. Can anyone explain how to do that?

@adam3smith

This comment has been minimized.

Show comment
Hide comment
@adam3smith

adam3smith Mar 18, 2017

Collaborator

Which documentation are you using for learning? A lot of the things you're doing in the translator can be done much simpler with current Zotero code.

As for Scaffold, the header is written automatically. You should just save the translator from Scaffold and then open it from it's location on the harddisk, i.e. in the translators folder in the Zotero data director.

The translator tests are created automatically when you click on "New Web" in the "Testing" pane in translator while looking at a relevant page. You'll then want to click "Save" again in Testing once you have all tests and make sure to save your translator (using the "save" icon at the top of Scaffold).

Collaborator

adam3smith commented Mar 18, 2017

Which documentation are you using for learning? A lot of the things you're doing in the translator can be done much simpler with current Zotero code.

As for Scaffold, the header is written automatically. You should just save the translator from Scaffold and then open it from it's location on the harddisk, i.e. in the translators folder in the Zotero data director.

The translator tests are created automatically when you click on "New Web" in the "Testing" pane in translator while looking at a relevant page. You'll then want to click "Save" again in Testing once you have all tests and make sure to save your translator (using the "save" icon at the top of Scaffold).

@sonali0901

This comment has been minimized.

Show comment
Hide comment
@sonali0901

sonali0901 Mar 18, 2017

Contributor

I went through the HWZT guide http://niche-canada.org/member-projects/zotero-guide/chapter1.html
I took help from other pre-written translators in the same category.
I think I'll learn as I explore, thanks for the quick reply 👍

Contributor

sonali0901 commented Mar 18, 2017

I went through the HWZT guide http://niche-canada.org/member-projects/zotero-guide/chapter1.html
I took help from other pre-written translators in the same category.
I think I'll learn as I explore, thanks for the quick reply 👍

@adam3smith

This comment has been minimized.

Show comment
Hide comment
@adam3smith

adam3smith Mar 18, 2017

Collaborator

yeah, HWZT is almost 10 years old, pretty much everything it says related to code is outdated, so we'll ask you to redo a lot of the things that you base on that. Do check recently updated translators, @zuphilip 's common code blocks (and mistakes): https://github.com/zuphilip/translators/wiki/Common-code-blocks-for-translators as well as the translator coding documentation at https://www.zotero.org/support/dev/translators/coding

Collaborator

adam3smith commented Mar 18, 2017

yeah, HWZT is almost 10 years old, pretty much everything it says related to code is outdated, so we'll ask you to redo a lot of the things that you base on that. Do check recently updated translators, @zuphilip 's common code blocks (and mistakes): https://github.com/zuphilip/translators/wiki/Common-code-blocks-for-translators as well as the translator coding documentation at https://www.zotero.org/support/dev/translators/coding

@adam3smith

just a couple of pointers so you don't waste time re-doing things.

Show outdated Hide outdated The Economic Times.js
}
function doWeb(doc, url) {
var namespace = doc.documentElement.namespaceURI;

This comment has been minimized.

@adam3smith

adam3smith Mar 18, 2017

Collaborator

e.g. you don't need this for any regular translator

@adam3smith

adam3smith Mar 18, 2017

Collaborator

e.g. you don't need this for any regular translator

Show outdated Hide outdated The Economic Times.js
//srcaping titles of search results
if (detectWeb(doc, url) == "multiple") {
var myXPath = '//main/descendant::h3';
var myXPathObject = doc.evaluate(myXPath, doc, nsResolver, XPathResult.ANY_TYPE, null);

This comment has been minimized.

@adam3smith

adam3smith Mar 18, 2017

Collaborator

doing this type of thing with ZU.xpath is much easier

@adam3smith

adam3smith Mar 18, 2017

Collaborator

doing this type of thing with ZU.xpath is much easier

Show outdated Hide outdated The Economic Times.js
function scrape(doc, url) {
newItem = new Zotero.Item("newspaperArticle");
newItem.url = doc.location.href;

This comment has been minimized.

@adam3smith

adam3smith Mar 18, 2017

Collaborator

don't use this if you have the URL from the url variable

@adam3smith

adam3smith Mar 18, 2017

Collaborator

don't use this if you have the URL from the url variable

This comment has been minimized.

@sonali0901

sonali0901 Mar 18, 2017

Contributor

Understood.

@sonali0901

sonali0901 Mar 18, 2017

Contributor

Understood.

@sonali0901

This comment has been minimized.

Show comment
Hide comment
@sonali0901

sonali0901 Mar 19, 2017

Contributor

The test creation for search pages is failing in Scaffold, reporting that the date is null. When each search result is passed for scraping through Zotero.Utilities.processDocuments(articles, scrape);, why is it showing date null? When a single article (through its url) is scraped the tests run fine. Any idea? Should the dates be scraped from search result page?

Contributor

sonali0901 commented Mar 19, 2017

The test creation for search pages is failing in Scaffold, reporting that the date is null. When each search result is passed for scraping through Zotero.Utilities.processDocuments(articles, scrape);, why is it showing date null? When a single article (through its url) is scraped the tests run fine. Any idea? Should the dates be scraped from search result page?

@zuphilip

This comment has been minimized.

Show comment
Hide comment
@zuphilip

zuphilip Mar 19, 2017

Collaborator

I guess you have to restrict the search results to articles only. Currently I also find "slidesshows" e.g. http://economictimes.indiatimes.com/slideshows/nation-world/many-memorable-firsts-in-2017-up-assembly-elections/many-firsts/slideshow/57577668.cms which will give you an error.

Collaborator

zuphilip commented Mar 19, 2017

I guess you have to restrict the search results to articles only. Currently I also find "slidesshows" e.g. http://economictimes.indiatimes.com/slideshows/nation-world/many-memorable-firsts-in-2017-up-assembly-elections/many-firsts/slideshow/57577668.cms which will give you an error.

Show outdated Hide outdated The Economic Times.js
var myXPath = '//main/descendant::h3';
var myXPathObject = ZU.xpathText(doc, '//main/div[1]/div[2]/a/h2',',','*');
myXPathObject = myXPathObject.concat(ZU.xpathText(doc, myXPath,',','*'));
var array = myXPathObject.split('*')

This comment has been minimized.

@zuphilip

zuphilip Mar 19, 2017

Collaborator

Try something like this instead:

function getSearchResults(doc, checkOnly) {
	var items = {};
	var found = false;
	//TODO: adjust the xpath
	var rows = ZU.xpath(doc, '//main//a[(h2 or h3) and contains(@href, "/articleshow")]');
	for (var i=0; i<rows.length; i++) {
		//TODO: check and maybe adjust
		var href = rows[i].href;
		//TODO: check and maybe adjust
		var title = ZU.trimInternal(rows[i].textContent);
		if (!href || !title) continue;
		if (checkOnly) return true;
		found = true;
		items[href] = title;
	}
	return found ? items : false;
}


function doWeb(doc, url) {
	if (detectWeb(doc, url) == "multiple") {
		Zotero.selectItems(getSearchResults(doc, false), function (items) {
			if (!items) {
				return true;
			}
			var articles = [];
			for (var i in items) {
				articles.push(i);
			}
			ZU.processDocuments(articles, scrape);
		});
	} else {
		scrape(doc, url);
	}
}

This is some stable code which is used in a lot of other translators as well. Check the TODOs and adjust them possibly.

@zuphilip

zuphilip Mar 19, 2017

Collaborator

Try something like this instead:

function getSearchResults(doc, checkOnly) {
	var items = {};
	var found = false;
	//TODO: adjust the xpath
	var rows = ZU.xpath(doc, '//main//a[(h2 or h3) and contains(@href, "/articleshow")]');
	for (var i=0; i<rows.length; i++) {
		//TODO: check and maybe adjust
		var href = rows[i].href;
		//TODO: check and maybe adjust
		var title = ZU.trimInternal(rows[i].textContent);
		if (!href || !title) continue;
		if (checkOnly) return true;
		found = true;
		items[href] = title;
	}
	return found ? items : false;
}


function doWeb(doc, url) {
	if (detectWeb(doc, url) == "multiple") {
		Zotero.selectItems(getSearchResults(doc, false), function (items) {
			if (!items) {
				return true;
			}
			var articles = [];
			for (var i in items) {
				articles.push(i);
			}
			ZU.processDocuments(articles, scrape);
		});
	} else {
		scrape(doc, url);
	}
}

This is some stable code which is used in a lot of other translators as well. Check the TODOs and adjust them possibly.

Show outdated Hide outdated The Economic Times.js
if(!date) date = ZU.xpathText(doc, '//article/div[1]/div[2]');
date = date.substring(date.indexOf("|") + 1);
var date_day = date.replace('Updated:','');
newItem.date = date_day;

This comment has been minimized.

@zuphilip

zuphilip Mar 19, 2017

Collaborator

This looks very fragile, because it depends on the exact order of the nodes. Usually it is better to use some attributes to filter on the correct elements (especially class). Moreover, if the variables like date can still be null, we should test that before continue.

I suggest here something like:

	var date = ZU.xpathText(doc, '//article/div/div[contains(@class, "publish_on") or contains(@class, "byline")]');
	if (date) {
		newItem.date = ZU.strToISO(date);
	}

Note: The ZU.strToISO is deleting most of the other noise around the date and it seems already to work good for a few examples I tested.

@zuphilip

zuphilip Mar 19, 2017

Collaborator

This looks very fragile, because it depends on the exact order of the nodes. Usually it is better to use some attributes to filter on the correct elements (especially class). Moreover, if the variables like date can still be null, we should test that before continue.

I suggest here something like:

	var date = ZU.xpathText(doc, '//article/div/div[contains(@class, "publish_on") or contains(@class, "byline")]');
	if (date) {
		newItem.date = ZU.strToISO(date);
	}

Note: The ZU.strToISO is deleting most of the other noise around the date and it seems already to work good for a few examples I tested.

@owcz

This comment has been minimized.

Show comment
Hide comment
@owcz

owcz Mar 19, 2017

Contributor

HWZT is almost 10 years old, pretty much everything it says related to code is outdated

@adam3smith Should we remove it from the wiki then?

Contributor

owcz commented Mar 19, 2017

HWZT is almost 10 years old, pretty much everything it says related to code is outdated

@adam3smith Should we remove it from the wiki then?

@sonali0901 sonali0901 changed the title from WIP: Add The Economic Times.js to Add The Economic Times.js Mar 19, 2017

@avram

This comment has been minimized.

Show comment
Hide comment
@avram

avram Mar 20, 2017

Contributor

The original HWZT isn't on the wiki, so we'd have to ask Adam Crymble to take it down or put a warning on the page. There is a version of it on the wiki (https://www.zotero.org/support/dev/how_to_write_a_zotero_translator_plusplus); that version is also outdated and would need major work to be a fully appropriate introduction to translator writing.

Contributor

avram commented Mar 20, 2017

The original HWZT isn't on the wiki, so we'd have to ask Adam Crymble to take it down or put a warning on the page. There is a version of it on the wiki (https://www.zotero.org/support/dev/how_to_write_a_zotero_translator_plusplus); that version is also outdated and would need major work to be a fully appropriate introduction to translator writing.

@zuphilip

Looks good for me. The target should be adjusted and the rest is just some nits.

Show outdated Hide outdated The Economic Times.js
"translatorID": "1a9a7ecf-01e9-4d5d-aa19-a7aa4010da83",
"label": "The Economic Times",
"creator": "Sonali Gupta",
"target": "http://economictimes.indiatimes.com",

This comment has been minimized.

@zuphilip

zuphilip Mar 22, 2017

Collaborator

This is a regular expression and we can therefore be more strict about the beginning, but already now allow possible https urls, and escape the points, i.e.

-"target": "http://economictimes.indiatimes.com",
+"target": "^https?://economictimes\\.indiatimes\\.com",

Note: In Scaffold you only need to escape them once, i.e. \..

@zuphilip

zuphilip Mar 22, 2017

Collaborator

This is a regular expression and we can therefore be more strict about the beginning, but already now allow possible https urls, and escape the points, i.e.

-"target": "http://economictimes.indiatimes.com",
+"target": "^https?://economictimes\\.indiatimes\\.com",

Note: In Scaffold you only need to escape them once, i.e. \..

This comment has been minimized.

@sonali0901

sonali0901 Mar 24, 2017

Contributor

I thought since this section is generated automatically, so no changes were to be made. Thanks for letting me know.

@sonali0901

sonali0901 Mar 24, 2017

Contributor

I thought since this section is generated automatically, so no changes were to be made. Thanks for letting me know.

This comment has been minimized.

@zuphilip

zuphilip Mar 24, 2017

Collaborator

I don't think you can generate the target regexp automatically...

@zuphilip

zuphilip Mar 24, 2017

Collaborator

I don't think you can generate the target regexp automatically...

This comment has been minimized.

@zuphilip

zuphilip Mar 24, 2017

Collaborator

Let me say it better: You input the information on the first tab in Saffold and all these information will be automatically written into this part here.

@zuphilip

zuphilip Mar 24, 2017

Collaborator

Let me say it better: You input the information on the first tab in Saffold and all these information will be automatically written into this part here.

Show outdated Hide outdated The Economic Times.js
*/
function detectWeb(doc, url) {
if (url.indexOf("/topic/") != -1) {

This comment has been minimized.

@zuphilip

zuphilip Mar 22, 2017

Collaborator

We should also check here that there are any results for multiples, i.e.

if (url.indexOf("/topic/") != -1 && getSearchResults(doc, true)) {
@zuphilip

zuphilip Mar 22, 2017

Collaborator

We should also check here that there are any results for multiples, i.e.

if (url.indexOf("/topic/") != -1 && getSearchResults(doc, true)) {
Show outdated Hide outdated The Economic Times.js
//get abstract
var abstract = ZU.xpathText(doc, '//meta[@property="og:description"]/@content');
newItem.abstractNote = abstract;

This comment has been minimized.

@zuphilip

zuphilip Mar 22, 2017

Collaborator

There is no need for an extra variable abstract here and you can write these two lines in one line.

@zuphilip

zuphilip Mar 22, 2017

Collaborator

There is no need for an extra variable abstract here and you can write these two lines in one line.

Show outdated Hide outdated The Economic Times.js
newItem.creators.push({lastName:authors, creatorType: "author",fieldMode: 1})
}
}

This comment has been minimized.

@zuphilip

zuphilip Mar 22, 2017

Collaborator

Is the language always English? Then you can add this as a constant value. Maybe also an ISSN exists?

@zuphilip

zuphilip Mar 22, 2017

Collaborator

Is the language always English? Then you can add this as a constant value. Maybe also an ISSN exists?

This comment has been minimized.

@sonali0901

sonali0901 Mar 24, 2017

Contributor

ISSN doesn't exist. I will make the desired changes.

@sonali0901

sonali0901 Mar 24, 2017

Contributor

ISSN doesn't exist. I will make the desired changes.

Show outdated Hide outdated The Economic Times.js
authors_org=authors.substring(0,authors.lastIndexOf("|")-1);
var regex = /(.*By\s+)(.*)/;
authors = authors_org.replace(regex, "$2");
newItem.creators.push({lastName:authors, creatorType: "author",fieldMode: 1})

This comment has been minimized.

@zuphilip

zuphilip Mar 22, 2017

Collaborator

Add space before fieldMode, delete one of the two spaces before creatorType.

@zuphilip

zuphilip Mar 22, 2017

Collaborator

Add space before fieldMode, delete one of the two spaces before creatorType.

@sonali0901

This comment has been minimized.

Show comment
Hide comment
@sonali0901

sonali0901 Mar 24, 2017

Contributor

@zuphilip There is an option to choose from different languages, but that redirects to other URL. For example http://hindi.economictimes.indiatimes.com/ or http://gujarati.economictimes.indiatimes.com/
The HTML structure of these pages is also not similar. So I think keeping the language as English for the target URL should be fine.

Contributor

sonali0901 commented Mar 24, 2017

@zuphilip There is an option to choose from different languages, but that redirects to other URL. For example http://hindi.economictimes.indiatimes.com/ or http://gujarati.economictimes.indiatimes.com/
The HTML structure of these pages is also not similar. So I think keeping the language as English for the target URL should be fine.

@adam3smith adam3smith merged commit 4333b9c into zotero:master Apr 5, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@adam3smith

This comment has been minimized.

Show comment
Hide comment
@adam3smith

adam3smith Apr 5, 2017

Collaborator

Thanks you & sorry for the delay with this.

Collaborator

adam3smith commented Apr 5, 2017

Thanks you & sorry for the delay with this.

@sonali0901 sonali0901 deleted the sonali0901:TET branch Apr 5, 2017

zuphilip added a commit to zuphilip/translators that referenced this pull request Mar 28, 2018

zuphilip added a commit to zuphilip/translators that referenced this pull request Mar 28, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment