Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upAdd The Economic Times.js #1282
Conversation
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
adam3smith
Mar 18, 2017
Collaborator
Which documentation are you using for learning? A lot of the things you're doing in the translator can be done much simpler with current Zotero code.
As for Scaffold, the header is written automatically. You should just save the translator from Scaffold and then open it from it's location on the harddisk, i.e. in the translators folder in the Zotero data director.
The translator tests are created automatically when you click on "New Web" in the "Testing" pane in translator while looking at a relevant page. You'll then want to click "Save" again in Testing once you have all tests and make sure to save your translator (using the "save" icon at the top of Scaffold).
Which documentation are you using for learning? A lot of the things you're doing in the translator can be done much simpler with current Zotero code. As for Scaffold, the header is written automatically. You should just save the translator from Scaffold and then open it from it's location on the harddisk, i.e. in the translators folder in the Zotero data director. The translator tests are created automatically when you click on "New Web" in the "Testing" pane in translator while looking at a relevant page. You'll then want to click "Save" again in Testing once you have all tests and make sure to save your translator (using the "save" icon at the top of Scaffold). |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
sonali0901
Mar 18, 2017
Contributor
I went through the HWZT guide http://niche-canada.org/member-projects/zotero-guide/chapter1.html
I took help from other pre-written translators in the same category.
I think I'll learn as I explore, thanks for the quick reply
I went through the HWZT guide http://niche-canada.org/member-projects/zotero-guide/chapter1.html |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
adam3smith
Mar 18, 2017
Collaborator
yeah, HWZT is almost 10 years old, pretty much everything it says related to code is outdated, so we'll ask you to redo a lot of the things that you base on that. Do check recently updated translators, @zuphilip 's common code blocks (and mistakes): https://github.com/zuphilip/translators/wiki/Common-code-blocks-for-translators as well as the translator coding documentation at https://www.zotero.org/support/dev/translators/coding
yeah, HWZT is almost 10 years old, pretty much everything it says related to code is outdated, so we'll ask you to redo a lot of the things that you base on that. Do check recently updated translators, @zuphilip 's common code blocks (and mistakes): https://github.com/zuphilip/translators/wiki/Common-code-blocks-for-translators as well as the translator coding documentation at https://www.zotero.org/support/dev/translators/coding |
adam3smith
reviewed
Mar 18, 2017
just a couple of pointers so you don't waste time re-doing things.
The Economic Times.js
} | ||
function doWeb(doc, url) { | ||
var namespace = doc.documentElement.namespaceURI; |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
The Economic Times.js
//srcaping titles of search results | ||
if (detectWeb(doc, url) == "multiple") { | ||
var myXPath = '//main/descendant::h3'; | ||
var myXPathObject = doc.evaluate(myXPath, doc, nsResolver, XPathResult.ANY_TYPE, null); |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
The Economic Times.js
function scrape(doc, url) { | ||
newItem = new Zotero.Item("newspaperArticle"); | ||
newItem.url = doc.location.href; |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
sonali0901
added some commits
Mar 19, 2017
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
sonali0901
Mar 19, 2017
Contributor
The test creation for search pages is failing in Scaffold, reporting that the date is null. When each search result is passed for scraping through Zotero.Utilities.processDocuments(articles, scrape);
, why is it showing date null? When a single article (through its url) is scraped the tests run fine. Any idea? Should the dates be scraped from search result page?
The test creation for search pages is failing in Scaffold, reporting that the date is null. When each search result is passed for scraping through |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
zuphilip
Mar 19, 2017
Collaborator
I guess you have to restrict the search results to articles only. Currently I also find "slidesshows" e.g. http://economictimes.indiatimes.com/slideshows/nation-world/many-memorable-firsts-in-2017-up-assembly-elections/many-firsts/slideshow/57577668.cms which will give you an error.
I guess you have to restrict the search results to articles only. Currently I also find "slidesshows" e.g. http://economictimes.indiatimes.com/slideshows/nation-world/many-memorable-firsts-in-2017-up-assembly-elections/many-firsts/slideshow/57577668.cms which will give you an error. |
The Economic Times.js
var myXPath = '//main/descendant::h3'; | ||
var myXPathObject = ZU.xpathText(doc, '//main/div[1]/div[2]/a/h2',',','*'); | ||
myXPathObject = myXPathObject.concat(ZU.xpathText(doc, myXPath,',','*')); | ||
var array = myXPathObject.split('*') |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
zuphilip
Mar 19, 2017
Collaborator
Try something like this instead:
function getSearchResults(doc, checkOnly) {
var items = {};
var found = false;
//TODO: adjust the xpath
var rows = ZU.xpath(doc, '//main//a[(h2 or h3) and contains(@href, "/articleshow")]');
for (var i=0; i<rows.length; i++) {
//TODO: check and maybe adjust
var href = rows[i].href;
//TODO: check and maybe adjust
var title = ZU.trimInternal(rows[i].textContent);
if (!href || !title) continue;
if (checkOnly) return true;
found = true;
items[href] = title;
}
return found ? items : false;
}
function doWeb(doc, url) {
if (detectWeb(doc, url) == "multiple") {
Zotero.selectItems(getSearchResults(doc, false), function (items) {
if (!items) {
return true;
}
var articles = [];
for (var i in items) {
articles.push(i);
}
ZU.processDocuments(articles, scrape);
});
} else {
scrape(doc, url);
}
}
This is some stable code which is used in a lot of other translators as well. Check the TODO
s and adjust them possibly.
zuphilip
Mar 19, 2017
Collaborator
Try something like this instead:
function getSearchResults(doc, checkOnly) {
var items = {};
var found = false;
//TODO: adjust the xpath
var rows = ZU.xpath(doc, '//main//a[(h2 or h3) and contains(@href, "/articleshow")]');
for (var i=0; i<rows.length; i++) {
//TODO: check and maybe adjust
var href = rows[i].href;
//TODO: check and maybe adjust
var title = ZU.trimInternal(rows[i].textContent);
if (!href || !title) continue;
if (checkOnly) return true;
found = true;
items[href] = title;
}
return found ? items : false;
}
function doWeb(doc, url) {
if (detectWeb(doc, url) == "multiple") {
Zotero.selectItems(getSearchResults(doc, false), function (items) {
if (!items) {
return true;
}
var articles = [];
for (var i in items) {
articles.push(i);
}
ZU.processDocuments(articles, scrape);
});
} else {
scrape(doc, url);
}
}
This is some stable code which is used in a lot of other translators as well. Check the TODO
s and adjust them possibly.
The Economic Times.js
if(!date) date = ZU.xpathText(doc, '//article/div[1]/div[2]'); | ||
date = date.substring(date.indexOf("|") + 1); | ||
var date_day = date.replace('Updated:',''); | ||
newItem.date = date_day; |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
zuphilip
Mar 19, 2017
Collaborator
This looks very fragile, because it depends on the exact order of the nodes. Usually it is better to use some attributes to filter on the correct elements (especially class
). Moreover, if the variables like date
can still be null
, we should test that before continue.
I suggest here something like:
var date = ZU.xpathText(doc, '//article/div/div[contains(@class, "publish_on") or contains(@class, "byline")]');
if (date) {
newItem.date = ZU.strToISO(date);
}
Note: The ZU.strToISO
is deleting most of the other noise around the date and it seems already to work good for a few examples I tested.
zuphilip
Mar 19, 2017
Collaborator
This looks very fragile, because it depends on the exact order of the nodes. Usually it is better to use some attributes to filter on the correct elements (especially class
). Moreover, if the variables like date
can still be null
, we should test that before continue.
I suggest here something like:
var date = ZU.xpathText(doc, '//article/div/div[contains(@class, "publish_on") or contains(@class, "byline")]');
if (date) {
newItem.date = ZU.strToISO(date);
}
Note: The ZU.strToISO
is deleting most of the other noise around the date and it seems already to work good for a few examples I tested.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
owcz
Mar 19, 2017
Contributor
HWZT is almost 10 years old, pretty much everything it says related to code is outdated
@adam3smith Should we remove it from the wiki then?
@adam3smith Should we remove it from the wiki then? |
sonali0901
changed the title from
WIP: Add The Economic Times.js
to
Add The Economic Times.js
Mar 19, 2017
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
avram
Mar 20, 2017
Contributor
The original HWZT isn't on the wiki, so we'd have to ask Adam Crymble to take it down or put a warning on the page. There is a version of it on the wiki (https://www.zotero.org/support/dev/how_to_write_a_zotero_translator_plusplus); that version is also outdated and would need major work to be a fully appropriate introduction to translator writing.
The original HWZT isn't on the wiki, so we'd have to ask Adam Crymble to take it down or put a warning on the page. There is a version of it on the wiki (https://www.zotero.org/support/dev/how_to_write_a_zotero_translator_plusplus); that version is also outdated and would need major work to be a fully appropriate introduction to translator writing. |
zuphilip
reviewed
Mar 22, 2017
Looks good for me. The target
should be adjusted and the rest is just some nits.
The Economic Times.js
"translatorID": "1a9a7ecf-01e9-4d5d-aa19-a7aa4010da83", | ||
"label": "The Economic Times", | ||
"creator": "Sonali Gupta", | ||
"target": "http://economictimes.indiatimes.com", |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
zuphilip
Mar 22, 2017
Collaborator
This is a regular expression and we can therefore be more strict about the beginning, but already now allow possible https
urls, and escape the points, i.e.
-"target": "http://economictimes.indiatimes.com",
+"target": "^https?://economictimes\\.indiatimes\\.com",
Note: In Scaffold you only need to escape them once, i.e. \.
.
zuphilip
Mar 22, 2017
Collaborator
This is a regular expression and we can therefore be more strict about the beginning, but already now allow possible https
urls, and escape the points, i.e.
-"target": "http://economictimes.indiatimes.com",
+"target": "^https?://economictimes\\.indiatimes\\.com",
Note: In Scaffold you only need to escape them once, i.e. \.
.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
sonali0901
Mar 24, 2017
Contributor
I thought since this section is generated automatically, so no changes were to be made. Thanks for letting me know.
sonali0901
Mar 24, 2017
Contributor
I thought since this section is generated automatically, so no changes were to be made. Thanks for letting me know.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
zuphilip
Mar 24, 2017
Collaborator
I don't think you can generate the target regexp automatically...
zuphilip
Mar 24, 2017
Collaborator
I don't think you can generate the target regexp automatically...
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
zuphilip
Mar 24, 2017
Collaborator
Let me say it better: You input the information on the first tab in Saffold and all these information will be automatically written into this part here.
zuphilip
Mar 24, 2017
Collaborator
Let me say it better: You input the information on the first tab in Saffold and all these information will be automatically written into this part here.
The Economic Times.js
*/ | ||
function detectWeb(doc, url) { | ||
if (url.indexOf("/topic/") != -1) { |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
zuphilip
Mar 22, 2017
Collaborator
We should also check here that there are any results for multiples, i.e.
if (url.indexOf("/topic/") != -1 && getSearchResults(doc, true)) {
zuphilip
Mar 22, 2017
Collaborator
We should also check here that there are any results for multiples, i.e.
if (url.indexOf("/topic/") != -1 && getSearchResults(doc, true)) {
The Economic Times.js
//get abstract | ||
var abstract = ZU.xpathText(doc, '//meta[@property="og:description"]/@content'); | ||
newItem.abstractNote = abstract; |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
zuphilip
Mar 22, 2017
Collaborator
There is no need for an extra variable abstract
here and you can write these two lines in one line.
zuphilip
Mar 22, 2017
Collaborator
There is no need for an extra variable abstract
here and you can write these two lines in one line.
The Economic Times.js
newItem.creators.push({lastName:authors, creatorType: "author",fieldMode: 1}) | ||
} | ||
} | ||
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
zuphilip
Mar 22, 2017
Collaborator
Is the language always English? Then you can add this as a constant value. Maybe also an ISSN exists?
zuphilip
Mar 22, 2017
Collaborator
Is the language always English? Then you can add this as a constant value. Maybe also an ISSN exists?
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
The Economic Times.js
authors_org=authors.substring(0,authors.lastIndexOf("|")-1); | ||
var regex = /(.*By\s+)(.*)/; | ||
authors = authors_org.replace(regex, "$2"); | ||
newItem.creators.push({lastName:authors, creatorType: "author",fieldMode: 1}) |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
zuphilip
Mar 22, 2017
Collaborator
Add space before fieldMode
, delete one of the two spaces before creatorType
.
zuphilip
Mar 22, 2017
Collaborator
Add space before fieldMode
, delete one of the two spaces before creatorType
.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
sonali0901
Mar 24, 2017
Contributor
@zuphilip There is an option to choose from different languages, but that redirects to other URL. For example http://hindi.economictimes.indiatimes.com/ or http://gujarati.economictimes.indiatimes.com/
The HTML structure of these pages is also not similar. So I think keeping the language as English for the target URL should be fine.
@zuphilip There is an option to choose from different languages, but that redirects to other URL. For example http://hindi.economictimes.indiatimes.com/ or http://gujarati.economictimes.indiatimes.com/ |
adam3smith
merged commit 4333b9c
into
zotero:master
Apr 5, 2017
1 check passed
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
Thanks you & sorry for the delay with this. |
sonali0901 commentedMar 18, 2017
I am learning to write a translator. The following scraping is working fine in Scaffold and the translation is successful. I want to ask how the tests are automatically created by Scaffold?
I believe, the above section and the test cases that we need to write for each translator are generated by Scaffold and not coded by us. Can anyone explain how to do that?