Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Sacramento Bee #1363

Merged
merged 6 commits into from Jul 15, 2017

Conversation

@owcz
Copy link
Contributor

commented Jul 8, 2017

tests text()/attr() polyfill from #1277


// Authors
var authorMetadata = doc.querySelectorAll('.ng_byline_name');
if (authorMetadata) {

This comment has been minimized.

Copy link
@dstillman

dstillman Jul 8, 2017

Member

querySelectorAll() always returns a NodeList, even if it's empty, so you would want to use authorMetadata.length here.

item.title = attr(doc,'[property="og:title"]','content');
item.date = text(doc,'.published-date');
item.abstractNote = text(doc,'#content-body- p');
item.tags = attr(doc,'meta[name="keywords"]','content').split(", ");

This comment has been minimized.

Copy link
@dstillman

dstillman Jul 8, 2017

Member

attr() will return null if the selector isn't found, which would produce an error, so it'd be safer to assign it to a keywords variable and then do if (keywords) { item.tags = keywords.split(", "); }.

@adam3smith
Copy link
Collaborator

left a comment

Looks good. Some small things.


function scrape(doc, url) {
var item = new Zotero.Item("newspaperArticle");
item.websiteTitle = "The Sacramento Bee";

This comment has been minimized.

Copy link
@adam3smith
function detectWeb(doc, url) {
if (url.match(/article\d+/)) {
return "newspaperArticle";
} else if (url.match(/\/(news|sports|entertainment)\//)) {

This comment has been minimized.

Copy link
@adam3smith

adam3smith Jul 9, 2017

Collaborator

I'd be more comfortable if you implement a version of a getSearchResults function (using querySelectorAll instead of ZU.xpath, of course). These matches cover a broad set of pages and we really want to avoid false positives. There's a reason @zuphilip puts them in all translators.

Also, use .search(/regex/)!= -1 for efficient tests of strings against regexs

This comment has been minimized.

Copy link
@owcz

owcz Jul 9, 2017

Author Contributor

.search is more efficient than .match?

This comment has been minimized.

Copy link
@adam3smith

adam3smith Jul 9, 2017

Collaborator

yes, because it just has to return true/false vs. an actual match. See https://jsperf.com/exec-vs-match-vs-test-vs-search/5 (indexOf is dramatically more efficient, so you could also use that for this one since you don't really need a regex, just three separate strings)

This comment has been minimized.

Copy link
@adam3smith

adam3smith Jul 9, 2017

Collaborator

mozilla actually makes this distinction explicitin in the docs

When you want to know whether a pattern is found and also its index in a string use search() (if you only want to know it exists, use the similar test() method on the RegExp prototype, which returns a boolean); for more information (but slower execution) use match() (similar to the regular expression exec() method).

This comment has been minimized.

Copy link
@owcz

owcz Jul 9, 2017

Author Contributor

I poked around and it looks like test() becomes (15%) more efficient than multiple indexOf()s when comparing for multiple strings

return "multiple";
} else if (url.indexOf("/search/?q=") != -1) {
return "multiple";
} else return null;

This comment has been minimized.

Copy link
@adam3smith

adam3smith Jul 9, 2017

Collaborator

(no need for an explicit return null or false here)

item.tags = keywords.split(", ");
}
item.attachments.push({
title: "The Sacramento Bee snapshot",

This comment has been minimized.

Copy link
@adam3smith

adam3smith Jul 9, 2017

Collaborator

capitalize "snapshot" per convention.

"creatorType": "author"
}
],
"date": "January 08, 2015 1:09 PM",

This comment has been minimized.

Copy link
@adam3smith

adam3smith Jul 9, 2017

Collaborator

use ZU.strToISO to avoid importing English dates into non-English locales.

owcz and others added some commits Jul 9, 2017

function detectWeb(doc, url) {
if (url.search(/article\d+/) != -1) {
return "newspaperArticle";
} else if (url.search(/(\/((news|sports|entertainment)\/)|(search\/\?q=))|sacbee\.com\/?$/) != -1 && getSearchResults(doc, true)) {

This comment has been minimized.

Copy link
@adam3smith

adam3smith Jul 9, 2017

Collaborator

@owcz -- this here is half the reason we want the getSearchResults function (or something like it): by including it in detectWeb, you make sure you don't get false positives. Even if you're quite confident that you have the site covered as it is now, imagine they change their CMS -- with things as you had them, you could get false positives in all sorts of places, including on article pages where the generic metadata translator might perform OK otherwise.

@adam3smith

This comment has been minimized.

Copy link
Collaborator

commented Jul 9, 2017

This is good to merge, but I'll hold off for a little to see if the discussion in #1277 changes anything.

use test() in detectWeb
test() is most efficient for testing regex, more efficient than multiple indexOf()s: https://jsperf.com/zotero/1
@@ -40,9 +40,9 @@
function attr(doc,selector,attr,index){if(index>0){var elem=doc.querySelectorAll(selector).item(index);return elem?elem.getAttribute(attr):null}var elem=doc.querySelector(selector);return elem?elem.getAttribute(attr):null}function text(doc,selector,index){if(index>0){var elem=doc.querySelectorAll(selector).item(index);return elem?elem.textContent:null}var elem=doc.querySelector(selector);return elem?elem.textContent:null}

function detectWeb(doc, url) {
if (url.search(/article\d+/) != -1) {
if (/article\d+/.test(url) != false) {

This comment has been minimized.

Copy link
@adam3smith

adam3smith Jul 9, 2017

Collaborator

no need for != false -- that's what if does aready.

@adam3smith adam3smith merged commit ae41654 into zotero:master Jul 15, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@adam3smith

This comment has been minimized.

Copy link
Collaborator

commented Jul 15, 2017

Cool, thanks!

zuphilip added a commit to zuphilip/translators that referenced this pull request Mar 28, 2018

Add Sacramento Bee (zotero#1363)
* use test() in detectWeb
test() is most efficient for testing regex, more efficient than multiple indexOf()s: https://jsperf.com/zotero/1

zuphilip added a commit to zuphilip/translators that referenced this pull request Mar 28, 2018

Add Sacramento Bee (zotero#1363)
* use test() in detectWeb
test() is most efficient for testing regex, more efficient than multiple indexOf()s: https://jsperf.com/zotero/1

@owcz owcz deleted the owcz:sacbee branch Jul 8, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.