Join GitHub today
GitHub is home to over 31 million developers working together to host and review code, manage projects, and build software together.
Sign upAdd Sacramento Bee #1363
Conversation
dstillman
reviewed
Jul 8, 2017
|
||
// Authors | ||
var authorMetadata = doc.querySelectorAll('.ng_byline_name'); | ||
if (authorMetadata) { |
This comment has been minimized.
This comment has been minimized.
dstillman
Jul 8, 2017
Member
querySelectorAll()
always returns a NodeList
, even if it's empty, so you would want to use authorMetadata.length
here.
item.title = attr(doc,'[property="og:title"]','content'); | ||
item.date = text(doc,'.published-date'); | ||
item.abstractNote = text(doc,'#content-body- p'); | ||
item.tags = attr(doc,'meta[name="keywords"]','content').split(", "); |
This comment has been minimized.
This comment has been minimized.
dstillman
Jul 8, 2017
•
Member
attr()
will return null
if the selector isn't found, which would produce an error, so it'd be safer to assign it to a keywords
variable and then do if (keywords) { item.tags = keywords.split(", "); }
.
adam3smith
requested changes
Jul 9, 2017
Looks good. Some small things. |
|
||
function scrape(doc, url) { | ||
var item = new Zotero.Item("newspaperArticle"); | ||
item.websiteTitle = "The Sacramento Bee"; |
This comment has been minimized.
This comment has been minimized.
adam3smith
Jul 9, 2017
Collaborator
item.publicationTitle
cf. https://aurimasv.github.io/z2csl/typeMap.xml#map-newspaperArticle
function detectWeb(doc, url) { | ||
if (url.match(/article\d+/)) { | ||
return "newspaperArticle"; | ||
} else if (url.match(/\/(news|sports|entertainment)\//)) { |
This comment has been minimized.
This comment has been minimized.
adam3smith
Jul 9, 2017
Collaborator
I'd be more comfortable if you implement a version of a getSearchResults
function (using querySelectorAll
instead of ZU.xpath
, of course). These matches cover a broad set of pages and we really want to avoid false positives. There's a reason @zuphilip puts them in all translators.
Also, use .search(/regex/)!= -1
for efficient tests of strings against regexs
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
adam3smith
Jul 9, 2017
Collaborator
yes, because it just has to return true/false vs. an actual match. See https://jsperf.com/exec-vs-match-vs-test-vs-search/5 (indexOf
is dramatically more efficient, so you could also use that for this one since you don't really need a regex, just three separate strings)
This comment has been minimized.
This comment has been minimized.
adam3smith
Jul 9, 2017
Collaborator
mozilla actually makes this distinction explicitin in the docs
When you want to know whether a pattern is found and also its index in a string use search() (if you only want to know it exists, use the similar test() method on the RegExp prototype, which returns a boolean); for more information (but slower execution) use match() (similar to the regular expression exec() method).
This comment has been minimized.
This comment has been minimized.
owcz
Jul 9, 2017
•
Author
Contributor
I poked around and it looks like test()
becomes (15%) more efficient than multiple indexOf()
s when comparing for multiple strings
return "multiple"; | ||
} else if (url.indexOf("/search/?q=") != -1) { | ||
return "multiple"; | ||
} else return null; |
This comment has been minimized.
This comment has been minimized.
item.tags = keywords.split(", "); | ||
} | ||
item.attachments.push({ | ||
title: "The Sacramento Bee snapshot", |
This comment has been minimized.
This comment has been minimized.
"creatorType": "author" | ||
} | ||
], | ||
"date": "January 08, 2015 1:09 PM", |
This comment has been minimized.
This comment has been minimized.
adam3smith
Jul 9, 2017
Collaborator
use ZU.strToISO
to avoid importing English dates into non-English locales.
owcz
and others
added some commits
Jul 9, 2017
adam3smith
reviewed
Jul 9, 2017
function detectWeb(doc, url) { | ||
if (url.search(/article\d+/) != -1) { | ||
return "newspaperArticle"; | ||
} else if (url.search(/(\/((news|sports|entertainment)\/)|(search\/\?q=))|sacbee\.com\/?$/) != -1 && getSearchResults(doc, true)) { |
This comment has been minimized.
This comment has been minimized.
adam3smith
Jul 9, 2017
Collaborator
@owcz -- this here is half the reason we want the getSearchResults
function (or something like it): by including it in detectWeb
, you make sure you don't get false positives. Even if you're quite confident that you have the site covered as it is now, imagine they change their CMS -- with things as you had them, you could get false positives in all sorts of places, including on article pages where the generic metadata translator might perform OK otherwise.
This comment has been minimized.
This comment has been minimized.
This is good to merge, but I'll hold off for a little to see if the discussion in #1277 changes anything. |
adam3smith
reviewed
Jul 9, 2017
@@ -40,9 +40,9 @@ | |||
function attr(doc,selector,attr,index){if(index>0){var elem=doc.querySelectorAll(selector).item(index);return elem?elem.getAttribute(attr):null}var elem=doc.querySelector(selector);return elem?elem.getAttribute(attr):null}function text(doc,selector,index){if(index>0){var elem=doc.querySelectorAll(selector).item(index);return elem?elem.textContent:null}var elem=doc.querySelector(selector);return elem?elem.textContent:null} | |||
|
|||
function detectWeb(doc, url) { | |||
if (url.search(/article\d+/) != -1) { | |||
if (/article\d+/.test(url) != false) { |
This comment has been minimized.
This comment has been minimized.
adam3smith
merged commit ae41654
into
zotero:master
Jul 15, 2017
1 check passed
This comment has been minimized.
This comment has been minimized.
Cool, thanks! |
owcz commentedJul 8, 2017
tests text()/attr() polyfill from #1277