Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upAdd translator for dn.se #1691
Conversation
adam3smith
requested changes
Jul 7, 2018
Thanks -- I commented very thoroughly on this -- it's really in quite good shape already. Please have a look & see if you have any questions. Please also make respective changes (e.g. on target) to your other current pull requests.
Dagens Nyheter.js
"translatorID": "192a29de-3d6e-4850-984f-943764126429", | ||
"label": "Dagens Nyheter", | ||
"creator": "Sebastian Berlin", | ||
"target": "https://www.dn.se/(nyheter|sok)/", |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
adam3smith
Jul 7, 2018
Collaborator
The Target is a regular expression, i.e. special characters should be escaped and for efficiency you should mark the string beginning, i.e. here:
`^https?://www.dn.se/(nyheter|sok)/' (this is when you do it in Scaffold. If you write straight to the translator, you need double backslashes -- www\.dn\.se)
adam3smith
Jul 7, 2018
Collaborator
The Target is a regular expression, i.e. special characters should be escaped and for efficiency you should mark the string beginning, i.e. here:
`^https?://www.dn.se/(nyheter|sok)/' (this is when you do it in Scaffold. If you write straight to the translator, you need double backslashes -- www\.dn\.se)
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
adam3smith
Jul 7, 2018
Collaborator
Also, this looks like it might be too restrictive? e.g.
https://www.dn.se/sport/fotboll/sverige-ute-ur-fotbolls-vm-forlust-mot-england-i-kvarten/
https://www.dn.se/ekonomi/ostkriget-skrammer-skanska-bonder/
adam3smith
Jul 7, 2018
Collaborator
Also, this looks like it might be too restrictive? e.g.
https://www.dn.se/sport/fotboll/sverige-ute-ur-fotbolls-vm-forlust-mot-england-i-kvarten/
https://www.dn.se/ekonomi/ostkriget-skrammer-skanska-bonder/
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
sebastian-berlin-wmse
Jul 19, 2018
Contributor
True. Changed the regex to cover the other sections too.
sebastian-berlin-wmse
Jul 19, 2018
Contributor
True. Changed the regex to cover the other sections too.
Dagens Nyheter.js
function detectWeb(doc, url) { | ||
if (url.includes('/nyheter/')) { |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
adam3smith
Jul 7, 2018
Collaborator
If I'm right about the regex above, this is too restrictive as well.
adam3smith
Jul 7, 2018
Collaborator
If I'm right about the regex above, this is too restrictive as well.
Dagens Nyheter.js
var author = ZU.cleanAuthor(nameString, "author"); | ||
if(author.firstName === "") { | ||
// If there's only one name, the author is not a person, | ||
// e.g. "TT". |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
adam3smith
Jul 7, 2018
Collaborator
Is this actually helpful information? I'm happy to go either way if you have a strong opinion, but would we really want a citation to start with "TT" -- my assumption is that mostly not. So maybe just remove these sorts of authors entirely?
adam3smith
Jul 7, 2018
Collaborator
Is this actually helpful information? I'm happy to go either way if you have a strong opinion, but would we really want a citation to start with "TT" -- my assumption is that mostly not. So maybe just remove these sorts of authors entirely?
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
sebastian-berlin-wmse
Jul 19, 2018
Contributor
Since the paper itself cites TT as a source, I'd say it's relevant. I don't know if Zotero has some other criteria as to what counts as an author, though. If you want to know more about TT: https://en.wikipedia.org/wiki/Tidningarnas_Telegrambyr%C3%A5 and their own "about us": https://tt.se/en/about-us/.
sebastian-berlin-wmse
Jul 19, 2018
Contributor
Since the paper itself cites TT as a source, I'd say it's relevant. I don't know if Zotero has some other criteria as to what counts as an author, though. If you want to know more about TT: https://en.wikipedia.org/wiki/Tidningarnas_Telegrambyr%C3%A5 and their own "about us": https://tt.se/en/about-us/.
Dagens Nyheter.js
author.firstName = undefined; | ||
author.fieldMode = true; | ||
} | ||
item.creators = [author]; |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
sebastian-berlin-wmse
Jul 19, 2018
Contributor
Actually, yes. When I added tests for more articles, I found some that did. Added logic to handle that.
sebastian-berlin-wmse
Jul 19, 2018
Contributor
Actually, yes. When I added tests for more articles, I found some that did. Added logic to handle that.
Dagens Nyheter.js
var abstractString = | ||
attr(doc, 'div[class="js-article"]', "data-article-description"); | ||
item.abstractNote = abstractString.replace(" ", " "); |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
adam3smith
Jul 7, 2018
Collaborator
This only replaces the first nbsp -- is that on purpose? If you want to replace all, you need to use a regex with the g flag
adam3smith
Jul 7, 2018
Collaborator
This only replaces the first nbsp -- is that on purpose? If you want to replace all, you need to use a regex with the g flag
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
sebastian-berlin-wmse
Jul 19, 2018
Contributor
Huh, I didn't know that replace()
was implemented differently depending on if you used string or regex. No, that wasn't on purpose; changed this.
sebastian-berlin-wmse
Jul 19, 2018
Contributor
Huh, I didn't know that replace()
was implemented differently depending on if you used string or regex. No, that wasn't on purpose; changed this.
Dagens Nyheter.js
var abstractString = | ||
attr(doc, 'div[class="js-article"]', "data-article-description"); | ||
item.abstractNote = abstractString.replace(" ", " "); |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
adam3smith
Jul 7, 2018
Collaborator
Also, both here and for section below, please test for the scraped string before processing it, since a missing string will otherwise cause an error, i.e. simply if (abstractString) item.abstractNote .= ..
adam3smith
Jul 7, 2018
Collaborator
Also, both here and for section below, please test for the scraped string before processing it, since a missing string will otherwise cause an error, i.e. simply if (abstractString) item.abstractNote .= ..
// Embedded Metadata | ||
translator.setTranslator('951c027d-74ac-47d4-a107-9c3069ab7b48'); | ||
translator.setHandler('itemDone', function (obj, item) { |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
adam3smith
Jul 7, 2018
Collaborator
it doesn't look like you're using the EM translator results here at all, are you? In that case you might as well just create a new item and scrape, i.e. var item = new Zotero.Item("newspaperArticle")
. While in general we push towards using EM where it helps, this isn't a blanket rule
adam3smith
Jul 7, 2018
Collaborator
it doesn't look like you're using the EM translator results here at all, are you? In that case you might as well just create a new item and scrape, i.e. var item = new Zotero.Item("newspaperArticle")
. While in general we push towards using EM where it helps, this isn't a blanket rule
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
sebastian-berlin-wmse
Jul 19, 2018
Contributor
It seems that at least item.title
is set by that translator. I get this if remove it:
Error: No title specified for item
sebastian-berlin-wmse
Jul 19, 2018
Contributor
It seems that at least item.title
is set by that translator. I get this if remove it:
Error: No title specified for item
Dagens Nyheter.js
var timeString = | ||
attr(doc, 'meta[property="article:published_time"]', "content"); | ||
item.date = timeString.split("T")[0]; | ||
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
adam3smith
Jul 7, 2018
Collaborator
Let's add the ISSN for the publication as item.ISSN
please. I assume that's going to be hardcoded.
adam3smith
Jul 7, 2018
Collaborator
Let's add the ISSN for the publication as item.ISSN
please. I assume that's going to be hardcoded.
zuphilip
added
the
New Translator
label
Jul 10, 2018
sebastian-berlin-wmse
and others
added some commits
Jul 6, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
adam3smith
Jul 29, 2018
Collaborator
Thanks, looks good. For future reference, please don't squash commits after updating a PR -- it's much easier if I can see what you changed from the PR I already reviewed.
Thanks, looks good. For future reference, please don't squash commits after updating a PR -- it's much easier if I can see what you changed from the PR I already reviewed. |
adam3smith
merged commit ea743e2
into
zotero:master
Jul 29, 2018
1 check passed
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
sebastian-berlin-wmse
Aug 13, 2018
Contributor
Thank you and sorry about the mess. I don't usually work with Github, so I didn't quite remember how you're supposed to do it and when I realized I'd messed up, it was already committed. I'll do it properly next time.
Thank you and sorry about the mess. I don't usually work with Github, so I didn't quite remember how you're supposed to do it and when I realized I'd messed up, it was already committed. I'll do it properly next time. |
sebastian-berlin-wmse commentedJul 6, 2018
This is the website for the Swedish newspaper Dagens Nyheter.