New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add translator for dn.se #1691

Merged
merged 2 commits into from Jul 29, 2018

Conversation

Projects
None yet
3 participants
@sebastian-berlin-wmse
Contributor

sebastian-berlin-wmse commented Jul 6, 2018

This is the website for the Swedish newspaper Dagens Nyheter.

@adam3smith

Thanks -- I commented very thoroughly on this -- it's really in quite good shape already. Please have a look & see if you have any questions. Please also make respective changes (e.g. on target) to your other current pull requests.

Show outdated Hide outdated Dagens Nyheter.js
"translatorID": "192a29de-3d6e-4850-984f-943764126429",
"label": "Dagens Nyheter",
"creator": "Sebastian Berlin",
"target": "https://www.dn.se/(nyheter|sok)/",

This comment has been minimized.

@adam3smith

adam3smith Jul 7, 2018

Collaborator

The Target is a regular expression, i.e. special characters should be escaped and for efficiency you should mark the string beginning, i.e. here:
`^https?://www.dn.se/(nyheter|sok)/' (this is when you do it in Scaffold. If you write straight to the translator, you need double backslashes -- www\.dn\.se)

@adam3smith

adam3smith Jul 7, 2018

Collaborator

The Target is a regular expression, i.e. special characters should be escaped and for efficiency you should mark the string beginning, i.e. here:
`^https?://www.dn.se/(nyheter|sok)/' (this is when you do it in Scaffold. If you write straight to the translator, you need double backslashes -- www\.dn\.se)

This comment has been minimized.

@sebastian-berlin-wmse

sebastian-berlin-wmse Jul 19, 2018

Contributor

True. Changed the regex to cover the other sections too.

@sebastian-berlin-wmse

sebastian-berlin-wmse Jul 19, 2018

Contributor

True. Changed the regex to cover the other sections too.

Show outdated Hide outdated Dagens Nyheter.js
function detectWeb(doc, url) {
if (url.includes('/nyheter/')) {

This comment has been minimized.

@adam3smith

adam3smith Jul 7, 2018

Collaborator

If I'm right about the regex above, this is too restrictive as well.

@adam3smith

adam3smith Jul 7, 2018

Collaborator

If I'm right about the regex above, this is too restrictive as well.

Show outdated Hide outdated Dagens Nyheter.js
var author = ZU.cleanAuthor(nameString, "author");
if(author.firstName === "") {
// If there's only one name, the author is not a person,
// e.g. "TT".

This comment has been minimized.

@adam3smith

adam3smith Jul 7, 2018

Collaborator

Is this actually helpful information? I'm happy to go either way if you have a strong opinion, but would we really want a citation to start with "TT" -- my assumption is that mostly not. So maybe just remove these sorts of authors entirely?

@adam3smith

adam3smith Jul 7, 2018

Collaborator

Is this actually helpful information? I'm happy to go either way if you have a strong opinion, but would we really want a citation to start with "TT" -- my assumption is that mostly not. So maybe just remove these sorts of authors entirely?

This comment has been minimized.

@sebastian-berlin-wmse

sebastian-berlin-wmse Jul 19, 2018

Contributor

Since the paper itself cites TT as a source, I'd say it's relevant. I don't know if Zotero has some other criteria as to what counts as an author, though. If you want to know more about TT: https://en.wikipedia.org/wiki/Tidningarnas_Telegrambyr%C3%A5 and their own "about us": https://tt.se/en/about-us/.

@sebastian-berlin-wmse

sebastian-berlin-wmse Jul 19, 2018

Contributor

Since the paper itself cites TT as a source, I'd say it's relevant. I don't know if Zotero has some other criteria as to what counts as an author, though. If you want to know more about TT: https://en.wikipedia.org/wiki/Tidningarnas_Telegrambyr%C3%A5 and their own "about us": https://tt.se/en/about-us/.

Show outdated Hide outdated Dagens Nyheter.js
author.firstName = undefined;
author.fieldMode = true;
}
item.creators = [author];

This comment has been minimized.

@adam3smith

adam3smith Jul 7, 2018

Collaborator

Are there really no articles with multiple authors?

@adam3smith

adam3smith Jul 7, 2018

Collaborator

Are there really no articles with multiple authors?

This comment has been minimized.

@sebastian-berlin-wmse

sebastian-berlin-wmse Jul 19, 2018

Contributor

Actually, yes. When I added tests for more articles, I found some that did. Added logic to handle that.

@sebastian-berlin-wmse

sebastian-berlin-wmse Jul 19, 2018

Contributor

Actually, yes. When I added tests for more articles, I found some that did. Added logic to handle that.

Show outdated Hide outdated Dagens Nyheter.js
var abstractString =
attr(doc, 'div[class="js-article"]', "data-article-description");
item.abstractNote = abstractString.replace(" ", " ");

This comment has been minimized.

@adam3smith

adam3smith Jul 7, 2018

Collaborator

This only replaces the first nbsp -- is that on purpose? If you want to replace all, you need to use a regex with the g flag

@adam3smith

adam3smith Jul 7, 2018

Collaborator

This only replaces the first nbsp -- is that on purpose? If you want to replace all, you need to use a regex with the g flag

This comment has been minimized.

@sebastian-berlin-wmse

sebastian-berlin-wmse Jul 19, 2018

Contributor

Huh, I didn't know that replace() was implemented differently depending on if you used string or regex. No, that wasn't on purpose; changed this.

@sebastian-berlin-wmse

sebastian-berlin-wmse Jul 19, 2018

Contributor

Huh, I didn't know that replace() was implemented differently depending on if you used string or regex. No, that wasn't on purpose; changed this.

Show outdated Hide outdated Dagens Nyheter.js
var abstractString =
attr(doc, 'div[class="js-article"]', "data-article-description");
item.abstractNote = abstractString.replace(" ", " ");

This comment has been minimized.

@adam3smith

adam3smith Jul 7, 2018

Collaborator

Also, both here and for section below, please test for the scraped string before processing it, since a missing string will otherwise cause an error, i.e. simply if (abstractString) item.abstractNote .= ..

@adam3smith

adam3smith Jul 7, 2018

Collaborator

Also, both here and for section below, please test for the scraped string before processing it, since a missing string will otherwise cause an error, i.e. simply if (abstractString) item.abstractNote .= ..

// Embedded Metadata
translator.setTranslator('951c027d-74ac-47d4-a107-9c3069ab7b48');
translator.setHandler('itemDone', function (obj, item) {

This comment has been minimized.

@adam3smith

adam3smith Jul 7, 2018

Collaborator

it doesn't look like you're using the EM translator results here at all, are you? In that case you might as well just create a new item and scrape, i.e. var item = new Zotero.Item("newspaperArticle"). While in general we push towards using EM where it helps, this isn't a blanket rule

@adam3smith

adam3smith Jul 7, 2018

Collaborator

it doesn't look like you're using the EM translator results here at all, are you? In that case you might as well just create a new item and scrape, i.e. var item = new Zotero.Item("newspaperArticle"). While in general we push towards using EM where it helps, this isn't a blanket rule

This comment has been minimized.

@sebastian-berlin-wmse

sebastian-berlin-wmse Jul 19, 2018

Contributor

It seems that at least item.title is set by that translator. I get this if remove it:

Error: No title specified for item
@sebastian-berlin-wmse

sebastian-berlin-wmse Jul 19, 2018

Contributor

It seems that at least item.title is set by that translator. I get this if remove it:

Error: No title specified for item
Show outdated Hide outdated Dagens Nyheter.js
var timeString =
attr(doc, 'meta[property="article:published_time"]', "content");
item.date = timeString.split("T")[0];

This comment has been minimized.

@adam3smith

adam3smith Jul 7, 2018

Collaborator

Let's add the ISSN for the publication as item.ISSN please. I assume that's going to be hardcoded.

@adam3smith

adam3smith Jul 7, 2018

Collaborator

Let's add the ISSN for the publication as item.ISSN please. I assume that's going to be hardcoded.

sebastian-berlin-wmse and others added some commits Jul 6, 2018

Add translator for dn.se
This is the website for the Swedish newspaper Dagens Nyheter.
@adam3smith

This comment has been minimized.

Show comment
Hide comment
@adam3smith

adam3smith Jul 29, 2018

Collaborator

Thanks, looks good. For future reference, please don't squash commits after updating a PR -- it's much easier if I can see what you changed from the PR I already reviewed.

Collaborator

adam3smith commented Jul 29, 2018

Thanks, looks good. For future reference, please don't squash commits after updating a PR -- it's much easier if I can see what you changed from the PR I already reviewed.

@adam3smith adam3smith merged commit ea743e2 into zotero:master Jul 29, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details
@sebastian-berlin-wmse

This comment has been minimized.

Show comment
Hide comment
@sebastian-berlin-wmse

sebastian-berlin-wmse Aug 13, 2018

Contributor

Thank you and sorry about the mess. I don't usually work with Github, so I didn't quite remember how you're supposed to do it and when I realized I'd messed up, it was already committed. I'll do it properly next time.

Contributor

sebastian-berlin-wmse commented Aug 13, 2018

Thank you and sorry about the mess. I don't usually work with Github, so I didn't quite remember how you're supposed to do it and when I realized I'd messed up, it was already committed. I'll do it properly next time.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment