New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Handle old pages of BBC #1371

Merged
merged 2 commits into from Jul 22, 2017

Conversation

Projects
None yet
3 participants
@sonali0901
Contributor

sonali0901 commented Jul 16, 2017

Fixes #1364
@mvolz @owcz I need some ideas for the title of older pages. The title provided by metadata is not accurate and if I extract it through ZU.xpathText(doc, '//meta[@name="Headline"]/@content') then it would affect results of other pages. Any workaround? Refer to last test case to see the issue.

@zuphilip

This comment has been minimized.

Show comment
Hide comment
@zuphilip

zuphilip Jul 16, 2017

Collaborator

AFAIS you can distinguish these two cases quite easily by analyzing the url. Thus, I suggest to use some conditional code, i.e. something like

if (url.substr(-4)==".stm") {
   //only for old pages of BBC
   item.title = ZU.xpathText(doc, '//meta[@name="Headline"]/@content');
   item.section = ZU.xpathText(doc, '//meta[@name="Section"]/@content');
}
Collaborator

zuphilip commented Jul 16, 2017

AFAIS you can distinguish these two cases quite easily by analyzing the url. Thus, I suggest to use some conditional code, i.e. something like

if (url.substr(-4)==".stm") {
   //only for old pages of BBC
   item.title = ZU.xpathText(doc, '//meta[@name="Headline"]/@content');
   item.section = ZU.xpathText(doc, '//meta[@name="Section"]/@content');
}
@adam3smith

This comment has been minimized.

Show comment
Hide comment
@adam3smith

adam3smith Jul 16, 2017

Collaborator

What @zuphilip says, though looking at both this and the detectWeb, I think we don't want to require the .stm is at the end of the URL. E.g. some link shorteners add something like ?utm_campaing=mycampaignname to the end of URLs and there's really no reason we should have detectWeb (and then this) break in those cases. I think the safest would be to (again both here and in detect) clean the URL by doing url.replace(/[\?#].+/, "")

Collaborator

adam3smith commented Jul 16, 2017

What @zuphilip says, though looking at both this and the detectWeb, I think we don't want to require the .stm is at the end of the URL. E.g. some link shorteners add something like ?utm_campaing=mycampaignname to the end of URLs and there's really no reason we should have detectWeb (and then this) break in those cases. I think the safest would be to (again both here and in detect) clean the URL by doing url.replace(/[\?#].+/, "")

@sonali0901 sonali0901 changed the title from WIP : Handle old pages of BBC to Handle old pages of BBC Jul 18, 2017

@adam3smith adam3smith merged commit 701f8c5 into zotero:master Jul 22, 2017

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@sonali0901 sonali0901 deleted the sonali0901:BBC branch Jul 23, 2017

zuphilip added a commit to zuphilip/translators that referenced this pull request Mar 28, 2018

zuphilip added a commit to zuphilip/translators that referenced this pull request Mar 28, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment