Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upHandle old pages of BBC #1371
Conversation
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
zuphilip
Jul 16, 2017
Collaborator
AFAIS you can distinguish these two cases quite easily by analyzing the url. Thus, I suggest to use some conditional code, i.e. something like
if (url.substr(-4)==".stm") {
//only for old pages of BBC
item.title = ZU.xpathText(doc, '//meta[@name="Headline"]/@content');
item.section = ZU.xpathText(doc, '//meta[@name="Section"]/@content');
}
AFAIS you can distinguish these two cases quite easily by analyzing the url. Thus, I suggest to use some conditional code, i.e. something like if (url.substr(-4)==".stm") {
//only for old pages of BBC
item.title = ZU.xpathText(doc, '//meta[@name="Headline"]/@content');
item.section = ZU.xpathText(doc, '//meta[@name="Section"]/@content');
} |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
adam3smith
Jul 16, 2017
Collaborator
What @zuphilip says, though looking at both this and the detectWeb, I think we don't want to require the .stm is at the end of the URL. E.g. some link shorteners add something like ?utm_campaing=mycampaignname to the end of URLs and there's really no reason we should have detectWeb (and then this) break in those cases. I think the safest would be to (again both here and in detect) clean the URL by doing url.replace(/[\?#].+/, "")
What @zuphilip says, though looking at both this and the detectWeb, I think we don't want to require the .stm is at the end of the URL. E.g. some link shorteners add something like ?utm_campaing=mycampaignname to the end of URLs and there's really no reason we should have detectWeb (and then this) break in those cases. I think the safest would be to (again both here and in detect) clean the URL by doing |
sonali0901 commentedJul 16, 2017
Fixes #1364
@mvolz @owcz I need some ideas for the title of older pages. The title provided by metadata is not accurate and if I extract it through
ZU.xpathText(doc, '//meta[@name="Headline"]/@content')
then it would affect results of other pages. Any workaround? Refer to last test case to see the issue.