Create Vice.js #1366

owcz · Jul 10, 2017

Handles every text-focused content site in the network. Some subsites (mainly i-D, Amuse i-D, and Vice News) vary significantly from the others, so some selectors were uglier than I'd prefer so as to accommodate. The rest more or less follow the same format. While i-D uses OpenGraph and might have benefited from using EM, the rest needed custom coding, so I thought it was best to build i-D support into a single Vice translator.

Need help with two issues:

Every snapshot I verified leads to a 404 page. Is there a workaround or should I just remove the snapshots altogether?
Some multis from the network sites that follow the same format have a weird issue:
https://broadly.vice.com/en_us
https://broadly.vice.com/en_us/topic/occult
https://waypoint.vice.com/en_us
https://waypoint.vice.com/en_us/latest
https://thump.vice.com/en_us/search?q=venetian
For instance, try any one of the above (all multis) and the saved articles will lack both author and tags, even though both do save properly if you run the scraper on the individual page (without the multi). There's something about the multi that leaves those two parameters out.

Scaffold is strange too. Add any one of those site's individual articles to the test, click "save", and it scrapes fine, as it normally does in the browser. But then "run" any one of those same individual articles and it'll ignore the site's authors and tags (same as above).

I tried to isolate the issue by removing everything that wasn't necessary for, say, broadly.vice.com's multi, but the scraper still returned null for each page's author/tags (even though the scraper returned the right results if run on the same pages individually). So not sure if it's an issue with the multi code or something with text()/attr() (#1277)

owcz · Jul 10, 2017

owcz reviewed Jul 10, 2017

View changes

owcz · Jul 10, 2017

Vice.js

+		}		
+	}
+
+	// archives appear to 404 on load?


Snapshots on every page I checked will succeed, but when later checked, they all 404. Not sure if there's a workaround or if it would be better to leave the snapshots out altogether.

Saving this website also fails when I try to do this in the browser. They are using a lot of JavaScript and some iframe things which seems not so friendly for saving purposes. Moreover, I haven't found a link to a "print friendly version" which sometimes exists. I suggest to skip saving snapshots here.

zuphilip · Jul 16, 2017

I guess that the problems you are describing are connected to some dynamic loading of the website or building the website out of some JSON together with some JavaScript calls...

When coming from a multiple to a single article, then I see that some JSON data is loaded:

Actually, it looks promising to use such API calls directly and work with that JSON data rather than scrape from the page directly. In some the examples it is enough to add ?json=true to the url, e.g. https://www.vice.com/en_us/article/padaqv/anti-g20-activists-told-us-what-they-brought-to-the-protest-in-hamburg?json=true . Let me know if this helps.

owcz · Jul 17, 2017

Hm, it appears inconsistent:

Works for
https://www.vice.com/en_us/article/padaqv/anti-g20-activists-told-us-what-they-brought-to-the-protest-in-hamburg?json=true
https://waypoint.vice.com/en_us/article/bjxjbw/nina-freemans-games-really-get-millennial-romance?json=true
https://waypoint.vice.com/en_us/latest?page=2&json=true
https://www.vice.com/de/article/59pdy5/wie-kaputte-handys-und-profite-die-g20-gegner-antreiben?json=true
https://www.vice.com/de?json=true
https://thump.vice.com/en_us/search?q=venetian&json=true
Not
https://news.vice.com/story/voters-may-soon-toughen-up-americas-weakest-police-shootings-law?json=true
As above, I think i-D and i-D Amuse are on a different system
http://i-d.vice.com/en_us/article/anish-kapoor-has-been-banned-from-using-yet-another-rare-paint?json=true
https://amuse-i-d.vice.com/exclusive-around-the-world-with-goshas-favourite-artist?json=true

Do you see JSON beneath news.vice.com and the i-D sites?

zuphilip · Jul 17, 2017

The news site on vice uses the WordPress API, i.e. e.g. https://news.vice.com/wp-json/wp/v2/posts/24702

adam3smith · Aug 14, 2017

Is this ready for review or are you still tweaking this?

owcz · Aug 14, 2017

Still tweaking, thanks

owcz · Jul 9, 2018

okay, ready for review

rewritten to use JSON
falls back to EM for News, i-D, and Amuse, and supplements with JSON-LD for News until it's officially built into EM

adam3smith · Jul 26, 2018

adam3smith requested changes Jul 26, 2018

View changes

A couple of suggestions & questions. Shou.ld be quick

adam3smith · Jul 26, 2018

Vice.js

+	"translatorID": "131310dc-854c-4629-acad-521319ab9f19",
+	"label": "Vice",
+	"creator": "czar",
+	"target": "^https?://((www|broadly|creators|i-d|amuse-i-d|impact|motherboard|munchies|news|noisey|sports|thump|tonic|waypoint)\\.)?vice\\.com",


I'd be OK with the simpler ^https://(.+?\\.)vice\\.com unless you need to specifically exclude certain subdomains

adam3smith · Jul 26, 2018

Vice.js

+
+function detectWeb(doc, url) {
+	if (/\/(article|story)\//.test(url)) {
+		return "blogPost";


I would have thought of Vice as an online magazine, not as a series of blogposts -- not going to insist, but consider switching this.

To wits: Wikipedia has Vice as a magazine and it appears to have a ISSN

I could see potentially adding it for the base vice.com even though I don't know its overlap with the print circulation
But I'd consider all of its many subdomains to be more blog- than magazine-like in nature

adam3smith · Jul 26, 2018

Vice.js

+function detectWeb(doc, url) {
+	if (/\/(article|story)\//.test(url)) {
+		return "blogPost";
+	} else if (/\.vice\.com\/?($|\w\w(\_\w\w)?\/?$)|\/(search\?q=)|topic\/|category\/|(latest|read)($|\?page=)/.test(url) && getSearchResults(doc, true) ) {


I don't think you want that beginning \. before vice.com, do you? (doesn't match https://vice.com)

my understanding is that all vice.com addresses redirect to www.vice.com but I suppose the translator's target detection would have caught any bad addresses by this point

adam3smith · Jul 26, 2018

Vice.js

+
+
+function scrape(doc, url) {
+	var jsonURL = url+'?json=true';


You'll almost certainly want to clean the URL by doing url.replace(/#.+/, "") -- otherwise internal page navigation as well as social media origin info will get included in the jsonURL and will likely break it.

https://motherboard.vice.com/en_us/article/43pngb/how-to-make-your-own-medicine-four-thieves-vinegar-collective?utm_campaign=sharebutton appears to work fine and I haven't seen # parameters used, but will add anyway

adam3smith · Aug 4, 2018

Thanks!

Create Vice.js
Handles every text-focused content site in the network Need help with two issues marked in the code

Loading status checks…

be91f29

Remove snapshots
as discussed

Loading status checks…

a1fa9a7

zuphilip added the New Translator label Nov 12, 2017

zuphilip added the waiting label Jan 1, 2018

rewritten to use JSON

Verified

This commit was created on GitHub.com and signed with a verified signature using GitHub’s key.

GPG key ID: 4AEE18F83AFDEB23 Learn about signing commits

Loading status checks…

3dac217

zuphilip removed the waiting label Jul 10, 2018

fixes per review

Verified

This commit was created on GitHub.com and signed with a verified signature using GitHub’s key.

GPG key ID: 4AEE18F83AFDEB23 Learn about signing commits

Loading status checks…

0efe5f0

adam3smith merged commit 114abf4 into zotero:master Aug 4, 2018
1 check passed

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

owcz deleted the owcz:patch-4 branch Aug 4, 2018

zotero/translators

Create Vice.js #1366

Create Vice.js #1366

owcz commented Jul 10, 2017 •

edited

owcz reviewed Jul 10, 2017

View changes

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

zuphilip commented Jul 16, 2017

This comment has been minimized.

owcz commented Jul 17, 2017

This comment has been minimized.

zuphilip commented Jul 17, 2017

This comment has been minimized.

adam3smith commented Aug 14, 2017

This comment has been minimized.

owcz commented Aug 14, 2017

zuphilip added the New Translator label Nov 12, 2017

zuphilip added the waiting label Jan 1, 2018

This comment has been minimized.

owcz commented Jul 9, 2018

zuphilip removed the waiting label Jul 10, 2018

adam3smith requested changes Jul 26, 2018

View changes

adam3smith left a comment

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

adam3smith merged commit `114abf4` into zotero:master Aug 4, 2018
1 check passed

1 check passed

This comment has been minimized.

adam3smith commented Aug 4, 2018

owcz deleted the owcz:patch-4 branch Aug 4, 2018

zotero/translators

Join GitHub today

Create Vice.js #1366

Conversation

owcz commented Jul 10, 2017 • edited

owcz reviewed Jul 10, 2017 View changes

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

zuphilip commented Jul 16, 2017

This comment has been minimized.

owcz commented Jul 17, 2017

This comment has been minimized.

zuphilip commented Jul 17, 2017

This comment has been minimized.

adam3smith commented Aug 14, 2017

This comment has been minimized.

owcz commented Aug 14, 2017

zuphilip added the New Translator label Nov 12, 2017

zuphilip added the waiting label Jan 1, 2018

This comment has been minimized.

owcz commented Jul 9, 2018

zuphilip removed the waiting label Jul 10, 2018

adam3smith requested changes Jul 26, 2018 View changes

adam3smith left a comment

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

Hide details View details adam3smith merged commit 114abf4 into zotero:master Aug 4, 2018 1 check passed

1 check passed

This comment has been minimized.

adam3smith commented Aug 4, 2018

owcz deleted the owcz:patch-4 branch Aug 4, 2018

owcz commented Jul 10, 2017 •

edited

owcz reviewed Jul 10, 2017

View changes

adam3smith requested changes Jul 26, 2018

View changes

adam3smith merged commit `114abf4` into zotero:master Aug 4, 2018
1 check passed