Investigate what would be needed to include crawl-sites visualization #146

ruebot · Jul 3, 2018

See what is needed to add crawl-sites.

We'd probably need to convert process.py to a helper method
We'd probably need to convert all the js here to standalone js file like we do with graph.js
Need to do some timing tests for large collections vs smaller collections (does it scale)

ianmilligan1 · Jul 3, 2018

Here's an example of the output for other following along: http://lintool.github.io/warcbase/vis/crawl-sites/.

I ran this on all the WALK collections, FWIW, and was able to do the full thing in a few minutes on a laptop if I remember correctly. Here's one of our 4-5TB ones: https://web-archive-group.github.io/WALK-CrawlVis/crawl-sites/ALBERTA_government_information_all_urls.html.

ianmilligan1 · Jul 4, 2018

FYI I dug back into our past workflow and am glad I did as it's a bit janky.

Here's the latest workflow I was using to do this.

https://github.com/web-archive-group/WALK-CrawlVis/blob/master/WORKFLOW.md

Note that the major problem is the output from the domain count is different than what process.py expects, mostly because the crawl-viz dates from when we still used Pig! Probably process.py should change to process the new format rather than me escaping random stuff using sed.

ruebot · Feb 6, 2019

If we add an additional spark sub-job:

/home/nruest/bin/spark-2.3.2-bin-hadoop2.7/bin/spark-shell --master local\[2\] --driver-memory 6G --conf spark.network.timeout=10000000 --conf spark.executor.heartbeatInterval=600s --conf spark.driver.maxResultSize=4G --packages "io.archivesunleashed:aut:0.17.0"


import io.archivesunleashed._
import io.archivesunleashed.matchbox._

val r =
RecordLoader.loadArchives("/home/nruest/Projects/tmp/4811/warcs/*.gz", sc)
.keepValidPages()
.map(r => (r.getCrawlMonth,ExtractDomain(r.getUrl)))
.countItems()
.saveAsTextFile("/home/nruest/Projects/tmp/auk-issue-146")

We'll get output like this:

((201411,corcoran.gwu.edu),36852)
((201412,corcoran.gwu.edu),20232)
((201409,www.corcoran.edu),16089)
((201409,www.corcoran.org),15923)
((201512,newsite.corcoran.org),5432)
((201512,corcoran.gwu.edu),2911)
((201410,unveiled.corcoran.org),1058)
((201412,unveiled.corcoran.gwu.edu),589)
((201411,unveiled.corcoran.org),545)
((201411,next2012.corcoran.edu),487)
((201411,next.corcoran.edu),345)
((201409,legacy.corcoran.edu),329)
((201411,next.corcoran.gwu.edu),277)
((201411,savethecorcoran.org),274)
((201411,next2011.corcoran.edu),211)
((201410,accounts.google.com),192)
((201412,www.facebook.com),177)
((201411,www.youtube.com),169)
((201412,www.youtube.com),160)
((201411,next2012.corcoran.gwu.edu),146)
((201411,next2011.corcoran.gwu.edu),96)
((201410,plus.google.com),96)
((201412,unveiled.corcoran.org),93)
((201411,accounts.google.com),91)
((201412,plus.google.com),88)
((201412,accounts.google.com),88)
((201411,next2013.corcoran.edu),83)
((201608,newsite.corcoran.org),81)
((201410,www.youtube.com),80)
((201411,widget.stagram.com),44)
((201411,player.vimeo.com),42)
((201410,player.vimeo.com),36)
((201409,player.vimeo.com),26)
((201512,player.vimeo.com),23)
((201411,pixel.fetchback.com),23)
((201412,player.vimeo.com),21)
((201411,widget.websta.me),17)
((201409,www.youtube.com),14)
((201411,www.corcoran.org),12)
((201411,s.youtube.com),9)
((201410,www.corcoran.org),6)
((201410,w.soundcloud.com),5)
((201410,gen.xyz),5)
((201411,dublincore.org),5)
((201409,next.corcoran.edu),5)
((201411,ogp.me),5)
((201411,www.gwu.edu),5)
((201410,vine.co),4)
((201410,8tracks.com),4)
((201411,r9---sn-nwj7knls.googlevideo.com),4)
((201608,www.googletagmanager.com),4)
((201410,www.wishpond.com),4)
((201512,www.corcoran.org),4)
((201410,www.ustream.tv),4)
((201411,www.corcoran.edu),3)
((201705,s.w.org),3)
((201411,www.google.com),3)
((201411,r6---sn-nwj7kner.googlevideo.com),3)
((201411,purl.org),3)
((201409,cm.g.doubleclick.net),3)
((201411,www.w3.org),3)
((201411,r18---sn-nwj7kned.googlevideo.com),3)
((201411,platform.twitter.com),3)
((201608,f.vimeocdn.com),3)
((201412,0.gravatar.com),3)
((201411,0.gravatar.com),3)
((201705,next.corcoran.gwu.edu),3)
((201411,r17---sn-nwj7knl7.googlevideo.com),2)
((201409,twitter.com),2)
((201512,cm.g.doubleclick.net),2)
((201411,www.wishpond.com),2)
((201411,r12---sn-nwj7knls.googlevideo.com),2)
((201411,www2.gwu.edu),2)
((201512,www.googletagmanager.com),2)
((201410,corcoran.gwu.edu),2)
((201705,wordpress.org),2)
((201409,server.iad.liveperson.net),2)
((201411,www.ustream.tv),2)
((201411,r14---sn-nwj7kned.googlevideo.com),2)
((201410,www.corcoran.edu),2)
((201412,www.wishpond.com),2)
((201411,plus.googleapis.com),2)
((201410,storify.com),2)
((201411,r13---sn-nwj7knls.googlevideo.com),2)
((201411,vine.co),2)
((201411,r6---sn-nwj7knek.googlevideo.com),2)
((201411,r5---sn-nwj7kner.googlevideo.com),2)
((201410,instagram.com),2)
((201412,r9---sn-nwj7knls.googlevideo.com),2)
((201412,www.ustream.tv),2)
((201411,r10---sn-nwj7knls.googlevideo.com),2)
((201412,8tracks.com),2)
((201411,r4---sn-nwj7kned.googlevideo.com),2)
((201411,r20---sn-nwj7knl7.googlevideo.com),2)
((201411,r10---sn-nwj7kned.googlevideo.com),2)
((201411,r9---sn-nwj7kned.googlevideo.com),2)
((201411,r1---sn-nwj7kned.googlevideo.com),2)
((201412,vine.co),2)
((201512,next.corcoran.edu),2)
((201411,r6---sn-nwj7kne6.googlevideo.com),2)
((201411,r13---sn-nwj7kner.googlevideo.com),2)
((201411,8tracks.com),2)
((201409,collection.corcoran.org),2)
((201411,r17---sn-nwj7kner.googlevideo.com),2)
((201409,www.googletagmanager.com),2)
((201411,plus.google.com),2)
((201411,r4---sn-o097znle.googlevideo.com),2)
((201411,r20---sn-nwj7knls.googlevideo.com),2)
((201411,xmlns.com),2)
((201411,r2---sn-nwj7kned.googlevideo.com),2)
((201410,platform.twitter.com),1)
((201411,r14---sn-nwj7knek.googlevideo.com),1)
((201411,storify.com),1)
((201411,youtu.be),1)
((201412,r9---sn-nwj7kned.googlevideo.com),1)
((201705,fonts.googleapis.com),1)
((201512,next.corcoran.gwu.edu),1)
((201512,www.w3.org),1)
((201409,www.liveperson.com),1)
((201412,r6---sn-nwj7knek.googlevideo.com),1)
((201412,r12---sn-nwj7knls.googlevideo.com),1)
((201412,r9---sn-nwj7knek.googlevideo.com),1)
((201411,f.vimeocdn.com),1)
((201512,www.facebook.com),1)
((201411,r17---sn-nwj7kne6.googlevideo.com),1)
((201409,corcoran.edu),1)
((201412,r13---sn-nwj7knls.googlevideo.com),1)
((201409,pixel.fetchback.com),1)
((201411,redirector.googlevideo.com),1)
((201409,www.w3.org),1)
((201411,ct1.addthis.com),1)
((201411,get.adobe.com),1)
((201411,s.ytimg.com),1)
((201412,ogp.me),1)
((201411,r3---sn-nwj7knls.googlevideo.com),1)
((201411,gwc.lphbs.com),1)
((201512,www.gwu.edu),1)
((201412,r10---sn-nwj7kned.googlevideo.com),1)
((201411,apis.google.com),1)
((201412,r1---sn-nwj7kned.googlevideo.com),1)
((201411,r1---sn-nwj7kner.googlevideo.com),1)
((201411,i.ytimg.com),1)
((201411,w.soundcloud.com),1)
((201411,1.gravatar.com),1)
((201608,pixel.admedia.com),1)
((201411,r18---sn-nwj7kner.googlevideo.com),1)
((201411,r5---sn-nwj7knl7.googlevideo.com),1)
((201412,storify.com),1)
((201411,r5---sn-nwj7knls.googlevideo.com),1)
((201411,m.youtube.com),1)
((201412,docs.google.com),1)
((201512,ogp.me),1)
((201705,www.hugo-creative.com),1)
((201412,1.gravatar.com),1)
((201412,r20---sn-nwj7knl7.googlevideo.com),1)
((201411,r15---sn-nwj7knek.googlevideo.com),1)
((201512,www.sheepandwool.org),1)
((201411,r1---sn-nwj7kne6.googlevideo.com),1)
((201409,ce.corcoran.edu),1)
((201411,r9---sn-nwj7knek.googlevideo.com),1)
((201409,www.uhs.uga.edu),1)
((201512,www.rawartists.org),1)
((201411,r20---sn-nwj7kner.googlevideo.com),1)
((201412,instagram.com),1)
((201409,chat.zoho.com),1)
((201512,dublincore.org),1)
((201412,r5---sn-nwj7kner.googlevideo.com),1)
((201411,r5---sn-nwj7knek.googlevideo.com),1)
((201412,r1---sn-nwj7kne6.googlevideo.com),1)
((201412,r13---sn-nwj7kner.googlevideo.com),1)
((201512,docs.google.com),1)
((201411,docs.google.com),1)
((201512,portfolios.corcoran.gwu.edu),1)
((201608,fpdl.vimeocdn.com),1)
((201412,w.soundcloud.com),1)
((201411,instagram.com),1)
((201512,sheepandwool.org),1)
((201411,next2013.corcoran.gwu.edu),1)
((201409,tcc.noellevitz.com),1)
((201412,r2---sn-nwj7kned.googlevideo.com),1)
((201512,www.corcoran.edu),1)
((201412,r14---sn-nwj7kned.googlevideo.com),1)
((201411,r7---sn-nwj7km7e.c.youtube.com),1)
((201412,r6---sn-nwj7kner.googlevideo.com),1)
((201608,player.vimeo.com),1)
((201411,web.resource.org),1)

Then we'll probably need to adapt process.py into a helper method to create the csv file for the visualization. This would be on the fly, and probably slow. So, maybe we should create another job, or add to the clean-up job to create the csv file in the background.

After that, it'd just be following the path of the Sigmajs visualization for this implementation.

ianmilligan1 · Feb 6, 2019

This sounds promising! I'd defer to you on the implementation, but creating this file and then possibly adding it to the clean-up job is a good route forward?

ruebot · Feb 6, 2019

Easy part done. Now I have to port process.py over to Ruby, and make sure it scales. Then implement the actual visualization.

ruebot · Feb 16, 2019

According to this, when we load in a csv via d3.csv() -- like we do here -- the function only takes in a path, not a URL. So we also need to update that js code to use at least d3 v4.

ruebot self-assigned this Jul 3, 2018

ruebot added question discussion labels Aug 20, 2018

ruebot added Background jobs feature in progress Rails and removed discussion question labels Feb 6, 2019

archivesunleashed/auk

Investigate what would be needed to include crawl-sites visualization #146

Investigate what would be needed to include crawl-sites visualization #146

ruebot commented Jul 3, 2018

ruebot self-assigned this Jul 3, 2018

This comment has been minimized.

ianmilligan1 commented Jul 3, 2018

This comment has been minimized.

ianmilligan1 commented Jul 4, 2018

ruebot added question discussion labels Aug 20, 2018

This comment has been minimized.

ruebot commented Feb 6, 2019

This comment has been minimized.

ianmilligan1 commented Feb 6, 2019

ruebot added a commit that referenced this issue Feb 6, 2019

ruebot added Background jobs feature in progress Rails and removed discussion question labels Feb 6, 2019

This comment has been minimized.

ruebot commented Feb 6, 2019

This comment has been minimized.

ruebot commented Feb 16, 2019

archivesunleashed/auk

Join GitHub today

Investigate what would be needed to include crawl-sites visualization #146

Comments

ruebot commented Jul 3, 2018

ruebot self-assigned this Jul 3, 2018

This comment has been minimized.

ianmilligan1 commented Jul 3, 2018

This comment has been minimized.

ianmilligan1 commented Jul 4, 2018

ruebot added question discussion labels Aug 20, 2018

This comment has been minimized.

ruebot commented Feb 6, 2019

This comment has been minimized.

ianmilligan1 commented Feb 6, 2019

ruebot added a commit that referenced this issue Feb 6, 2019

ruebot added Background jobs feature in progress Rails and removed discussion question labels Feb 6, 2019

This comment has been minimized.

ruebot commented Feb 6, 2019

This comment has been minimized.

ruebot commented Feb 16, 2019