New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Background job for basic analysis #28

Closed
ruebot opened this Issue Feb 7, 2018 · 1 comment

Comments

Projects
None yet
1 participant
@ruebot
Copy link
Member

ruebot commented Feb 7, 2018

Once a collection is downloaded, run basic analysis on it.

...we'll need to have Spark with aut loaded up and running.

@ruebot ruebot closed this in 1e43a20 Feb 27, 2018

@ruebot

This comment has been minimized.

Copy link
Member

ruebot commented Feb 27, 2018

[nruest@roo:990]$ tree 9848 
9848
├── derivatives
│   ├── all-domains
│   │   ├── part-00000
│   │   └── _SUCCESS
│   ├── all-text
│   │   ├── part-00000
│   │   └── _SUCCESS
│   └── links-for-gephi.gexf
├── spark_jobs
│   ├── 9848.scala
│   └── 9848.scala.log
└── warcs
    └── ARCHIVEIT-9848-TEST-JOB511844-20171215174205244-00000.warc.gz

5 directories, 8 files
[nruest@roo:990]$ cat 9848/derivatives/all-domains/part-00000
(theoryandpractice.planning.dal.ca,44)
(r3---sn-n4v7sn76.googlevideo.com,3)
(www.youtube-nocookie.com,1)
(www.youtube.com,1)
[nruest@roo:990]$ head -n 1 9848/derivatives/all-text/part-00000
(20171215,theoryandpractice.planning.dal.ca,http://theoryandpractice.planning.dal.ca/,HTTP/1.1 200 OK Date: Fri, 15 Dec 2017 17:42:07 GMT Server: Apache/2.2.15 (Red Hat) Last-Modified: Mon, 30 Jan 2017 19:55:08 GMT ETag: "280768-14d3-547553145bf00" Accept-Ranges: bytes Content-Length: 5331 Connection: close Content-Type: text/html; charset=UTF-8 Jill Grant Research Projects Planning History Trends In The Suburbs Gated & Private Communities Creative Cities: Halifax Coordinating Multiple Plans Health & The Built Environment Neighbourhood Change In Halifax   a research compendium Welcome to the home page of our research. Our work examines the relationships (and sometimes the contradictions) between the theory planners have about what contributes to designing and building good communities and what actually results from the practice of developing our communities. You'll find links here to our current and past projects with summaries of our findings, early results, and examples of student research. Although copyright regulations prevent us from posting copies of most of the scholarly publications from the work, conference presentations and early drafts are here as working papers when possible. We hope you find the information useful, and welcome any comments. For a recent version of my c.v. and more about my background, please visit my faculty web page. Dr. Jill L. Grant, Professor, School of Planning   Principal Researcher - Dr. Jill L. Grant School of Planning Dalhousie University Contact site by mr.deps)
[nruest@roo:990]$ head -n 20 9848/derivatives/links-for-gephi.gexf
<?xml version="1.0" encoding="UTF-8"?>
      <gexf xmlns="http://www.gexf.net/1.3draft"
            xmlns:xsi="http://www.w3.org/2001/XMLSchema-instance"
            xsi:schemaLocation="http://www.gexf.net/1.3draft
                                http://www.gexf.net/1.3draft/gexf.xsd"
            version="1.3">
        <graph mode="static" defaultedgetype="directed">
          <attributes class="edge">
            <attribute id="0" title="crawlDate" type="string" />
          </attributes>
          <nodes>      <node id="architectureandplanning.dal.ca" label="architectureandplanning.dal.ca" />
      <node id="theoryandpractice.planning.dal.ca" label="theoryandpractice.planning.dal.ca" />

      <node id="dal.ca" label="dal.ca" />
    </nodes>
      <edges>
            <edge source="theoryandpractice.planning.dal.ca" target="architectureandplanning.dal.ca" label="" weight="85"  type="directed">
      <attvalues>
      <attvalue for="0" value="20171215" />
      </attvalues>
$ cat 9848/spark_jobs/9848.scala

      import io.archivesunleashed.spark.matchbox.{ExtractDomain, ExtractLinks, RemoveHTML, RecordLoader, WriteGEXF}
      import io.archivesunleashed.spark.rdd.RecordRDD._
      sc.setLogLevel("INFO")
      RecordLoader.loadArchives("/home/nruest/Projects/tmp/990/9848/warcs/*.gz", sc).keepValidPages().map(r => ExtractDomain(r.getUrl)).countItems().saveAsTextFile("/home/nruest/Projects/tmp/990/9848/derivatives/all-domains")
      RecordLoader.loadArchives("/home/nruest/Projects/tmp/990/9848/warcs/*.gz", sc).keepValidPages().map(r => (r.getCrawlDate, r.getDomain, r.getUrl, RemoveHTML(r.getContentString))).saveAsTextFile("/home/nruest/Projects/tmp/990/9848/derivatives/all-text")
      val links = RecordLoader.loadArchives("/home/nruest/Projects/tmp/990/9848/warcs/*.gz", sc).keepValidPages().map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString))).flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", "")))).filter(r => r._2 != "" && r._3 != "").countItems().filter(r => r._2 > 5)
      WriteGEXF(links, "/home/nruest/Projects/tmp/990/9848/derivatives/links-for-gephi.gexf")
      sys.exit
      % 
$ tail -n 20 9848/spark_jobs/9848.scala.log
2018-02-27 02:12:57,639 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@6f112f70{/stages/pool/json,null,UNAVAILABLE,@Spark}
2018-02-27 02:12:57,640 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@30fa8a6b{/stages/pool,null,UNAVAILABLE,@Spark}
2018-02-27 02:12:57,640 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@73d91faf{/stages/stage/json,null,UNAVAILABLE,@Spark}
2018-02-27 02:12:57,641 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@3cb04dd{/stages/stage,null,UNAVAILABLE,@Spark}
2018-02-27 02:12:57,641 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@3bcc8f13{/stages/json,null,UNAVAILABLE,@Spark}
2018-02-27 02:12:57,642 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@35becbd4{/stages,null,UNAVAILABLE,@Spark}
2018-02-27 02:12:57,642 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@3ba97962{/jobs/job/json,null,UNAVAILABLE,@Spark}
2018-02-27 02:12:57,642 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@20231384{/jobs/job,null,UNAVAILABLE,@Spark}
2018-02-27 02:12:57,642 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@7c0e4e4e{/jobs/json,null,UNAVAILABLE,@Spark}
2018-02-27 02:12:57,642 [Thread-1] INFO  ContextHandler - Stopped o.s.j.s.ServletContextHandler@58bad46f{/jobs,null,UNAVAILABLE,@Spark}
2018-02-27 02:12:57,656 [Thread-1] INFO  SparkUI - Stopped Spark web UI at http://172.17.0.1:4040
2018-02-27 02:12:57,696 [dispatcher-event-loop-1] INFO  MapOutputTrackerMasterEndpoint - MapOutputTrackerMasterEndpoint stopped!
2018-02-27 02:12:57,754 [Thread-1] INFO  MemoryStore - MemoryStore cleared
2018-02-27 02:12:57,754 [Thread-1] INFO  BlockManager - BlockManager stopped
2018-02-27 02:12:57,755 [Thread-1] INFO  BlockManagerMaster - BlockManagerMaster stopped
2018-02-27 02:12:57,758 [dispatcher-event-loop-0] INFO  OutputCommitCoordinator$OutputCommitCoordinatorEndpoint - OutputCommitCoordinator stopped!
2018-02-27 02:12:57,767 [Thread-1] INFO  SparkContext - Successfully stopped SparkContext
2018-02-27 02:12:57,768 [Thread-1] INFO  ShutdownHookManager - Shutdown hook called
2018-02-27 02:12:57,769 [Thread-1] INFO  ShutdownHookManager - Deleting directory /tmp/spark-0e8e1a92-29db-4c2f-b483-bf441704b065
2018-02-27 02:12:57,783 [Thread-1] INFO  ShutdownHookManager - Deleting directory /tmp/spark-0e8e1a92-29db-4c2f-b483-bf441704b065/repl-f6589426-78c2-480a-ae2c-c7cb1ddbd35b

🎉 🎉 🎉 🎉 🎉

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment