CommandLineApp DomainGraphExtractor Uses Different Node IDs than WriteGraph #439

ianmilligan1 · 2020-04-09T18:00:16Z

Describe the bug
The output of the CommandLineApp DomainGraphExtractor creates different node ID types than running WriteGraph directly through spark shell. They should be the same.

To Reproduce
The following command line command (both DF and RDD):

bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /Users/ianmilligan1/dropbox/git/aut/target/aut-0.50.1-SNAPSHOT-fatjar.jar --extractor DomainGraphExtractor --input /users/ianmilligan1/dropbox/git/aut-resources/sample-data/*.gz --output /users/ianmilligan1/desktop/domaingraph-gexf --output-format GEXF --partition 1

creates an output file that looks like:

<node id="2343ec78a04c6ea9d80806345d31fd78" label="facebook.com" />
<node id="9cce24c55aee4eb39845fde935cca3da" label="web.net" />
<node id="5399465c5b23df17b16c2377e865a0b2" label="PetitionOnline.com" />
<node id="1fbfb6126d36fd25c16de2b0142700d8" label="traduku.net" />
<node id="d1063af181fe606e55ed93dd5b867169" label="en.wikipedia.org" />
<node id="0412791bbc450bbeb5b7d35eaed7e4f2" label="calendarix.com" />
<node id="fb1c73ca981330da55c56e07be521842" label="goodsforgreens.myshopify.com" />

Conversely, if we run this script as per aut-docs:

import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._

val links = RecordLoader.loadArchives("/users/ianmilligan1/dropbox/git/aut-resources/sample-data/*.gz", sc)
  .keepValidPages()
  .map(r => (r.getCrawlDate, ExtractLinksRDD(r.getUrl, r.getContentString)))
  .flatMap(r => r._2.map(f => (r._1,
                               ExtractDomainRDD(f._1).replaceAll("^\\s*www\\.", ""),
                               ExtractDomainRDD(f._2).replaceAll("^\\s*www\\.", ""))))
  .filter(r => r._2 != "" && r._3 != "")
  .countItems()
  .filter(r => r._2 > 5)

WriteGraph(links, "/users/ianmilligan1/desktop/script-gexf.gexf")

We get an output that looks like:

<node id="76" label="liberalpartyofcanada-mb.ca" />
<node id="80" label="lpco.ca" />
<node id="84" label="snapdesign.ca" />
<node id="88" label="PetitionOnline.com" />
<node id="92" label="egale.ca" />
<node id="96" label="liberal.nf.net" />
<node id="100" label="policyalternatives.ca" />
<node id="1" label="collectionscanada.ca" />

Expected behavior
The output of DomainGraphExtractor is preferable to the WriteGraph output. In other words, the nodes as hashes is superior to the notes as ID #s.

Environment information

AUT version: Most recent master
OS: MacOS 15.4
Java version: Java 8
Apache Spark version: 2.4.4
Apache Spark w/aut: w/ --jars
Apache Spark command used to run AUT: see above

ianmilligan1 added the bug label Apr 9, 2020

ianmilligan1 assigned ruebot Apr 9, 2020

archivesunleashed / aut

CommandLineApp DomainGraphExtractor Uses Different Node IDs than WriteGraph #439

CommandLineApp DomainGraphExtractor Uses Different Node IDs than WriteGraph #439

ianmilligan1 commented Apr 9, 2020 •

edited

archivesunleashed / aut

Join GitHub today

CommandLineApp DomainGraphExtractor Uses Different Node IDs than WriteGraph #439

CommandLineApp DomainGraphExtractor Uses Different Node IDs than WriteGraph #439

Comments

ianmilligan1 commented Apr 9, 2020 • edited

ianmilligan1 commented Apr 9, 2020 •

edited