Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CommandLineApp DomainGraphExtractor Uses Different Node IDs than WriteGraph #439

Open
ianmilligan1 opened this issue Apr 9, 2020 · 0 comments
Assignees
Labels
bug

Comments

@ianmilligan1
Copy link
Member

@ianmilligan1 ianmilligan1 commented Apr 9, 2020

Describe the bug
The output of the CommandLineApp DomainGraphExtractor creates different node ID types than running WriteGraph directly through spark shell. They should be the same.

To Reproduce
The following command line command (both DF and RDD):

bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /Users/ianmilligan1/dropbox/git/aut/target/aut-0.50.1-SNAPSHOT-fatjar.jar --extractor DomainGraphExtractor --input /users/ianmilligan1/dropbox/git/aut-resources/sample-data/*.gz --output /users/ianmilligan1/desktop/domaingraph-gexf --output-format GEXF --partition 1

creates an output file that looks like:

<node id="2343ec78a04c6ea9d80806345d31fd78" label="facebook.com" />
<node id="9cce24c55aee4eb39845fde935cca3da" label="web.net" />
<node id="5399465c5b23df17b16c2377e865a0b2" label="PetitionOnline.com" />
<node id="1fbfb6126d36fd25c16de2b0142700d8" label="traduku.net" />
<node id="d1063af181fe606e55ed93dd5b867169" label="en.wikipedia.org" />
<node id="0412791bbc450bbeb5b7d35eaed7e4f2" label="calendarix.com" />
<node id="fb1c73ca981330da55c56e07be521842" label="goodsforgreens.myshopify.com" />

Conversely, if we run this script as per aut-docs:

import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._

val links = RecordLoader.loadArchives("/users/ianmilligan1/dropbox/git/aut-resources/sample-data/*.gz", sc)
  .keepValidPages()
  .map(r => (r.getCrawlDate, ExtractLinksRDD(r.getUrl, r.getContentString)))
  .flatMap(r => r._2.map(f => (r._1,
                               ExtractDomainRDD(f._1).replaceAll("^\\s*www\\.", ""),
                               ExtractDomainRDD(f._2).replaceAll("^\\s*www\\.", ""))))
  .filter(r => r._2 != "" && r._3 != "")
  .countItems()
  .filter(r => r._2 > 5)

WriteGraph(links, "/users/ianmilligan1/desktop/script-gexf.gexf")

We get an output that looks like:

<node id="76" label="liberalpartyofcanada-mb.ca" />
<node id="80" label="lpco.ca" />
<node id="84" label="snapdesign.ca" />
<node id="88" label="PetitionOnline.com" />
<node id="92" label="egale.ca" />
<node id="96" label="liberal.nf.net" />
<node id="100" label="policyalternatives.ca" />
<node id="1" label="collectionscanada.ca" />

Expected behavior
The output of DomainGraphExtractor is preferable to the WriteGraph output. In other words, the nodes as hashes is superior to the notes as ID #s.

Environment information

  • AUT version: Most recent master
  • OS: MacOS 15.4
  • Java version: Java 8
  • Apache Spark version: 2.4.4
  • Apache Spark w/aut: w/ --jars
  • Apache Spark command used to run AUT: see above
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.