CommandLineApp DomainGraphExtractor Uses Different Node IDs than WriteGraph #439

ianmilligan1 · 2020-04-09T18:00:16Z

Describe the bug
The output of the CommandLineApp DomainGraphExtractor creates different node ID types than running WriteGraph directly through spark shell. They should be the same.

To Reproduce
The following command line command (both DF and RDD):

bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /Users/ianmilligan1/dropbox/git/aut/target/aut-0.50.1-SNAPSHOT-fatjar.jar --extractor DomainGraphExtractor --input /users/ianmilligan1/dropbox/git/aut-resources/sample-data/*.gz --output /users/ianmilligan1/desktop/domaingraph-gexf --output-format GEXF --partition 1

creates an output file that looks like:

<node id="2343ec78a04c6ea9d80806345d31fd78" label="facebook.com" />
<node id="9cce24c55aee4eb39845fde935cca3da" label="web.net" />
<node id="5399465c5b23df17b16c2377e865a0b2" label="PetitionOnline.com" />
<node id="1fbfb6126d36fd25c16de2b0142700d8" label="traduku.net" />
<node id="d1063af181fe606e55ed93dd5b867169" label="en.wikipedia.org" />
<node id="0412791bbc450bbeb5b7d35eaed7e4f2" label="calendarix.com" />
<node id="fb1c73ca981330da55c56e07be521842" label="goodsforgreens.myshopify.com" />

Conversely, if we run this script as per aut-docs:

import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._

val links = RecordLoader.loadArchives("/users/ianmilligan1/dropbox/git/aut-resources/sample-data/*.gz", sc)
  .keepValidPages()
  .map(r => (r.getCrawlDate, ExtractLinksRDD(r.getUrl, r.getContentString)))
  .flatMap(r => r._2.map(f => (r._1,
                               ExtractDomainRDD(f._1).replaceAll("^\\s*www\\.", ""),
                               ExtractDomainRDD(f._2).replaceAll("^\\s*www\\.", ""))))
  .filter(r => r._2 != "" && r._3 != "")
  .countItems()
  .filter(r => r._2 > 5)

WriteGraph(links, "/users/ianmilligan1/desktop/script-gexf.gexf")

We get an output that looks like:

<node id="76" label="liberalpartyofcanada-mb.ca" />
<node id="80" label="lpco.ca" />
<node id="84" label="snapdesign.ca" />
<node id="88" label="PetitionOnline.com" />
<node id="92" label="egale.ca" />
<node id="96" label="liberal.nf.net" />
<node id="100" label="policyalternatives.ca" />
<node id="1" label="collectionscanada.ca" />

Expected behavior
The output of DomainGraphExtractor is preferable to the WriteGraph output. In other words, the nodes as hashes is superior to the notes as ID #s.

Environment information

AUT version: Most recent master
OS: MacOS 15.4
Java version: Java 8
Apache Spark version: 2.4.4
Apache Spark w/aut: w/ --jars
Apache Spark command used to run AUT: see above

ruebot · 2020-04-13T14:23:45Z

Do we have a documented rationale for why we have so many write options for graphs? Currently, we have:

~~WriteGraphXML (not documented)~~
WriteGraphML (documented via CommandLineApp)
WriteGraph (documented)
WriteGEXF (documented via CommandLineApp)

Do we really need all of these? I'd argue, at the very least, we can just remove WriteGraph since it is redundant.

WriteGraph

import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._

val links = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocities", sc)
  .keepValidPages()
  .map(r => (r.getCrawlDate, ExtractLinksRDD(r.getUrl, r.getContentString)))
  .flatMap(r => r._2.map(f => (r._1,
                               ExtractDomainRDD(f._1).replaceAll("^\\s*www\\.", ""),
                               ExtractDomainRDD(f._2).replaceAll("^\\s*www\\.", ""))))
  .filter(r => r._2 != "" && r._3 != "")
  .countItems()
  .filter(r => r._2 > 5)

WriteGraph(links, "/home/nruest/Projects/au/sample-data/issue-439/writegraph.gexf")

WriteGEXF

import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._

val links = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocities", sc)
  .keepValidPages()
  .map(r => (r.getCrawlDate, ExtractLinksRDD(r.getUrl, r.getContentString)))
  .flatMap(r => r._2.map(f => (r._1,
                               ExtractDomainRDD(f._1).replaceAll("^\\s*www\\.", ""),
                               ExtractDomainRDD(f._2).replaceAll("^\\s*www\\.", ""))))
  .filter(r => r._2 != "" && r._3 != "")
  .countItems()
  .filter(r => r._2 > 5)

WriteGEXF(links, "/home/nruest/Projects/au/sample-data/issue-439/writegexf.gexf")

These two scripts produce the same thing, other than the issue raised here. So, I'm going to open up a PR where we just rip it all out. AUK will need to be updated for the next release, as will all the documentation.

$ wc -l *              
  29186 writegexf.gexf
  29186 writegraph.gexf
  58372 total


        Remove WriteGraph; resolves #439.

ianmilligan1 · 2020-04-13T17:57:23Z

For context, issue #289 - way back in November 2018 (!) - discusses the context behind having this. Basically, I think the only difference is that WriteGraph uses zipWithUniqueIds and WriteGexf & WriteGraphml use ComputeMD5. There are pros and cons. WriteGraph is slower (@greebie thought 10-15% slower) but WriteGraph has the chance of an MD5 hash collision.

Apologies, I should have looked this up before, but didn't think we had these functions running in parallel but they're both there. We should certainly kill one.

I have no strong feelings on what we keep. I guess part of me thinks that MD5 collisions are like, very rare (i.e. this random StackOverflow answer), but I'm also a historian so I'd defer to other thoughts.

FWIW I think we could also delete WriteGraphXML - it looks to be a product of some of the GraphX experiments we were doing 2-3 years ago? reference

ianmilligan1 added the bug label Apr 9, 2020

ianmilligan1 assigned ruebot Apr 9, 2020

ruebot added a commit that referenced this issue Apr 13, 2020

Remove WriteGraph; resolves #439.

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

Loading status checks…

04ebf04

ruebot mentioned this issue Apr 13, 2020

Remove WriteGraph; resolves #439. #441

Merged

ruebot mentioned this issue Apr 13, 2020

Remove GraphXML and ExtractGraphX #442

Closed

ianmilligan1 closed this in c1f9b31 Apr 14, 2020

archivesunleashed / aut

CommandLineApp DomainGraphExtractor Uses Different Node IDs than WriteGraph #439

CommandLineApp DomainGraphExtractor Uses Different Node IDs than WriteGraph #439

ianmilligan1 commented Apr 9, 2020 •

edited

This comment has been minimized.

ruebot commented Apr 13, 2020 •

edited

This comment has been minimized.

ianmilligan1 commented Apr 13, 2020

archivesunleashed / aut

Join GitHub today

CommandLineApp DomainGraphExtractor Uses Different Node IDs than WriteGraph #439

CommandLineApp DomainGraphExtractor Uses Different Node IDs than WriteGraph #439

Comments

ianmilligan1 commented Apr 9, 2020 • edited

This comment has been minimized.

ruebot commented Apr 13, 2020 • edited

This comment has been minimized.

ianmilligan1 commented Apr 13, 2020

ianmilligan1 commented Apr 9, 2020 •

edited

ruebot commented Apr 13, 2020 •

edited