Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upCommandLineApp DomainGraphExtractor Uses Different Node IDs than WriteGraph #439
Comments
This comment has been minimized.
This comment has been minimized.
Do we have a documented rationale for why we have so many write options for graphs? Currently, we have:
Do we really need all of these? I'd argue, at the very least, we can just remove
import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
val links = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocities", sc)
.keepValidPages()
.map(r => (r.getCrawlDate, ExtractLinksRDD(r.getUrl, r.getContentString)))
.flatMap(r => r._2.map(f => (r._1,
ExtractDomainRDD(f._1).replaceAll("^\\s*www\\.", ""),
ExtractDomainRDD(f._2).replaceAll("^\\s*www\\.", ""))))
.filter(r => r._2 != "" && r._3 != "")
.countItems()
.filter(r => r._2 > 5)
WriteGraph(links, "/home/nruest/Projects/au/sample-data/issue-439/writegraph.gexf")
import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
val links = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocities", sc)
.keepValidPages()
.map(r => (r.getCrawlDate, ExtractLinksRDD(r.getUrl, r.getContentString)))
.flatMap(r => r._2.map(f => (r._1,
ExtractDomainRDD(f._1).replaceAll("^\\s*www\\.", ""),
ExtractDomainRDD(f._2).replaceAll("^\\s*www\\.", ""))))
.filter(r => r._2 != "" && r._3 != "")
.countItems()
.filter(r => r._2 > 5)
WriteGEXF(links, "/home/nruest/Projects/au/sample-data/issue-439/writegexf.gexf") These two scripts produce the same thing, other than the issue raised here. So, I'm going to open up a PR where we just rip it all out. AUK will need to be updated for the next release, as will all the documentation. $ wc -l *
29186 writegexf.gexf
29186 writegraph.gexf
58372 total |
This comment has been minimized.
This comment has been minimized.
For context, issue #289 - way back in November 2018 (!) - discusses the context behind having this. Basically, I think the only difference is that Apologies, I should have looked this up before, but didn't think we had these functions running in parallel but they're both there. We should certainly kill one. I have no strong feelings on what we keep. I guess part of me thinks that MD5 collisions are like, very rare (i.e. this random StackOverflow answer), but I'm also a historian so I'd defer to other thoughts. FWIW I think we could also delete |
ianmilligan1 commentedApr 9, 2020
•
edited
Describe the bug
The output of the
CommandLineApp
DomainGraphExtractor
creates different node ID types than runningWriteGraph
directly through spark shell. They should be the same.To Reproduce
The following command line command (both DF and RDD):
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /Users/ianmilligan1/dropbox/git/aut/target/aut-0.50.1-SNAPSHOT-fatjar.jar --extractor DomainGraphExtractor --input /users/ianmilligan1/dropbox/git/aut-resources/sample-data/*.gz --output /users/ianmilligan1/desktop/domaingraph-gexf --output-format GEXF --partition 1
creates an output file that looks like:
Conversely, if we run this script as per aut-docs:
We get an output that looks like:
Expected behavior
The output of
DomainGraphExtractor
is preferable to theWriteGraph
output. In other words, the nodes as hashes is superior to the notes as ID #s.Environment information
--jars