Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upCommandLineApp DomainGraphExtractor Uses Different Node IDs than WriteGraph #439
Labels
Comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Describe the bug
The output of the
CommandLineApp
DomainGraphExtractor
creates different node ID types than runningWriteGraph
directly through spark shell. They should be the same.To Reproduce
The following command line command (both DF and RDD):
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /Users/ianmilligan1/dropbox/git/aut/target/aut-0.50.1-SNAPSHOT-fatjar.jar --extractor DomainGraphExtractor --input /users/ianmilligan1/dropbox/git/aut-resources/sample-data/*.gz --output /users/ianmilligan1/desktop/domaingraph-gexf --output-format GEXF --partition 1
creates an output file that looks like:
Conversely, if we run this script as per aut-docs:
We get an output that looks like:
Expected behavior
The output of
DomainGraphExtractor
is preferable to theWriteGraph
output. In other words, the nodes as hashes is superior to the notes as ID #s.Environment information
--jars