Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

DomainGraphExtractor produces different output in RDD vs DF #436

Closed
ruebot opened this issue Apr 8, 2020 · 0 comments
Closed

DomainGraphExtractor produces different output in RDD vs DF #436

ruebot opened this issue Apr 8, 2020 · 0 comments

Comments

@ruebot
Copy link
Member

@ruebot ruebot commented Apr 8, 2020

To Reproduce
Steps to reproduce the behavior (e.g.):

  1. bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.50.1-SNAPSHOT-fatjar.jar --extractor DomainGraphExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/app-output/DomainGraphText --output-format TEXT

  2. bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.50.1-SNAPSHOT-fatjar.jar --extractor DomainGraphExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/app-output/DomainGraphText --df --output-format TEXT

  3. cat the part files together for each.

$ wc -l DomainGraphText.txt DomainGraphDFtext.csv
  4935 DomainGraphText.txt
 70368 DomainGraphDFtext.csv
 75303 total

Expected behavior

The files should be the same.

Environment information

  • AUT version: 0.50.0, 0.50.1-SNAPSHOT
  • OS: Ubuntu 18.04
  • Java version: Java 8
  • Apache Spark version: 2.4.5
  • Apache Spark w/aut: spark-submit
  • Apache Spark command used to run AUT: see above

Additional context

Blocks #435

ruebot added a commit that referenced this issue Apr 8, 2020
ruebot added a commit that referenced this issue Apr 8, 2020
- Resolves #436
- Remove WWW prefix for RDD was double escaping
- Update DF so it matches RDD output (it wasn't even close before
:facepalm:)
- Update tests so they're basically testing the same thing
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
1 participant
You can’t perform that action at this time.