Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Align RDD and DF output for DomainGraphExtractor. #437

Merged
merged 7 commits into from Apr 8, 2020
Merged

Conversation

@ruebot
Copy link
Member

ruebot commented Apr 8, 2020

GitHub issue(s): #436

What does this Pull Request do?

Align RDD and DF output for DomainGraphExtractor.

- Resolves #436
- Remove WWW prefix for RDD was double escaping
- Update DF so it matches RDD output (it wasn't even close before
:facepalm:)
- Update tests so they're basically testing the same thing

How should this be tested?

TravisCI + Some version of this:

bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.50.1-SNAPSHOT-fatjar.jar --extractor DomainGraphExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/app-output/issue-436-rdd --output-format TEXT --partition 1
bin/spark-submit --class io.archivesunleashed.app.CommandLineAppRunner /home/nruest/Projects/au/aut/target/aut-0.50.1-SNAPSHOT-fatjar.jar --extractor DomainGraphExtractor --input /home/nruest/Projects/au/sample-data/geocities/* --output /home/nruest/Projects/au/sample-data/app-output/issue-436-df --output-format TEXT --df --partition 1

The output of these two file should have the same line count:

[nruest@wombat:app-output]$ wc -l issue-436-rdd/part-00000 issue-436-df/part-00000-10a96d3c-7f35-4bba-9239-8fb23997612c-c000.csv
  4874 issue-436-rdd/part-00000
  4874 issue-436-df/part-00000-10a96d3c-7f35-4bba-9239-8fb23997612c-c000.csv
  9748 total

Additional Notes

This should unblock #435.

It's also worth noting this:

https://github.com/archivesunleashed/aut/blob/issue-436/src/main/scala/io/archivesunleashed/app/DomainGraphExtractor.scala#L42 vs https://github.com/archivesunleashed/aut/blob/issue-436/src/main/scala/io/archivesunleashed/package.scala#L184

They should be doing the same thing, but on the DataFrame side, we still get empty src or dest values. That's why https://github.com/archivesunleashed/aut/blob/issue-436/src/main/scala/io/archivesunleashed/app/DomainGraphExtractor.scala#L62-L63 is there.

ruebot added 7 commits Feb 10, 2020
- Resolves #436
- Remove WWW prefix for RDD was double escaping
- Update DF so it matches RDD output (it wasn't even close before
:facepalm:)
- Update tests so they're basically testing the same thing
@ruebot ruebot requested review from lintool and ianmilligan1 Apr 8, 2020
@codecov

This comment has been minimized.

Copy link

codecov bot commented Apr 8, 2020

Codecov Report

Merging #437 into master will increase coverage by 0.05%.
The diff coverage is 97.36%.

@@            Coverage Diff             @@
##           master     #437      +/-   ##
==========================================
+ Coverage   77.99%   78.04%   +0.05%     
==========================================
  Files          43       43              
  Lines        1554     1558       +4     
  Branches      286      286              
==========================================
+ Hits         1212     1216       +4     
  Misses        217      217              
  Partials      125      125              
Copy link
Member

ianmilligan1 left a comment

Looks good - tried it out.

One proviso - the output of this has node IDs like:

<node id="2343ec78a04c6ea9d80806345d31fd78" label="facebook.com" />
<node id="9cce24c55aee4eb39845fde935cca3da" label="web.net" />
<node id="5399465c5b23df17b16c2377e865a0b2" label="PetitionOnline.com" />
<node id="1fbfb6126d36fd25c16de2b0142700d8" label="traduku.net" />
<node id="d1063af181fe606e55ed93dd5b867169" label="en.wikipedia.org" />
<node id="0412791bbc450bbeb5b7d35eaed7e4f2" label="calendarix.com" />
<node id="fb1c73ca981330da55c56e07be521842" label="goodsforgreens.myshopify.com" />

Whereas if we were to run a script like this one in aut-docs, we get:

<node id="76" label="liberalpartyofcanada-mb.ca" />
<node id="80" label="lpco.ca" />
<node id="84" label="snapdesign.ca" />
<node id="88" label="PetitionOnline.com" />
<node id="92" label="egale.ca" />
<node id="96" label="liberal.nf.net" />
<node id="100" label="policyalternatives.ca" />
<node id="1" label="collectionscanada.ca" />

The behaviour of DomainGraphExtractor is preferable to the WriteGraph(links, "links-for-gephi.gexf") output.

@ianmilligan1 ianmilligan1 merged commit 96899f4 into master Apr 8, 2020
3 checks passed
3 checks passed
codecov/patch 97.36% of diff hit (target 77.99%)
Details
codecov/project 78.04% (+0.05%) compared to eed5a4f
Details
continuous-integration/travis-ci/pr The Travis CI build passed
Details
@ianmilligan1 ianmilligan1 deleted the issue-436 branch Apr 8, 2020
@ruebot

This comment has been minimized.

Copy link
Member Author

ruebot commented Apr 8, 2020

@ianmilligan1 can you open up an issue for that? That's a good catch. Those should all be aligned.

@ianmilligan1

This comment has been minimized.

Copy link
Member

ianmilligan1 commented Apr 9, 2020

@ruebot Will do tomorrow!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

2 participants
You can’t perform that action at this time.