Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add extract-simple-site-link-structure DF example. #35

Merged
merged 4 commits into from Jan 8, 2020

Conversation

@ruebot
Copy link
Member

ruebot commented Jan 8, 2020

After you cat the files together, you should have something like this:

$ head sample.csv
geocities.com,home.ici.net,8
geocities.com,service.bfast.com,222
geocities.com,multicity.com,6
geocities.com,public.iastate.edu,7
geocities.com,volleyball.org,6
geocities.com,realitypornpass.com,91
geocities.com,ttrehber.gov.tr,6
geocities.com,pub16.bravenet.com,25
geocities.com,terravista.pt,81
geocities.com,img.photobucket.com,325
@ruebot ruebot requested review from lintool and ianmilligan1 Jan 8, 2020
@lintool

This comment has been minimized.

Copy link
Member

lintool commented Jan 8, 2020

The RDD version: https://github.com/archivesunleashed/aut-docs/blob/master/current/link-analysis.md#extract-simple-site-link-structure

has

  .filter(r => r._1 != "" && r._2 != "")

To remove cases where the ExtractDomainRDD UDF returns nothing... does the DF version do this automagically?

@ruebot

This comment has been minimized.

Copy link
Member Author

ruebot commented Jan 8, 2020

Oh, hrm explode removes nulls, but not sure if this one does. I'll double check. Good call!

@ruebot

This comment has been minimized.

Copy link
Member Author

ruebot commented Jan 8, 2020

No diff here locally, but a relatively small data set I'm testing on.

$ diff sample2.csv sample.csv
RecordLoader.loadArchives("example.arc.gz", sc).webgraph()
.groupBy(RemovePrefixWWWDF(ExtractDomainDF($"src")).as("src"), RemovePrefixWWWDF(ExtractDomainDF($"dest")).as("dest"))
.count()
.filter(($"src".isNotNull) || ($"dest".isNotNull))

This comment has been minimized.

Copy link
@ruebot

ruebot Jan 8, 2020

Author Member

This could go here, or above.

This comment has been minimized.

Copy link
@lintool

lintool Jan 8, 2020

Member

I would prefer above since it makes more sense logically.
Don't need the paren, right?
And unless filter works differently in DF vs. RDD, should it be && instead of ||?

This comment has been minimized.

Copy link
@lintool

lintool Jan 8, 2020

Member

That is, we only want to keep all links where both the src and dest are not null?

This comment has been minimized.

Copy link
@ruebot

ruebot Jan 8, 2020

Author Member

🤦‍♂ yeah, &&

I'll update.

This comment has been minimized.

Copy link
@lintool

lintool Jan 8, 2020

Member

Actually, now that I think about it, wouldn't it make more sense to push
.filter($"src".isNotNull && $"dest".isNotNull) into .webgraph() itself?

I can't imagine the user wanting nulls in the webgraph?

This comment has been minimized.

Copy link
@ruebot
@ruebot

This comment has been minimized.

Copy link
Member Author

ruebot commented Jan 8, 2020

This PR now depends on this PR.

@lintool
lintool approved these changes Jan 8, 2020
@ruebot ruebot merged commit 4186cd9 into master Jan 8, 2020
@ruebot ruebot deleted the extract-simple-site-link-structure-df branch Jan 8, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.