Add extract-simple-site-link-structure DF example. #35

ruebot · 2020-01-08T01:51:00Z

After you cat the files together, you should have something like this:

$ head sample.csv
geocities.com,home.ici.net,8
geocities.com,service.bfast.com,222
geocities.com,multicity.com,6
geocities.com,public.iastate.edu,7
geocities.com,volleyball.org,6
geocities.com,realitypornpass.com,91
geocities.com,ttrehber.gov.tr,6
geocities.com,pub16.bravenet.com,25
geocities.com,terravista.pt,81
geocities.com,img.photobucket.com,325


        Add extract-simple-site-link-structure DF example.

lintool · 2020-01-08T02:20:14Z

The RDD version: https://github.com/archivesunleashed/aut-docs/blob/master/current/link-analysis.md#extract-simple-site-link-structure

has

  .filter(r => r._1 != "" && r._2 != "")

To remove cases where the ExtractDomainRDD UDF returns nothing... does the DF version do this automagically?

ruebot · 2020-01-08T02:21:18Z

Oh, hrm explode removes nulls, but not sure if this one does. I'll double check. Good call!


        review

ruebot · 2020-01-08T02:52:03Z

No diff here locally, but a relatively small data set I'm testing on.

$ diff sample2.csv sample.csv

ruebot · 2020-01-08T02:52:36Z

current/link-analysis.md

+RecordLoader.loadArchives("example.arc.gz", sc).webgraph()
+  .groupBy(RemovePrefixWWWDF(ExtractDomainDF($"src")).as("src"), RemovePrefixWWWDF(ExtractDomainDF($"dest")).as("dest"))
+  .count()
+  .filter(($"src".isNotNull) || ($"dest".isNotNull))


This could go here, or above.

I would prefer above since it makes more sense logically.
Don't need the paren, right?
And unless filter works differently in DF vs. RDD, should it be && instead of ||?

That is, we only want to keep all links where both the src and dest are not null?

🤦‍♂ yeah, &&

I'll update.

Actually, now that I think about it, wouldn't it make more sense to push
.filter($"src".isNotNull && $"dest".isNotNull) into .webgraph() itself?

I can't imagine the user wanting nulls in the webgraph?

archivesunleashed/aut#400


        review


        review

ruebot · 2020-01-08T13:52:00Z

This PR now depends on this PR.

Add extract-simple-site-link-structure DF example.

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

0070873

ruebot requested review from lintool and ianmilligan1 Jan 8, 2020

review

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

1457a45

ruebot reviewed Jan 8, 2020

View changes

review

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

6505421

ruebot mentioned this pull request Jan 8, 2020

Filter blank src/dest out of webgraph. #400

Merged

review

Verified

This commit was signed with a verified signature.

ruebot Nick Ruest

GPG key ID: 417FAF1A0E1080CD Learn about signing commits

18292ea

lintool approved these changes Jan 8, 2020

View changes

ruebot merged commit 4186cd9 into master Jan 8, 2020

ruebot deleted the extract-simple-site-link-structure-df branch Jan 8, 2020

Please note that GitHub no longer supports your web browser.

archivesunleashed / aut-docs

Add extract-simple-site-link-structure DF example. #35

Add extract-simple-site-link-structure DF example. #35

ruebot commented Jan 8, 2020

This comment has been minimized.

lintool commented Jan 8, 2020

This comment has been minimized.

ruebot commented Jan 8, 2020

This comment has been minimized.

ruebot commented Jan 8, 2020

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

This comment has been minimized.

ruebot commented Jan 8, 2020

Please note that GitHub no longer supports your web browser.

archivesunleashed / aut-docs

Join GitHub today

Add extract-simple-site-link-structure DF example. #35

Add extract-simple-site-link-structure DF example. #35

Conversation

ruebot commented Jan 8, 2020

This comment has been minimized.

lintool commented Jan 8, 2020

This comment has been minimized.

ruebot commented Jan 8, 2020

This comment has been minimized.

ruebot commented Jan 8, 2020

This comment has been minimized.

ruebot Jan 8, 2020

This comment has been minimized.

lintool Jan 8, 2020

This comment has been minimized.

lintool Jan 8, 2020

This comment has been minimized.

ruebot Jan 8, 2020

This comment has been minimized.

lintool Jan 8, 2020

This comment has been minimized.

ruebot Jan 8, 2020

This comment has been minimized.

ruebot commented Jan 8, 2020