Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upAdd extract-simple-site-link-structure DF example. #35
Conversation
This comment has been minimized.
This comment has been minimized.
The RDD version: https://github.com/archivesunleashed/aut-docs/blob/master/current/link-analysis.md#extract-simple-site-link-structure has
To remove cases where the |
This comment has been minimized.
This comment has been minimized.
Oh, hrm |
This comment has been minimized.
This comment has been minimized.
No diff here locally, but a relatively small data set I'm testing on.
|
RecordLoader.loadArchives("example.arc.gz", sc).webgraph() | ||
.groupBy(RemovePrefixWWWDF(ExtractDomainDF($"src")).as("src"), RemovePrefixWWWDF(ExtractDomainDF($"dest")).as("dest")) | ||
.count() | ||
.filter(($"src".isNotNull) || ($"dest".isNotNull)) |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
lintool
Jan 8, 2020
Member
I would prefer above since it makes more sense logically.
Don't need the paren, right?
And unless filter
works differently in DF vs. RDD, should it be &&
instead of ||
?
This comment has been minimized.
This comment has been minimized.
lintool
Jan 8, 2020
Member
That is, we only want to keep all links where both the src and dest are not null?
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
lintool
Jan 8, 2020
Member
Actually, now that I think about it, wouldn't it make more sense to push
.filter($"src".isNotNull && $"dest".isNotNull)
into .webgraph()
itself?
I can't imagine the user wanting nulls in the webgraph?
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This PR now depends on this PR. |
ruebot commentedJan 8, 2020
After you
cat
the files together, you should have something like this: