Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add "Find Images Shared Between Domains" section. #27

Merged
merged 2 commits into from Nov 26, 2019

Conversation

@ruebot
Copy link
Member

ruebot commented Nov 21, 2019

Feel free to wordsmith. Rough draft here 😄

...though not 100% sure this hits the original criteria in the issue for images large that 50x50 🤷‍♂

val result = total
.join(links, "MD5")
.groupBy("Domain","MD5")
.agg(first("ImageUrl")

This comment has been minimized.

Copy link
@lintool

lintool Nov 21, 2019

Member

I think this is more a matter of taste, but in these cases I wouldn't strictly follow indentation conventions, would rather do

  .agg(first("ImageUrl").as("ImageUrl")).orderBy(asc("MD5"))
  .write.format("csv").option("header","true").mode("Overwrite").save("/path/to/output")

Since semantically, each line does something coherent taken together. (And the line isn't that long...)

But I'm agnostic.

This comment has been minimized.

Copy link
@ruebot

ruebot Nov 21, 2019

Author Member

I'll create an issue that'll be a TODO before we do our first publish, to go through and make formatting consistent.

Copy link
Member

ianmilligan1 left a comment

Looks like it satisfies the requirements of that code request. Thanks @ruebot!

Also, maybe swap out the /path/to/warcs with example.arc.gz for consistency.

The code'll have to be updated to reflect our rapidly evolving syntax – ExtractDomain -> ExtractDomainDF; ExtractImageLinks -> ExtractImageLinksDF etc.

(I'm agnostic on formatting!)

Copy link
Member

ianmilligan1 left a comment

Looks great - sorry for the delay on this review @ruebot.

@ianmilligan1 ianmilligan1 merged commit e774c6c into master Nov 26, 2019
@ianmilligan1 ianmilligan1 deleted the aut-issue-237 branch Nov 26, 2019
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
3 participants
You can’t perform that action at this time.