Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ExtractDomain or ExtractBaseDomain? #367

Closed
lintool opened this issue Oct 25, 2019 · 6 comments
Assignees
Labels

Comments

@lintool
Copy link
Member

@lintool lintool commented Oct 25, 2019

In Scala RDD, we have:

import io.archivesunleashed._
import io.archivesunleashed.matchbox._

RecordLoader.loadArchives("src/test/resources/warc/example.warc.gz", sc).keepValidPages()
  .map(r => ExtractDomain(r.getUrl))
  .countItems()
  .take(10)

In Scala DF, we have:

import io.archivesunleashed._
import io.archivesunleashed.df._

RecordLoader.loadArchives("src/test/resources/warc/example.warc.gz", sc).extractValidPagesDF()
  .select(ExtractBaseDomain($"Url").as("Domain"))
  .groupBy("Domain").count().orderBy(desc("count"))
  .show(20, False)

We should be consistent: ExtractDomain or ExtractBaseDomain?

@ruebot ruebot self-assigned this Oct 25, 2019
@ruebot

This comment has been minimized.

Copy link
Member

@ruebot ruebot commented Oct 25, 2019

Let's go with ExtractDomain.

@ruebot

This comment has been minimized.

Copy link
Member

@ruebot ruebot commented Oct 25, 2019

Shall we just rename it to ExtractDomainDF (in the interim period between moving migrating all the RDDs to DFs)?

val ExtractBaseDomain = udf(io.archivesunleashed.matchbox.ExtractDomain.apply(_: String, ""))

@lintool

This comment has been minimized.

Copy link
Member Author

@lintool lintool commented Oct 25, 2019

Any reason why you want the DF suffix?
I suppose there's the potential for confusion, but I like the beauty of uniformity...

@ruebot

This comment has been minimized.

Copy link
Member

@ruebot ruebot commented Oct 25, 2019

Oh, just keeping it consistent with how we've done the same with the other DF functions like here

def extractHyperlinksDF(): DataFrame = {

...I'm happy to keep it uniform too.

@lintool

This comment has been minimized.

Copy link
Member Author

@lintool lintool commented Oct 25, 2019

Although per #366 we would rename that to just hyperlinks? To make consistent with Python?

@lintool

This comment has been minimized.

Copy link
Member Author

@lintool lintool commented Oct 25, 2019

Okay, so stick to ExtractDomain?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
2 participants
You can’t perform that action at this time.