Skip to content
Please note that GitHub no longer supports your web browser.

We recommend upgrading to the latest Google Chrome or Firefox.

Learn more
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Scala DF documentation for AUK derivatives. #34

Draft
wants to merge 1 commit into
base: master
from
Draft

Conversation

@ruebot
Copy link
Member

ruebot commented Jan 1, 2020

@lintool @ianmilligan1 here are the first two. We still need to do the third derivative, and I'll move this out of draft when we get it done.

@SinghGursimran can you make this your next focus point in archivesunleashed/aut#223? Converting this (below) to DF?

val links = validPages
  .map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString)))
  .flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\\\s*www\\\\.", ""), ExtractDomain(f._2).replaceAll("^\\\\s*www\\\\.", ""))))
  .filter(r => r._2 != "" && r._3 != "")
  .countItems()
  .filter(r => r._2 > 5)

WriteGraph.asGraphml(links, "example.graphml")
@ruebot ruebot requested review from lintool and ianmilligan1 Jan 1, 2020
Copy link
Member

ianmilligan1 left a comment

Tested and all checks out - I'll keep an eye on the draft PR @ruebot!

@SinghGursimran

This comment has been minimized.

Copy link
Contributor

SinghGursimran commented Jan 2, 2020

@ruebot
Dataframe implementation for the above query:

import io.archivesunleashed._
import io.archivesunleashed.df._

val target = udf((vs: Any) => {
   				       var res = ""
   					if(vs != null){
   						res = vs.toString.split(",")(1)
   					}
   					res
   				})
val src = udf((vs: Any) => {
   				var res = ""
   				if(vs != null){
   					val s = vs.toString.split(",")(0)
   					if(s.length() != 0)
   						res = s.drop(1)
   				}
   				res
   			   })
val modify = udf((str: String) => str.replaceAll("^\\\\s*www\\\\.", ""))

val df = RecordLoader.loadArchives("./src/test/resources/arc/example.arc.gz",sc).webpages()
   				 .select($"crawl_date",explode_outer(ExtractLinksDF($"url",$"content")).as("link"))

df.select($"crawl_date",modify(ExtractDomainDF(src($"link"))).as("Source"),modify(ExtractDomainDF(target($"link"))).as("Target"))
.filter($"Source" =!= "")
.filter($"Target" =!= "")
.groupBy("crawl_date","Source","Target")
.count()
.filter($"count" > 5)
.orderBy(desc("count"))
.show(20)

For writing as graph, we do not have df implementation yet. I will add the code for that then update this.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
4 participants
You can’t perform that action at this time.