Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Scala DF documentation for AUK derivatives. #34

Merged
merged 6 commits into from Apr 14, 2020
Merged

Conversation

@ruebot
Copy link
Member

ruebot commented Jan 1, 2020

@lintool @ianmilligan1 here are the first two. We still need to do the third derivative, and I'll move this out of draft when we get it done.

@SinghGursimran can you make this your next focus point in archivesunleashed/aut#223? Converting this (below) to DF?

val links = validPages
  .map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString)))
  .flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\\\s*www\\\\.", ""), ExtractDomain(f._2).replaceAll("^\\\\s*www\\\\.", ""))))
  .filter(r => r._2 != "" && r._3 != "")
  .countItems()
  .filter(r => r._2 > 5)

WriteGraph.asGraphml(links, "example.graphml")
@ruebot ruebot requested review from lintool and ianmilligan1 Jan 1, 2020
current/standard-derivatives.md Outdated Show resolved Hide resolved
current/standard-derivatives.md Outdated Show resolved Hide resolved
Copy link
Member

ianmilligan1 left a comment

Tested and all checks out - I'll keep an eye on the draft PR @ruebot!

@SinghGursimran

This comment has been minimized.

Copy link
Contributor

SinghGursimran commented Jan 2, 2020

@ruebot
Dataframe implementation for the above query:

import io.archivesunleashed._
import io.archivesunleashed.df._

val target = udf((vs: Any) => {
   				       var res = ""
   					if(vs != null){
   						res = vs.toString.split(",")(1)
   					}
   					res
   				})
val src = udf((vs: Any) => {
   				var res = ""
   				if(vs != null){
   					val s = vs.toString.split(",")(0)
   					if(s.length() != 0)
   						res = s.drop(1)
   				}
   				res
   			   })
val modify = udf((str: String) => str.replaceAll("^\\\\s*www\\\\.", ""))

val df = RecordLoader.loadArchives("./src/test/resources/arc/example.arc.gz",sc).webpages()
   				 .select($"crawl_date",explode_outer(ExtractLinksDF($"url",$"content")).as("link"))

df.select($"crawl_date",modify(ExtractDomainDF(src($"link"))).as("Source"),modify(ExtractDomainDF(target($"link"))).as("Target"))
.filter($"Source" =!= "")
.filter($"Target" =!= "")
.groupBy("crawl_date","Source","Target")
.count()
.filter($"count" > 5)
.orderBy(desc("count"))
.show(20)

For writing as graph, we do not have df implementation yet. I will add the code for that then update this.

ruebot added 2 commits Apr 14, 2020
…to auk-derv-df
@ruebot ruebot marked this pull request as ready for review Apr 14, 2020
@ruebot

This comment has been minimized.

Copy link
Member Author

ruebot commented Apr 14, 2020

DataFrame graphml tested locally with:

df:

import io.archivesunleashed._
import io.archivesunleashed.df._
import io.archivesunleashed.app._

sc.setLogLevel("INFO")

// Web archive collection; web graph.
val webgraph = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocities", sc)
  .webgraph()


val graph = webgraph.groupBy(
                       $"crawl_date",
                       RemovePrefixWWWDF(ExtractDomainDF($"src")).as("src_domain"),
                       RemovePrefixWWWDF(ExtractDomainDF($"dest")).as("dest_domain"))
              .count()
              .filter(!($"dest_domain"===""))
              .filter(!($"src_domain"===""))
              .filter($"count" > 5)
              .orderBy(desc("count"))

WriteGraphML(graph.collect(), "/home/nruest/Projects/au/sample-data/issue-439/documentation-pr-test-df.graphml")

rdd:

import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
  
sc.setLogLevel("INFO")

// Web archive collection.
val warcs = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocities", sc)
  .keepValidPages()


// GraphML.
val links = warcs
  .map(r => (r.getCrawlDate, ExtractLinksRDD(r.getUrl, r.getContentString)))
  .flatMap(r => r._2.map(f => (r._1,
                               ExtractDomainRDD(f._1).replaceAll("^\\s*www\\.", ""),
                               ExtractDomainRDD(f._2).replaceAll("^\\s*www\\.", ""))))
  .filter(r => r._2 != "" && r._3 != "")
  .countItems()
  .filter(r => r._2 > 5)

WriteGraphML(links, "/home/nruest/Projects/au/sample-data/issue-439/documentation-pr-test-rdd.graphml")
ruebot added 3 commits Apr 14, 2020
…to auk-derv-df
Copy link
Member

ianmilligan1 left a comment

Works like a charm!
Screen Shot 2020-04-14 at 4 57 06 PM

@ianmilligan1 ianmilligan1 merged commit f5b2652 into master Apr 14, 2020
2 checks passed
2 checks passed
delivery
Details
delivery
Details
@ianmilligan1 ianmilligan1 deleted the auk-derv-df branch Apr 14, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked issues

Successfully merging this pull request may close these issues.

None yet

4 participants
You can’t perform that action at this time.