Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upAdd Scala DF documentation for AUK derivatives. #34
Conversation
Tested and all checks out - I'll keep an eye on the draft PR @ruebot! |
This comment has been minimized.
This comment has been minimized.
@ruebot import io.archivesunleashed._
import io.archivesunleashed.df._
val target = udf((vs: Any) => {
var res = ""
if(vs != null){
res = vs.toString.split(",")(1)
}
res
})
val src = udf((vs: Any) => {
var res = ""
if(vs != null){
val s = vs.toString.split(",")(0)
if(s.length() != 0)
res = s.drop(1)
}
res
})
val modify = udf((str: String) => str.replaceAll("^\\\\s*www\\\\.", ""))
val df = RecordLoader.loadArchives("./src/test/resources/arc/example.arc.gz",sc).webpages()
.select($"crawl_date",explode_outer(ExtractLinksDF($"url",$"content")).as("link"))
df.select($"crawl_date",modify(ExtractDomainDF(src($"link"))).as("Source"),modify(ExtractDomainDF(target($"link"))).as("Target"))
.filter($"Source" =!= "")
.filter($"Target" =!= "")
.groupBy("crawl_date","Source","Target")
.count()
.filter($"count" > 5)
.orderBy(desc("count"))
.show(20)
For writing as graph, we do not have df implementation yet. I will add the code for that then update this. |
This comment has been minimized.
This comment has been minimized.
DataFrame graphml tested locally with: df: import io.archivesunleashed._
import io.archivesunleashed.df._
import io.archivesunleashed.app._
sc.setLogLevel("INFO")
// Web archive collection; web graph.
val webgraph = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocities", sc)
.webgraph()
val graph = webgraph.groupBy(
$"crawl_date",
RemovePrefixWWWDF(ExtractDomainDF($"src")).as("src_domain"),
RemovePrefixWWWDF(ExtractDomainDF($"dest")).as("dest_domain"))
.count()
.filter(!($"dest_domain"===""))
.filter(!($"src_domain"===""))
.filter($"count" > 5)
.orderBy(desc("count"))
WriteGraphML(graph.collect(), "/home/nruest/Projects/au/sample-data/issue-439/documentation-pr-test-df.graphml") rdd: import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._
sc.setLogLevel("INFO")
// Web archive collection.
val warcs = RecordLoader.loadArchives("/home/nruest/Projects/au/sample-data/geocities", sc)
.keepValidPages()
// GraphML.
val links = warcs
.map(r => (r.getCrawlDate, ExtractLinksRDD(r.getUrl, r.getContentString)))
.flatMap(r => r._2.map(f => (r._1,
ExtractDomainRDD(f._1).replaceAll("^\\s*www\\.", ""),
ExtractDomainRDD(f._2).replaceAll("^\\s*www\\.", ""))))
.filter(r => r._2 != "" && r._3 != "")
.countItems()
.filter(r => r._2 > 5)
WriteGraphML(links, "/home/nruest/Projects/au/sample-data/issue-439/documentation-pr-test-rdd.graphml") |
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
ruebot commentedJan 1, 2020
•
edited
@lintool @ianmilligan1 here are the first two. We still need to do the third derivative, and I'll move this out of draft when we get it done.
@SinghGursimran can you make this your next focus point in archivesunleashed/aut#223? Converting this (below) to DF?