Join GitHub today
GitHub is home to over 50 million developers working together to host and review code, manage projects, and build software together.
Sign upMissing loadArchives on WriteGEXF docs #81
Comments
Could replace with import io.archivesunleashed._
import io.archivesunleashed.udfs._
import io.archivesunleashed.app._
val graph = RecordLoader.loadArchives("/path/to/warcs",sc)
.webgraph.groupBy(
$"crawl_date",
removePrefixWWW(extractDomain($"src")).as("src_domain"),
removePrefixWWW(extractDomain($"dest")).as("dest_domain"))
.count()
.filter(!($"dest_domain"===""))
.filter(!($"src_domain"===""))
.filter($"count" > 5)
.orderBy(desc("count"))
.collect()
WriteGEXF(graph, "links-for-gephi.gexf") If that looks good @ruebot I can make the change. |
I'll just create a PR for the little things I'm catching during testing this afternoon. |
Let's get the line formatting like this: import io.archivesunleashed._
import io.archivesunleashed.udfs._
import io.archivesunleashed.app._
val graph = RecordLoader.loadArchives("/path/to/warcs",sc)
.webgraph.groupBy(
$"crawl_date",
removePrefixWWW(extractDomain($"src")).as("src_domain"),
removePrefixWWW(extractDomain($"dest")).as("dest_domain"))
.count()
.filter(!($"dest_domain"===""))
.filter(!($"src_domain"===""))
.filter($"count" > 5)
.orderBy(desc("count"))
.collect()
WriteGEXF(graph, "links-for-gephi.gexf") |
@ianmilligan1 if you're in a position to do this, sometime today, can you do a PR for this one or both of your open issues you're working on? I want to see if I finally got the gh-action to work correctly. It's supposed to deploy only on merges or commits to the docusaurus branch, which is something we couldn't do with Travis. |
@ruebot Opened up a draft PR - there’s one or two more things I want to do on that branch before merging (although can always do a separate one if need be). |
ianmilligan1 commentedJun 17, 2020
Our Scala DF script doesn't actually load the WARCs. It is currently: