Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Missing loadArchives on WriteGEXF docs #81

Closed
ianmilligan1 opened this issue Jun 17, 2020 · 5 comments
Closed

Missing loadArchives on WriteGEXF docs #81

ianmilligan1 opened this issue Jun 17, 2020 · 5 comments

Comments

@ianmilligan1
Copy link
Member

@ianmilligan1 ianmilligan1 commented Jun 17, 2020

Our Scala DF script doesn't actually load the WARCs. It is currently:

import io.archivesunleashed._
import io.archivesunleashed.udfs._
import io.archivesunleashed.app._

val graph = webgraph.groupBy(
                       $"crawl_date",
                       removePrefixWWW(extractDomain($"src")).as("src_domain"),
                       removePrefixWWW(extractDomain($"dest")).as("dest_domain"))
              .count()
              .filter(!($"dest_domain"===""))
              .filter(!($"src_domain"===""))
              .filter($"count" > 5)
              .orderBy(desc("count"))
              .collect()

WriteGEXF(graph, "links-for-gephi.gexf")
@ianmilligan1
Copy link
Member Author

@ianmilligan1 ianmilligan1 commented Jun 17, 2020

Could replace with

import io.archivesunleashed._
import io.archivesunleashed.udfs._
import io.archivesunleashed.app._

val graph = RecordLoader.loadArchives("/path/to/warcs",sc)
                    .webgraph.groupBy(
                       $"crawl_date",
                       removePrefixWWW(extractDomain($"src")).as("src_domain"),
                       removePrefixWWW(extractDomain($"dest")).as("dest_domain"))
              .count()
              .filter(!($"dest_domain"===""))
              .filter(!($"src_domain"===""))
              .filter($"count" > 5)
              .orderBy(desc("count"))
              .collect()

WriteGEXF(graph, "links-for-gephi.gexf")

If that looks good @ruebot I can make the change.

ianmilligan1 added a commit that referenced this issue Jun 17, 2020
@ianmilligan1
Copy link
Member Author

@ianmilligan1 ianmilligan1 commented Jun 17, 2020

I'll just create a PR for the little things I'm catching during testing this afternoon.

@ruebot
Copy link
Member

@ruebot ruebot commented Jun 18, 2020

Let's get the line formatting like this:

import io.archivesunleashed._
import io.archivesunleashed.udfs._
import io.archivesunleashed.app._

val graph = RecordLoader.loadArchives("/path/to/warcs",sc)
              .webgraph.groupBy(
                          $"crawl_date",
                          removePrefixWWW(extractDomain($"src")).as("src_domain"),
                          removePrefixWWW(extractDomain($"dest")).as("dest_domain"))
              .count()
              .filter(!($"dest_domain"===""))
              .filter(!($"src_domain"===""))
              .filter($"count" > 5)
              .orderBy(desc("count"))
              .collect()

WriteGEXF(graph, "links-for-gephi.gexf")
ianmilligan1 added a commit that referenced this issue Jun 18, 2020
@ruebot
Copy link
Member

@ruebot ruebot commented Jun 18, 2020

@ianmilligan1 if you're in a position to do this, sometime today, can you do a PR for this one or both of your open issues you're working on? I want to see if I finally got the gh-action to work correctly. It's supposed to deploy only on merges or commits to the docusaurus branch, which is something we couldn't do with Travis.

@ianmilligan1
Copy link
Member Author

@ianmilligan1 ianmilligan1 commented Jun 18, 2020

@ruebot Opened up a draft PR - there’s one or two more things I want to do on that branch before merging (although can always do a separate one if need be).

@ruebot ruebot closed this in 11f49f4 Jun 18, 2020
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
2 participants
You can’t perform that action at this time.