New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace hashing of unique ids with .zipWithUniqueId() #243

Closed
greebie opened this Issue Jul 20, 2018 · 15 comments

Comments

Projects
None yet
4 participants
@greebie
Contributor

greebie commented Jul 20, 2018

Describe the Enhancement

AUT uses hash values to create unique ids, which can leave us duplicates of the same url in a network graph when hashes collide.

To Reproduce
Steps to reproduce the behavior (e.g.):
Run a Domain Graph Extractor with a large number of network nodes (websites).
Run in Gephi.
Discover duplicate websites in graph.

Expected behavior
All network nodes should be unique.

Screenshots
N/A

Additional context
The .zipWithIndex() feature in Apache Spark would be a better approach. http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.zipWithIndex

.zipWithUniqueId() does not call another context so it could be faster.

See also #228

@greebie greebie self-assigned this Jul 20, 2018

@greebie greebie added the enhancement label Jul 20, 2018

@greebie

This comment has been minimized.

Contributor

greebie commented Nov 7, 2018

I found a solution, which still needs testing, but current time trials using this code and the UVIC local news warcs:

def timed(f: => Unit) = {
  val start = System.currentTimeMillis()
  f
  val end = System.currentTimeMillis()
  println("Elapsed Time: " + (end - start))
}

timed {
import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._

val links = RecordLoader.loadArchives("/Users/ryandeschamps/warcs/", sc)
  .keepValidPages()
  .map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString)))
  .flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", ""))))
  .filter(r => r._2 != "" && r._3 != "")
  .countItems()
  .filter(r => r._2 > 5)

WriteGEXF(links, "new-gephi3.gexf") }
@greebie

This comment has been minimized.

Contributor

greebie commented Nov 7, 2018

"Old Way "New Way"
223715 230982
201160 233126
209087 232057

The new way is definitely slower, but within 10-20%.

@ianmilligan1

This comment has been minimized.

Member

ianmilligan1 commented Nov 7, 2018

Thanks for this @greebie! This new approach seems better, but I'm just weighing the cost/benefit of longer processing time vs. avoiding all hash collisions.

Right now, when we have collisions, Gephi automatically fixes those is that correct?

@greebie

This comment has been minimized.

Contributor

greebie commented Nov 7, 2018

Not really. Gephi merges items that have the collisions, so with large collections, it could produce a misrepresentation of the graph. However, I have not come across it in this example.

@ianmilligan1

This comment has been minimized.

Member

ianmilligan1 commented Nov 7, 2018

OK great, thanks for the explanations and documentation here. My sense is that this is a better approach, but I'm not 100% due to the time trade-off.

Thoughts @ruebot @lintool ?

@lintool

This comment has been minimized.

Member

lintool commented Nov 7, 2018

Why not zipWithUniqueId() - which doesn't need to trigger a Spark job according to the docs?

You just need the ids to be unique - they don't need to be sequential, right?

@greebie

This comment has been minimized.

Contributor

greebie commented Nov 7, 2018

I'm using zipWithUniqueId(). :)

@lintool

This comment has been minimized.

Member

lintool commented Nov 7, 2018

oh, misread then. In the first comment in the issue you wrote:

The .zipWithIndex() feature in Apache Spark would be a better approach. > http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.zipWithIndex

@greebie

This comment has been minimized.

Contributor

greebie commented Nov 7, 2018

That's right - the current pushed branch uses zipWIthUniqueId() instead for the reasons you said. (I changed the issue title to avoid future confusion.)

@greebie greebie changed the title from Replace hashing of unique ids with .zipWithIndex() to Replace hashing of unique ids with .zipWithUniqueId() Nov 7, 2018

@greebie

This comment has been minimized.

Contributor

greebie commented Nov 7, 2018

Okay - I've decided to keep the existing WriteGraphml and WriteGexf outputs and included a new app called WriteGraph. WriteGraph(rdd, filePath) will default to gexf (this can be switched to graphml) but you can access the "new" approach using WriteGraphl.asGraphml(rdd, filepath) (or .asGexf). That means we have no impacts on auk in the short run and can take our time deciding which method we prefer.

@ianmilligan1

This comment has been minimized.

Member

ianmilligan1 commented Nov 7, 2018

OK, so this proposed approach would have

WriteGraphml and WriteGexf outputs as original, and then a new WriteGraph approach that does much the same but with this new zipWithUniqueId() approach? I think at least in the medium-term we're not going to want to have these two approaches co-existing for too long, although short-term that is ok.

Thoughts @ruebot @lintool?

@greebie

This comment has been minimized.

Contributor

greebie commented Nov 7, 2018

That's right. It means instead of "fixing" WriteGexf, I am adding this new approach, leaving the following possibilities.

  1. We decide that the WriteGraph approach is not worth the extra processing time, reject the PR and close the issue.
  2. We decide that the WriteGraph approach is better, accept the PR and then I'll create issues to send WriteGEXF & WriteGraphml through a deprecation cycle. Auk would then move to WriteGraph whenever @ruebot feels it's okay to do so.
@ruebot

This comment has been minimized.

Member

ruebot commented Nov 7, 2018

This wouldn't truly effect AUK until there was a new release of AUT. That said, can you provide more detail as to what we'd have to do change the workflow? Would we still produce graphml? Would that graphml, if produced, be passed off to GraphPass? If we're creating gexf in AUT now, is that gexf different from the gexf GraphPass produces? If so, how would we represent them? Looks of potential documentation changes too to think about.

@greebie

This comment has been minimized.

Contributor

greebie commented Nov 7, 2018

@ruebot The only change to aut should be that the aut command for WriteGEXF(rdd, filepath) would change to WriteGraph.asGraphml(rdd, filepath).

The Graphpass workflow should remain the same.

Alternately, I can make graphml the default WriteGraph behavior and then the only difference would be WriteGraph(rdd, filepath).

Basically, I chose this approach because I was duplicating code running from WriteGEXF and WriteGraphml and it started to seem that I should put them both together.

@greebie

This comment has been minimized.

Contributor

greebie commented Nov 7, 2018

@ruebot I realize I failed to answer your last questions. The graphs produced by graphpass include metadata about how to display the graph in Gephi or Sigma (ie. how big should the dots be, what color and where they should be positioned in the visualisation).

Auk-produced graphs, unfortunately, just provide the raw network data with no visualization metadata.

Currently, the best we have in auk right now is ExtractGraphX which produces metadata for node sizes and some things can offer a fair way to reduce the size of large graphs for visualization, but it would increase the amount of time it would take to produce derivatives for small graphs. When we accepted that udf into the repo, we decided it might be good for the toolkit, but it's not quite ready to help auk.

@ruebot ruebot closed this in d3aebf4 Nov 22, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment