Replace hashing of unique ids with .zipWithUniqueId() #243

greebie · Jul 20, 2018

Describe the Enhancement

AUT uses hash values to create unique ids, which can leave us duplicates of the same url in a network graph when hashes collide.

To Reproduce
Steps to reproduce the behavior (e.g.):
Run a Domain Graph Extractor with a large number of network nodes (websites).
Run in Gephi.
Discover duplicate websites in graph.

Expected behavior
All network nodes should be unique.

Screenshots
N/A

Additional context
The .zipWithIndex() feature in Apache Spark would be a better approach. http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.zipWithIndex

.zipWithUniqueId() does not call another context so it could be faster.

See also #228

greebie · Nov 7, 2018

I found a solution, which still needs testing, but current time trials using this code and the UVIC local news warcs:

def timed(f: => Unit) = {
  val start = System.currentTimeMillis()
  f
  val end = System.currentTimeMillis()
  println("Elapsed Time: " + (end - start))
}

timed {
import io.archivesunleashed._
import io.archivesunleashed.app._
import io.archivesunleashed.matchbox._

val links = RecordLoader.loadArchives("/Users/ryandeschamps/warcs/", sc)
  .keepValidPages()
  .map(r => (r.getCrawlDate, ExtractLinks(r.getUrl, r.getContentString)))
  .flatMap(r => r._2.map(f => (r._1, ExtractDomain(f._1).replaceAll("^\\s*www\\.", ""), ExtractDomain(f._2).replaceAll("^\\s*www\\.", ""))))
  .filter(r => r._2 != "" && r._3 != "")
  .countItems()
  .filter(r => r._2 > 5)

WriteGEXF(links, "new-gephi3.gexf") }

greebie · Nov 7, 2018

"Old Way	"New Way"
223715	230982
201160	233126
209087	232057

The new way is definitely slower, but within 10-20%.

ianmilligan1 · Nov 7, 2018

Thanks for this @greebie! This new approach seems better, but I'm just weighing the cost/benefit of longer processing time vs. avoiding all hash collisions.

Right now, when we have collisions, Gephi automatically fixes those is that correct?

greebie · Nov 7, 2018

Not really. Gephi merges items that have the collisions, so with large collections, it could produce a misrepresentation of the graph. However, I have not come across it in this example.

ianmilligan1 · Nov 7, 2018

OK great, thanks for the explanations and documentation here. My sense is that this is a better approach, but I'm not 100% due to the time trade-off.

Thoughts @ruebot @lintool ?

lintool · Nov 7, 2018

Why not zipWithUniqueId() - which doesn't need to trigger a Spark job according to the docs?

You just need the ids to be unique - they don't need to be sequential, right?

greebie · Nov 7, 2018

I'm using zipWithUniqueId(). :)

lintool · Nov 7, 2018

oh, misread then. In the first comment in the issue you wrote:

The .zipWithIndex() feature in Apache Spark would be a better approach. > http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.zipWithIndex

greebie · Nov 7, 2018

That's right - the current pushed branch uses zipWIthUniqueId() instead for the reasons you said. (I changed the issue title to avoid future confusion.)

greebie · Nov 7, 2018

Okay - I've decided to keep the existing WriteGraphml and WriteGexf outputs and included a new app called WriteGraph. WriteGraph(rdd, filePath) will default to gexf (this can be switched to graphml) but you can access the "new" approach using WriteGraphl.asGraphml(rdd, filepath) (or .asGexf). That means we have no impacts on auk in the short run and can take our time deciding which method we prefer.

ianmilligan1 · Nov 7, 2018

OK, so this proposed approach would have

WriteGraphml and WriteGexf outputs as original, and then a new WriteGraph approach that does much the same but with this new zipWithUniqueId() approach? I think at least in the medium-term we're not going to want to have these two approaches co-existing for too long, although short-term that is ok.

Thoughts @ruebot @lintool?

greebie · Nov 7, 2018

That's right. It means instead of "fixing" WriteGexf, I am adding this new approach, leaving the following possibilities.

We decide that the WriteGraph approach is not worth the extra processing time, reject the PR and close the issue.
We decide that the WriteGraph approach is better, accept the PR and then I'll create issues to send WriteGEXF & WriteGraphml through a deprecation cycle. Auk would then move to WriteGraph whenever @ruebot feels it's okay to do so.

ruebot · Nov 7, 2018

This wouldn't truly effect AUK until there was a new release of AUT. That said, can you provide more detail as to what we'd have to do change the workflow? Would we still produce graphml? Would that graphml, if produced, be passed off to GraphPass? If we're creating gexf in AUT now, is that gexf different from the gexf GraphPass produces? If so, how would we represent them? Looks of potential documentation changes too to think about.

greebie · Nov 7, 2018

@ruebot The only change to aut should be that the aut command for WriteGEXF(rdd, filepath) would change to WriteGraph.asGraphml(rdd, filepath).

The Graphpass workflow should remain the same.

Alternately, I can make graphml the default WriteGraph behavior and then the only difference would be WriteGraph(rdd, filepath).

Basically, I chose this approach because I was duplicating code running from WriteGEXF and WriteGraphml and it started to seem that I should put them both together.

greebie · Nov 7, 2018

@ruebot I realize I failed to answer your last questions. The graphs produced by graphpass include metadata about how to display the graph in Gephi or Sigma (ie. how big should the dots be, what color and where they should be positioned in the visualisation).

Auk-produced graphs, unfortunately, just provide the raw network data with no visualization metadata.

Currently, the best we have in auk right now is ExtractGraphX which produces metadata for node sizes and some things can offer a fair way to reduce the size of large graphs for visualization, but it would increase the amount of time it would take to produce derivatives for small graphs. When we accepted that udf into the repo, we decided it might be good for the toolkit, but it's not quite ready to help auk.

greebie self-assigned this Jul 20, 2018

greebie added the enhancement label Jul 20, 2018

greebie changed the title from Replace hashing of unique ids with .zipWithIndex() to Replace hashing of unique ids with .zipWithUniqueId() Nov 7, 2018

greebie referenced this issue Nov 7, 2018
Merged
Change Id generation for graphs from using hashes for urls to using .zipWithUniqueIds() #289

ruebot closed this in d3aebf4 Nov 22, 2018

archivesunleashed/aut

Replace hashing of unique ids with .zipWithUniqueId() #243

greebie commented Jul 20, 2018 •

edited

greebie self-assigned this Jul 20, 2018

greebie added the enhancement label Jul 20, 2018

This comment has been minimized.

greebie commented Nov 7, 2018 •

edited

This comment has been minimized.

greebie commented Nov 7, 2018 •

edited

This comment has been minimized.

ianmilligan1 commented Nov 7, 2018

This comment has been minimized.

greebie commented Nov 7, 2018

This comment has been minimized.

ianmilligan1 commented Nov 7, 2018

This comment has been minimized.

lintool commented Nov 7, 2018

This comment has been minimized.

greebie commented Nov 7, 2018

This comment has been minimized.

lintool commented Nov 7, 2018

This comment has been minimized.

greebie commented Nov 7, 2018 •

edited

greebie changed the title from Replace hashing of unique ids with .zipWithIndex() to Replace hashing of unique ids with .zipWithUniqueId() Nov 7, 2018

This comment has been minimized.

greebie commented Nov 7, 2018

This comment has been minimized.

ianmilligan1 commented Nov 7, 2018

This comment has been minimized.

greebie commented Nov 7, 2018

This comment has been minimized.

ruebot commented Nov 7, 2018

This comment has been minimized.

greebie commented Nov 7, 2018 •

edited

greebie referenced this issue Nov 7, 2018

Change Id generation for graphs from using hashes for urls to using .zipWithUniqueIds() #289

This comment has been minimized.

greebie commented Nov 7, 2018

ruebot closed this in `d3aebf4` Nov 22, 2018

archivesunleashed/aut

Join GitHub today

Replace hashing of unique ids with .zipWithUniqueId() #243

Comments

greebie commented Jul 20, 2018 • edited

greebie self-assigned this Jul 20, 2018

greebie added the enhancement label Jul 20, 2018

This comment has been minimized.

greebie commented Nov 7, 2018 • edited

This comment has been minimized.

greebie commented Nov 7, 2018 • edited

This comment has been minimized.

ianmilligan1 commented Nov 7, 2018

This comment has been minimized.

greebie commented Nov 7, 2018

This comment has been minimized.

ianmilligan1 commented Nov 7, 2018

This comment has been minimized.

lintool commented Nov 7, 2018

This comment has been minimized.

greebie commented Nov 7, 2018

This comment has been minimized.

lintool commented Nov 7, 2018

This comment has been minimized.

greebie commented Nov 7, 2018 • edited

greebie changed the title from Replace hashing of unique ids with .zipWithIndex() to Replace hashing of unique ids with .zipWithUniqueId() Nov 7, 2018

This comment has been minimized.

greebie commented Nov 7, 2018

This comment has been minimized.

ianmilligan1 commented Nov 7, 2018

This comment has been minimized.

greebie commented Nov 7, 2018

This comment has been minimized.

ruebot commented Nov 7, 2018

This comment has been minimized.

greebie commented Nov 7, 2018 • edited

greebie referenced this issue Nov 7, 2018

Change Id generation for graphs from using hashes for urls to using .zipWithUniqueIds() #289

This comment has been minimized.

greebie commented Nov 7, 2018

ruebot closed this in d3aebf4 Nov 22, 2018

greebie commented Jul 20, 2018 •

edited

greebie commented Nov 7, 2018 •

edited

greebie commented Nov 7, 2018 •

edited

greebie commented Nov 7, 2018 •

edited

greebie commented Nov 7, 2018 •

edited

ruebot closed this in `d3aebf4` Nov 22, 2018