Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upReplace hashing of unique ids with .zipWithUniqueId() #243
Comments
greebie
self-assigned this
Jul 20, 2018
greebie
added
the
enhancement
label
Jul 20, 2018
This comment has been minimized.
This comment has been minimized.
I found a solution, which still needs testing, but current time trials using this code and the UVIC local news warcs:
|
This comment has been minimized.
This comment has been minimized.
The new way is definitely slower, but within 10-20%. |
This comment has been minimized.
This comment has been minimized.
Thanks for this @greebie! This new approach seems better, but I'm just weighing the cost/benefit of longer processing time vs. avoiding all hash collisions. Right now, when we have collisions, Gephi automatically fixes those is that correct? |
This comment has been minimized.
This comment has been minimized.
Not really. Gephi merges items that have the collisions, so with large collections, it could produce a misrepresentation of the graph. However, I have not come across it in this example. |
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
This comment has been minimized.
Why not You just need the ids to be unique - they don't need to be sequential, right? |
This comment has been minimized.
This comment has been minimized.
I'm using zipWithUniqueId(). :) |
This comment has been minimized.
This comment has been minimized.
oh, misread then. In the first comment in the issue you wrote:
|
This comment has been minimized.
This comment has been minimized.
That's right - the current pushed branch uses zipWIthUniqueId() instead for the reasons you said. (I changed the issue title to avoid future confusion.) |
greebie
changed the title from
Replace hashing of unique ids with .zipWithIndex()
to
Replace hashing of unique ids with .zipWithUniqueId()
Nov 7, 2018
This comment has been minimized.
This comment has been minimized.
Okay - I've decided to keep the existing |
This comment has been minimized.
This comment has been minimized.
OK, so this proposed approach would have
|
This comment has been minimized.
This comment has been minimized.
That's right. It means instead of "fixing" WriteGexf, I am adding this new approach, leaving the following possibilities.
|
This comment has been minimized.
This comment has been minimized.
This wouldn't truly effect AUK until there was a new release of AUT. That said, can you provide more detail as to what we'd have to do change the workflow? Would we still produce |
This comment has been minimized.
This comment has been minimized.
@ruebot The only change to aut should be that the aut command for The Graphpass workflow should remain the same. Alternately, I can make graphml the default WriteGraph behavior and then the only difference would be Basically, I chose this approach because I was duplicating code running from WriteGEXF and WriteGraphml and it started to seem that I should put them both together. |
greebie
referenced this issue
Nov 7, 2018
Merged
Change Id generation for graphs from using hashes for urls to using .zipWithUniqueIds() #289
This comment has been minimized.
This comment has been minimized.
@ruebot I realize I failed to answer your last questions. The graphs produced by graphpass include metadata about how to display the graph in Gephi or Sigma (ie. how big should the dots be, what color and where they should be positioned in the visualisation). Auk-produced graphs, unfortunately, just provide the raw network data with no visualization metadata. Currently, the best we have in auk right now is ExtractGraphX which produces metadata for node sizes and some things can offer a fair way to reduce the size of large graphs for visualization, but it would increase the amount of time it would take to produce derivatives for small graphs. When we accepted that udf into the repo, we decided it might be good for the toolkit, but it's not quite ready to help auk. |
greebie commentedJul 20, 2018
•
edited
Describe the Enhancement
AUT uses hash values to create unique ids, which can leave us duplicates of the same url in a network graph when hashes collide.
To Reproduce
Steps to reproduce the behavior (e.g.):
Run a Domain Graph Extractor with a large number of network nodes (websites).
Run in Gephi.
Discover duplicate websites in graph.
Expected behavior
All network nodes should be unique.
Screenshots
N/A
Additional context
The .zipWithIndex() feature in Apache Spark would be a better approach. http://spark.apache.org/docs/latest/api/python/pyspark.html#pyspark.RDD.zipWithIndex
.zipWithUniqueId() does not call another context so it could be faster.
See also #228