Join GitHub today
GitHub is home to over 40 million developers working together to host and review code, manage projects, and build software together.
Sign upUse Monochromatic Ids instead of hash to produce network identifiers. #440
Comments
This comment has been minimized.
This comment has been minimized.
Hi @greebie - could you just flesh that out a bit for me? |
This comment has been minimized.
This comment has been minimized.
Re: Overloading For very large datasets, it is possible for two different urls to convert to the same hash, which means two different nodes will end up having the same id, causing data loss and potentially false links in the network. |
This comment has been minimized.
This comment has been minimized.
(Also related to #243) |
This comment has been minimized.
This comment has been minimized.
In this PR, I have implemented the code for getting a distinct node ID for each node. Instead, I have used the Spark window concept to generate a distinct Id based on the URL of the node. That is each distinct URL will have a distinct nodeID. |
greebie commentedApr 9, 2020
Is your feature request related to a problem? Please describe.
Using MD5 or other hashes for ids in Graphml and Gexf network outputs can result in lost data due to overloading. With RDDs it was possible to use .zipWithIndex(), but this function is not available in Dataframe.
Describe the solution you'd like
It may be possible to produce unique identifiers using
monotonicallyIncreasingId()
https://stackoverflow.com/a/36946952/3050664
Describe alternatives you've considered
Other solutions implement zipWithId() in rdd and transform to df, including a schema to increase efficiency.