Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Monochromatic Ids instead of hash to produce network identifiers. #440

Open
greebie opened this issue Apr 9, 2020 · 4 comments
Open

Use Monochromatic Ids instead of hash to produce network identifiers. #440

greebie opened this issue Apr 9, 2020 · 4 comments

Comments

@greebie
Copy link
Contributor

@greebie greebie commented Apr 9, 2020

Is your feature request related to a problem? Please describe.

Using MD5 or other hashes for ids in Graphml and Gexf network outputs can result in lost data due to overloading. With RDDs it was possible to use .zipWithIndex(), but this function is not available in Dataframe.

Describe the solution you'd like

It may be possible to produce unique identifiers using monotonicallyIncreasingId()

https://stackoverflow.com/a/36946952/3050664

Describe alternatives you've considered

Other solutions implement zipWithId() in rdd and transform to df, including a schema to increase efficiency.

@ianmilligan1

This comment has been minimized.

Copy link
Member

@ianmilligan1 ianmilligan1 commented Apr 9, 2020

lost data due to overloading

Hi @greebie - could you just flesh that out a bit for me?

@greebie

This comment has been minimized.

Copy link
Contributor Author

@greebie greebie commented Apr 9, 2020

Re: Overloading

For very large datasets, it is possible for two different urls to convert to the same hash, which means two different nodes will end up having the same id, causing data loss and potentially false links in the network.

@greebie

This comment has been minimized.

Copy link
Contributor Author

@greebie greebie commented Apr 9, 2020

(Also related to #243)

@SinghGursimran

This comment has been minimized.

Copy link
Collaborator

@SinghGursimran SinghGursimran commented Apr 9, 2020

#397

In this PR, I have implemented the code for getting a distinct node ID for each node.
Though monotonicallyIncreasingId() function does provide monotonically increasing Id's, they are of the order of 10^10 which makes it difficult to manually visualize the graph.

Instead, I have used the Spark window concept to generate a distinct Id based on the URL of the node. That is each distinct URL will have a distinct nodeID.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Linked pull requests

Successfully merging a pull request may close this issue.

None yet
3 participants
You can’t perform that action at this time.