Join GitHub today
GitHub is home to over 28 million developers working together to host and review code, manage projects, and build software together.
Sign upRefactor ExtractGraph and assess value of GraphX for producing network graphs #203
Comments
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ruebot
Apr 26, 2018
Member
Context: The potential upgrade to Spark 2.3.0 in the issue-197 branch (#197).
Context: The potential upgrade to Spark 2.3.0 in the issue-197 branch (#197). |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
lintool
May 2, 2018
Member
As a first task, let's try to extract a domain link graph and convert it into a GraphX graph object - and from there we can get access to GraphX features...
Potentially interesting links:
As a first task, let's try to extract a domain link graph and convert it into a GraphX graph object - and from there we can get access to GraphX features... Potentially interesting links: |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
greebie
May 2, 2018
Contributor
An analysis that would be helpful to me is to produce values for weak and strongly connected components that output to Gexf. This would be a good thing to pull out of GraphX rather than Graphpass or igraph or whatnot.
Basically, this is a way to extract the "big ball" (giant component) out of a bunch of groups.
An analysis that would be helpful to me is to produce values for weak and strongly connected components that output to Gexf. This would be a good thing to pull out of GraphX rather than Graphpass or igraph or whatnot. Basically, this is a way to extract the "big ball" (giant component) out of a bunch of groups. |
greebie
changed the title from
Refactor ExtractGraph and assess value of SparkX for producing network graphs
to
Refactor ExtractGraph and assess value of GraphX for producing network graphs
May 2, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
greebie
May 2, 2018
Contributor
Note that Graphx (I called it Sparkx and that is wrong) is not compatible with Dataframes. GraphFrames may be an alternative: http://graphframes.github.io/
Note that Graphx (I called it Sparkx and that is wrong) is not compatible with Dataframes. GraphFrames may be an alternative: http://graphframes.github.io/ |
hardiksahi
referenced this issue
May 17, 2018
Closed
Converts WARC RDD into a GraphX object, performs PageRank and converts into GraphML object #228
ruebot
added this to In Progress
in DataFrames and PySpark
May 21, 2018
ruebot
removed this from In Progress
in DataFrames and PySpark
May 21, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ianmilligan1
Jun 25, 2018
Member
Note that Graphx (I called it Sparkx and that is wrong) is not compatible with Dataframes. GraphFrames may be an alternative: http://graphframes.github.io/
Given that GraphX is not comparable with the future direction of the project (Dataframes), should we close this issue?
Given that GraphX is not comparable with the future direction of the project (Dataframes), should we close this issue? |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
greebie
Jun 25, 2018
Contributor
@hardiksahl 's work provides some helpful new direction, but it will not resolve this issue in the short-run. Closing.
@hardiksahl 's work provides some helpful new direction, but it will not resolve this issue in the short-run. Closing. |
greebie
closed this
Jun 25, 2018
greebie
reopened this
Jul 25, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
greebie
Jul 25, 2018
Contributor
After looking at the issue, there is another way forward, using connected components and graphpass.
If we use GraphX instead of the usual, we will have access Page Rank values, connected components and strongly connected components.
In Graphpass, if the the node list > 50,000, then we can try filtering using just the largest connected component (and the data is all ready there, so it is not a huge run).
If the node list is still > 50,000 then we can use the graph of just the strongly connected components. In this case, it means that we have the "true core" of websites in interaction with each other. It is not a full representation of the graph, but it is a logical criteria for a "first pass" in the graph.
After looking at the issue, there is another way forward, using connected components and graphpass. If we use GraphX instead of the usual, we will have access Page Rank values, connected components and strongly connected components. In Graphpass, if the the node list > 50,000, then we can try filtering using just the largest connected component (and the data is all ready there, so it is not a huge run). If the node list is still > 50,000 then we can use the graph of just the strongly connected components. In this case, it means that we have the "true core" of websites in interaction with each other. It is not a full representation of the graph, but it is a logical criteria for a "first pass" in the graph. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
greebie
Jul 25, 2018
Contributor
In other news, as a general feature, it may be possible to offer a Gexf that at least provides colour (using components) and size visualization features for a Gephi output. Positioning will still be random, but if someone does not want to use GraphPass, it's still better than the status quo.
In other news, as a general feature, it may be possible to offer a Gexf that at least provides colour (using components) and size visualization features for a Gephi output. Positioning will still be random, but if someone does not want to use GraphPass, it's still better than the status quo. |
greebie
referenced this issue
Jul 26, 2018
Merged
Add ExtractGraphX including algorithms for PageRank and Components. Issue 203 #245
ruebot
added this to In Progress
in DataFrames and PySpark
Aug 13, 2018
ruebot
added this to To Do
in 1.0.0 Release of AUT
Aug 13, 2018
ruebot
moved this from In Progress
to ToDo
in DataFrames and PySpark
Aug 13, 2018
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
ruebot
Aug 14, 2018
Member
Can we rewrite this issue, or close and open up some news ones? It seems like there are a few different things going on here, and it's not exactly clear what the definition of done is.
Can we rewrite this issue, or close and open up some news ones? It seems like there are a few different things going on here, and it's not exactly clear what the definition of done is. |
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
This comment has been minimized.
Show comment
Hide comment
This comment has been minimized.
greebie
Aug 14, 2018
Contributor
I see this as resolved. The rest is just a few notes on what we could do with it.
I see this as resolved. The rest is just a few notes on what we could do with it. |
greebie commentedApr 26, 2018
The upgrade to 2.3 fails ExtractGraphTest, but that test and the associated udf are under-utilized and could use some refactoring.
Let's use this opportunity to re-examine the value of ExtractGraph into the wasapi - aut - graphpass - auk pipeline.