Refactor ExtractGraph and assess value of GraphX for producing network graphs #203

greebie · Apr 26, 2018

The upgrade to 2.3 fails ExtractGraphTest, but that test and the associated udf are under-utilized and could use some refactoring.

Let's use this opportunity to re-examine the value of ExtractGraph into the wasapi - aut - graphpass - auk pipeline.

Context: The potential upgrade to Spark 2.3.0 in the issue-197 branch (#197).

ruebot · Apr 26, 2018

Context: The potential upgrade to Spark 2.3.0 in the issue-197 branch (#197).

As a first task, let's try to extract a domain link graph and convert it into a GraphX graph object - and from there we can get access to GraphX features...

Potentially interesting links:

lintool · May 2, 2018

As a first task, let's try to extract a domain link graph and convert it into a GraphX graph object - and from there we can get access to GraphX features...

Potentially interesting links:

An analysis that would be helpful to me is to produce values for weak and strongly connected components that output to Gexf. This would be a good thing to pull out of GraphX rather than Graphpass or igraph or whatnot.

Basically, this is a way to extract the "big ball" (giant component) out of a bunch of groups.

greebie · May 2, 2018

An analysis that would be helpful to me is to produce values for weak and strongly connected components that output to Gexf. This would be a good thing to pull out of GraphX rather than Graphpass or igraph or whatnot.

Basically, this is a way to extract the "big ball" (giant component) out of a bunch of groups.

Note that Graphx (I called it Sparkx and that is wrong) is not compatible with Dataframes. GraphFrames may be an alternative: http://graphframes.github.io/

greebie · May 2, 2018

Note that Graphx (I called it Sparkx and that is wrong) is not compatible with Dataframes. GraphFrames may be an alternative: http://graphframes.github.io/

Note that Graphx (I called it Sparkx and that is wrong) is not compatible with Dataframes. GraphFrames may be an alternative: http://graphframes.github.io/

Given that GraphX is not comparable with the future direction of the project (Dataframes), should we close this issue?

ianmilligan1 · Jun 25, 2018

Note that Graphx (I called it Sparkx and that is wrong) is not compatible with Dataframes. GraphFrames may be an alternative: http://graphframes.github.io/

Given that GraphX is not comparable with the future direction of the project (Dataframes), should we close this issue?

@hardiksahl 's work provides some helpful new direction, but it will not resolve this issue in the short-run. Closing.

greebie · Jun 25, 2018

@hardiksahl 's work provides some helpful new direction, but it will not resolve this issue in the short-run. Closing.

After looking at the issue, there is another way forward, using connected components and graphpass.

If we use GraphX instead of the usual, we will have access Page Rank values, connected components and strongly connected components.

In Graphpass, if the the node list > 50,000, then we can try filtering using just the largest connected component (and the data is all ready there, so it is not a huge run).

If the node list is still > 50,000 then we can use the graph of just the strongly connected components. In this case, it means that we have the "true core" of websites in interaction with each other. It is not a full representation of the graph, but it is a logical criteria for a "first pass" in the graph.

greebie · Jul 25, 2018

After looking at the issue, there is another way forward, using connected components and graphpass.

If we use GraphX instead of the usual, we will have access Page Rank values, connected components and strongly connected components.

In Graphpass, if the the node list > 50,000, then we can try filtering using just the largest connected component (and the data is all ready there, so it is not a huge run).

If the node list is still > 50,000 then we can use the graph of just the strongly connected components. In this case, it means that we have the "true core" of websites in interaction with each other. It is not a full representation of the graph, but it is a logical criteria for a "first pass" in the graph.

In other news, as a general feature, it may be possible to offer a Gexf that at least provides colour (using components) and size visualization features for a Gephi output. Positioning will still be random, but if someone does not want to use GraphPass, it's still better than the status quo.

greebie · Jul 25, 2018

In other news, as a general feature, it may be possible to offer a Gexf that at least provides colour (using components) and size visualization features for a Gephi output. Positioning will still be random, but if someone does not want to use GraphPass, it's still better than the status quo.

Can we rewrite this issue, or close and open up some news ones? It seems like there are a few different things going on here, and it's not exactly clear what the definition of done is.

ruebot · Aug 14, 2018

Can we rewrite this issue, or close and open up some news ones? It seems like there are a few different things going on here, and it's not exactly clear what the definition of done is.

I’m a bit confused too. #245 seems to have resolved the major parts of this (both the assessing/investigating and revising GraphX support). If we go further down the Gephi output road that @greebie discussed above, that might be better served with more specific issues.

ianmilligan1 · Aug 14, 2018

I’m a bit confused too. #245 seems to have resolved the major parts of this (both the assessing/investigating and revising GraphX support). If we go further down the Gephi output road that @greebie discussed above, that might be better served with more specific issues.

I see this as resolved. The rest is just a few notes on what we could do with it.

greebie · Aug 14, 2018

I see this as resolved. The rest is just a few notes on what we could do with it.

greebie changed the title from Refactor ExtractGraph and assess value of SparkX for producing network graphs to Refactor ExtractGraph and assess value of GraphX for producing network graphs May 2, 2018

hardiksahi referenced this issue May 17, 2018
Closed
Converts WARC RDD into a GraphX object, performs PageRank and converts into GraphML object #228

ruebot added this to In Progress in DataFrames and PySpark May 21, 2018

ruebot removed this from In Progress in DataFrames and PySpark May 21, 2018

greebie closed this Jun 25, 2018

greebie reopened this Jul 25, 2018

greebie referenced this issue Jul 26, 2018
Merged
Add ExtractGraphX including algorithms for PageRank and Components. Issue 203 #245

greebie referenced this issue Aug 10, 2018
Open
Create tests for ExtractGraph.scala #49

ruebot added this to In Progress in DataFrames and PySpark Aug 13, 2018

ruebot added this to To Do in 1.0.0 Release of AUT Aug 13, 2018

ruebot moved this from In Progress to ToDo in DataFrames and PySpark Aug 13, 2018

greebie closed this Aug 14, 2018

1.0.0 Release of AUT automation moved this from To Do to Done Aug 14, 2018

DataFrames and PySpark automation moved this from ToDo to In review Aug 14, 2018

ruebot moved this from In review to Done in DataFrames and PySpark Aug 20, 2018

archivesunleashed/aut

Join GitHub today

Refactor ExtractGraph and assess value of GraphX for producing network graphs #203

Comments

greebie commented Apr 26, 2018

This comment has been minimized.

ruebot commented Apr 26, 2018

This comment has been minimized.

lintool commented May 2, 2018 • edited Edited 1 time lintool edited May 2, 2018 (most recent)

This comment has been minimized.

greebie commented May 2, 2018

greebie changed the title from Refactor ExtractGraph and assess value of SparkX for producing network graphs to Refactor ExtractGraph and assess value of GraphX for producing network graphs May 2, 2018

This comment has been minimized.

greebie commented May 2, 2018

hardiksahi referenced this issue May 17, 2018

ruebot added this to In Progress in DataFrames and PySpark May 21, 2018

ruebot removed this from In Progress in DataFrames and PySpark May 21, 2018

This comment has been minimized.

ianmilligan1 commented Jun 25, 2018

This comment has been minimized.

greebie commented Jun 25, 2018 • edited Edited 1 time greebie edited Jun 25, 2018 (most recent) greebie created Jun 25, 2018

greebie closed this Jun 25, 2018

greebie reopened this Jul 25, 2018

This comment has been minimized.

greebie commented Jul 25, 2018

This comment has been minimized.

greebie commented Jul 25, 2018

greebie referenced this issue Jul 26, 2018

greebie referenced this issue Aug 10, 2018

ruebot added this to In Progress in DataFrames and PySpark Aug 13, 2018

ruebot added this to To Do in 1.0.0 Release of AUT Aug 13, 2018

ruebot moved this from In Progress to ToDo in DataFrames and PySpark Aug 13, 2018

This comment has been minimized.

ruebot commented Aug 14, 2018

This comment has been minimized.

ianmilligan1 commented Aug 14, 2018

This comment has been minimized.

greebie commented Aug 14, 2018

greebie closed this Aug 14, 2018

1.0.0 Release of AUT automation moved this from To Do to Done Aug 14, 2018

DataFrames and PySpark automation moved this from ToDo to In review Aug 14, 2018

ruebot moved this from In review to Done in DataFrames and PySpark Aug 20, 2018

lintool commented May 2, 2018 •

edited

Edited 1 time

lintool edited May 2, 2018 (most recent)

greebie commented Jun 25, 2018 •

edited

Edited 1 time

greebie edited Jun 25, 2018 (most recent)

greebie created Jun 25, 2018