New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Refactor ExtractGraph and assess value of GraphX for producing network graphs #203

Closed
greebie opened this Issue Apr 26, 2018 · 11 comments

Comments

4 participants
@greebie
Contributor

greebie commented Apr 26, 2018

The upgrade to 2.3 fails ExtractGraphTest, but that test and the associated udf are under-utilized and could use some refactoring.

Let's use this opportunity to re-examine the value of ExtractGraph into the wasapi - aut - graphpass - auk pipeline.

@ruebot

This comment has been minimized.

Show comment
Hide comment
@ruebot

ruebot Apr 26, 2018

Member

Context: The potential upgrade to Spark 2.3.0 in the issue-197 branch (#197).

Member

ruebot commented Apr 26, 2018

Context: The potential upgrade to Spark 2.3.0 in the issue-197 branch (#197).

@lintool

This comment has been minimized.

Show comment
Hide comment
@lintool

lintool May 2, 2018

Member

As a first task, let's try to extract a domain link graph and convert it into a GraphX graph object - and from there we can get access to GraphX features...

Potentially interesting links:

Member

lintool commented May 2, 2018

As a first task, let's try to extract a domain link graph and convert it into a GraphX graph object - and from there we can get access to GraphX features...

Potentially interesting links:

@greebie

This comment has been minimized.

Show comment
Hide comment
@greebie

greebie May 2, 2018

Contributor

An analysis that would be helpful to me is to produce values for weak and strongly connected components that output to Gexf. This would be a good thing to pull out of GraphX rather than Graphpass or igraph or whatnot.

Basically, this is a way to extract the "big ball" (giant component) out of a bunch of groups.

Contributor

greebie commented May 2, 2018

An analysis that would be helpful to me is to produce values for weak and strongly connected components that output to Gexf. This would be a good thing to pull out of GraphX rather than Graphpass or igraph or whatnot.

Basically, this is a way to extract the "big ball" (giant component) out of a bunch of groups.

@greebie greebie changed the title from Refactor ExtractGraph and assess value of SparkX for producing network graphs to Refactor ExtractGraph and assess value of GraphX for producing network graphs May 2, 2018

@greebie

This comment has been minimized.

Show comment
Hide comment
@greebie

greebie May 2, 2018

Contributor

Note that Graphx (I called it Sparkx and that is wrong) is not compatible with Dataframes. GraphFrames may be an alternative: http://graphframes.github.io/

Contributor

greebie commented May 2, 2018

Note that Graphx (I called it Sparkx and that is wrong) is not compatible with Dataframes. GraphFrames may be an alternative: http://graphframes.github.io/

@ianmilligan1

This comment has been minimized.

Show comment
Hide comment
@ianmilligan1

ianmilligan1 Jun 25, 2018

Member

Note that Graphx (I called it Sparkx and that is wrong) is not compatible with Dataframes. GraphFrames may be an alternative: http://graphframes.github.io/

Given that GraphX is not comparable with the future direction of the project (Dataframes), should we close this issue?

Member

ianmilligan1 commented Jun 25, 2018

Note that Graphx (I called it Sparkx and that is wrong) is not compatible with Dataframes. GraphFrames may be an alternative: http://graphframes.github.io/

Given that GraphX is not comparable with the future direction of the project (Dataframes), should we close this issue?

@greebie

This comment has been minimized.

Show comment
Hide comment
@greebie

greebie Jun 25, 2018

Contributor

@hardiksahl 's work provides some helpful new direction, but it will not resolve this issue in the short-run. Closing.

Contributor

greebie commented Jun 25, 2018

@hardiksahl 's work provides some helpful new direction, but it will not resolve this issue in the short-run. Closing.

@greebie greebie closed this Jun 25, 2018

@greebie greebie reopened this Jul 25, 2018

@greebie

This comment has been minimized.

Show comment
Hide comment
@greebie

greebie Jul 25, 2018

Contributor

After looking at the issue, there is another way forward, using connected components and graphpass.

If we use GraphX instead of the usual, we will have access Page Rank values, connected components and strongly connected components.

In Graphpass, if the the node list > 50,000, then we can try filtering using just the largest connected component (and the data is all ready there, so it is not a huge run).

If the node list is still > 50,000 then we can use the graph of just the strongly connected components. In this case, it means that we have the "true core" of websites in interaction with each other. It is not a full representation of the graph, but it is a logical criteria for a "first pass" in the graph.

Contributor

greebie commented Jul 25, 2018

After looking at the issue, there is another way forward, using connected components and graphpass.

If we use GraphX instead of the usual, we will have access Page Rank values, connected components and strongly connected components.

In Graphpass, if the the node list > 50,000, then we can try filtering using just the largest connected component (and the data is all ready there, so it is not a huge run).

If the node list is still > 50,000 then we can use the graph of just the strongly connected components. In this case, it means that we have the "true core" of websites in interaction with each other. It is not a full representation of the graph, but it is a logical criteria for a "first pass" in the graph.

@greebie

This comment has been minimized.

Show comment
Hide comment
@greebie

greebie Jul 25, 2018

Contributor

In other news, as a general feature, it may be possible to offer a Gexf that at least provides colour (using components) and size visualization features for a Gephi output. Positioning will still be random, but if someone does not want to use GraphPass, it's still better than the status quo.

Contributor

greebie commented Jul 25, 2018

In other news, as a general feature, it may be possible to offer a Gexf that at least provides colour (using components) and size visualization features for a Gephi output. Positioning will still be random, but if someone does not want to use GraphPass, it's still better than the status quo.

@ruebot

This comment has been minimized.

Show comment
Hide comment
@ruebot

ruebot Aug 14, 2018

Member

Can we rewrite this issue, or close and open up some news ones? It seems like there are a few different things going on here, and it's not exactly clear what the definition of done is.

Member

ruebot commented Aug 14, 2018

Can we rewrite this issue, or close and open up some news ones? It seems like there are a few different things going on here, and it's not exactly clear what the definition of done is.

@ianmilligan1

This comment has been minimized.

Show comment
Hide comment
@ianmilligan1

ianmilligan1 Aug 14, 2018

Member

I’m a bit confused too. #245 seems to have resolved the major parts of this (both the assessing/investigating and revising GraphX support). If we go further down the Gephi output road that @greebie discussed above, that might be better served with more specific issues.

Member

ianmilligan1 commented Aug 14, 2018

I’m a bit confused too. #245 seems to have resolved the major parts of this (both the assessing/investigating and revising GraphX support). If we go further down the Gephi output road that @greebie discussed above, that might be better served with more specific issues.

@greebie

This comment has been minimized.

Show comment
Hide comment
@greebie

greebie Aug 14, 2018

Contributor

I see this as resolved. The rest is just a few notes on what we could do with it.

Contributor

greebie commented Aug 14, 2018

I see this as resolved. The rest is just a few notes on what we could do with it.

@greebie greebie closed this Aug 14, 2018

1.0.0 Release of AUT automation moved this from To Do to Done Aug 14, 2018

DataFrames and PySpark automation moved this from ToDo to In review Aug 14, 2018

@ruebot ruebot moved this from In review to Done in DataFrames and PySpark Aug 20, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment